Workflow:Validation Error Analysis and Treatment for PDF-hul 133 Invalid date

From COPTR
Jump to navigation Jump to search
Validation Error Analysis and Treatment for PDF-hul 133 Invalid date
Status:Production
Tools:
Input:File with JHOVE validation error PDF-HUL-133 “Improperly formed date"
Output:Fixed file
Organisation:TIB

Workflow Description[edit]

The workflow describes the analysis and the fix of a specific instance of a PDF-HUL-133 error. It is a manual workflow. The methodology used here is that introduced in https://hdl.handle.net/2142/121092.

Validation Error

JHOVE v1.30.0RC1 PDF-hul v1.12.5 PDF-HUL-133 Improperly formed date. Well-formed, but not valid.

Cross-Check with other Tools

Cross-checked with: pdfcpu v0.6.0dev relaxed mode - no error pdfcpu v0.6.0dev strict mode - no error qpdf v9.1.1 - no error Arlington PDF Model checker by verapdf 1.26.0-RC1 Greenfield Parser: Entry ModDate in DocInfo shall have type Date (1 occurance) Entry CreationDate in DocInfo shall have type Date (1 occurance)

Locate Error in Spec

ISO3200-2:2017 (PDF 2.0 Spec) lists the the Keys CreationDate and ModDate in the Document Info Dictionary as type Date. Section 7.0.4 defines that a date shall be a text string containing no whitespace, of the form: (D:YYYYMMDDHHmmSSOHH'mm) where: YYYY shall be the year MM shall be the month (01–12) DD shall be the day (01–31) HH shall be the hour (00–23) mm shall be the minute (00–59) SS shall be the second (00–59) O shall be the relationship of local time to Universal Time (UT), and shall be denoted by one of the characters PLUS SIGN (U+002B) (+), HYPHEN-MINUS (U+002D) (-), or LATIN CAPITAL LETTER Z (U+005A) (Z) (see below) HH followed by APOSTROPHE (U+0027) (') shall be the absolute value of the offset from UT in hours (00–23) mm shall be the absolute value of the offset from UT in minutes (00–59)

The prefix D: shall be present, the year field (YYYY) shall be present and all other fields may be present but only if all of their preceding fields are also present. The APOSTROPHE following the hour offset field (HH) shall only be present if the HH field is present. The minute offset field (mm) shall only be present if the APOSTROPHE following the hour offset field (HH) is present. The default values for MM and DD shall be both 01; all other numerical fields shall default to zero values. A PLUS SIGN as the value of the O field signifies that local time is now and later than UT, a HYPHEN-MINUS signifies that local time is earlier than UT, and the LATIN CAPITAL LETTER Z signifies that local time is equal to UT. If no UT information is specified, the relationship of the specified time to UT shall be considered to be GMT. Regardless of whether the time zone is specified, the rest of the date shall be specified in local time.

The specification further notes that versions up to and including 1.7 defined a date string to include a terminating apostrophe. PDF processors are recommended to accept date strings that still follow that convention.


Locate Error in File Offset given by JHOVE can be navigated to using a hex editor such as HxD. The offset points to the Document Info Dictionary with /CreationDate (D:2018020915415409'00') and /ModDate(D:2018020915415409'00'). Mapping those dates to the notation given above it becomes clear that the O key which identifies the relation to UT is missing. Alternatively to looking for the dates in a hex editor, exiftool can be used to pull out all dates with: exiftool -time:all -G -a -s file.pdf

Match?

Yes. Description in the spec and error in the file match. (D:2018020915415409'00') is missing the O key.

Fixable?

Yes. If other dates, e.g. in the XMP-section, are in the correct format, they can be used to determine the corect value for O. If not, the HH'MM' UT offset can be removed to form a valid date.

Fix

Option 1: This option only works in the case of the missing O entry and the trailing apostrophe being present. In this case, the trailing apostrophe can be deleted and the O entry, if it can be determined from other dates, added to the value. This does not change the overall byte number of the date value and therefore has no impact on the PDF's internal structure.

Option 2: The date can also be modified using exiftool. This will create an incremental update to the end of the file which can be rolled back with tools such as pdfressurect. Due to objects being added, the size of the file will change. The exiftool command to e.g. update the dates above to include the O key + is: exiftool -PDF:ModifyDate="2018:02:09 15:41:54+09:00" -PDF:CreateDate="2018:02:09 15:41:54+09:00" file.pdf If you do not know the correct O key, you can also omit the offset with: exiftool -PDF:ModifyDate="2018:02:09 15:41:54" -PDF:CreateDate="2018:02:09 15:41:54" file.pdf

Check

Re-validate file with Arlington PDF Model checker by veraPDF: Error messages are gone, no deviations found. Re-validate file with JHOVE: Well-formed and valid with option 2. Well-formed but not valid with option 1, as JHOVE currently expects date in format pre-PDF2.0, i.e. without trailing apostrophe.

Success?

Yes.

Workflow image


Purpose, Context and Content[edit]

This workflow describes the analysis and treatment of JHOVE PDF-Hul error message PDF-HUL-133 Improperly formed date. It describes the process and results of a manual validation error analysis and treatment process.

Evaluation/Review[edit]

The workflow is effective and should be replicable for most improperly formed date related errors in PDFs. However, as an invalid date has little impact on the rendering behavior of the PDF file itself, institutions may decide to not fix the error. Please note that most PDF Readers (e.g., Adobe Acrobat, Foxit) make use of XMP Metadata and not of Document Info Dictionary Metadata when they display Date Created and Last Modified Dates in a document's properties.

Further Information[edit]