Difference between revisions of "Workflow:Validation Error Analysis and Treatment for PDF-hul 133 Invalid date"

From COPTR
Jump to navigation Jump to search
(Created page with '{{Infobox COW |status=Production |tools=ExifTool, HxD, Pdfcpu, Qpdf, JHOVE, veraPDF Arlington Model Checker |input=File with JHOVE validation error PDF-HUL-133 “Invalid date...')
 
Line 31: Line 31:
 
ISO3200-2:2017 (PDF 2.0 Spec) lists the the Keys CreationDate and ModDate in the Document Info Dictionary as type Date.  
 
ISO3200-2:2017 (PDF 2.0 Spec) lists the the Keys CreationDate and ModDate in the Document Info Dictionary as type Date.  
 
Section 7.0.4 defines that a date shall be a text string containing no whitespace, of the form:  
 
Section 7.0.4 defines that a date shall be a text string containing no whitespace, of the form:  
"(D:YYYYMMDDHHmmSSOHH'mm)
+
''(D:YYYYMMDDHHmmSSOHH'mm)
 
where:
 
where:
 
YYYY shall be the year
 
YYYY shall be the year
Line 43: Line 43:
 
mm shall be the absolute value of the offset from UT in minutes (00–59)
 
mm shall be the absolute value of the offset from UT in minutes (00–59)
  
The prefix D: shall be present, the year field (YYYY) shall be present and all other fields may be present but only if all of their preceding fields are also present. The APOSTROPHE following the hour offset field (HH) shall only be present if the HH field is present. The minute offset field (mm) shall only be present if the APOSTROPHE following the hour offset field (HH) is present. The default values for MM and DD shall be both 01; all other numerical fields shall default to zero values. A PLUS SIGN as the value of the O field signifies that local time is now and later than UT, a HYPHEN-MINUS signifies that local time is earlier than UT, and the LATIN CAPITAL LETTER Z signifies that local time is equal to UT. If no UT information is specified, the relationship of the specified time to UT shall be considered to be GMT. Regardless of whether the time zone is specified, the rest of the date shall be specified in local time."
+
The prefix D: shall be present, the year field (YYYY) shall be present and all other fields may be present but only if all of their preceding fields are also present. The APOSTROPHE following the hour offset field (HH) shall only be present if the HH field is present. The minute offset field (mm) shall only be present if the APOSTROPHE following the hour offset field (HH) is present. The default values for MM and DD shall be both 01; all other numerical fields shall default to zero values. A PLUS SIGN as the value of the O field signifies that local time is now and later than UT, a HYPHEN-MINUS signifies that local time is earlier than UT, and the LATIN CAPITAL LETTER Z signifies that local time is equal to UT. If no UT information is specified, the relationship of the specified time to UT shall be considered to be GMT. Regardless of whether the time zone is specified, the rest of the date shall be specified in local time.''
  
The specification further notes that "versions up to and including 1.7 defined a date string to include a terminating apostrophe. PDF processors are recommended to accept date strings that still follow that convention".
+
The specification further notes that ''versions up to and including 1.7 defined a date string to include a terminating apostrophe. PDF processors are recommended to accept date strings that still follow that convention''.
  
  
 
'''Locate Error in File
 
'''Locate Error in File
 
'''
 
'''
Offset given by JHOVE only points to place where reference was used in GoTo destination. The reader tries to resolved the named destination used in the GoTo action via the name tree. Here the object is missing and replaced by "Null".
+
Offset given by JHOVE can be navigated to using a hex editor such as HxD. The offset points to the Document Info Dictionary with /CreationDate (D:2018020915415409'00') and /ModDate(D:2018020915415409'00'). Mapping those dates to the notation given above it becomes clear that the O key which identifies the relation to UT is missing. Alternatively to looking for the dates in a hex editor, exiftool can be used to pull out all dates with:
(Rechte_von_Eltern_in_der_Kita_2018_V7_bf.indd:.45593:62)[null/Fit ]
+
exiftool -time:all -G -a -s file.pdf
  
 
''' Match?'''
 
''' Match?'''
  
Yes.
+
Yes. Description in the spec and error in the file match. (D:2018020915415409'00') is missing the O key.  
  
 
'''Fixable?'''
 
'''Fixable?'''
  
Yes.
+
Yes. If other dates, e.g. in the XMP-section, are in the correct format, they can be used to determine the corect value for O. If not, the HH'MM' UT offset can be removed to form a valid date.
  
 
'''Fix'''
 
'''Fix'''
  
Find location by checking page object the wrong destination is used on. With Adobe Acrobat Pro's "Edit Link" option, the erronous link can be removed and replaced by a correct one, if known.
+
Option 1:
 +
This option only works in the case of the missing O entry and the trailing apostrophe being present. In this case, the trailing apostrophe can be deleted and the O entry, if it can be determined from other dates, added to the value. This does not change the overall byte number of the date value and therefore has no impact on the PDF's internal structure.
 +
 
 +
Option 2:
 +
The date can also be modified using exiftool. This will create an incremental update to the end of the file which can be rolled back with tools such as pdfressurect. Due to objects being added, the size of the file will change.
 +
The exiftool command to e.g. update the dates above to include the O key + is:
 +
exiftool -PDF:ModifyDate="2018:02:09 15:41:54+09:00" -PDF:CreateDate="2018:02:09 15:41:54+09:00" file.pdf
 +
If you do not know the correct O key, you can also omit the offset with:
 +
exiftool -PDF:ModifyDate="2018:02:09 15:41:54" -PDF:CreateDate="2018:02:09 15:41:54" file.pdf
  
 
'''Check'''
 
'''Check'''
  
Re-validated file with JHOVE: now well-formed and valid. Link is now actionable.
+
Re-validate file with Arlington PDF Model checker by veraPDF: Error messages are gone, no deviations found.
 +
Re-validate file with JHOVE: Well-formed and valid with option 2. Well-formed but not valid with option 1, as JHOVE currently expects date in format pre-PDF2.0, i.e. without trailing apostrophe.  
  
 
'''Success?'''
 
'''Success?'''
Line 80: Line 89:
 
==Purpose, Context and Content==
 
==Purpose, Context and Content==
 
<!-- Describe what your workflow is for - i.e. what it is designed to achieve, what the organisational context of the workflow is, and what content it is designed to work with -->
 
<!-- Describe what your workflow is for - i.e. what it is designed to achieve, what the organisational context of the workflow is, and what content it is designed to work with -->
 +
This workflow describes the analysis and treatment of JHOVE PDF-Hul error message PDF-HUL-133 Invalid date. It describes the process and results of a manual validation error analysis and treatment process.
  
 
==Evaluation/Review==
 
==Evaluation/Review==
 
<!-- How effective was the workflow? Was it replaced with a better workflow? Did it work well with some content but not others? What is the current status of the workflow? Does it relate to another workflow already described on the wiki? Link, explain and elaborate -->
 
<!-- How effective was the workflow? Was it replaced with a better workflow? Did it work well with some content but not others? What is the current status of the workflow? Does it relate to another workflow already described on the wiki? Link, explain and elaborate -->
 +
The workflow is effective and should be replicable for most invalid date related errors in PDFs. However, as an invalid date has little impact on the rendering behavior of the PDF file itself, institutions may decide to not fix the error.
 +
Please note that most PDF Readers (e.g., Adobe Acrobat, Foxit) make use of XMP Metadata and not of Document Info Dictionary Metadata when they display Date Created and Last Modified Dates in a document's properties.
  
 
==Further Information==
 
==Further Information==

Revision as of 17:11, 8 May 2024

Validation Error Analysis and Treatment for PDF-hul 133 Invalid date
Status:Production
Tools:
Input:File with JHOVE validation error PDF-HUL-133 “Invalid date"
Output:Fixed file
Organisation:TIB

Workflow Description

The workflow describes the analysis and the fix of a specific instance of a PDF-HUL-133 error. It is a manual workflow. The methodology used here is that introduced in https://hdl.handle.net/2142/121092.

Validation Error

JHOVE v1.30.0RC1 PDF-hul v1.12.5 PDF-HUL-133 Invalid Date. Well-formed, but not valid.

Cross-Check with other Tools

Cross-checked with: pdfcpu v0.6.0dev relaxed mode - no error pdfcpu v0.6.0dev strict mode - no error qpdf v9.1.1 - no error Arlington PDF Model checker by verapdf 1.26.0-RC1 Greenfield Parser: Entry ModDate in DocInfo shall have type Date (1 occurance) Entry CreationDate in DocInfo shall have type Date (1 occurance)

Locate Error in Spec

ISO3200-2:2017 (PDF 2.0 Spec) lists the the Keys CreationDate and ModDate in the Document Info Dictionary as type Date. Section 7.0.4 defines that a date shall be a text string containing no whitespace, of the form: (D:YYYYMMDDHHmmSSOHH'mm) where: YYYY shall be the year MM shall be the month (01–12) DD shall be the day (01–31) HH shall be the hour (00–23) mm shall be the minute (00–59) SS shall be the second (00–59) O shall be the relationship of local time to Universal Time (UT), and shall be denoted by one of the characters PLUS SIGN (U+002B) (+), HYPHEN-MINUS (U+002D) (-), or LATIN CAPITAL LETTER Z (U+005A) (Z) (see below) HH followed by APOSTROPHE (U+0027) (') shall be the absolute value of the offset from UT in hours (00–23) mm shall be the absolute value of the offset from UT in minutes (00–59)

The prefix D: shall be present, the year field (YYYY) shall be present and all other fields may be present but only if all of their preceding fields are also present. The APOSTROPHE following the hour offset field (HH) shall only be present if the HH field is present. The minute offset field (mm) shall only be present if the APOSTROPHE following the hour offset field (HH) is present. The default values for MM and DD shall be both 01; all other numerical fields shall default to zero values. A PLUS SIGN as the value of the O field signifies that local time is now and later than UT, a HYPHEN-MINUS signifies that local time is earlier than UT, and the LATIN CAPITAL LETTER Z signifies that local time is equal to UT. If no UT information is specified, the relationship of the specified time to UT shall be considered to be GMT. Regardless of whether the time zone is specified, the rest of the date shall be specified in local time.

The specification further notes that versions up to and including 1.7 defined a date string to include a terminating apostrophe. PDF processors are recommended to accept date strings that still follow that convention.


Locate Error in File Offset given by JHOVE can be navigated to using a hex editor such as HxD. The offset points to the Document Info Dictionary with /CreationDate (D:2018020915415409'00') and /ModDate(D:2018020915415409'00'). Mapping those dates to the notation given above it becomes clear that the O key which identifies the relation to UT is missing. Alternatively to looking for the dates in a hex editor, exiftool can be used to pull out all dates with: exiftool -time:all -G -a -s file.pdf

Match?

Yes. Description in the spec and error in the file match. (D:2018020915415409'00') is missing the O key.

Fixable?

Yes. If other dates, e.g. in the XMP-section, are in the correct format, they can be used to determine the corect value for O. If not, the HH'MM' UT offset can be removed to form a valid date.

Fix

Option 1: This option only works in the case of the missing O entry and the trailing apostrophe being present. In this case, the trailing apostrophe can be deleted and the O entry, if it can be determined from other dates, added to the value. This does not change the overall byte number of the date value and therefore has no impact on the PDF's internal structure.

Option 2: The date can also be modified using exiftool. This will create an incremental update to the end of the file which can be rolled back with tools such as pdfressurect. Due to objects being added, the size of the file will change. The exiftool command to e.g. update the dates above to include the O key + is: exiftool -PDF:ModifyDate="2018:02:09 15:41:54+09:00" -PDF:CreateDate="2018:02:09 15:41:54+09:00" file.pdf If you do not know the correct O key, you can also omit the offset with: exiftool -PDF:ModifyDate="2018:02:09 15:41:54" -PDF:CreateDate="2018:02:09 15:41:54" file.pdf

Check

Re-validate file with Arlington PDF Model checker by veraPDF: Error messages are gone, no deviations found. Re-validate file with JHOVE: Well-formed and valid with option 2. Well-formed but not valid with option 1, as JHOVE currently expects date in format pre-PDF2.0, i.e. without trailing apostrophe.

Success?

Yes.

Textual description


Purpose, Context and Content

This workflow describes the analysis and treatment of JHOVE PDF-Hul error message PDF-HUL-133 Invalid date. It describes the process and results of a manual validation error analysis and treatment process.

Evaluation/Review

The workflow is effective and should be replicable for most invalid date related errors in PDFs. However, as an invalid date has little impact on the rendering behavior of the PDF file itself, institutions may decide to not fix the error. Please note that most PDF Readers (e.g., Adobe Acrobat, Foxit) make use of XMP Metadata and not of Document Info Dictionary Metadata when they display Date Created and Last Modified Dates in a document's properties.

Further Information