Difference between revisions of "Pdfcpu"

From COPTR
Jump to navigation Jump to search
(Input Format)
 
Line 70: Line 70:
  
 
== User Experiences ==
 
== User Experiences ==
<!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. Use a bullet list. -->
+
 
 +
* Caron, Bertrand. ‘“Validation OK” They Said: Fixing the Rendering of a So-Called Valid PDF’. Open Preservation Foundation, 21 May 2025. https://openpreservation.org/blogs/validation-ok-they-said-fixing-the-rendering-of-a-so-called-valid-pdf/.
 +
* Lindlar, Micky. ‘A Date with PDF-HUL-133 “Improperly Formed Date”’. Open Preservation Foundation, 13 May 2024. https://openpreservation.org/blogs/a-date-with-pdf-hul-133-improperly-formed-date/.
 +
* Lindlar, Micky. ‘Destination Null: One of the Many Causes of PDF-Hul 122’. Open Preservation Foundation, 21 January 2024. https://openpreservation.org/blogs/destination-null-one-of-the-many-cases-of-pdf-hul-122/.
 +
* Lindlar, Micky. ‘Tabs Keys in PDF Page Objects – What Are They and Should I Care?’ Open Preservation Foundation, 16 May 2023. https://openpreservation.org/blogs/tabs-keys-in-pdf-page-objects-what-are-they-and-should-i-care/.
 +
* Lindlar, Micky. ‘Trouble-Shooting PDF Validation Errors - a Case of PDF-HUL-38’. Open Preservation Foundation, 27 November 2022. https://openpreservation.org/blogs/trouble-shooting-pdf-validation-errors-a-case-of-pdf-hul-38/.
  
 
== Development Activity ==
 
== Development Activity ==

Latest revision as of 16:32, 5 February 2026




Pdfcpu
A Go library and command line tool for PDF processing incl. validation
Homepage:https://pdfcpu.io/
License:Apache-2.0 License
Input Formats:PDF
Function:Metadata Extraction,Validation
Content type:Document
Appears in COW:Validation Error Analysis and Treatment for PDF-hul 122 Invalid destination - Destination NULL, Validation Error Analysis and Treatment for PDF-hul 133 Invalid date



Description

pdfcpu is a powerful Go library and command line tool that supports many PDF processing functions such as reading and writing the xref table, extracting images, fonts and embedded file attachments and encryption detection as well as decryption. A full list of the command set is avaiable at: https://pdfcpu.io/about/command_set

Validation

pdfcpu validates PDFs up to version 1.7. There are two different levels of validation, strict and relaxed, and the option to run validation in verbose mode which shows the output of the entire PDF syntax.

pdfcpu validate -mode strict this.pdf
validating(mode=strict) this.pdf ...
validation error (try -mode=relaxed): dict=type1FontDict required entry=FirstChar missing
pdfcpu validate -mode relaxed this.pdf
validating(mode=relaxed) this.pdf ...
validation ok

Metadata Extraction

Extraction of Metadata is possible in two ways. The argument info prints Information such as title, author, PDF producer, creation / modification data and some technical metadata such as encryption and permission information and whether the PDF is tagged, linearized or includes watermarks. Sample output:

pdfcpu info this.pdf
        PDF version: 1.6
         Page count: 96
          Page size: 21.00 x 29.70 cm
   .........................................
              Title: This is just a test
             Author: Digiman
            Subject: 
       PDF Producer: Adobe PDF Library 15.0
    Content creator: Adobe InDesign CC 207 (Macintosh)
      Creation date: D:20190912181416+02'00'
  Modification date: D:20190918120753+02'00'
           Keywords: key1
                     key2
   ..........................................
             Tagged: Yes
             Hybrid: No
         Linearized: No
 Using XRef streams: Yes
 Using object streams: Yes
         Watermarks: No
   ..........................................
          Encrypted: No
        Permissions: Full access

The second option is to extract any embedded metadata via the extract -mode meta flags. This creates txt files with the extracted metadata entries in a specified directory. Sample output:

pdfcpu extract -mode meta this.pdf mdout
  extracting metadata from this.pdf into mdout/ ...
  writing mdout\this_Metadata_XObject_6499_6500.txt
  writing mdout\this_Metadata_unknown_401_33.txt
  writing mdout\this_Metadata_XObject_292_289.txt
  writing mdout\this_Metadata_Catalog_6455_385.txt
  writing mdout\this_Metadata_XObject_291_290.txt
  writing mdout\this_Metadata_unknown_6491_6475.txt


User Experiences

Development Activity

pdfcpu has an active user and developer community. All activity can be viewed via the github repo: https://github.com/pdfcpu/pdfcpu