Difference between revisions of "Workflow:PDF/A validation and metadata extraction"

From COPTR
Jump to navigation Jump to search
Line 18: Line 18:
  
 
The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see [https://www.webarchive.org.uk/shine/search?query=content_type:%22application/pdf%22%20content_type_version:%221b%22 this]). The result is a list of URLs. These are fetched with Wget to create a large test corpus of predominantly PDF/A files. VeraPDF is then used to validate and extract metadata for each file. XMLstarlet is applied to extract fields of interest from the resulting XML creating a CSV. The CSV is then imported to Excel for analysis.
 
The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see [https://www.webarchive.org.uk/shine/search?query=content_type:%22application/pdf%22%20content_type_version:%221b%22 this]). The result is a list of URLs. These are fetched with Wget to create a large test corpus of predominantly PDF/A files. VeraPDF is then used to validate and extract metadata for each file. XMLstarlet is applied to extract fields of interest from the resulting XML creating a CSV. The CSV is then imported to Excel for analysis.
 
==List of Tools==
 
<!-- List the tools in your workflow in a bulleted list (begin each line with an asterisk). Link to tool entries in COPTR where possible -->
 
*UKWA search interface
 
*[[GNU_Wget]]
 
*[[veraPDF]]
 
*XMLstarlet
 
*Excel
 
  
 
==Purpose, context and content==
 
==Purpose, context and content==

Revision as of 10:54, 6 September 2018

PDF/A validation and metadata extraction
Status:Productive
Tools:
  • UKWA search interface
    GNU_Wget
    veraPDF
    XMLstarlet
    Excel
  • Property "Tools" (as page type) with input value "UKWA search interfaceGNU_WgetveraPDFXMLstarletExcel" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.
Input:Corpus of PDF/A files
Output:CSV with validationresult and metadata
Organisation:Digital Preservation Coalition

Workflow description

  • UKWA search interface (create URL list of PDF/A files in UKWA) ->
  • GNU_Wget (download files from IA, to create the test corpus) ->
  • veraPDF (validates and extracts metadata from test corpus) ->
  • XMLstarlet (process XML, extract useful fields into a CSV) ->
  • Excel (view and analyse results in spreadsheet form)

The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see this). The result is a list of URLs. These are fetched with Wget to create a large test corpus of predominantly PDF/A files. VeraPDF is then used to validate and extract metadata for each file. XMLstarlet is applied to extract fields of interest from the resulting XML creating a CSV. The CSV is then imported to Excel for analysis.

Purpose, context and content

This is a basic workflow for ad-hoc investigation of a set of PDF/A files. This is a very simple, manual, workflow that was used for testing purposes as part of the development of veraPDF, and is described further in this blog post: http://dpconline.org/blog/pdf-eh-redux-putting-verapdf-into-practice

Evaluation/Review

The workflow was useful in that it helped to identify a number of bugs in veraPDF that have since been fixed. The workflow as is provides a very basic method of investigating PDF files, perhaps where some troubleshooting is required.

Prwheatley (talk) 11:51, 8 February 2017 (UTC)