Difference between revisions of "Workflow:PDF/A validation and metadata extraction"

	PDF/A validation and metadata extraction
Status:	Experimental
Tools:	GNU Wget; VeraPDF; UKWA search interface; XMLstarlet; Excel;
Input:	Corpus of PDF/A files
Output:	CSV with validation result and metadata
Organisation:	Digital Preservation Coalition

Latest revision as of 08:36, 28 April 2021

Workflow description[edit]

UKWA search interface (create URL list of PDF/A files in UKWA) ->
GNU_Wget (download files from IA, to create the test corpus) ->
veraPDF (validates and extracts metadata from test corpus) ->
XMLstarlet (process XML, extract useful fields into a CSV) ->
Excel (view and analyse results in spreadsheet form)

The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see this). The result is a list of URLs. These are fetched with Wget to create a large test corpus of predominantly PDF/A files. VeraPDF is then used to validate and extract metadata for each file. XMLstarlet is applied to extract fields of interest from the resulting XML creating a CSV. The CSV is then imported to Excel for analysis.

Purpose, context and content[edit]

This is a basic workflow for ad-hoc investigation of a set of PDF/A files. This is a very simple, manual, workflow that was used for testing purposes as part of the development of veraPDF, and is described further in this blog post: http://dpconline.org/blog/pdf-eh-redux-putting-verapdf-into-practice

Evaluation/Review[edit]

The workflow was useful in that it helped to identify a number of bugs in veraPDF that have since been fixed. The workflow as is provides a very basic method of investigating PDF files, perhaps where some troubleshooting is required.

Prwheatley (talk) 11:51, 8 February 2017 (UTC)

@@ Line 1: / Line 1: @@
-{{Infobox_COW
+{{Infobox COW
+|status=Experimental
+|tools=GNU Wget, VeraPDF, UKWA search interface, XMLstarlet, Excel
+|input=Corpus of PDF/A files
+|output=CSV with validation result and metadata
+|organisation=Digital Preservation Coalition
+|organisationurl=http://dpconline.org/
 |name=PDF/A validation and metadata extraction
-|status=Productive
-|tools=UKWA search interface<br />[[GNU_Wget]]<br />[[veraPDF]]<br />XMLstarlet<br />Excel
-|input= Corpus of PDF/A files
-|output= CSV with validationresult and metadata
-|organisation=[http://dpconline.org/ Digital Preservation Coalition]
 }}
-[[Category:COW Workflows]]
 ==Workflow description==
 <!-- Describe your workflow here. If necessary add a diagram -->
@@ Line 18: / Line 17: @@
 The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see [https://www.webarchive.org.uk/shine/search?query=content_type:%22application/pdf%22%20content_type_version:%221b%22 this]). The result is a list of URLs. These are fetched with Wget to create a large test corpus of predominantly PDF/A files. VeraPDF is then used to validate and extract metadata for each file. XMLstarlet is applied to extract fields of interest from the resulting XML creating a CSV. The CSV is then imported to Excel for analysis.
-==List of Tools==
-<!-- List the tools in your workflow in a bulleted list (begin each line with an asterisk). Link to tool entries in COPTR where possible -->
-*UKWA search interface
-*[[GNU_Wget]]
-*[[veraPDF]]
-*XMLstarlet
-*Excel
 ==Purpose, context and content==

PDF/A validation and metadata extraction
Status:	Experimental
Tools:	GNU Wget VeraPDF UKWA search interface XMLstarlet Excel
Input:	Corpus of PDF/A files
Output:	CSV with validation result and metadata
Organisation:	Digital Preservation Coalition

Difference between revisions of "Workflow:PDF/A validation and metadata extraction"

Latest revision as of 08:36, 28 April 2021

Workflow description[edit]

Purpose, context and content[edit]

Evaluation/Review[edit]

Navigation menu

Search