Difference between revisions of "Workflow:PDF/A validation and metadata extraction"
Prwheatley (talk | contribs) |
Prwheatley (talk | contribs) |
||
(14 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | + | {{Infobox COW | |
− | + | |status=Experimental | |
− | + | |tools=GNU Wget, VeraPDF, UKWA search interface, XMLstarlet, Excel | |
+ | |input=Corpus of PDF/A files | ||
+ | |output=CSV with validation result and metadata | ||
+ | |organisation=Digital Preservation Coalition | ||
+ | |organisationurl=http://dpconline.org/ | ||
+ | |name=PDF/A validation and metadata extraction | ||
+ | }} | ||
==Workflow description== | ==Workflow description== | ||
<!-- Describe your workflow here. If necessary add a diagram --> | <!-- Describe your workflow here. If necessary add a diagram --> | ||
Line 10: | Line 16: | ||
*Excel (view and analyse results in spreadsheet form) | *Excel (view and analyse results in spreadsheet form) | ||
− | The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see [https://www.webarchive.org.uk/shine/search?query=content_type:%22application/pdf%22%20content_type_version:%221b%22 this]). The result is a list of URLs. These are fetched with Wget. VeraPDF is then used to validate and extract metadata for each file | + | The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see [https://www.webarchive.org.uk/shine/search?query=content_type:%22application/pdf%22%20content_type_version:%221b%22 this]). The result is a list of URLs. These are fetched with Wget to create a large test corpus of predominantly PDF/A files. VeraPDF is then used to validate and extract metadata for each file. XMLstarlet is applied to extract fields of interest from the resulting XML creating a CSV. The CSV is then imported to Excel for analysis. |
==Purpose, context and content== | ==Purpose, context and content== |
Latest revision as of 08:36, 28 April 2021
Workflow description[edit]
- UKWA search interface (create URL list of PDF/A files in UKWA) ->
- GNU_Wget (download files from IA, to create the test corpus) ->
- veraPDF (validates and extracts metadata from test corpus) ->
- XMLstarlet (process XML, extract useful fields into a CSV) ->
- Excel (view and analyse results in spreadsheet form)
The workflow begins with creation of a corpus of test files which is constructed using the UK Web Archive search interface (for example see this). The result is a list of URLs. These are fetched with Wget to create a large test corpus of predominantly PDF/A files. VeraPDF is then used to validate and extract metadata for each file. XMLstarlet is applied to extract fields of interest from the resulting XML creating a CSV. The CSV is then imported to Excel for analysis.
Purpose, context and content[edit]
This is a basic workflow for ad-hoc investigation of a set of PDF/A files. This is a very simple, manual, workflow that was used for testing purposes as part of the development of veraPDF, and is described further in this blog post: http://dpconline.org/blog/pdf-eh-redux-putting-verapdf-into-practice
Evaluation/Review[edit]
The workflow was useful in that it helped to identify a number of bugs in veraPDF that have since been fixed. The workflow as is provides a very basic method of investigating PDF files, perhaps where some troubleshooting is required.
Prwheatley (talk) 11:51, 8 February 2017 (UTC)