Editing Web Archive Discovery
Jump to navigation
Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 16: | Line 16: | ||
It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools. | It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools. | ||
− | If can also be configured to use [[Apache PDFBox]] | + | If can also be configured to use [[Apache PDFBox Preflight]] to scan PDF's for PDF/A violations, and record them in the index for analysis. |
It also records any parsing errors or other access problems it discovers, which can help find problematic resources. | It also records any parsing errors or other access problems it discovers, which can help find problematic resources. | ||
Line 24: | Line 24: | ||
== User Experiences == | == User Experiences == | ||
<!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. --> | <!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. --> | ||
− | * Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. | + | * Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. More details TBA. |
− | |||
= Development Activity = | = Development Activity = |