Editing Web Archive Discovery
Jump to navigation
Jump to search
Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.
The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.
Latest revision | Your text | ||
Line 16: | Line 16: | ||
It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools. | It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools. | ||
− | If can also be configured to use [[Apache PDFBox]] | + | If can also be configured to use [[Apache PDFBox Preflight]] to scan PDF's for PDF/A violations, and record them in the index for analysis. |
It also records any parsing errors or other access problems it discovers, which can help find problematic resources. | It also records any parsing errors or other access problems it discovers, which can help find problematic resources. |