Web Archive Discovery
Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.
If can also be configured to use Apache PDFBox Preflight to scan PDF's for PDF/A violations, and record them in the index for analysis.
It also records any parsing errors or other access problems it discovers, which can help find problematic resources.
For more information, see the Features page of the tool's wiki.
- Used by the UK Web Archive to provide access to their collections. Some details in this blog post. Feel free to contact @UKWebArchive for more information.
- Also used by a number of other web archives, and by the WARCLight project.
All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery
There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.
Below the last 3 release feeds: Failed to load RSS feed from https://github.com/ukwa/webarchive-discovery/releases.atom: Error fetching URL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
Below the last 5 commits: Failed to load RSS feed from https://github.com/ukwa/webarchive-discovery/commits/master.atom: Error fetching URL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
Andy Jackson (100.0%)