Web Archive Discovery

Jump to navigation Jump to search

Indexing and discovery tools for web archives.
Function:Metadata Extraction,File Format Identification,Content Profiling,Discovery
Content type:Web


Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the Features page of the tool's wiki.

User Experiences[edit]

Development Activity[edit]

All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery

There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.

Release Feed[edit]

Below the last 3 release feeds:

2023-06-02 11:04:22
[tag:github.com,2008:Repository/7257232/warc-discovery-3.3.0 warc-discovery-3.3.0]
by anjackson
2020-11-27 12:25:29
[tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]
by anjackson
2018-09-19 09:55:55
[tag:github.com,2008:Repository/7257232/warc-discovery-3.0.0 warc-discovery-3.0.0]
by anjackson

Activity Feed[edit]

Below the last 5 commits:

2023-06-27 14:28:40
[tag:github.com,2008:Grit::Commit/9d07645953d252f415b1aad4dc43766258363ea5 Update status badges.]
by anjackson https://github.com/anjackson
2023-06-27 08:10:53
[tag:github.com,2008:Grit::Commit/48da3fb74ac155ddceb8ab85e56660ae2a587439 Update Change Log]
by anjackson https://github.com/anjackson
2023-06-02 11:04:25
[tag:github.com,2008:Grit::Commit/5faa3848133305b3bd483b8108945e3ac33e03cd [maven-release-plugin] prepare for next development iteration]
by anjackson https://github.com/anjackson
2023-06-02 11:04:21
[tag:github.com,2008:Grit::Commit/07479e3303b3b33b90c9ec8a5d439855319d1adc [maven-release-plugin] prepare release warc-discovery-3.3.0]
by anjackson https://github.com/anjackson
2023-03-31 16:51:23
[tag:github.com,2008:Grit::Commit/d41faf85f4ba55844a10879080bc30dc057d50a1 Finally building and testing clean, but had to ignore some tests. To …]
by anjackson https://github.com/anjackson