Web Archive Discovery

Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Indexing and discovery tools for web archives.


Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the Features page of the tool's wiki.

User Experiences

Development Activity

All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery

There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.

Release Feed

Below the last 3 release feeds:

2020-11-27 12:25:29
[tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]
by anjackson
2018-09-19 09:55:55
[tag:github.com,2008:Repository/7257232/warc-discovery-3.0.0 warc-discovery-3.0.0]
by anjackson
2017-10-25 14:14:15
[tag:github.com,2008:Repository/7257232/3.0.0-BETA-1 3.0.0-BETA-1: Merge pull request #129 from ruebot/archiveit_fields]
by anjackson

Activity Feed

Below the last 5 commits:

2021-10-12 14:14:33
[tag:github.com,2008:Grit::Commit/10e6f07134d2eb3fe6535c43ccb9457ce360ef70 Add opts and second step.]
by anjackson https://github.com/anjackson
2021-10-12 14:08:49
[tag:github.com,2008:Grit::Commit/d8a5c5ff964fbd3e55fc04bacf7eec62921f2d7e Fix action workflow.]
by anjackson https://github.com/anjackson
2021-10-12 14:06:26
[tag:github.com,2008:Grit::Commit/b1893be5dbfd1fc2410947df35c3492eec58a6f2 Add push.]
by anjackson https://github.com/anjackson
2021-10-12 13:57:09
[tag:github.com,2008:Grit::Commit/98f716eaf3ccda288faaf9dd23ec7c504700341d Merge pull request #264 from netarchivesuite/double_jar]
by anjackson https://github.com/anjackson
2021-10-12 13:56:42
[tag:github.com,2008:Grit::Commit/45c3c54c45088ebc5c169b4205302bb367e34ab1 Merge pull request #265 from netarchivesuite/file_flush]
by anjackson https://github.com/anjackson