Web Archive Discovery

Revision as of 14:34, 21 April 2021 by Prwheatley (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Indexing and discovery tools for web archives.
Function:Metadata Extraction,File Format Identification,Content Profiling,Discovery
Content type:Web


Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the Features page of the tool's wiki.

User Experiences[edit]

Development Activity[edit]

All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery

There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.

Release Feed[edit]

Below the last 3 release feeds:

2020-11-27 12:25:29
[tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]
by anjackson
2018-09-19 09:55:55
[tag:github.com,2008:Repository/7257232/warc-discovery-3.0.0 warc-discovery-3.0.0]
by anjackson
2017-10-25 14:14:15
[tag:github.com,2008:Repository/7257232/3.0.0-BETA-1 3.0.0-BETA-1: Merge pull request #129 from ruebot/archiveit_fields]
by anjackson

Activity Feed[edit]

Below the last 5 commits:

2022-04-02 11:05:53
[tag:github.com,2008:Grit::Commit/0ef8d4cb4940d5377cc247a4ba0837873fb85c23 Merge pull request #269 from aponb/elastic2opensearch_migration]
by anjackson https://github.com/anjackson
2022-04-01 19:38:26
[tag:github.com,2008:Grit::Commit/f0d9f7290d2e8fbcdd274187918b8e0a25973c39 Merge branch 'master' into elastic2opensearch_migration]
by aponb https://github.com/aponb
2022-03-30 08:18:02
[tag:github.com,2008:Grit::Commit/02ee73e7a765bc07794475c472581971abfb9e0f Merge pull request #263 from netarchivesuite/elastic_text]
by anjackson https://github.com/anjackson
2022-03-30 08:17:09
[tag:github.com,2008:Grit::Commit/68c3fe4a68369f1748f5bbe9fdc18182801863a8 Merge pull request #270 from netarchivesuite/warcit]
by anjackson https://github.com/anjackson
2022-03-30 08:15:41
[tag:github.com,2008:Grit::Commit/b83a188abf0615c3cebe79aaaf03f18b524c7245 Merge pull request #257 from netarchivesuite/field_rewrite]
by anjackson https://github.com/anjackson