Web Archive Discovery

From COPTR
Jump to navigation Jump to search



Indexing and discovery tools for web archives.
Homepage:https://github.com/ukwa/webarchive-discovery
License:Mixed
Platforms:Java
Function:Metadata Extraction,File Format Identification,Content Profiling,Discovery
Content type:Web



Description[edit]

Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the Features page of the tool's wiki.

User Experiences[edit]

Development Activity[edit]

All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery

There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.

Release Feed[edit]

Below the last 3 release feeds:

2020-11-27 12:25:29
[tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]
by anjackson
2018-09-19 09:55:55
[tag:github.com,2008:Repository/7257232/warc-discovery-3.0.0 warc-discovery-3.0.0]
by anjackson
2017-10-25 14:14:15
[tag:github.com,2008:Repository/7257232/3.0.0-BETA-1 3.0.0-BETA-1: Merge pull request #129 from ruebot/archiveit_fields]
by anjackson


Activity Feed[edit]

Below the last 5 commits:

2022-09-30 22:31:19
[tag:github.com,2008:Grit::Commit/2041fa6b0b05547e8a48f5a60e14dd3878823655 No = in CLI, and added mainClass.]
by anjackson https://github.com/anjackson
2022-09-30 13:20:24
[tag:github.com,2008:Grit::Commit/ab92d14d2e70a84c926d4f1d526593d251df29e2 Avoid over-logging.]
by anjackson https://github.com/anjackson
2022-09-30 13:16:54
[tag:github.com,2008:Grit::Commit/ad8b1f6302df8d931f3dccc924afac5aa7f22006 Add JSONL output for #299]
by anjackson https://github.com/anjackson
2022-09-30 09:41:20
[tag:github.com,2008:Grit::Commit/5043147e1b94d90ea087f0afff05175462a7434b Tweak testing for #299]
by anjackson https://github.com/anjackson
2022-09-29 23:00:58
[tag:github.com,2008:Grit::Commit/baa131a5027cfcac8d4089a1ce13088f7b70c774 Adding in more data, support JSONL output.]
by anjackson https://github.com/anjackson