Web Archive Discovery

From COPTR
Jump to navigation Jump to search


Indexing and discovery tools for web archives.
Homepage:https://github.com/ukwa/webarchive-discovery
License:Mixed
Platforms:Java
Function:Metadata Extraction,File Format Identification,Content Profiling,Discovery
Content type:Web



Description[edit]

Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the Features page of the tool's wiki.

User Experiences[edit]

Development Activity[edit]

All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery

There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.

Release Feed[edit]

Below the last 3 release feeds:

2020-11-27 12:25:29
[tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]
by anjackson
2018-09-19 09:55:55
[tag:github.com,2008:Repository/7257232/warc-discovery-3.0.0 warc-discovery-3.0.0]
by anjackson
2017-10-25 14:14:15
[tag:github.com,2008:Repository/7257232/3.0.0-BETA-1 3.0.0-BETA-1: Merge pull request #129 from ruebot/archiveit_fields]
by anjackson


Activity Feed[edit]

Below the last 5 commits:

2021-06-14 20:18:59
[tag:github.com,2008:Grit::Commit/510eb711a273c7cbbf7d69dae64b0bbbd426d8ea Merge pull request #255 from netarchivesuite/srcset]
by anjackson https://github.com/anjackson
2021-06-14 13:08:40
[tag:github.com,2008:Grit::Commit/c5789562305d867d3c06d96f5a8ad9beca7fe038 Add support for indexing srcset links as image links]
by tokee https://github.com/tokee
2021-06-08 21:28:28
[tag:github.com,2008:Grit::Commit/ad6d52bbf59a11258b1fc74a3e9bf9823a8f8955 Merge pull request #254 from ukwa/upgrade-nanite-1.4.1]
by anjackson https://github.com/anjackson
2021-06-08 21:26:39
[tag:github.com,2008:Grit::Commit/203502219860a71a96b59b7c0f994c8e9565a212 Update copyright headers.]
by anjackson https://github.com/anjackson
2021-06-08 21:17:27
[tag:github.com,2008:Grit::Commit/ef892a0f0d5e8cb4ca7808e588d14f389d4fce88 Update Nanite to 1.4.1-97 for #252.]
by anjackson https://github.com/anjackson