Difference between revisions of "Web Archive Discovery"

From COPTR
Jump to: navigation, search
(Added activity feeds.)
(Description)
Line 22: Line 22:
 
<!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. -->
 
<!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. -->
 
Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.
 
Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.
 +
 +
It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.
 +
 +
If can also be configured to use [[Apache PDFBox Preflight]] to scan PDF's for PDF/A violations, and record them in the index for analysis.
 +
 +
It also records any parsing errors or other access problems it discovers, which can help find problematic resources.
 +
 +
For more information, see the [https://github.com/ukwa/webarchive-discovery/wiki/Features Features page of the tool's wiki].
  
 
== User Experiences ==
 
== User Experiences ==

Revision as of 15:27, 24 September 2018



Indexing and discovery tools for web archives.
Homepage:https://github.com/ukwa/webarchive-discovery
License:Mixed
Platforms:Java

Contents

Description

Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use Apache PDFBox Preflight to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the Features page of the tool's wiki.

User Experiences

  • Used by the UK Web Archive to provide access to their collections. More details TBA.

Development Activity

All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery

There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.

Release Feed

Below the last 3 release feeds: Failed to load RSS feed from https://github.com/ukwa/webarchive-discovery/releases.atom: Error fetching URL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version


Activity Feed

Below the last 5 commits: Failed to load RSS feed from https://github.com/ukwa/webarchive-discovery/commits/master.atom: Error fetching URL: error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version



Contributors

Andy Jackson (100.0%)