Editing Web Archive Discovery

{{Infobox tool
|purpose=Indexing and discovery tools for web archives.
|homepage=https://github.com/ukwa/webarchive-discovery
|license=Mixed
|platforms=Java
|function=Metadata Extraction, File Format Identification, Content Profiling, Discovery
|content=Web
}}
{{Infobox tool details
|ohloh_id=Heritrix
}}
== Description ==
<!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. -->
Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use [[Apache PDFBox Preflight]] to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the [https://github.com/ukwa/webarchive-discovery/wiki/Features Features page of the tool's wiki].

== User Experiences ==
<!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. -->
* Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. Some details [http://blogs.bl.uk/webarchive/2017/06/revitalising-the-uk-web-archive.html in this blog post]. Feel free to contact [https://twitter.com/UKWebArchive @UKWebArchive] for more information.
* Also used by a number of other web archives, and by the [https://github.com/archivesunleashed/warclight WARCLight] project.

= Development Activity =
<!-- Provide *evidence* of development activity of the tool. For example, RSS feeds for code issues or commits. -->
All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery
 
There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.
 
=== Release Feed ===
Below the last 3 release feeds:
<rss max=3>https://github.com/ukwa/webarchive-discovery/releases.atom</rss>
  
 
=== Activity Feed ===
Below the last 5 commits:
<rss max=5>https://github.com/ukwa/webarchive-discovery/commits/master.atom</rss>