Latest revision as of 14:34, 21 April 2021

Description

Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.

It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.

If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.

It also records any parsing errors or other access problems it discovers, which can help find problematic resources.

For more information, see the Features page of the tool's wiki.

User Experiences

Used by the UK Web Archive to provide access to their collections. Some details in this blog post. Feel free to contact @UKWebArchive for more information.
Also used by a number of other web archives, and by the WARCLight project.

Development Activity

	Indexing and discovery tools for web archives.
Homepage:	https://github.com/ukwa/webarchive-discovery
License:	Mixed
Platforms:	Java
Function:	Metadata Extraction,File Format Identification,Content Profiling,Discovery
Content type:	Web

All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery

There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.

Release Feed

Below the last 3 release feeds:

2024-04-02 09:25:58: [tag:github.com,2008:Repository/7257232/warc-discovery-3.3.1 Revert of source_file_path]; by GilHoggarth
2023-06-02 11:04:22: [tag:github.com,2008:Repository/7257232/warc-discovery-3.3.0 warc-discovery-3.3.0]; by anjackson
2020-11-27 12:25:29: [tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]; by anjackson

Activity Feed

Below the last 5 commits:

2025-11-21 19:20:54: [tag:github.com,2008:Grit::Commit/2eb4f08257d31f64faab7d5a4e8f19bad5ca1fdf Moved .github/workflows to was.github/workflows to disable incurring …]; by GilHoggarth https://github.com/GilHoggarth
2025-07-24 10:32:36: [tag:github.com,2008:Grit::Commit/d398e1f90ebc1f95a0a01a2bf5235b30a6efd47d Added warc-indexer announcement, redirection]; by GilHoggarth https://github.com/GilHoggarth
2025-07-01 15:28:19: [tag:github.com,2008:Grit::Commit/229eac3df79527ae359112f8bbd5f11ab4406dbf Merge pull request #321 from inonwir/patch-1]; by GilHoggarth https://github.com/GilHoggarth
2025-06-25 07:41:18: [tag:github.com,2008:Grit::Commit/5bb7c38c53270aa7ca5ad8e90f3d736020caf0fa Update CallableTest.java]; by inonwir https://github.com/inonwir
2025-03-11 12:48:48: [tag:github.com,2008:Grit::Commit/4898ed804b3edaa3bdff84f46b2d1d3b71325660 Merge pull request #320 from bnfleb/issue-319]; by GilHoggarth https://github.com/GilHoggarth

@@ Line 1: / Line 1: @@
-<!-- Use the structure provided in this template, do not change it! -->
+{{Infobox tool
-{{Infobox_tool
 |purpose=Indexing and discovery tools for web archives.
 |homepage=https://github.com/ukwa/webarchive-discovery
 |license=Mixed
 |platforms=Java
+|function=Metadata Extraction, File Format Identification, Content Profiling, Discovery
+|content=Web
+}}
+{{Infobox tool details
+|ohloh_id=Heritrix
 }}
+== Description ==
+<!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. -->
+Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.
-<!-- Add one ore more categories to describe the function of the tool. Choose carefully, and view the list of existing categories first (see the Navigation sidebar on the left). The following are common category examples, remove those that don't apply -->
+It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.
-[[Category:Metadata Extraction]]
-[[Category:File Format Identification]]
-[[Category:Content Profiling]]
-[[Category:Discovery]]
+If can also be configured to use [[Apache PDFBox]]'s Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.
-<!-- Add relevant categories to describe the content type that the tool addresses. Choose carefully, and view the list of existing categories first (see the Navigation sidebar on the left). If the tool works on any content type, do not add a category. The following are common category examples, remove those that don't apply -->
+It also records any parsing errors or other access problems it discovers, which can help find problematic resources.
-[[Category:Web]]
-[[Category:Web_Archive]]
-== Description ==
+For more information, see the [https://github.com/ukwa/webarchive-discovery/wiki/Features Features page of the tool's wiki].
-<!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. -->
-Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.
 == User Experiences ==
 <!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. -->
-* Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. More details TBA.
+* Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. Some details [http://blogs.bl.uk/webarchive/2017/06/revitalising-the-uk-web-archive.html in this blog post]. Feel free to contact [https://twitter.com/UKWebArchive @UKWebArchive] for more information.
+* Also used by a number of other web archives, and by the [https://github.com/archivesunleashed/warclight WARCLight] project.
 = Development Activity =
@@ Line 41: / Line 41: @@
 Below the last 5 commits:
 <rss max=5>https://github.com/ukwa/webarchive-discovery/commits/master.atom</rss>
-{{Infobox_tool_details
-|releases_rss=
-|issues_rss=
-|mailing_lists=
-|ohloh_id=Heritrix
-}}

Difference between revisions of "Web Archive Discovery"

Latest revision as of 14:34, 21 April 2021

Contents

Description

User Experiences

Development Activity

Release Feed

Activity Feed

Navigation menu

Search

Indexing and discovery tools for web archives.
Homepage:	https://github.com/ukwa/webarchive-discovery
License:	Mixed
Platforms:	Java
Function:	Metadata Extraction,File Format Identification,Content Profiling,Discovery
Content type:	Web