Difference between revisions of "Web Archive Discovery"
Prwheatley (talk | contribs) |
|||
(3 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | + | {{Infobox tool | |
− | |||
− | {{ | ||
|purpose=Indexing and discovery tools for web archives. | |purpose=Indexing and discovery tools for web archives. | ||
|homepage=https://github.com/ukwa/webarchive-discovery | |homepage=https://github.com/ukwa/webarchive-discovery | ||
|license=Mixed | |license=Mixed | ||
|platforms=Java | |platforms=Java | ||
+ | |function=Metadata Extraction, File Format Identification, Content Profiling, Discovery | ||
+ | |content=Web | ||
+ | }} | ||
+ | {{Infobox tool details | ||
+ | |ohloh_id=Heritrix | ||
}} | }} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
== Description == | == Description == | ||
<!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. --> | <!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. --> | ||
Line 25: | Line 16: | ||
It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools. | It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools. | ||
− | If can also be configured to use [[Apache PDFBox | + | If can also be configured to use [[Apache PDFBox]]'s Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis. |
It also records any parsing errors or other access problems it discovers, which can help find problematic resources. | It also records any parsing errors or other access problems it discovers, which can help find problematic resources. | ||
Line 33: | Line 24: | ||
== User Experiences == | == User Experiences == | ||
<!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. --> | <!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. --> | ||
− | * Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. | + | * Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. Some details [http://blogs.bl.uk/webarchive/2017/06/revitalising-the-uk-web-archive.html in this blog post]. Feel free to contact [https://twitter.com/UKWebArchive @UKWebArchive] for more information. |
+ | * Also used by a number of other web archives, and by the [https://github.com/archivesunleashed/warclight WARCLight] project. | ||
= Development Activity = | = Development Activity = | ||
Line 49: | Line 41: | ||
Below the last 5 commits: | Below the last 5 commits: | ||
<rss max=5>https://github.com/ukwa/webarchive-discovery/commits/master.atom</rss> | <rss max=5>https://github.com/ukwa/webarchive-discovery/commits/master.atom</rss> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 14:34, 21 April 2021
Description[edit]
Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.
It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.
If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.
It also records any parsing errors or other access problems it discovers, which can help find problematic resources.
For more information, see the Features page of the tool's wiki.
User Experiences[edit]
- Used by the UK Web Archive to provide access to their collections. Some details in this blog post. Feel free to contact @UKWebArchive for more information.
- Also used by a number of other web archives, and by the WARCLight project.
Development Activity[edit]
All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery
There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.
Release Feed[edit]
Below the last 3 release feeds:
- 2024-04-02 09:25:58
- [tag:github.com,2008:Repository/7257232/warc-discovery-3.3.1 Revert of source_file_path]
- by GilHoggarth
- 2023-06-02 11:04:22
- [tag:github.com,2008:Repository/7257232/warc-discovery-3.3.0 warc-discovery-3.3.0]
- by anjackson
- 2020-11-27 12:25:29
- [tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]
- by anjackson
Activity Feed[edit]
Below the last 5 commits:
- 2024-08-09 10:57:54
- [tag:github.com,2008:Grit::Commit/40ce1635f79b8d9d13f3fa2a1577f0ca46aa8404 Merge pull request #318 from lasztoth/langid-language-analyser]
- by GilHoggarth https://github.com/GilHoggarth
- 2024-08-09 10:36:40
- [tag:github.com,2008:Grit::Commit/380afa66e0d45e569f0dd2971c1a8039daa90402 Added correct version of artifact]
- by KGX747@MC212515.gouv.etat.lu
- 2024-08-09 10:08:55
- [tag:github.com,2008:Grit::Commit/170c8dfb3543159065af792cf226e2ea1726c852 Update LanguageAnalyser.java]
- by GilHoggarth https://github.com/GilHoggarth
- 2024-08-09 10:02:40
- [tag:github.com,2008:Grit::Commit/4def14555648fc24fd2a5e75f367b799e6759bf2 Update pom.xml]
- by GilHoggarth https://github.com/GilHoggarth
- 2024-08-09 09:00:07
- [tag:github.com,2008:Grit::Commit/61e070adf7ebf91efd49a01928b8216b7019cd58 Merge pull request #317 from lasztoth/langid-language-analyser]
- by GilHoggarth https://github.com/GilHoggarth