Difference between revisions of "Web Archive Discovery"
(Added initial webarchive-discovery outline) |
|||
(4 intermediate revisions by the same user not shown) | |||
Line 17: | Line 17: | ||
<!-- Add relevant categories to describe the content type that the tool addresses. Choose carefully, and view the list of existing categories first (see the Navigation sidebar on the left). If the tool works on any content type, do not add a category. The following are common category examples, remove those that don't apply --> | <!-- Add relevant categories to describe the content type that the tool addresses. Choose carefully, and view the list of existing categories first (see the Navigation sidebar on the left). If the tool works on any content type, do not add a category. The following are common category examples, remove those that don't apply --> | ||
[[Category:Web]] | [[Category:Web]] | ||
− | [[Category: | + | [[Category:Web_Archive]] |
== Description == | == Description == | ||
<!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. --> | <!-- Describe the what the tool does, focusing on it's digital preservation value. Keep it factual. --> | ||
− | Full-text indexing system, using Apache Solr as the search back-end. Supports command-line | + | Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks. |
+ | |||
+ | It runs format identification using both [[Apache Tika]] and [[DROID]], and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools. | ||
+ | |||
+ | If can also be configured to use [[Apache PDFBox]]'s Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis. | ||
+ | |||
+ | It also records any parsing errors or other access problems it discovers, which can help find problematic resources. | ||
+ | |||
+ | For more information, see the [https://github.com/ukwa/webarchive-discovery/wiki/Features Features page of the tool's wiki]. | ||
== User Experiences == | == User Experiences == | ||
<!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. --> | <!-- Add hotlinks to user experiences with the tool (eg. blog posts). These should illustrate the effectiveness (or otherwise) of the tool. --> | ||
− | * Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. | + | * Used by the [http://www.webarchive.org.uk/ UK Web Archive] to provide access to their collections. Some details [http://blogs.bl.uk/webarchive/2017/06/revitalising-the-uk-web-archive.html in this blog post]. Feel free to contact [https://twitter.com/UKWebArchive @UKWebArchive] for more information. |
+ | * Also used by a number of other web archives, and by the [https://github.com/archivesunleashed/warclight WARCLight] project. | ||
− | + | = Development Activity = | |
<!-- Provide *evidence* of development activity of the tool. For example, RSS feeds for code issues or commits. --> | <!-- Provide *evidence* of development activity of the tool. For example, RSS feeds for code issues or commits. --> | ||
+ | All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery | ||
+ | |||
+ | There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details. | ||
+ | |||
+ | === Release Feed === | ||
+ | Below the last 3 release feeds: | ||
+ | <rss max=3>https://github.com/ukwa/webarchive-discovery/releases.atom</rss> | ||
+ | |||
+ | |||
+ | === Activity Feed === | ||
+ | Below the last 5 commits: | ||
+ | <rss max=5>https://github.com/ukwa/webarchive-discovery/commits/master.atom</rss> | ||
+ | |||
− | |||
{{Infobox_tool_details | {{Infobox_tool_details | ||
− | |ohloh_id= | + | |releases_rss= |
+ | |issues_rss= | ||
+ | |mailing_lists= | ||
+ | |ohloh_id=Heritrix | ||
}} | }} |
Revision as of 15:33, 24 September 2018
Description
Full-text indexing system, using Apache Solr as the search back-end. Supports command-line or large-scale map-reduce (Hadoop) processing of ARC and WARC files. Also integrates file format analysis and scans for some known preservation risks.
It runs format identification using both Apache Tika and DROID, and also extracts filename extensions and the first few bytes of each resource, to enable analysis of formats unknown to those tools.
If can also be configured to use Apache PDFBox's Preflight tool to scan PDF's for PDF/A violations, and record them in the index for analysis.
It also records any parsing errors or other access problems it discovers, which can help find problematic resources.
For more information, see the Features page of the tool's wiki.
User Experiences
- Used by the UK Web Archive to provide access to their collections. Some details in this blog post. Feel free to contact @UKWebArchive for more information.
- Also used by a number of other web archives, and by the WARCLight project.
Development Activity
All development activity is visible on GitHub: http://github.com/ukwa/webarchive-discovery
There is also a #webarchive-discovery channel on the IIPC Slack service. Contact https://twitter.com/NetPreserve for details.
Release Feed
Below the last 3 release feeds:
- 2024-04-02 09:25:58
- [tag:github.com,2008:Repository/7257232/warc-discovery-3.3.1 Revert of source_file_path]
- by GilHoggarth
- 2023-06-02 11:04:22
- [tag:github.com,2008:Repository/7257232/warc-discovery-3.3.0 warc-discovery-3.3.0]
- by anjackson
- 2020-11-27 12:25:29
- [tag:github.com,2008:Repository/7257232/warc-discovery-3.1.0 warc-discovery-3.1.0]
- by anjackson
Activity Feed
Below the last 5 commits:
- 2024-04-02 09:24:57
- [tag:github.com,2008:Grit::Commit/13595bead029fd44f133ec6c18f689edde202e53 Update CHANGES.md]
- by GilHoggarth https://github.com/GilHoggarth
- 2024-04-02 08:33:41
- [tag:github.com,2008:Grit::Commit/2581409f298d2617fb21461edadd0044f70db617 Merge pull request #313 from thomasegense/master]
- by GilHoggarth https://github.com/GilHoggarth
- 2023-12-26 09:58:01
- [tag:github.com,2008:Grit::Commit/f98deaddfde179051ee3ba67adb3263b8111fc81 typo fix]
- by teg@kb.dk
- 2023-12-24 09:02:03
- [tag:github.com,2008:Grit::Commit/c7873c9a60e7029b70c57a3836690699dd74fa34 Added comment]
- by teg@kb.dk
- 2023-12-24 08:59:23
- [tag:github.com,2008:Grit::Commit/9f7e9105841a1aa64613cf39c8be0b9edd1b5947 Changed to debug. Some harvest tools generate a request record for every]
- by teg@kb.dk