Revision as of 15:51, 22 September 2015

Description

	Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site.
Homepage:	http://crawler.archive.org
License:	GNU Lesser General Public License 2.1
Platforms:	Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.
Appears in COW:	Quality Assurance: Iterative Seed Issue Decision Tree, Web Archiving Quality Assurance (QA) Workflow, Web Archiving Quality Assurance Lifecycle

Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site. The software is most often used as a powerful back-end tool incorporated into a web archiving workflow.

Provider

Internet Archive

Licensing and cost

Apache License, Version 2.0 – free. Some individual source code files are subject to or offered under other licenses.

Development activity

Version 3.1.1 was released in May 2012. Heritrix powers the Internet Archive, and so receives ongoing support.

Platform and interoperability

As a Java application, Heritrix is theoretically platform agnostic; however, only Linux is supported. The software requires Java Runtime Environment 1.6 or higher, and at least 256MB of available RAM.

Functional notes

Web crawls are carried out by configuring a ‘job,’ which itself is an instance of a crawl template called a ‘profile.’ Although they contain the same configurations, these two entities have different functions; profiles record the set of configurations and act as a starting point for shaping a new job, but only the job itself can excecute a crawl. The software will crawl FTP sites in addition to HTTP. Users can examine the results of a crawl by opening its log files, which include information about crawl problems and errors, each URI that was collected, and statistics about the job as a whole. Users can also create reports showing a summary of the crawl’s activity. Heritrix stores the web resources it crawls in an Arc file. The software includes a command-line tool called arcreader which can be used to extract the contents.

Documentation and user support

The User Guide for versions 3.0 and 3.1 is in the form of a wiki, which at time of writing is not structured in any obvious narrative order; while detailed, it is very difficult to navigate. The User Manual for version 2.0 is structured and can be used as a reference for navigation. Extensive documentation is available, including release notes, Javadoc API documentation, and FAQs linking within the wiki. Heritrix’s website links to two active mailing lists: a yahoo discussion group and a sourceforge list distributing source code commits. The project also uses a public JIRA for bug, feature, and issue tracking.

Usability

Heritrix is installed via a command line interface, but once installed the user can launch a web-based interface for configuration. Setting up a crawl requires a significant number of adjustments.

Expertise required

Installation requires solid knowledge of Linux and command line interfaces. As with any web archiving software, deep understanding of the project’s scope and collections policy is essential in order to set up appropriate targets.

Standards compliance

Heritrix does not offer metadata support. The software is designed to respect robots.txt exclusion directives and META robots tags.

Influence and take-up

Heritrix is extremely influential; as of March 2012 the sourceforge site reports nearly 240,000 downloads. Users include the Internet Archive, The British Library, the United States Library of Congress, and the French National Library. The software powers Netarchive Suite and the Web Curator Tool.

User Experiences

Development Activity

Error in widget Ohloh Project: unable to write file /var/www/html/extensions/Widgets/compiled_templates/wrt662a85328522b7_39984831

Mailing List(s)

See here for more information.

Release Feed

2024-04-12 07:00:21: [tag:github.com,2008:Grit::Commit/f81a987110b62f6bf774f791288a7aa67ec5e345 Merge pull request #583 from internetarchive/dependabot/maven/commons…]; by ato https://github.com/ato
2024-04-12 06:58:23: [tag:github.com,2008:Grit::Commit/7975bf3d7c14ab2a0fd05c3e77b6cd1a430a02d2 Bump org.springframework:spring-expression in /commons]; by dependabot https://github.com/dependabot
2024-04-12 06:49:17: [tag:github.com,2008:Grit::Commit/b3d276b6d8712751ef82c195ebf6a4544496d023 Merge pull request #575 from internetarchive/pdfbox]; by ato https://github.com/ato
2024-03-20 17:22:05: [tag:github.com,2008:Grit::Commit/455250994133a2516bc00ca6be7eecbb57f3c70a Merge pull request #582 from galgeek/fix-yt-dlp-typo-too]; by galgeek https://github.com/galgeek
2024-03-20 17:17:34: [tag:github.com,2008:Grit::Commit/3996195ab34af6c225cfbc1c3cc8a56efc47004e update comment]; by barbara@archive.org

Issues Feed

Failed to load RSS feed from https://webarchive.jira.com/sr/jira.issueviews:searchrequest-rss/temp/SearchRequest.xml?jqlQuery=project+%3D+HER&tempMax=100: There was a problem during the HTTP request: 400 Bad Request

@@ Line 46: / Line 46: @@
 {{Infobox_tool_details
+|releases_rss=https://github.com/internetarchive/heritrix3/commits/master.atom
+|issues_rss=https://webarchive.jira.com/sr/jira.issueviews:searchrequest-rss/temp/SearchRequest.xml?jqlQuery=project+%3D+HER&tempMax=100
+|mailing_lists=https://groups.yahoo.com/neo/groups/archive-crawler/info
 |ohloh_id=Heritrix
 }}

Difference between revisions of "Heritrix"