Heritrix

From COPTR
Jump to: navigation, search


Heritrix
Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site.
Homepage:http://crawler.archive.org
License:GNU Lesser General Public License 2.1
Platforms:Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.


Contents

[edit] Description

Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site. The software is most often used as a powerful back-end tool incorporated into a web archiving workflow.

[edit] Provider

Internet Archive

[edit] Licensing and cost

Apache License, Version 2.0 – free. Some individual source code files are subject to or offered under other licenses.

[edit] Development activity

Version 3.1.1 was released in May 2012. Heritrix powers the Internet Archive, and so receives ongoing support.

[edit] Platform and interoperability

As a Java application, Heritrix is theoretically platform agnostic; however, only Linux is supported.  The software requires Java Runtime Environment 1.6 or higher, and at least 256MB of available RAM.

[edit] Functional notes

Web crawls are carried out by configuring a ‘job,’ which itself is an instance of a crawl template called a ‘profile.’ Although they contain the same configurations, these two entities have different functions; profiles record the set of configurations and act as a starting point for shaping a new job, but only the job itself can excecute a crawl. The software will crawl FTP sites in addition to HTTP. Users can examine the results of a crawl by opening its log files, which include information about crawl problems and errors, each URI that was collected, and statistics about the job as a whole. Users can also create reports showing a summary of the crawl’s activity. Heritrix stores the web resources it crawls in an Arc file. The software includes a command-line tool called arcreader which can be used to extract the contents.

[edit] Documentation and user support

The User Guide for versions 3.0 and 3.1 is in the form of a wiki, which at time of writing is not structured in any obvious narrative order; while detailed, it is very difficult to navigate.  The User Manual for version 2.0 is structured and can be used as a reference for navigation.  Extensive documentation is available, including release notes, Javadoc API documentation, and FAQs linking within the wiki. Heritrix’s website links to two active mailing lists: a yahoo discussion group and a sourceforge list distributing source code commits. The project also uses a public JIRA for bug, feature, and issue tracking.

[edit] Usability

Heritrix is installed via a command line interface, but once installed the user can launch a web-based interface for configuration. Setting up a crawl requires a significant number of adjustments.

[edit] Expertise required

Installation requires solid knowledge of Linux and command line interfaces. As with any web archiving software, deep understanding of the project’s scope and collections policy is essential in order to set up appropriate targets.

[edit] Standards compliance

Heritrix does not offer metadata support. The software is designed to respect robots.txt exclusion directives and META robots tags.

[edit] Influence and take-up

Heritrix is extremely influential; as of March 2012, the SourceForge site reports nearly 240,000 downloads. Users include the Internet Archive, the British Library, the United States Library of Congress, and the French National Library. The software powers NetarchiveSuite and the Web Curator Tool.


[edit] User Experiences

[edit] Development Activity

All development activity is visible on GitHub: http://github.com/internetarchive/heritrix3/commits


[edit] Release Feed

Below the last 3 release feeds:

2014-01-11 00:18:30
[tag:github.com,2008:Repository/2623406/3.2.0 3.2.0]
by nlevitt
2012-08-07 22:18:54
[tag:github.com,2008:Repository/2623406/3.1.1 3.1.1]
by unknown


[edit] Activity Feed

Below the last 5 commits:

2017-12-12 22:20:55
[tag:github.com,2008:Grit::Commit/5a0326eba19cfe463fc869f4e3c730d661616564 Merge pull request #196 from kngenie/fix-test-failures-with-bdb]
by nlevitt https://github.com/nlevitt
2017-12-09 01:15:03
[tag:github.com,2008:Grit::Commit/a7b7c6cf5a0d567dad91fd6c499279cde2004508 fix for test failures in a workspace on NFS-mounted filesystem]
by kngenie https://github.com/kngenie
2017-11-16 19:24:20
[tag:github.com,2008:Grit::Commit/76bf98a4f0d42a37d61b36c20ebfd8b403e01f60 Merge pull request #194 from internetarchive/formsMaxSize]
by nlevitt https://github.com/nlevitt
2017-11-16 16:48:06
[tag:github.com,2008:Grit::Commit/998edf3479569f6da08e7a3cc154fe95849b7ddc max size for extracted form elements]
by galgeek https://github.com/galgeek
2017-11-02 00:20:30
[tag:github.com,2008:Grit::Commit/deda35eedd8011da6a51233af2dca2bd49c6be84 Merge pull request #192 from internetarchive/robotstxt-size-limit]
by nlevitt https://github.com/nlevitt


Mailing List(s)

See here for more information.

Issues Feed

2017-08-31 17:31:02
[HER-2096] heritrix hitting non existent URLs in wix.com/app-market
<style type="text/css"> .tableBorder, .grid { background-color: #fff; width: 100%; border-collapse: collapse; } .tableBorder td, .grid td { vertical-align: top; padding:...
by Vangelis Banos
2016-10-28 19:55:16
[HER-2095] Crawl M3U8 files and capture resources they describe
<style type="text/css"> .tableBorder, .grid { background-color: #fff; width: 100%; border-collapse: collapse; } .tableBorder td, .grid td { vertical-align: top; padding:...
by Barbara Miller
2017-03-07 19:21:51
[HER-2094] Add support for extracting URLs from img srcset attribute
<style type="text/css"> .tableBorder, .grid { background-color: #fff; width: 100%; border-collapse: collapse; } .tableBorder td, .grid td { vertical-align: top; padding:...
by Adam Miller
2016-06-07 07:21:42
[HER-2093] appCtx.getBean() does no longer work in scripting console
<style type="text/css"> .tableBorder, .grid { background-color: #fff; width: 100%; border-collapse: collapse; } .tableBorder td, .grid td { vertical-align: top; padding:...
by Robert Jäschke
2016-06-07 07:41:39
[HER-2092] Heritrix ignores robots.txt
<style type="text/css"> .tableBorder, .grid { background-color: #fff; width: 100%; border-collapse: collapse; } .tableBorder td, .grid td { vertical-align: top; padding:...
by Robert Jäschke

Contributors

Chlara (19.6%), Nullhandle (2.3%), Andy Jackson (15.9%), COPTR Bot (62.2%)