Difference between revisions of "Heritrix"

From COPTR
Jump to navigation Jump to search
(Trial import from script.)
 
(7 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Infobox_tool
+
{{Infobox tool
|purpose=Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content.
+
|image=heritrix-logo.gif
|image=
+
|purpose=Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site.
|homepage=http://crawler.archive.org
+
|homepage=https://github.com/internetarchive/heritrix3/wiki
|license=GNU Lesser General Public License 2.1  
+
|license=GNU Lesser General Public License 2.1
|platforms=Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.  
+
|platforms=Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.
 +
|Wikidata ID=Q3097891
 +
|function=Web Capture
 +
|content=Web
 
}}
 
}}
 +
{{Infobox tool details
 +
|ohloh_id=Heritrix
 +
|mailing_lists=https://groups.yahoo.com/neo/groups/archive-crawler/info
 +
|issues_rss=https://webarchive.jira.com/sr/jira.issueviews:searchrequest-rss/temp/SearchRequest.xml?jqlQuery=project+%3D+HER&tempMax=100
 +
}}
 +
= Description =
 +
[https://webarchive.jira.com/wiki/display/Heritrix/Heritrix Heritrix] is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site. The software is most often used as a powerful back-end tool incorporated into a web archiving workflow.
 +
====Provider====
 +
Internet Archive
 +
====Licensing and cost====
 +
[http://www.apache.org/licenses/LICENSE-2.0.html Apache License, Version 2.0] – free. Some individual source code files are subject to or offered under other licenses.
 +
====Development activity====
 +
Version 3.1.1 was released in May 2012.
 +
Heritrix powers the Internet Archive, and so receives ongoing support.
 +
====Platform and interoperability====
 +
As a Java application, Heritrix is theoretically platform agnostic; however, only Linux is supported.  The software requires Java Runtime Environment 1.6 or higher, and at least 256MB of available RAM.
 +
====Functional notes====
 +
Web crawls are carried out by configuring a ‘job,’ which itself is an instance of a crawl template called a ‘profile.’ Although they contain the same configurations, these two entities have different functions; profiles record the set of configurations and act as a starting point for shaping a new job, but only the job itself can excecute a crawl.
 +
The software will crawl FTP sites in addition to HTTP. Users can examine the results of a crawl by opening its log files, which include information about crawl problems and errors, each URI that was collected, and statistics about the job as a whole. Users can also create reports showing a summary of the crawl’s activity.
 +
Heritrix stores the web resources it crawls in an Arc file. The software includes a command-line tool called arcreader which can be used to extract the contents.
 +
====Documentation and user support====
 +
The [https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.0+and+3.1+User+Guide User Guide for versions 3.0 and 3.1] is in the form of a wiki, which at time of writing is not structured in any obvious narrative order; while detailed, it is very difficult to navigate.  The [http://crawler.archive.org/articles/user_manual/ User Manual for version 2.0] is structured and can be used as a reference for navigation.  Extensive documentation is available, including release notes, Javadoc API documentation, and FAQs linking within the wiki.
 +
Heritrix’s website links to two active mailing lists: a yahoo discussion group and a sourceforge list distributing source code commits. The project also uses a public JIRA for bug, feature, and issue tracking.
 +
====Usability====
 +
Heritrix is installed via a command line interface, but once installed the user can launch a web-based interface for configuration. Setting up a crawl requires a significant number of adjustments.
 +
====Expertise required====
 +
Installation requires solid knowledge of Linux and command line interfaces. As with any web archiving software, deep understanding of the project’s scope and collections policy is essential in order to set up appropriate targets.
 +
====Standards compliance====
 +
Heritrix does not offer metadata support. The software is designed to respect robots.txt exclusion directives and META robots tags.
 +
====Influence and take-up====
 +
Heritrix is extremely influential; as of March 2012, the SourceForge site reports nearly 240,000 downloads. [https://webarchive.jira.com/wiki/display/Heritrix/Users+of+Heritrix Users] include the Internet Archive, the British Library, the United States Library of Congress, and the French National Library. The software powers [http://www.dcc.ac.uk/node/9380 NetarchiveSuite] and the [http://www.dcc.ac.uk/node/9394 Web Curator Tool].
  
<!-- Delete the Categories that do not apply -->
 
[[Category:Web Crawl]]
 
[[Category:Web]]
 
 
 
= Description =
 
Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content. Developed by Internet Archive. Written in Java.
 
  
 
= User Experiences =
 
= User Experiences =
Line 19: Line 46:
  
 
= Development Activity =
 
= Development Activity =
 
+
<!-- Provide *evidence* of development activity of the tool. For example, RSS feeds for code issues or commits. -->
{{Infobox_tool_details
+
All development activity is visible on GitHub: http://github.com/internetarchive/heritrix3/commits
|ohloh_id=Heritrix
+
}}
+
 +
=== Release Feed ===
 +
Below the last 3 release feeds:
 +
<rss max=3>https://github.com/internetarchive/heritrix3/releases.atom</rss>
 +
 
 +
 +
=== Activity Feed ===
 +
Below the last 5 commits:
 +
<rss max=5>https://github.com/internetarchive/heritrix3/commits/master.atom</rss>

Latest revision as of 15:59, 26 November 2021


Heritrix
Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site.
Homepage:https://github.com/internetarchive/heritrix3/wiki
License:GNU Lesser General Public License 2.1
Platforms:Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.
Wikidata ID:Q3097891
Function:Web Capture
Content type:Web
Appears in COW:Quality Assurance: Iterative Seed Issue Decision Tree, Web Archiving Quality Assurance (QA) Workflow, Web Archiving Quality Assurance Lifecycle


Error in widget Ohloh Project: unable to write file /var/www/html/extensions/Widgets/compiled_templates/wrt6649429b952078_48600121

Mailing List(s)

See here for more information.

Issues Feed

Failed to load RSS feed from https://webarchive.jira.com/sr/jira.issueviews:searchrequest-rss/temp/SearchRequest.xml?jqlQuery=project+%3D+HER&tempMax=100: There was a problem during the HTTP request: 400 Bad Request

Description[edit]

Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site. The software is most often used as a powerful back-end tool incorporated into a web archiving workflow.

Provider[edit]

Internet Archive

Licensing and cost[edit]

Apache License, Version 2.0 – free. Some individual source code files are subject to or offered under other licenses.

Development activity[edit]

Version 3.1.1 was released in May 2012. Heritrix powers the Internet Archive, and so receives ongoing support.

Platform and interoperability[edit]

As a Java application, Heritrix is theoretically platform agnostic; however, only Linux is supported.  The software requires Java Runtime Environment 1.6 or higher, and at least 256MB of available RAM.

Functional notes[edit]

Web crawls are carried out by configuring a ‘job,’ which itself is an instance of a crawl template called a ‘profile.’ Although they contain the same configurations, these two entities have different functions; profiles record the set of configurations and act as a starting point for shaping a new job, but only the job itself can excecute a crawl. The software will crawl FTP sites in addition to HTTP. Users can examine the results of a crawl by opening its log files, which include information about crawl problems and errors, each URI that was collected, and statistics about the job as a whole. Users can also create reports showing a summary of the crawl’s activity. Heritrix stores the web resources it crawls in an Arc file. The software includes a command-line tool called arcreader which can be used to extract the contents.

Documentation and user support[edit]

The User Guide for versions 3.0 and 3.1 is in the form of a wiki, which at time of writing is not structured in any obvious narrative order; while detailed, it is very difficult to navigate.  The User Manual for version 2.0 is structured and can be used as a reference for navigation.  Extensive documentation is available, including release notes, Javadoc API documentation, and FAQs linking within the wiki. Heritrix’s website links to two active mailing lists: a yahoo discussion group and a sourceforge list distributing source code commits. The project also uses a public JIRA for bug, feature, and issue tracking.

Usability[edit]

Heritrix is installed via a command line interface, but once installed the user can launch a web-based interface for configuration. Setting up a crawl requires a significant number of adjustments.

Expertise required[edit]

Installation requires solid knowledge of Linux and command line interfaces. As with any web archiving software, deep understanding of the project’s scope and collections policy is essential in order to set up appropriate targets.

Standards compliance[edit]

Heritrix does not offer metadata support. The software is designed to respect robots.txt exclusion directives and META robots tags.

Influence and take-up[edit]

Heritrix is extremely influential; as of March 2012, the SourceForge site reports nearly 240,000 downloads. Users include the Internet Archive, the British Library, the United States Library of Congress, and the French National Library. The software powers NetarchiveSuite and the Web Curator Tool.


User Experiences[edit]

Development Activity[edit]

All development activity is visible on GitHub: http://github.com/internetarchive/heritrix3/commits


Release Feed[edit]

Below the last 3 release feeds:

2022-07-28 08:21:57
[tag:github.com,2008:Repository/2623406/3.4.0-20220727 2022-07-27 Interim Release]
by anjackson
2021-09-30 12:56:11
[tag:github.com,2008:Repository/2623406/3.4.0-20210923 2021-09-23 Interim Release]
by anjackson
2021-08-03 09:44:29
[tag:github.com,2008:Repository/2623406/3.4.0-20210803 2021-08-03 Interim Release]
by anjackson


Activity Feed[edit]

Below the last 5 commits:

2024-05-15 14:17:26
[tag:github.com,2008:Grit::Commit/e38cd3528eaa1006bdbed35c4b1ea2b6c88c0b14 Merge pull request #589 from internetarchive/jdk22]
by ato https://github.com/ato
2024-05-15 14:14:51
[tag:github.com,2008:Grit::Commit/5424423e7d59a75d8b62537f1edc699977976bab Add JDK 22 to test matrix]
by ato https://github.com/ato
2024-05-15 14:13:18
[tag:github.com,2008:Grit::Commit/0c251c7c34a63f6fd5fd185bab8611061c8f55e0 Update groovy to 4.0.21 for JDK 22 support]
by ato https://github.com/ato
2024-05-15 12:03:50
[tag:github.com,2008:Grit::Commit/a151937d4c9c0a9beceeb306161e66ea4b4a19f2 Remove workaround for non-https builds.archive.org repository]
by ato https://github.com/ato
2024-05-15 06:50:07
[tag:github.com,2008:Grit::Commit/64c7f0fb1c9a2976cd86127180c772b5bd2555c4 Merge pull request #572 from BulkSecurityGeneratorProjectV2/fix/JLL/u…]
by ato https://github.com/ato