Revision as of 22:09, 10 November 2013

Description

	Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content.
Homepage:	http://crawler.archive.org
License:	GNU Lesser General Public License 2.1
Platforms:	Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.
Appears in COW:	Quality Assurance: Iterative Seed Issue Decision Tree, Web Archiving Quality Assurance (QA) Workflow, Web Archiving Quality Assurance Lifecycle

Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content. Developed by Internet Archive. Written in Java.

User Experiences

Development Activity

Difference between revisions of "Heritrix"

Revision as of 22:09, 10 November 2013

Description

User Experiences

Development Activity

Navigation menu

Search

Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content.
Homepage:	http://crawler.archive.org
License:	GNU Lesser General Public License 2.1
Platforms:	Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.
Appears in COW:	Quality Assurance: Iterative Seed Issue Decision Tree, Web Archiving Quality Assurance (QA) Workflow, Web Archiving Quality Assurance Lifecycle