Editing Heritrix

{{Infobox_tool
|purpose=Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content.
|image=
|homepage=http://crawler.archive.org
|license=GNU Lesser General Public License 2.1 
|platforms=Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM. 
}}

<!-- Delete the Categories that do not apply -->
[[Category:Web Crawl]]
[[Category:Web]]


= Description =
Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content. Developed by Internet Archive. Written in Java.

= User Experiences =


= Development Activity =

{{Infobox_tool_details
|ohloh_id=Heritrix
}}
@@ Line 1: / Line 1: @@
-{{Infobox tool
+{{Infobox_tool
-|image=heritrix-logo.gif
+|purpose=Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content.
-|purpose=Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site.
+|image=
-|homepage=https://github.com/internetarchive/heritrix3/wiki
+|homepage=http://crawler.archive.org
 |license=GNU Lesser General Public License 2.1
 |platforms=Written in Java. Must have Java Runtime Environment (JRE, http://www.java.com/en/download/index.jsp) and at least Java version 5.0 installed. Default heap size is 256MB RAM.
-|Wikidata ID=Q3097891
-|function=Web Capture
-|content=Web
-}}
-{{Infobox tool details
-|ohloh_id=Heritrix
-|mailing_lists=https://groups.yahoo.com/neo/groups/archive-crawler/info
-|issues_rss=https://webarchive.jira.com/sr/jira.issueviews:searchrequest-rss/temp/SearchRequest.xml?jqlQuery=project+%3D+HER&tempMax=100
 }}
+<!-- Delete the Categories that do not apply -->
+[[Category:Web Crawl]]
+[[Category:Web]]
 = Description =
-[https://webarchive.jira.com/wiki/display/Heritrix/Heritrix Heritrix] is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site. The software is most often used as a powerful back-end tool incorporated into a web archiving workflow.
+Heritrix is a flexible, extensible, robust, and scalable Web crawler capable of fetching, archiving, and analyzing Internet-accessible content. Developed by Internet Archive. Written in Java.
-====Provider====
-Internet Archive
-====Licensing and cost====
-[http://www.apache.org/licenses/LICENSE-2.0.html Apache License, Version 2.0] &ndash; free. Some individual source code files are subject to or offered under other licenses.
-====Development activity====
-Version 3.1.1 was released in May 2012.
-Heritrix powers the Internet Archive, and so receives ongoing support.
-====Platform and interoperability====
-As a Java application, Heritrix is theoretically platform agnostic; however, only Linux is supported.&nbsp; The software requires Java Runtime Environment 1.6 or higher, and at least 256MB of available RAM.
-====Functional notes====
-Web crawls are carried out by configuring a &lsquo;job,&rsquo; which itself is an instance of a crawl template called a &lsquo;profile.&rsquo; Although they contain the same configurations, these two entities have different functions; profiles record the set of configurations and act as a starting point for shaping a new job, but only the job itself can excecute a crawl.
-The software will crawl FTP sites in addition to HTTP. Users can examine the results of a crawl by opening its log files, which include information about crawl problems and errors, each URI that was collected, and statistics about the job as a whole. Users can also create reports showing a summary of the crawl&rsquo;s activity.
-Heritrix stores the web resources it crawls in an Arc file. The software includes a command-line tool called arcreader which can be used to extract the contents.
-====Documentation and user support====
-The [https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.0+and+3.1+User+Guide User Guide for versions 3.0 and 3.1] is in the form of a wiki, which at time of writing is not structured in any obvious narrative order; while detailed, it is very difficult to navigate.&nbsp; The [http://crawler.archive.org/articles/user_manual/ User Manual for version 2.0]&nbsp;is structured and can be used as a reference for navigation.&nbsp; Extensive documentation is available, including release notes, Javadoc API documentation, and FAQs linking within the wiki.
-Heritrix&rsquo;s website links to two active mailing lists: a yahoo discussion group and a sourceforge list distributing source code commits. The project also uses a public JIRA for bug, feature, and issue tracking.
-====Usability====
-Heritrix is installed via a command line interface, but once installed the user can launch a web-based interface for configuration. Setting up a crawl requires a significant number of adjustments.
-====Expertise required====
-Installation requires solid knowledge of Linux and command line interfaces. As with any web archiving software, deep understanding of the project&rsquo;s scope and collections policy is essential in order to set up appropriate targets.
-====Standards compliance====
-Heritrix does not offer metadata support. The software is designed to respect robots.txt exclusion directives and META robots tags.
-====Influence and take-up====
-Heritrix is extremely influential; as of March 2012, the SourceForge site reports nearly 240,000 downloads. [https://webarchive.jira.com/wiki/display/Heritrix/Users+of+Heritrix Users] include the Internet Archive, the British Library, the United States Library of Congress, and the French National Library. The software powers [http://www.dcc.ac.uk/node/9380 NetarchiveSuite] and the [http://www.dcc.ac.uk/node/9394 Web Curator Tool].
 = User Experiences =
@@ Line 46: / Line 19: @@
 = Development Activity =
-<!-- Provide *evidence* of development activity of the tool. For example, RSS feeds for code issues or commits. -->
-All development activity is visible on GitHub: http://github.com/internetarchive/heritrix3/commits
+{{Infobox_tool_details
+|ohloh_id=Heritrix
+}}
-=== Release Feed ===
-Below the last 3 release feeds:
-<rss max=3>https://github.com/internetarchive/heritrix3/releases.atom</rss>
-=== Activity Feed ===
-Below the last 5 commits:
-<rss max=5>https://github.com/internetarchive/heritrix3/commits/master.atom</rss>