Web Crawl

Jump to navigation Jump to search
Function definition: Tools that support the capture of data from the world wide web, typically by "crawling" links between resources.
Lifecycle stage: Create or Receive (Acquire)

Tools for this function

Archive-ItArchive-It is the leading web archiving service for collecting and accessing cultural heritage on the web. It is a service provided by the Internet Archive.
ArchiveFacebookArchiveFacebook is a Firefox extension which allows individuals to save and manage Facebook web content.
ContextMinerContextMiner is a framework to collect, analyze, and present the contextual information along with the data.
Curate.UsWith a simple click of the mouse, you can create visually compelling clips and quotes of web content that are easily embedded in blog posts, email, forums, and websites.
DIMAGA software suite supporting archives with preservation of digital information for eternity
DeepArcIntended for preserving web sites from the back-end, this is a database-to-XML curation tool.
Find It! Keep It!Find It! Keep It! is a tool to save and organise web content.
GNU WgetNon-interactive network downloader
HTTrackHTTrack is a website copying utility.
HeritrixHeritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site.
Heritrix plug-in for rich media captureThe Rich Media Capture module (RMC), developed in the LiWA (Living Web Archives) project, is designed to enhance the capturing capabilities of the crawler, with regards to different multimedia content types.
MetaproductsMetaproducts offers several commercial capture and off-line browsing tools.
NetarchiveSuiteNetarchiveSuite is a web archiving software package designed to plan, schedule and run web harvests of parts of the Internet.
NutchWAXNutchWAX is software for indexing ARC files (archived Web sites gathered using Heritrix) for full text search.
PageVaultpageVault supports the archiving of all unique responses generated by a web server.
PagelyzerSuite of tools for detecting changes in web pages and their rendering
RARC (ARC replicator)rARC is a distributed system that enables Internet users to provide storage space from their computers to replicate small parts of the archived data stored in the central repository of the Web archive.
SiteStorySiteStory is a transactional web archive. It archives resources of a web server it is associated with.
Spadix softwareSpadix Software can download websites from a starting URL, search engine results or web dirs, and is able to follow external links.
StorytrackerTools for tracking stories on news homepages
TeleportTeleport is a web crawling tool that enables offline browsing
The DeDuplicator (Heritrix add-on module)The DeDuplicator is an add-on module for Heritrix to reduce the amount of duplicate data collected in a series of snapshot crawls.
TubeKitTubeKit is a toolkit for creating YouTube crawlers.
WARCreateGoogle Chrome browser extension for creating WARC files from web pages
WAS (Web Archiving Service)The Web Archiving Service (WAS) is a Web-based curatorial tool that enables libraries and archivists to capture, curate, analyze, and preserve Web-based government and political information.
WAXToolbarWAXToolbar is a firefox extension to help users with common tasks encountered surfing a web archive.
WCT (Web Curator Tool)Web Curator Tool (WCT) is a workflow management application for selective web archiving.
WarcManagerThe WARC Manager is a web-based UI for managing and querying collections of web crawl data.
WarrickWarrick is a free utility for reconstructing (or recovering) a website from web archives.
Wayback MachineThe Wayback Machine is a powerful search and discovery tool for use with collections of Web site "snapshots" collected through Web harvesting, usually with Heritrix (ARC or WARC files).
Xenu's Link SleuthThe tool checks the hyperlinks on websites.