Web Capture

	Function definition: Tools that support the capture of data from the world wide web, by "crawling" links between resources or other approaches.
	Lifecycle stage: Create or Receive (Acquire)

Tools for this function

Tool	Purpose
Archive-It	Archive-It is the leading web archiving service for collecting and accessing cultural heritage on the web. It is a service provided by the Internet Archive.
ArchiveBox	ArchiveBox is an open source tool that lets organizations & individuals archive both public & private web content while retaining control over their data.
ArchiveFacebook	ArchiveFacebook is a Firefox extension which allows individuals to save and manage Facebook web content.
Brozzler	From GitHub (https://github.com/internetarchive/brozzler): Brozzler is a distributed web crawler that uses a real browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links. Brozzler is designed to work in conjunction with warcprox for web archiving.
CINCH	CINCH (Capture INgest and CHecksum Tool) facilitates batch downloading and ingest of Internet-accessible documents and/or images to a central repository.
ContextMiner	ContextMiner is a framework to collect, analyze, and present the contextual information along with the data.
Curate.Us	With a simple click of the mouse, you can create visually compelling clips and quotes of web content that are easily embedded in blog posts, email, forums, and websites.
DeepArc	Intended for preserving web sites from the back-end, this is a database-to-XML curation tool.
Find It! Keep It!	Find It! Keep It! is a tool to save and organise web content.
GNU Wget	Non-interactive network downloader
HTTrack	HTTrack is a website copying utility.
Heritrix	Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site.
Heritrix plug-in for rich media capture	The Rich Media Capture module (RMC), developed in the LiWA (Living Web Archives) project, is designed to enhance the capturing capabilities of the crawler, with regards to different multimedia content types.
IMacros	iMacros makes it easy to test web-based applications.
Khtml2png	khtml2png is a command line program to create screenshots of webpages.
Library (xklb)	Media indexing multi-tool
Metaproducts	Metaproducts offers several commercial capture and off-line browsing tools.
NetarchiveSuite	NetarchiveSuite is a web archiving software package designed to plan, schedule and run web harvests of parts of the Internet.
NutchWAX	NutchWAX is software for indexing ARC files (archived Web sites gathered using Heritrix) for full text search.
PageVault	pageVault supports the archiving of all unique responses generated by a web server.
Pearl Crescent Page Saver	Pearl Crescent Page Saver is an extension for Mozilla Firefox that lets you capture images of web pages, including Flash content.
RARC (ARC replicator)	rARC is a distributed system that enables Internet users to provide storage space from their computers to replicate small parts of the archived data stored in the central repository of the Web archive.
SPARQLing Unicorn QGIS Plugin	Plugin for QGIS. Fetches data from Wikidata and other Linked Data SPARQL endpoints and adds a new layer in a QGIS project. Just insert a SPARQL query for Geo-Items and get a new vector layer into QGIS.
Screen-scraper	screen-scraper is a tool for extracting data from websites.
SiteStory	SiteStory is a transactional web archive. It archives resources of a web server it is associated with.
Spadix software	Spadix Software can download websites from a starting URL, search engine results or web dirs, and is able to follow external links.
Storytracker	Tools for tracking stories on news homepages
Teleport	Teleport is a web crawling tool that enables offline browsing
The DeDuplicator (Heritrix add-on module)	The DeDuplicator is an add-on module for Heritrix to reduce the amount of duplicate data collected in a series of snapshot crawls.
TubeKit	TubeKit is a toolkit for creating YouTube crawlers.
WARCreate	Google Chrome browser extension for creating WARC files from web pages
WAS (Web Archiving Service)	The Web Archiving Service (WAS) is a Web-based curatorial tool that enables libraries and archivists to capture, curate, analyze, and preserve Web-based government and political information.
WAXToolbar	WAXToolbar is a firefox extension to help users with common tasks encountered surfing a web archive.
WCT (Web Curator Tool)	Web Curator Tool (WCT) is a workflow management application for selective web archiving.
WarcManager	The WARC Manager is a web-based UI for managing and querying collections of web crawl data.
Warrick	Warrick is a free utility for reconstructing (or recovering) a website from web archives.
Wayback Machine	The Wayback Machine is a powerful search and discovery tool for use with collections of Web site "snapshots" collected through Web harvesting, usually with Heritrix (ARC or WARC files).
Web Scraper Plus+	Web Scraper Plus+ takes data from the web and puts it into a spreadsheet or database.
WebCite	WebCite is an on-demand web archiving service that takes snapshots of Internet-accessible digital objects at the behest of users, storing the data on their own servers and assigning unique identifiers to those instances of the material.
WebShot	WebShot allows you to take screenshots of web pages and save them as full sized images or thumbnails.
Webkit2png	webkit2png is a command line tool that creates png screenshots of webpages.
Webrecorder	Webrecorder is a hosted web archiving tool with which users can capture what they see as they browse websites and save that information (locally or to a free account)
Xenu's Link Sleuth	The tool checks the hyperlinks on websites.
YT-DLP (You Tube Download P)	Supports download of youtube videos, based on the now defunct YT-DL

Web Capture

Tools for this function

Navigation menu

Search

Function definition: Tools that support the capture of data from the world wide web, by "crawling" links between resources or other approaches.
Lifecycle stage: Create or Receive (Acquire)