Difference between revisions of "Web"

From COPTR
Jump to navigation Jump to search
 
Line 1: Line 1:
Also see the [https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists Web Archiving Community master list of software].
 
 
 
{{Infobox content
 
{{Infobox content
 
|definition=Tools that support the preservation of live web data and archived web data (such as ARC and WARC formats).
 
|definition=Tools that support the preservation of live web data and archived web data (such as ARC and WARC formats).
 
}}
 
}}
 +
Also see the [https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#the-master-lists Web Archiving Community master list of software].

Latest revision as of 15:43, 9 December 2021

Content definition: Tools that support the preservation of live web data and archived web data (such as ARC and WARC formats).

Tools for this content type

Annotation Curation Tool (ACT) w3act is an annotation and curation tool for web archives X
ArchiFiltre Overview of folder trees with fine diagrams XXX
Archive-It Archive-It is the leading web archiving service for collecting and accessing cultural heritage on the web. It is a service provided by the Internet Archive. XX
ArchiveFacebook ArchiveFacebook is a Firefox extension which allows individuals to save and manage Facebook web content. X
Brozzler From GitHub (https://github.com/internetarchive/brozzler):

Brozzler is a distributed web crawler that uses a real browser (Chrome or Chromium) to fetch pages and embedded URLs and to extract links.

Brozzler is designed to work in conjunction with warcprox for web archiving.
X
CINCH CINCH (Capture INgest and CHecksum Tool) facilitates batch downloading and ingest of Internet-accessible documents and/or images to a central repository. X
DeepArc Intended for preserving web sites from the back-end, this is a database-to-XML curation tool. XX
Filestar Universal file converter for 900+ file types. XXX
GNU Wget Non-interactive network downloader X
HTTrack HTTrack is a website copying utility. X
Heritrix Heritrix is an open-source web crawler, allowing users to target websites they wish to include in a collection and to harvest an instance of each site. X
IMacros iMacros makes it easy to test web-based applications. XXX
Khtml2png khtml2png is a command line program to create screenshots of webpages. X
Libsafe libsafe allows the organizations to create a full OAIS compliant Archive, including active and passive digital preservation workflows and is particularly suited for master image files of digitizing processes. X
NetarchiveSuite NetarchiveSuite is a web archiving software package designed to plan, schedule and run web harvests of parts of the Internet. X
NutchWAX NutchWAX is software for indexing ARC files (archived Web sites gathered using Heritrix) for full text search. X
PageVault pageVault supports the archiving of all unique responses generated by a web server. X
Pagelyzer Suite of tools for detecting changes in web pages and their rendering XXX
Pearl Crescent Page Saver Pearl Crescent Page Saver is an extension for Mozilla Firefox that lets you capture images of web pages, including Flash content. X
Perma.cc A tool that captures, stores, plays-back and provides a new URL for web citation. Built and maintained at the Harvard Law School Library. XX
Screen-scraper screen-scraper is a tool for extracting data from websites. X
SiteStory SiteStory is a transactional web archive. It archives resources of a web server it is associated with. X
Spadix software Spadix Software can download websites from a starting URL, search engine results or web dirs, and is able to follow external links. X
Storytracker Tools for tracking stories on news homepages X
Teleport Teleport is a web crawling tool that enables offline browsing X
The DeDuplicator (Heritrix add-on module) The DeDuplicator is an add-on module for Heritrix to reduce the amount of duplicate data collected in a series of snapshot crawls. XX
UKWA Access API Web archives access API XXX
W3C Markup Validation Service This is the World Wide Web Consortium's validation tool. X
WARCreate Google Chrome browser extension for creating WARC files from web pages XX
WAS (Web Archiving Service) The Web Archiving Service (WAS) is a Web-based curatorial tool that enables libraries and archivists to capture, curate, analyze, and preserve Web-based government and political information. X
WAXToolbar WAXToolbar is a firefox extension to help users with common tasks encountered surfing a web archive. X
WCT (Web Curator Tool) Web Curator Tool (WCT) is a workflow management application for selective web archiving. XX
WERA (Web ARchive Access) WERA (Web ARchive Access) is a freely available solution for searching and navigating archived web document collections. X
Warc Analyzer A proof-of-concept client side webapp for analyzing WARC data using Webrecorder's warcio.js. No WARC data is uploaded anywhere it runs on your machine. The idea is that it would be useful for archivists who have been given a pile of WARC data and they would like to quickly know what it contains. X
Warc-proxy Warc-proxy is a simple tool to view WARC content in Firefox XX
WarcManager The WARC Manager is a web-based UI for managing and querying collections of web crawl data. XXX
Warcit Warcit is a command-line tool that converts directories (including nested directories), files (including HTML or other web assets and data files) and ZIP files to Web Archives (WARC). X
Warctools Command line tools and libraries for handling and manipulating WARC files (and HTTP contents) XXX
Warrick Warrick is a free utility for reconstructing (or recovering) a website from web archives. X
Wayback Machine The Wayback Machine is a powerful search and discovery tool for use with collections of Web site "snapshots" collected through Web harvesting, usually with Heritrix (ARC or WARC files). XX
Web Archive Discovery Indexing and discovery tools for web archives. XXXX
Web Recorder Player A tool that replays WARC files on your local computer. XX
Web Scraper Plus+ Web Scraper Plus+ takes data from the web and puts it into a spreadsheet or database. X
WebCite WebCite is an on-demand web archiving service that takes snapshots of Internet-accessible digital objects at the behest of users, storing the data on their own servers and assigning unique identifiers to those instances of the material. XXXX
WebShot WebShot allows you to take screenshots of web pages and save them as full sized images or thumbnails. X
Webkit2png webkit2png is a command line tool that creates png screenshots of webpages. X
Webrecorder Webrecorder is a hosted web archiving tool with which users can capture what they see as they browse websites and save that information (locally or to a free account) X
Xenu's Link Sleuth The tool checks the hyperlinks on websites. X
YT-DLP (You Tube Download P) Supports download of youtube videos, based on the now defunct YT-DL X

Also see the Web Archiving Community master list of software.