Workflow:Quality Assurance: Iterative Seed Issue Decision Tree

	Quality Assurance: Iterative Seed Issue Decision Tree
Status:	Production
Tools:	Heritrix; Webrecorder; OpenWayback; Pywb; OutbackCDX; CDX;
Input:	Web Archives visual replay and crawl report data
Output:	Adjustments to seed URLs and scopes; the results of a future crawl; documentation
Organisation:	Library of Congress

Workflow Description

Purpose, Context and Content

This decision tree is meant to provide an outline for completing seed-by-seed quality assurance, beginning with data from Heritrix crawl reports and iterating on input, crawler, or other variables until either the captures improve or a seed URL is deemed non-archivable. At our organization, this workflow is conducted entirely by the Web Archiving Team, the technical team which facilitates the contracted crawling, the use of our curatorial workflow tool Digiboard, and the ingest and access to the web archives (this latter in conjunction with our Office of the Chief Information Officer or OCIO).

Evaluation/Review

This workflow or some version of it has been in place for a long time in our program. It is labor intensive and certainly cannot be completed on 100% of the materials going into the crawls or 100% of the materials coming out of the crawls. That said, close attention to detail may yield results, even if QA cannot be completed all the time on everything.

Further Information

Quality Assurance: Iterative Seed Issue Decision Tree
Status:	Production
Tools:	Heritrix Webrecorder OpenWayback Pywb OutbackCDX CDX
Input:	Web Archives visual replay and crawl report data
Output:	Adjustments to seed URLs and scopes; the results of a future crawl; documentation
Organisation:	Library of Congress

Workflow:Quality Assurance: Iterative Seed Issue Decision Tree

Contents

Workflow Description

Purpose, Context and Content

Evaluation/Review

Further Information

Navigation menu

Search