Difference between revisions of "Workflow:Web Archiving Quality Assurance Lifecycle"

	Web Archiving Quality Assurance Lifecycle
Status:	Production
Tools:	Heritrix; AWS; Pywb; OpenWayback; CDX;
Input:	Input: Seed URLs, SURTs, Exclude lists
Output:	WARCs, CDX files, Curatorial Data
Organisation:	Library of Congress

Revision as of 20:16, 2 June 2023

Workflow Description

Purpose, Context and Content

This workflow is an illustration of the life cycle of the quality assurance (QA) process in place in the Web Archiving Program at the Library of Congress as of March 25th, 2023. This work flow is meant to be vendor-agnostic, but assumes cloud-based web archiving services and cloud-based transfer. It is designed for an iterative crawling environment whereby adjustments to seed URLs, scopes, SURTs, regex, etc. are done from crawl to crawl, rather than having missing elements patched in; and for a large scale operation. There is a mix of open source and non-open source technologies in play, and the QA itself does not rely on a single technology, but require Web Archives crawl and replay tech.

Evaluation/Review

Further Information

@@ Line 11: / Line 11: @@
 <!-- To add an image of your workflow, open the "Upload File" link on the left in a new browser tab and follow on screen instructions, then return to this page and add the name of your uploaded image to the line below - replacing "workflow.png" with the name of your file. Replace the text "Textual description" with a short description of your image. Filenames are case sensitive! If you don't want to add a workflow diagram or other image, delete the line below  -->
-[[File:577px-QA-Life-Cycle-20230525.jpeg|Quality Assurance Life Cycle at Library of Congress as of March, 25 2023.]]<br>
+[[File:QA-Life-Cycle-20230525.jpeg|Quality Assurance Life Cycle at Library of Congress as of March, 25 2023.]]<br>
 <!-- Describe your workflow here with an overview of the different steps or processes involved-->
 ==Purpose, Context and Content==

Web Archiving Quality Assurance Lifecycle
Status:	Production
Tools:	Heritrix AWS Pywb OpenWayback CDX
Input:	Input: Seed URLs, SURTs, Exclude lists
Output:	WARCs, CDX files, Curatorial Data
Organisation:	Library of Congress

Difference between revisions of "Workflow:Web Archiving Quality Assurance Lifecycle"

Revision as of 20:16, 2 June 2023

Contents

Workflow Description

Purpose, Context and Content

Evaluation/Review

Further Information

Navigation menu

Search