Workflow:Web Archiving Quality Assurance Lifecycle

Revision as of 20:20, 2 June 2023 by Meghly (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Web Archiving Quality Assurance Lifecycle
Input:Input: Seed URLs, SURTs, Exclude lists
Output:WARCs, CDX files, Curatorial Data
Organisation:Library of Congress

Workflow Description[edit]

Quality Assurance Life Cycle at Library of Congress as of March, 25 2023.

Purpose, Context and Content[edit]

This workflow is an illustration of the life cycle of the quality assurance (QA) process in place in the Web Archiving Program at the Library of Congress as of March 25th, 2023. This work flow is meant to be vendor-agnostic, but assumes cloud-based web archiving services and cloud-based transfer. It is designed for an iterative crawling environment whereby adjustments to seed URLs, scopes, SURTs, regex, etc. are done from crawl to crawl, rather than having missing elements patched in; and for a large scale operation. There is a mix of open source and non-open source technologies in play, and the QA itself does not rely on a single technology, but require Web Archives crawl and replay tech.


Further Information[edit]