Editing Workflow:Web Archiving Quality Assurance (QA) Workflow

Jump to navigation Jump to search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
 
{{Infobox COW
 
{{Infobox COW
 
|status=Production
 
|status=Production
|tools=Heritrix, Cathode, Browsertrix, Jira, Screaming Frog
+
|tools=Heritrix, Cathode, Browsertrix, JIRA, Screaming Frog
 
|input=Live website content.
 
|input=Live website content.
 
|output=Archived website content served from WARC files.
 
|output=Archived website content served from WARC files.
Line 32: Line 32:
  
 
We can also add specific checks for a temporary period. For example, if we need to update the contents of a particular field due to a process change.
 
We can also add specific checks for a temporary period. For example, if we need to update the contents of a particular field due to a process change.
 
  
 
<b>Site Crawled</b>
 
<b>Site Crawled</b>
 
 
The crawl order is generated as an XML file and sent to our vendor. The vendor launches the crawls.   
 
The crawl order is generated as an XML file and sent to our vendor. The vendor launches the crawls.   
 
  
 
<b>Tracking and Prioritisation</b>
 
<b>Tracking and Prioritisation</b>
 
+
We currently use JIRA as our tracking system for crawls.
We currently use Jira as our tracking system for crawls.
+
As soon as a crawl is launched a JIRA ticket is set up by our vendors, containing basic information about the crawl.  
As soon as a crawl is launched a Jira ticket is set up by our vendors, containing basic information about the crawl.  
+
All correspondence between TNA and our vendors about the crawl takes place on the JIRA ticket.  
All correspondence between TNA and our vendors about the crawl takes place on the Jira ticket.  
+
TNA marks up the JIRA tickets of any crawls which need to be treated as ‘High Priority’. We add a standard label and a descriptive comment.
TNA marks up the Jira tickets of any crawls which need to be treated as ‘High Priority’. We add a standard label and a descriptive comment.
 
  
 
Common reasons for a site being considered High Priority include:  
 
Common reasons for a site being considered High Priority include:  
Line 55: Line 51:
 
Our supplier will also leave a comment in the ticket if a problem is noticed during the crawl – for example if it is becoming much larger than expected or if the crawler is blocked.  
 
Our supplier will also leave a comment in the ticket if a problem is noticed during the crawl – for example if it is becoming much larger than expected or if the crawler is blocked.  
  
Each site has a parent ticket (task) in Jira and each individual crawl has a child ticket (sub-task). This enables us to record information which applies to all crawls at parent level and to easily move between individual crawl tickets.
+
Each site has a parent ticket (task) in JIRA and each individual crawl has a child ticket (sub-task). This enables us to record information which applies to all crawls at parent level and to easily move between individual crawl tickets.  
 
 
We also use Jira to monitor the overall situation with crawling and QA, for example to check TNA and vendor QA progress towards targets.
 
 
 
 
 
<b>Auto-QA – Crawl Log Analysis (CLA)</b>
 
 
 
After the crawl is complete the first stage of our ‘Auto-QA’ runs. It checks all error codes in the crawl.log file to see whether any of the urls are available in the live web. A list of any available urls is attached to the Jira ticket.
 
Auto-QA is available to all under the MIT Licence: https://github.com/tna-webarchive/open-auto-qa
 
 
 
 
 
<b>Vendor QA</b>
 
 
 
Our vendor runs QA checks on all sites, starting with those marked as 'High Priority' in Jira.
 
They patch in urls identified during the ‘Auto-QA’ process and other missing files. This usually aims to fix small problems such as missing images or broken formatting.
 
All QA notes are added to the Jira ticket, including information about problems they have been unable to fix at the QA stage which may be fixed by additional work from their engineering team.
 
 
 
 
 
<b>The National Archives Team QA – High Priority Crawls</b>
 
 
 
We run two more parts of the ‘Auto QA’ process. They are only run on high priority crawls as they take a lot of time and resource.
 
 
 
* Diffex – involves running the 'SEO Site Audit Tool’, Screaming Frog, across the live site then checking the outcome against the archive site. A list of urls which were found by Screaming Frog but not in the archived site is attached to the Jira ticket. TNA QA checks over the list and asks the vendor to patch them in. Diffex is more effective if the live site is dormant for the duration of the archiving process.
 
 
 
* PDF Flash – crawls all pdf files in the archived site, extracts any in-scope hyperlinks and checks them against the archived site. A list of urls missing from the archived site is attached to the Jira ticket. TNA QA checks over the list and asks the vendor to patch them in.
 
 
 
We also undertake visual QA. This involves visually comparing the live site with the archived site, concentrating on areas which are likely to cause problems: such as interactive content, videos and animations and large publication storage sections. We use information from previous crawls to guide what we check.
 
 
 
If any patching is needed, including from Diffex or PDF Flash, or we find other problems with the archived site, we assign it back to the vendor and ask them to fix. We add detailed descriptions of the problems to Jira, often illustrated with screenshots or short videos.
 
 
 
 
 
<b>The National Archives Team QA – Regular Priority</b>
 
 
 
We undertake visual QA checks only. As with High Priority crawls this involves visually comparing the live site with the archived site, concentrating on areas which are likely to cause problems such as interactive content, videos and animations and large publication storage sections. We also use information from previous crawls to guide what we check and how much checking to do. If no serious problems were found in previous crawls of a site, and no problems with the current were reported from Vendor QA we do not dedicate much QA time.
 
 
 
  
<b>Supplier Patching & Additional Work</b>
 
  
The supplier will do the patching we request.
 
When the work is complete, the ticket will be assigned back to us to check. If we do not think the work is complete, we will assign back to the Vendor to recheck. This will continue until we are happy that the site is as close to ‘state of the art’ as possible.
 
If any problems cannot be resolved by patching, they may suggest that we consider additional work by their engineering team. This is dealt with through a separate workflow.
 
  
 
<b>Site Owner QA for very High Priority Crawls</b>
 
 
If we are in contact with the website owner of a site which is closing or being redesigned, we will ask them to check the site before it is published. We believe this is useful as they have the best knowledge of how their site works. We emphasise that it is the site owner’s responsibility to check that content has been captured and works in the archive before removing it from the live web.
 
If site owners find content which is not working, we will ask the Vendor to investigate and fix if possible.
 
We provide guidance on QA for website owners on our web pages: https://www.nationalarchives.gov.uk/webarchive/archive-a-website/how-to-check-your-archived-website/
 
 
 
<b>Publication</b>
 
 
When we are happy that a site is as close as possible to ‘state of the art’ we approve it for publication. This is the end of the QA process.
 
Any information from the QA process which may be helpful for future crawls will be added either to the JIRA parent ticket (for example for content which cannot currently be made to work in the archive) or to WAMDB (if it needs to be added to the crawl order next time).
 
  
 
==Purpose, Context and Content==
 
==Purpose, Context and Content==
 
<!-- Describe what your workflow is for - i.e. what it is designed to achieve, what the organisational context of the workflow is, and what content it is designed to work with -->
 
<!-- Describe what your workflow is for - i.e. what it is designed to achieve, what the organisational context of the workflow is, and what content it is designed to work with -->
The National Archives is responsible for archiving UK central government owned websites. The archived archived websites are made available to the public through the UK Government Web Archive: https://www.nationalarchives.gov.uk/webarchive/
 
 
The QA workflow aims to enable us to work with our vendor (and sometimes our stakeholders) to achieve 'state of the art' capture and replay of websites, while deploying our resources efficiently.
 
 
We define 'state of art' as the archived website looking and behaving as closely as possible to the live website at the time it was captured, taking into account the limitations of currently available technology and resources. 
 
  
 
==Evaluation/Review==
 
==Evaluation/Review==
 
<!-- How effective was the workflow? Was it replaced with a better workflow? Did it work well with some content but not others? What is the current status of the workflow? Does it relate to another workflow already described on the wiki? Link, explain and elaborate -->
 
<!-- How effective was the workflow? Was it replaced with a better workflow? Did it work well with some content but not others? What is the current status of the workflow? Does it relate to another workflow already described on the wiki? Link, explain and elaborate -->
The QA workflow is effective but is constantly under review. It is one of our central workflows and we seek to continuously improve it.
 
  
 
==Further Information==
 
==Further Information==
Line 128: Line 68:
  
 
<!-- Note that your workflow will be marked with a CC3.0 licence -->
 
<!-- Note that your workflow will be marked with a CC3.0 licence -->
 
[[User:ClaireUKGWA|ClaireUKGWA]] ([[User talk:ClaireUKGWA|talk]]) 15:22, 7 February 2024 (UTC)
 

Please note that all contributions to COPTR are considered to be released under the Attribution-ShareAlike 3.0 Unported (see COPTR:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)

Template used on this page: