Editing Workflow:Web Archiving Quality Assurance (QA) Workflow

Jump to navigation Jump to search

Warning: You are not logged in. Your IP address will be publicly visible if you make any edits. If you log in or create an account, your edits will be attributed to your username, along with other benefits.

The edit can be undone. Please check the comparison below to verify that this is what you want to do, and then save the changes below to finish undoing the edit.

Latest revision Your text
Line 1: Line 1:
 
{{Infobox COW
 
{{Infobox COW
 
|status=Production
 
|status=Production
|tools=Heritrix, Cathode, Browsertrix, Jira, Screaming Frog
+
|tools=Heritrix, Cathode, Browsertrix, JIRA, Screaming Frog
 
|input=Live website content.
 
|input=Live website content.
 
|output=Archived website content served from WARC files.
 
|output=Archived website content served from WARC files.
Line 41: Line 41:
 
<b>Tracking and Prioritisation</b>
 
<b>Tracking and Prioritisation</b>
  
We currently use Jira as our tracking system for crawls.
+
We currently use JIRA as our tracking system for crawls.
As soon as a crawl is launched a Jira ticket is set up by our vendors, containing basic information about the crawl.  
+
As soon as a crawl is launched a JIRA ticket is set up by our vendors, containing basic information about the crawl.  
All correspondence between TNA and our vendors about the crawl takes place on the Jira ticket.  
+
All correspondence between TNA and our vendors about the crawl takes place on the JIRA ticket.  
TNA marks up the Jira tickets of any crawls which need to be treated as ‘High Priority’. We add a standard label and a descriptive comment.
+
TNA marks up the JIRA tickets of any crawls which need to be treated as ‘High Priority’. We add a standard label and a descriptive comment.
  
 
Common reasons for a site being considered High Priority include:  
 
Common reasons for a site being considered High Priority include:  
Line 55: Line 55:
 
Our supplier will also leave a comment in the ticket if a problem is noticed during the crawl – for example if it is becoming much larger than expected or if the crawler is blocked.  
 
Our supplier will also leave a comment in the ticket if a problem is noticed during the crawl – for example if it is becoming much larger than expected or if the crawler is blocked.  
  
Each site has a parent ticket (task) in Jira and each individual crawl has a child ticket (sub-task). This enables us to record information which applies to all crawls at parent level and to easily move between individual crawl tickets.
+
Each site has a parent ticket (task) in JIRA and each individual crawl has a child ticket (sub-task). This enables us to record information which applies to all crawls at parent level and to easily move between individual crawl tickets.
 
 
We also use Jira to monitor the overall situation with crawling and QA, for example to check TNA and vendor QA progress towards targets.  
 
  
  
 
<b>Auto-QA – Crawl Log Analysis (CLA)</b>
 
<b>Auto-QA – Crawl Log Analysis (CLA)</b>
  
After the crawl is complete the first stage of our ‘Auto-QA’ runs. It checks all error codes in the crawl.log file to see whether any of the urls are available in the live web. A list of any available urls is attached to the Jira ticket.  
+
After the crawl is complete the first stage of our ‘Auto-QA’ runs. It checks all error codes in the crawl.log file to see whether any of the urls are available in the live web. A list of any available urls is attached to the JIRA ticket.  
 
Auto-QA is available to all under the MIT Licence: https://github.com/tna-webarchive/open-auto-qa
 
Auto-QA is available to all under the MIT Licence: https://github.com/tna-webarchive/open-auto-qa
  
Line 68: Line 66:
 
<b>Vendor QA</b>
 
<b>Vendor QA</b>
  
Our vendor runs QA checks on all sites, starting with those marked as 'High Priority' in Jira.  
+
Our vendor runs QA checks on all sites, starting with those marked as 'High Priority' in JIRA.  
They patch in urls identified during the ‘Auto-QA’ process and other missing files. This usually aims to fix small problems such as missing images or broken formatting.  
+
They patch in urls identified during the ‘Auto-QA’ process and other missing files often to fix small problems such as missing images or broken formatting.  
All QA notes are added to the Jira ticket, including information about problems they have been unable to fix at the QA stage which may be fixed by additional work from their engineering team.  
+
All QA notes are added to the JIRA ticket, including information about problems they have been unable to fix at the QA stage which may be fixed by additional work from their engineering team.  
  
  
 
<b>The National Archives Team QA – High Priority Crawls</b>
 
<b>The National Archives Team QA – High Priority Crawls</b>
 
 
We run two more parts of the ‘Auto QA’ process. They are only run on high priority crawls as they take a lot of time and resource.
 
We run two more parts of the ‘Auto QA’ process. They are only run on high priority crawls as they take a lot of time and resource.
  
* Diffex – involves running the 'SEO Site Audit Tool’, Screaming Frog, across the live site then checking the outcome against the archive site. A list of urls which were found by Screaming Frog but not in the archived site is attached to the Jira ticket. TNA QA checks over the list and asks the vendor to patch them in. Diffex is more effective if the live site is dormant for the duration of the archiving process.  
+
* Diffex – involves running the 'SEO Site Audit Tool’, Screaming Frog, across the live site then checking the outcome against the archive site. A list of urls which were found by Screaming Frog but not in the archived site is attached to the JIRA ticket. TNA QA checks over the list and asks the vendor to patch them in. Diffex is more effective if the live site is dormant for the duration of the archiving process.  
  
* PDF Flash – crawls all pdf files in the archived site, extracts any in-scope hyperlinks and checks them against the archived site. A list of urls missing from the archived site is attached to the Jira ticket. TNA QA checks over the list and asks the vendor to patch them in.  
+
* PDF Flash – crawls all pdf files in the archived site, extracts any in-scope hyperlinks and checks them against the archived site. A list of urls missing from the archived site is attached to the JIRA ticket. TNA QA checks over the list and asks the vendor to patch them in.  
  
 
We also undertake visual QA. This involves visually comparing the live site with the archived site, concentrating on areas which are likely to cause problems: such as interactive content, videos and animations and large publication storage sections. We use information from previous crawls to guide what we check.  
 
We also undertake visual QA. This involves visually comparing the live site with the archived site, concentrating on areas which are likely to cause problems: such as interactive content, videos and animations and large publication storage sections. We use information from previous crawls to guide what we check.  
  
If any patching is needed, including from Diffex or PDF Flash, or we find other problems with the archived site, we assign it back to the vendor and ask them to fix. We add detailed descriptions of the problems to Jira, often illustrated with screenshots or short videos.
+
If any patching is needed, including from Diffex or PDF Flash, or we find other problems with the archived site, we assign it back to the vendor and ask them to fix. We add detailed descriptions of the problems to JIRA, often illustrated with screenshots or short videos.
  
  

Please note that all contributions to COPTR are considered to be released under the Attribution-ShareAlike 3.0 Unported (see COPTR:Copyrights for details). If you do not want your writing to be edited mercilessly and redistributed at will, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource. Do not submit copyrighted work without permission!

To edit this page, please answer the question that appears below (more info):

Cancel Editing help (opens in new window)

Template used on this page: