Workflow:Web Archiving Quality Assurance (QA) Workflow
Revision as of 14:45, 7 February 2024 by ClaireUKGWA (talk | contribs)
Workflow Description
Pre-crawl checks
A report is produced from WAMDB, our management database, listing all the websites due to be crawled the following month along with useful information held in the database about each site.
The list is divided between the web archiving team.
The team runs through a list of checks on each website. The aim is to ensure the information we send to the crawler is as complete as possible, so the crawl is ‘right first time’. This helps to avoid additional work and temporal problems which can occur if content is added to the archive later (patching).
Checks include:
- Is the entry url (seed) for the site correct?
- Is the site still being updated or is it ‘dormant’. If it’s not been updated since the previous crawl we’ll postpone the crawl.
- Is there an XML sitemap? Has it changed since the last crawl?
- Update pagination patterns – web crawlers are set to capture sites to a specific number of ‘hops’ from the homepage, if the number of pages in a section is bigger than the number of ‘hops’ we send those links as additional seeds for the crawler.
- Check for new pagination patterns.
- Check for content hosted third party locations (eg. downloads, images).
- Has site been redesigned since previous crawl?
- Are there any specific checks we need to do or information we need to provide at crawl time?
We can also add specific checks for a temporary period. For example, if we need to update the contents of a particular field due to a process change.