Difference between revisions of "Workflow:Browsertrix-crawler Workflow"

From COPTR
Jump to navigation Jump to search
Line 25: Line 25:
  
 
1. A site is identified for capture.
 
1. A site is identified for capture.
 +
 
2. The site is assessed to determine which capture method is suitable. At this point we look at:
 
2. The site is assessed to determine which capture method is suitable. At this point we look at:
 +
 
     * How large the site is
 
     * How large the site is
 
     * Does the site contain interactive content?
 
     * Does the site contain interactive content?
Line 31: Line 33:
 
     * What level of fidelity is required
 
     * What level of fidelity is required
 
     * Have previous crawls of the site been attempted and what was the outcome
 
     * Have previous crawls of the site been attempted and what was the outcome
 +
 
3. An initial decision of what capture technology to use is made
 
3. An initial decision of what capture technology to use is made
  

Revision as of 16:48, 9 December 2021

Browsertrix-crawler Workflow
Status:Experimental
Tools:
Input:Website
Output:WARC file
Organisation:UK Government Web Archive

Workflow Description

Flowchart workflow for capturing a website with Browsertrix-crawler


The workflow involves the decision to capture a website with Browsertrix-crawler. It shows the iterative process of crawling a page with Browsertrix, QAing the results in Conifer and recrawling with adjusted settings.

Purpose, Context and Content

The purpose of this workflow is determine whether a site is suitable for capture with Browsertrix Crawler and if so, run a Browsertrix crawl. The crawl is then subject to Quality Assurance. If the crawl is found to be unsatisfactory Browsertrix settings are adjusted and the crawl is run again, with this process potentially being repeated several times until a satisfactory crawl is completed. If not satisfactory crawl can be made in this way, the site will be captured with Conifer.

The steps are as follows:

1. A site is identified for capture.

2. The site is assessed to determine which capture method is suitable. At this point we look at:

    * How large the site is
    * Does the site contain interactive content?
    * What is the planned capture frequency? (if the proposed capture is very frequent we may be more likely to use an in-house tool like Browsertrix to reduce costs)
    * What level of fidelity is required
    * Have previous crawls of the site been attempted and what was the outcome

3. An initial decision of what capture technology to use is made

Evaluation/Review

Further Information