Workflow:Flashback project - optical media imaging workflow

Workflow Description
This workflow is used by the British Library's Digital Preservation team to capture, test and preserve the contents of hand-held optical media in the Library's collection. The workflow is largely automated, though command line arguments can be used to force particular run-time options. Please note that this workflow does not cover the imaging of magnetic media (i.e. floppy disks), nor does it include other parts of the pre-ingest workflow such as cataloguing and scanning of printed materials. The workflow is performed within a dedicated and customised BitCurator environment, including a number of additional tools which will be listed below.

Main Workflow - Start

 * Check the number of sessions and tracks on the disc using CD-INFO.
 * If there is only one session and one track on the disc, then proceed on the DD branch.
 * If there is only one session on the disc but multiple tracks, then check what runtime arguments have been provided:
 * If the user has selected 'Console mode' then proceed on the DD branch. Certain 1990s console discs imaged using CDRDAO are left with 'holes' in the data which cannot be read
 * If the user has not selected 'Console mode' then proceed on the CDRDAO branch. Multi-track discs are generally mixed-mode discs from which audio content needs to be extracted separately
 * If there is more than one session on the disc, then proceed on the CDRDAO branch. Each session on the disc needs to be captured separately

DD Branch

 * Use MD5SUM to create an MD5 checksum hash directly from the source media by addressing the device directly (e.g. as /dev/sr0 rather than /media/ /mountpoint)
 * Create a bitwise image of the source media using DD (including SYNC and NOTRUNC options)
 * Use MD5SUM to create an MD5 checksum hash from the bitwise image. Check that this hash value matches that created from the source media.
 * Proceed to Main Workflow - End

CDRDAO Branch

 * Use CDRDAO (CD Read-At-Once) to create BIN (binary image) and TOC (table of contents) files for each session on the disc.
 * Use TOC2CUE to convert each TOC file to a CUE (cue sheet) file. The BinChunker tool requires a CUE rather than a TOC file for each session
 * Use BCHUNK (BinChunker) to extract the audio and data content from each BIN file, using its CUE file as a guide.
 * Any Audio content on the disc is extracted in WAV format
 * Any file system on the disc is extracted in ISO format
 * Depending on where any file system was located on the disc - i.e. whether it was before or after any audio tracks - the resulting ISO files may or may not be navigable. For each ISO file extracted:
 * Create a temporary image file zero-padded with the appropriate byte offset (obtained using the data produced by CDRDAO).
 * Use DD to copy the bytes of the ISO file directly to the new temporary file. This file is regarded as temporary as it is not usable within an emulator or VM
 * Mount the temporary ISO file as a read-only file system.
 * Use GENISOIMAGE to create a new ISO9660 image containing the files in the temporary ISO image. This file can be viewed in an emulator or VM and is effectively an access copy
 * Umount the temporary ISO file.
 * Proceed to Main Workflow - End

Main Workflow - End

 * Attempt to mount the image file as read-only.
 * List and perform shallow identification of all files contained in the mounted disc image. The standard Linux commands LS and FILE are used for this purpose, but other tools such as DROID and Siegfried could be substituted
 * Virus scan all files contained in the mounted disc image.
 * Unmount the image.
 * Move files to temporary storage area, optionally to a quarantine location depending on whether viruses or potentially unwanted applications (PUAs) were detected in the image.

Tools Used

 * BitCurator
 * dd (standard Unix copy & convert tool)
 * md5sum (standard Unix MD5 checksum calculation tool)
 * mount (standard Unix filesystem mount tool)
 * ls (standard Unix file list tool)
 * file (standard Unix file identification tool)
 * libcdio
 * cd-info (get session & track information for optical media)
 * iso-info (get information for an ISO9660 filesystem)
 * cdrdao
 * cdrdao (CD authoring & ripping tool)
 * toc2cue (convert TOC files to CUE files)
 * BinChunker for Unix/Linux
 * bchunk (convert raw BIN images to ISO data tracks and CDR/WAV audio tracks)
 * genisoimage
 * genisoimage (create ISO9660 file system as ISO file)
 * Sophos Antivirus
 * savscan (sophos antivirus scan)

Purpose, Context and Content
The British Library has thousands of CDs, DVDs and floppy disks in its collection, and practices for managing them vary throughout the organisation. Known disks date back to the early 1980s. The age of these media means that preservation action is now overdue. The Library must image these media and ensure that the content is preserved properly so that it can ultimately be accessed and used by readers. To this end, we have developed an adaptable workflow that allows the capture of images from both magnetic and optical media. This section of the workflow deals specifically with optical media - CDs and DVDs - and aims to capture the contents of such discs, in a useable format, reliably, and with minimal operator intervention.

Evaluation/Review
This workflow has undergone a number of changes since it was initially developed in 2016 - the imaging of multi-session and multi-track discs presented a challenge which required the workflow to 'branch' at an early stage depending on the properties of the disc.

Further Information
Johan van der Knijff's excellent blog post Imaging CD-Extra / BlueBook discs goes through each step of the process to create an image of a mixed audio/data CD - though the specific tools used at each stage may differ, he provides detailed output for each step and clearly explains the challenges encountered along the way.