DigiPresHack Outline Proposal
The idea is to have recurring events in hackathon style where we build up the information we need to do digital preservation better. These could be regular fixtures alongside conferences (iPres, IIPC GA, IDCC, etc.) but would also have a strong remote-participation element. There would be three main outcomes:
- More information documenting more formats, risks and other preservation issues.
- Better tools.
- More individuals with the skill to contribute to the above.
i.e. there would always be an educational/introductory strange to help people learn about the issues and learn how to contribute to the registries/data sources.
DigiPresHack: Formats & Tools Hackathon
Andrew N. Jackson
To effectively capture and preserve the web, we need to understand the formats and protocols of the web, and the tools that can be used to manage them over time. This need has manifested itself via a range of digital preservation tool and format registries and test corpora, but so far these have only represented a partial success. Many registries have been developed but have failed to take hold, and those that have succeeded are those that have sought to identify, support and recognised those individuals willing to spend time contributing their effort and knowledge.
The idea of this DigiPresHack is to support those who wish to contribute in this area by providing a supportive environment and a clear framework for contribution. The hackathon format would be a one-day workshop in the ‘unconference’ style. Suggested activities would include:
- Generating example test files for various formats, e.g.
- WARC and ARC files demonstrating the different de-duplication methods
- HTML files demonstrating particular features, ideally accompanied by screenshots to capture the results (e.g. using emulators for old browsers).
oExtending the Archival Acid Test suite: https://github.com/machawk1/archivalAcidTest
- Extending the PWG database, possibly combining it with the newly-developed PET tools (https://github.com/pericles-project/pet)
- Review and/or add web archiving tool information to COPTR (http://coptr.digipres.org/)
- Document difficult or particularly interesting/challenging formats (http://fileformats.archiveteam.org)
- Extend the aggregations and visualisations at http://www.digipres.org/ in order to be able to see how far we’ve come.
To go ahead, this hackathon would require some additional funding to bring in appropriate individuals who could facilitate this event and who would not otherwise be able to attend. If possible, modest prizes for significant contributions could help build momentum. Ideally, we could use a webcast/hangout or similar to enable engagement by those who cannot attend.
These are tasks that only require basic technical skills and a willingness to learn how to document their findings. We would perform basic tasks where we create test files and check how they are rendered.
- Making example test files.
- Checking if existing files still render in different software versions.
- This could also include using the UKWA format index to hunt down the more obscure formats.
- Taking screenshots.
- Adding test files to a suitable corpus (e.g. OPF format corpus).
- Adding documentation to File Formats Wiki where it concerns formats and access etc.
- Adding documentation to COPTR where it concerns running preservation tasks (rather than access)
Goal is to better understand formats and software dependencies and document genuine preservation risks.
Improving tools, making new ID signatures in forms suitable for PRONOM etc.