DataFlow is a two-stage data management infrastructure that is designed to allow researchers to work with, annotate, publish, and permanently store research data. There are two components to DataFlow: DataStage and DataBank. DataStage is a secure, personalized 'local' file management environment for use at the research group level. It appears as a mapped drive on the end-user's computer, with additional features such as repository submission and addition of metadata available via a web interface. DataBank is a scalable, domain-agnostic data repository system designed specifically to manage and share research data in an institutional setting.
The University of Oxford, with JISC funding
Licensing and cost
Free – open source MIT license
DataFlow Version 0.2 was released in February 2012. Version 0.3 was released in May 2012, and Version 1.0 is scheduled for release on 31 May 2012. While the DataFlow project will end in May 2012, the Oxford Bodleian Library has decided to trial the software for its own use and will support development until at least May 2013.
Platform and interoperability
DataBank is written in Python. Both DataBank and DataStage are designed to work with the Ubuntu Linux 11.10 Oneiric Ocelot operating system, and the Virtual Machines work with VMWare Fusion 4.x. The system may be deployed on the Eduserv cloud, on a commercial data storage cloud, or on a local institutional server. Although specifically designed to work together, both DataBank and DataStage offer simple APIs that other services can use to integrate with the individual components.
DataBank’s digital object model is based around "collections," also known as "silos," that function as virtual administrative groups. Each silo has a set of users who can read and write files in the silo, and an administrator to manage it. A data package belongs to a silo and may contain one or many data files, metadata files, and license information for the contents of the data package. The administrator can set up an embargo period for the silo, or for an individual data package. All data is stored as ZIP files, which are unzipped by the software in the access process. Metadata is added and modified directly through RDF files, although it is exposed in both human- and machine-readable forms. The software will also assign DOIs to data sets. Unless the administrator adds a robots file saying they do not want to be crawled, by default all data held in a non-dark instance of DataBank are visible to Google and any other web crawlers. DataStage gives three levels of password-controlled access: a "private" area only accessible to the file owner and the group leader, a "shared" area giving read-only access to the group, and a "collaborative" area giving read- and write-access. The administrator can invite outside collaborators into the group, pinpointing their level of access. Users can also access and annotate the files through a web interface. DataStage can be deployed on a local server, or on an institutional or commercial cloud; users can also dynamically invoke additional cloud storage as required. Users can integrate the system into existing backup procedures. The repository interface also allows researchers to push selected files into a more permanent archive facility. While users can add free-text metadata via the web interface, DataStage also automatically captures a number of general file attributes: date uploaded; file name; last modified; type; owner; location; and size.
Documentation and user support
There is a variety of documentation for the project, including an Information for Test Users page, a DataStage documentation wiki, and a DataBank documentation wiki. Users can access an email list at email@example.com, and can report problems through an Issue Tracker. The project is creating video walkthroughs for installation, configuration and use of DataStage and DataBank, to be available from the website by the end of May 2012.
End-users can interact with the DataBank system through a web interface, but metadata must be added by uploading an rdf file. DataStage appears either as a mapped drive on a user’s computer, implicitly integrating with their operating system’s current navigation structure, or through a web interface. Installation and configuration use a command-line interface.
Installation and configuration require solid knowledge of command line interfaces, and benefit from system administration experience. Walkthrough videos should make it possible to get the system running without expertise, but novice users may not be able to get maximum functionality and customisability from the system.
DataStage automatically gathers metadata in RDF format. The system uses the BagIt specification when transferring files to a permanent archive, which must be SWORD-2 compliant. DataBank uses Dublin Core as its default metadata standard. The system is able to assign DOIs using the DataCite API.
Influence and take-up
DataFlow is used in a number of settings at the University of Oxford; it is not in “production” use outside the project yet but the project is developing an active base of test users and interested parties, including:
* UK Data Archive (testing DataStage + Eprints) * University of Hertfordshire and the Centre for Digital Music, Queen Mary University London (testing DataStage + DSpace) * RoaDMap project, University of Leeds (testing DataStage and DataBank) * YHMAN Shared Virtual Data Centre (interest in rolling out DataStage and DataBank on a 'community cloud' of eight universities in Yorkshire & Humber and comprises a cluster resource pool in existing data centres linked by a stretched long-haul network) * Microsoft Research (interest in DataBank as archive to hold and publicise large datasets) * A pool of research group leaders in the University of Oxford, who implemented ADMIRAL (a precursor to DataStage) and will upgrade to DataStage 1.0.