Workflow:Appraise email and other large, unstructured text collections

Jump to navigation Jump to search
Appraise email and other large, unstructured text collections
Input:The workflow is intended for use with email collections in PST format, but it can be adapted to support most other email formats as well as most common office productivity formats including PDF and Word.
Output:The outputs include deduplicated & reformatted (PST->EML) files, over 100 extracted metadata attributes about each document, extracted attachments, indexing (OCR & audio transcriptions), plain text versions, and scores indicating if the document is archival and/or sensitive (with accompanying statistics).
Organisation:University of Illinois

Workflow Description[edit]

  1. Using Emailchemy, convert the source PST files to individual EML files for inventory, preservation, and deduplication.
  2. Ingest the EML files into Nuix Discover to extract attachments & metadata, deduplicate, and index/OCR/transcribe.
  3. Create a sample (e.g., 1,000 documents) from the collection and code it (e.g., archival or non-archival). This will be use as a control to evaluate the performance of the predictive coding model.
  4. Create a sample (e.g., 200 documents) from the collection (minus the control set) and code it. This will be the initial set of documents used to train the model.
  5. Create a predictive coding model using the training set. Train the model, then score the control set using that model.
  6. Review the projections to evaluate the model's accuracy at 95% recall.
    • If below 80% accuracy, create a new version of the model using additional documents from the collection (minus the control set). Code those documents, train the new version, score the control set using the new version, and review the new projections. Repeat until you reach 80% accuracy.
    • If the performance drops between two versions, use the Review Conflicts feature to identify instances where the human coder may have erred in coding the control set. Also, create a separate model where the training and control sets are reversed. Use Review Conflicts in that model to identify instances where the human coder may have erred in coding the training set.
  7. Once 80% accuracy is achieved, note the threshold and then apply it to the full population. All documents will be scored.
  8. Export the To/From/CC/BCC, Date, subject, & score metadata for all documents with a score below the threshold. These non-archival documents will be discarded, but the basic metadata will be retained for network analysis & context.
  9. Export all metadata, attachments, and plain text versions of all documents with a score at or above the threshold. This will be preserved and eventually made available to researchers.
    • If either the email or its attachment(s) are archival, both will be retained.

Purpose, Context and Content[edit]

Huge, unorganized, unstructured data with scattered sensitive content presents challenges that other structured categorized, foldered, and form-based electronic records do not present. Email gives context for decisions. The fact that email is frequently the subject of FOIA requests is an indicator of its documentary and evidentiary value. It is a challenge to know what to keep. The U.S. NARA Capstone approach helps, but non-records exist within a Capstone account and records exist in non-Capstone accounts. Also, it is a challenge to identify what content can be accessed by whom at what times and by what means if it contains PII. These records end up ignored and deleted, or preserved but neglected, causing us to lose or overlook a valuable resource, reducing awareness of its value and risking format obsolescence. We have solutions for acquiring (Capstone & PST/MBOX/EPADD, etc.), reformatting (Emailchemy, etc.), indexing, (Acrobat, etc.), preserving (Preservica, etc.), arranging (EPADD/Preservica, etc.), describing (EPADD/Preservica, etc.), and accessing (EPADD/Preservica, etc.) email. The purpose of this workflow is to provide a way to code and classify large, unstructured datasets at the discrete item level. To do this, we use predictive coding where a statistical model based upon training and control groups is created that finds what you want in a large collection using representative samples to make decision with high confidence.


In an NHPRC-funded project completed in 2020 in collaboration with the Illinois State Archives, 516 GB of PST files from 68 senior state officials dating 2000-2014 were reduced from 5.3M messages to 2.4M archival messages & attachments, of which 1.4M were truly archival, 1M were false positives (non-archival but kept), 2.2M were truly non-archival, and 70K were false negatives (archival but discarded). This was accomplished by reviewing about 0.1% of the documents (~50 hours). This allows more quality content to be made accessible more quickly, and reduces the footprint of the collection. It also quantifies the risk that data is erroneously deleted, retained, shared, or corrupted. The cost of Nuix Discover is non-trivial, but as a monthly subscription it can be used in sprints to quickly process the data and avoid long-term costs.

Further Information[edit]

Bmwest (talk) 18:53, 6 February 2024 (UTC)