To edit this page, please answer the question that appears below (more info):
What short name does OAIS use for an information package that is used for dissemination?
Free text:
==Workflow Description== <!-- To add an image of your workflow, open the "Upload File" link on the left in a new browser tab and follow on screen instructions, then return to this page and add the name of your uploaded image to the line below - replacing "workflow.png" with the name of your file. Replace the text "Textual description" with a short description of your image. Filenames are case sensitive! If you don't want to add a workflow diagram or other image, delete the line below --> <!-- [[File:workflow.png|Textual description]]<br> --> <!-- Describe your workflow here with an overview of the different steps or processes involved--> # Using Emailchemy, convert the source PST files to individual EML files for inventory, preservation, and deduplication. # Ingest the EML files into Nuix Discover to extract attachments & metadata, deduplicate, and index/OCR/transcribe. # Create a sample (e.g., 1,000 documents) from the collection and code it (e.g., archival or non-archival). This will be use as a control to evaluate the performance of the predictive coding model. # Create a sample (e.g., 200 documents) from the collection (minus the control set) and code it. This will be the initial set of documents used to train the model. # Create a predictive coding model using the training set. Train the model, then score the control set using that model. # Review the projections to evaluate the model's accuracy at 95% recall. #* If below 80% accuracy, create a new version of the model using additional documents from the collection (minus the control set). Code those documents, train the new version, score the control set using the new version, and review the new projections. Repeat until you reach 80% accuracy. #* If the performance drops between two versions, use the Review Conflicts feature to identify instances where the human coder may have erred in coding the control set. Also, create a separate model where the training and control sets are reversed. Use Review Conflicts in that model to identify instances where the human coder may have erred in coding the training set. # Once 80% accuracy is achieved, note the threshold and then apply it to the full population. All documents will be scored. # Export the To/From/CC/BCC, Date, subject, & score metadata for all documents with a score below the threshold. These non-archival documents will be discarded, but the basic metadata will be retained for network analysis & context. # Export all metadata, attachments, and plain text versions of all documents with a score at or above the threshold. This will be preserved and eventually made available to researchers. #* If either the email or its attachment(s) are archival, both will be retained. ==Purpose, Context and Content== <!-- Describe what your workflow is for - i.e. what it is designed to achieve, what the organisational context of the workflow is, and what content it is designed to work with --> Huge, unorganized, unstructured data with scattered sensitive content presents challenges that other structured categorized, foldered, and form-based electronic records do not present. Email gives context for decisions. The fact that email is frequently the subject of FOIA requests is an indicator of its documentary and evidentiary value. It is a challenge to know what to keep. The U.S. NARA Capstone approach helps, but non-records exist within a Capstone account and records exist in non-Capstone accounts. Also, it is a challenge to identify what content can be accessed by whom at what times and by what means if it contains PII. These records end up ignored and deleted, or preserved but neglected, causing us to lose or overlook a valuable resource, reducing awareness of its value and risking format obsolescence. We have solutions for acquiring (Capstone & PST/MBOX/EPADD, etc.), reformatting (Emailchemy, etc.), indexing, (Acrobat, etc.), preserving (Preservica, etc.), arranging (EPADD/Preservica, etc.), describing (EPADD/Preservica, etc.), and accessing (EPADD/Preservica, etc.) email. The purpose of this workflow is to provide a way to code and classify large, unstructured datasets at the discrete item level. To do this, we use predictive coding where a statistical model based upon training and control groups is created that finds what you want in a large collection using representative samples to make decision with high confidence. ==Evaluation/Review== <!-- How effective was the workflow? Was it replaced with a better workflow? Did it work well with some content but not others? What is the current status of the workflow? Does it relate to another workflow already described on the wiki? Link, explain and elaborate --> In an NHPRC-funded project completed in 2020 in collaboration with the Illinois State Archives, 516 GB of PST files from 68 senior state officials dating 2000-2014 were reduced from 5.3M messages to 2.4M archival messages & attachments, of which 1.4M were truly archival, 1M were false positives (non-archival but kept), 2.2M were truly non-archival, and 70K were false negatives (archival but discarded). This was accomplished by reviewing about 0.1% of the documents (~50 hours). This allows more quality content to be made accessible more quickly, and reduces the footprint of the collection. It also quantifies the risk that data is erroneously deleted, retained, shared, or corrupted. The cost of Nuix Discover is non-trivial, but as a monthly subscription it can be used in sprints to quickly process the data and avoid long-term costs. ==Further Information== <!-- Provide any further information or links to additional documentation here --> <!-- Add four tildes below ("~~~~") to create an automatic signature, including your wiki username. Ensure your user page (click on your username to create it) includes an up to date contact email address so that people can contact you if they want to discuss your workflow --> [[User:Bmwest|Bmwest]] ([[User talk:Bmwest|talk]]) 18:53, 6 February 2024 (UTC) <!-- Note that your workflow will be marked with a CC3.0 licence -->