Metadata Extraction

From COPTR
Jump to navigation Jump to search
Function definition: Tools that support the extraction of metadata from files.
Lifecycle stage: Cross-Lifecycle Functions,Ingest

Tools for this function

ToolPurpose
ALTAG3DAn open source archive software
Aaru Data Preservation SuiteMedia dump software and disc image manager
Adobe Photoshop ElementsA commercial image editor with a metadata module (Organizer).
Apache PDFBoxJAVA PDF library for creation, manipulation, validation and content extraction of PDF documents
Apache POI - the Java API for Microsoft DocumentsThe Apache POI Project's mission is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2).
Apache TikaJava based tool for identifying file formats using signatures and extracting metadata and text content from documents.
BWF MetaEditBWF MetaEdit permits embedding, validating, and exporting of metadata in Broadcast WAVE Format (BWF) files.
BitCuratorThe BitCurator Environment is an Ubuntu Linux distribution geared to the needs of archivists and librarians. It includes a suite of open source digital forensics and data analysis tools to help collecting institutions process born-digital materials.
BrunnhildeSiegfried-based characterization of directories and disk images
C3POC3PO is a content profiling tool for visualization and preservation analysis
CloudCompareCloudCompare is a tool for editing and processing 3D point clouds and triangular meshes.
CyberChefA forensic tool with workflow capabilities to analyse files and containers
DIMAGA software suite supporting archives with preservation of digital information for eternity
DIMAG IngestListAccompanies ingest process from donor to archive, logs process steps.
DROID (Digital Record Object Identification)DROID (Digital Record Object Identification) is a software tool developed to perform automated batch identification of file formats.
DUMPBIN UtilityThe DUMPBIN utility, which is provided with the 32-bit version of Microsoft Visual C++, combines the abilities of the LINK, LIB, and EXEHDR utilities.
DemystifyFormat Identification Analysis and Reporting
DiPS (Digital Preservation Solution)DiPS (OAIS compliant Digital Preservation Solution)
Directory List & PrintA universal metadata extractor
DisktypeTool for detecting the content format of a disk or disk image. It knows about common file systems, partition tables, and boot codes.
Duke Data AccessionerData Accessioner provides a graphical user interface to aid in migrating data from physical media to a dedicated file server, documenting the process and using MD5 checksums to identify any errors introduced in transfer.
EMET (Embedded Metadata Extraction Tool)EMET is a stand-alone tool designed to extract metadata embedded in JPEG and TIFF files.
EPADDePADD is a software package developed by Stanford University's Special Collections & University Archives that supports archival processes around the appraisal, ingest, processing, discovery, and delivery of email archives.
EXE ExplorerEXE Explorer reads and displays executable file properties and structure.
EXIF to DC XML normaliserExtract EXIF data and normalise it to DC XML.
Easy CD-DA ExtractorEasy CD-DA Extractor is CD Ripper, Music Converter, Audio Converter, Metadata Editor, and CD/DVD burning software.
EpubCheckValidator for EPUB files
Exact Audio CopyExact Audio Copy is an audio grabber for audio CDs using standard CD and DVD-ROM drives on Windows only.
ExempiExempi is a library for handling XMP metadata, based on the Adobe XMP SDK
ExifToolProperties extraction, identification, metadata editing
FFAStransTask automation engine, mostly used in audio and video visual content management.
FIDO (Format Identification for Digital Objects)A PRONOM based, command line, file format identification tool written in Python
FITS (File Information Tool Set)FITS allows data curators to identify, validate, and extract technical metadata for the objects in their digital repository.
File Analyzer and Metadata Harvester V2The File Analyzer is a general purpose desktop (and command line) tool designed to automate simple, file-based operations. The File Analyzer assembles a toolkit of tasks a user can perform. The tasks that have been written into the File Analyzer code base have been optimized for use by libraries, archives, and other cultural heritage institutions.
FileAlyzerFileAlyzer allows a basic analysis of files (showing file properties and file contents in hex dump form) and is able to interpret common file contents like resources structures (like text, graphics, HTML, media and PE).
FileTroveFileTrove indexes files and creates metadata from them. The single binary application walks a directory tree and identifies all regular files by type with Siegfried.
FilestarUniversal file converter for 900+ file types.
FqTool, language and decoders for working with binary data.
GNU libextractorGNU libextractor is a library used to extract meta data from files of arbitrary type.
GeosetterA tool that sets coordinates and edits all kind of embedded image metadata.
GetID3()Extracts technical and embedded descriptive metadata from common multimedia file formats.
ITextPDF library for manipulation, content extraction and creation
InBoxerInBoxer is a next generation email archiving, IM archiving, e-discovery, and policy management system.
Index.dat Analyzer v2.5Index.dat Analyzer is a tool to view, examine and delete contents of index.dat files.
JHOVE (Harvard Object Validation Environment)JHOVE provides functions to perform format-specific identification, validation, and characterization of digital objects.
JHOVE2JHOVE2 allows data curators to characterise the digital objects in their repositories.
JWATJava Web Archive Toolkit
Jp2StructCheckSimple JP2 file structure checker
JpylyzerJP2 validation + properties extraction
Keith Humphreys' PhraseRatePhraseRate is a program, developed by Keith Humphreys, for extracting a set of meaningful, attractive keywords and key phrases from a web page describing the content of that page.
LingfoLingfo provides a library for developers to use to extract information from Microsoft Excel spreadsheet files.
METS Reader WriterPython library for processing and outputting METS/PREMIS XML according to the Archivematica METS profile.
MP3::TagMP3::Tag is a module for reading tags of MP3 audio files.
MailStore HomeUnifies your private emails into one searchable, platform-independent repository
MdqcTool for managing and comparing digital asset metadata
MediaInfoSupplies technical and tag information about a video or audio file.
Metadata Extraction ToolMetadata Extraction Tool automatically extracts a limited set of metadata from the headers of digital files.
Metadata InterrogatorThe Metadata Interrogator is a standalone, offline GUI tool for extracting and analysing metadata from a wide variety of file formats.
Metadata transformerA simple tool for creating new CSV and HTML reports based on the metadata files generated by the Data Accessioner
Metadata++Freeware tool to view, edit, modify, extract, copy metadata of various formats.
Metadata2GoWeb-based EXIF data viewer
NARA File Analyzer and Metadata HarvesterNARA File Analyzer and Metadata Harvester allows a user to analyze the contents of a file system or external drive and generates statistics about the contents of the contained directories.
NARA Video Frame AnalyzerNARA Video Frame Analyzer analyzes technical properties of individual frames of a video file in order to detect quality issues within digitized video files.
NaniteA friendly swarm of format-identifying robots
ODF ValidatorODF Validator is a tool that validates OpenDocument files and checks them for certain conformance criteria.
Officeparser.pyofficerparser.py is a python script that parses the format of OLE compound documents used by Microsoft Office applications.
OpenJPEGThe OpenJPEG library is an open-source JPEG 2000 codec written in C language.
PDF Tools (by Didier Stevens)Tools for parsing and analysing PDF documents
PET (PERICLES Extraction Tool)A tool to capture contextual information in a sheer curation scenario
PagelyzerSuite of tools for detecting changes in web pages and their rendering
PdfaPilotpdfaPilot: Conversion of documents and emails into robust, searchable PDF or PDF/A files
PdfcpuA Go library and command line tool for PDF processing incl. validation
PdftkPDF manipulation tool
Peepdfpeepdf is a Python tool to explore PDF files in order to find out if the file can be harmful or not.
Pre-Ingest ToolA tool for generating an OAIS SIP for digital preservation. It produces METS document that contains metadata for digital preservation.
PremisshPremissh is a simple prototype tool for automatically creating PREMIS XML from a file, using DROID, BASH and XSLT.
Python XMP ToolkitLibrary for working with XMP metadata, as well as reading/writing XMP metadata stored in many different file formats
QpdfQPDF is a command-line program that does structural, content-preserving transformations on PDF files
RATOMReview, Appraisal, and Triage of Mail (RATOM) is software to assist archives and other collecting organizations with email analysis, selection, and appraisal tasks
SheekoMachine learning implementation package to generate descriptive metadata for digitized historical images.
Smithsonian CookProcessing of 3D model, mesh, and texture data including the option to define custom processing workflows, where a set of files is processed by multiple tools.
WarctoolsCommand line tools and libraries for handling and manipulating WARC files (and HTTP contents)
Web Archive DiscoveryIndexing and discovery tools for web archives.
WordHoardWordHoard is an application for the close reading and scholarly analysis of deeply tagged texts.
XpdfOpen source PDF viewer that includes PDF information extractor and font analyzer