NSRL (National Software Reference Library)
The National Software Reference Library (NSRL) is a large collection of software packages from various sources. Technical metadata about millions of files (including MD5 and SHA-1 hashes) is published every three months as the NSRL Reference Data Set (RDS).
The NSRL RDS is primarily used in forensic investigations to eliminate non-unique, irrelevant files but may also be useful for archives and digital curators of unstructured personal archives to automatically separate significant files from operating system and application parts.
Downloads are available in several variants. The full RDS release is 6 GB in size (.iso file). A "minimal" set of 42,060,540 file hashes is only 2.77 GB in size (.zip file) but only lists one example of every file in the NSRL and cannot be used to determine all possible sources. (All sizes and quantities refer to RDS version 2.49 as of June 2015.) For just filtering out non-significant files from a personal digital archive, the minimal set should be sufficient.
Details of the record format are available.