De-Duplication

From COPTR
Jump to navigation Jump to search
Function definition: Tools that enable the identification and/or removal of duplicate or similar files.
Lifecycle stage: Preservation Action

Tools for this function

ToolPurpose
AllDupA brief description
CloneSpyFinds duplicates, also precisely, deletes with rules.
DROID Siegfried Sqlite Analysis EngineAnalysis and automatic generation of summary information from DROID output
DupeGuruA brief description
EmailchemyConverts proprietary emails to standard portable formats
FileVerifier++Windows utility for verifying file contents
FolderMatchCompares two directory trees and flags up duplicates
FslintSet of utilities to find and clean various forms of lint on a filesystem, such as duplicate files, empty directories, and bad file names.
GNU DiffutilsGNU Diffutils is a package of several programs related to finding differences between files.
Java library implementing PairtreeThe PAIRTREE LIBRARY is a software library that supports the mapping between identifiers and filepaths according to the Pairtree Specification.
Matchbox ToolMatchbox: Duplicate detection tool for digital document collections.
SSDeepRecursive piecewise hashing tool
The DeDuplicator (Heritrix add-on module)The DeDuplicator is an add-on module for Heritrix to reduce the amount of duplicate data collected in a series of snapshot crawls.
WinMergeA visual tool for differencing and merging of file collections, images and texts.
XcorrSoundThe xcorrSound package compares sound waves using cross correlation.

For some guidance on approaches to de-duplication see: