The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are:
- Duplicate search in one turn (parameter â€˜allâ€™)
- Professional duplicate search (experienced user can execute particular step in â€˜FindDuplicatesâ€™ workflow)
- Quick check if two documents are duplicates (based on previous BoW dictionary).
Further parameters that influence and adjust duplicate analysis are currently investigated.
Image processing method:
The image processing algorithm can be described in 4 steps:
1. Document feature extraction
- Interest point detection (applying Scale Invariant Feature Transform (SIFT) keypoint extraction)
- Derivation of local feature descriptors (invariant to geometrical or radiometrical distortions)
2. Learning visual dictionary
- Clustering method applied to all SIFT descriptors of all images using k-means algorithm
- Run over collection and collect local descriptors in a visual dictionary using Bag-Of-Words (BoW) algorithm
3. Create visual histogram for each image document
4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score â€” pair-wise comparison of corresponding keyword frequency histograms for all documents. Conduct structural similarity analysis applying Sturctural SIMilarity (SSIM) approach (1 means identical and 0 means very different)
FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and â€˜allâ€™.
scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h
usage: FindDuplicates.py [-h] [-threads THREADS|â€”threads THREADS] [-sdk SDK|â€”sdk SDK] [-precluster PRECLUSTER|â€”precluster PRECLUSTER] [-clahe CLAHE|â€”clahe CLAHE] [-config CONFIG|â€”config CONFIG] [-featdir FEATDIR|â€”featdir FEATDIR] [-bowsize BOWSIZE|â€”bowsize BOWSIZE] [-csv|â€”csv] [-v] dir all,extract,compare,train,bowhist,clean
currently installed at Austrian National Library