Matchbox Tool

= Description = The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are:


 * Duplicate search in one turn (parameter â€˜allâ€™)


 * Professional duplicate search (experienced user can execute particular step in â€˜FindDuplicatesâ€™ workflow)


 * Quick check if two documents are duplicates (based on previous BoW dictionary).

Further parameters that influence and adjust duplicate analysis are currently investigated.

Image processing method:

The image processing algorithm can be described in 4 steps:

1. Document feature extraction


 * Interest point detection (applying Scale Invariant Feature Transform (SIFT) keypoint extraction)
 * Derivation of local feature descriptors (invariant to geometrical or radiometrical distortions)

2. Learning visual dictionary


 * Clustering method applied to all SIFT descriptors of all images using k-means algorithm
 * Run over collection and collect local descriptors in a visual dictionary using Bag-Of-Words (BoW) algorithm

3. Create visual histogram for each image document

4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score â€” pair-wise comparison of corresponding keyword frequency histograms for all documents. Conduct structural similarity analysis applying Sturctural SIMilarity (SSIM) approach (1 means identical and 0 means very different)


 * Rotate
 * Scale
 * Mask
 * Overlaying

Usage:

FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and â€˜allâ€™.

scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h

usage: FindDuplicates.py [-h] [-threads THREADS|â€”threads THREADS] [-sdk SDK|â€”sdk SDK] [-precluster PRECLUSTER|â€”precluster PRECLUSTER] [-clahe CLAHE|â€”clahe CLAHE] [-config CONFIG|â€”config CONFIG] [-featdir FEATDIR|â€”featdir FEATDIR] [-bowsize BOWSIZE|â€”bowsize BOWSIZE] [-csv|â€”csv] [-v] dir all,extract,compare,train,bowhist,clean

= User Experiences = currently installed at Austrian National Library

= Development Activity =