Difference between revisions of "Matchbox Tool"

From COPTR
Jump to navigation Jump to search
m (→‎Description: Encoding issues corrected)
 
Line 12: Line 12:
 
The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are:
 
The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are:
  
* Duplicate search in one turn (parameter ‘all’)
+
* Duplicate search in one turn (parameter 'all')
  
* Professional duplicate search (experienced user can execute particular step in ‘FindDuplicates’ workflow)
+
* Professional duplicate search (experienced user can execute particular step in 'FindDuplicates' workflow)
  
 
* Quick check if two documents are duplicates (based on previous BoW dictionary).
 
* Quick check if two documents are duplicates (based on previous BoW dictionary).
Line 36: Line 36:
 
3. Create visual histogram for each image document
 
3. Create visual histogram for each image document
  
4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score — pair-wise comparison of corresponding keyword frequency histograms for all documents. Conduct structural similarity analysis applying Sturctural SIMilarity (SSIM) approach (1 means identical and 0 means very different)
+
4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score pair-wise comparison of corresponding keyword frequency histograms for all documents. Conduct structural similarity analysis applying Sturctural SIMilarity (SSIM) approach (1 means identical and 0 means very different)
  
 
* Rotate
 
* Rotate
Line 45: Line 45:
 
Usage:
 
Usage:
  
FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and ‘all’.
+
FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and 'all'.
  
 
scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h
 
scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h
  
usage: FindDuplicates.py [-h] [-threads THREADS|—threads THREADS] [-sdk SDK|—sdk SDK] [-precluster PRECLUSTER|—precluster PRECLUSTER] [-clahe CLAHE|—clahe CLAHE] [-config CONFIG|—config CONFIG] [-featdir FEATDIR|—featdir FEATDIR] [-bowsize BOWSIZE|—bowsize BOWSIZE] [-csv|—csv] [-v] dir ''all,extract,compare,train,bowhist,clean''
+
usage: FindDuplicates.py [-h] [\--threads THREADS] [\--sdk SDK] [\--precluster PRECLUSTER] [\--clahe CLAHE] [\--config CONFIG] [\--featdir FEATDIR] [\--bowsize BOWSIZE] [\--csv] [-v] dir all,extract,compare,train,bowhist,clean
  
 
= User Experiences =
 
= User Experiences =

Latest revision as of 14:09, 6 December 2021




Matchbox: Duplicate detection tool for digital document collections.
Homepage:https://github.com/openplanets/scape/tree/master/pc-qa-matchbox
License:Open source
Function:Quality Assurance,De-Duplication
Content type:Image


Error in widget Ohloh Project: unable to write file /var/www/html/extensions/Widgets/compiled_templates/wrt67430c0fa5f857_64656765


Description[edit]

The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are:

  • Duplicate search in one turn (parameter 'all')
  • Professional duplicate search (experienced user can execute particular step in 'FindDuplicates' workflow)
  • Quick check if two documents are duplicates (based on previous BoW dictionary).

Further parameters that influence and adjust duplicate analysis are currently investigated.

Image processing method:

The image processing algorithm can be described in 4 steps:

1. Document feature extraction

  • Interest point detection (applying Scale Invariant Feature Transform (SIFT) keypoint extraction)
  • Derivation of local feature descriptors (invariant to geometrical or radiometrical distortions)

2. Learning visual dictionary

  • Clustering method applied to all SIFT descriptors of all images using k-means algorithm
  • Run over collection and collect local descriptors in a visual dictionary using Bag-Of-Words (BoW) algorithm

3. Create visual histogram for each image document

4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score – pair-wise comparison of corresponding keyword frequency histograms for all documents. Conduct structural similarity analysis applying Sturctural SIMilarity (SSIM) approach (1 means identical and 0 means very different)

  • Rotate
  • Scale
  • Mask
  • Overlaying

Usage:

FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and 'all'.

scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h

usage: FindDuplicates.py [-h] [\--threads THREADS] [\--sdk SDK] [\--precluster PRECLUSTER] [\--clahe CLAHE] [\--config CONFIG] [\--featdir FEATDIR] [\--bowsize BOWSIZE] [\--csv] [-v] dir all,extract,compare,train,bowhist,clean

User Experiences[edit]

currently installed at Austrian National Library

Development Activity[edit]