Difference between revisions of "Matchbox Tool"
(Trial import from script.) |
m (→Description: Encoding issues corrected) |
||
| (4 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
| − | {{ | + | {{Infobox tool |
|purpose=Matchbox: Duplicate detection tool for digital document collections. | |purpose=Matchbox: Duplicate detection tool for digital document collections. | ||
| − | |||
|homepage=https://github.com/openplanets/scape/tree/master/pc-qa-matchbox | |homepage=https://github.com/openplanets/scape/tree/master/pc-qa-matchbox | ||
|license=Open source | |license=Open source | ||
| − | | | + | |function=Quality Assurance, De-Duplication |
| + | |content=Image | ||
| + | }} | ||
| + | {{Infobox tool details | ||
| + | |ohloh_id=Matchbox Tool | ||
}} | }} | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
= Description = | = Description = | ||
| − | The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are: | + | The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are: |
| − | * Duplicate search in one turn (parameter | + | * Duplicate search in one turn (parameter 'all') |
| − | * Professional duplicate search (experienced user can execute particular step in | + | * Professional duplicate search (experienced user can execute particular step in 'FindDuplicates' workflow) |
| − | * Quick check if two documents are duplicates (based on previous BoW dictionary). | + | * Quick check if two documents are duplicates (based on previous BoW dictionary). |
Further parameters that influence and adjust duplicate analysis are currently investigated. | Further parameters that influence and adjust duplicate analysis are currently investigated. | ||
| Line 26: | Line 22: | ||
Image processing method: | Image processing method: | ||
| − | The image processing algorithm can be described in 4 steps: | + | The image processing algorithm can be described in 4 steps: |
1. Document feature extraction | 1. Document feature extraction | ||
| Line 38: | Line 34: | ||
* Run over collection and collect local descriptors in a visual dictionary using Bag-Of-Words (BoW) algorithm | * Run over collection and collect local descriptors in a visual dictionary using Bag-Of-Words (BoW) algorithm | ||
| − | 3. Create visual histogram for each image document | + | 3. Create visual histogram for each image document |
| − | 4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score | + | 4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score – pair-wise comparison of corresponding keyword frequency histograms for all documents. Conduct structural similarity analysis applying Sturctural SIMilarity (SSIM) approach (1 means identical and 0 means very different) |
* Rotate | * Rotate | ||
| Line 49: | Line 45: | ||
Usage: | Usage: | ||
| − | FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and | + | FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and 'all'. |
| − | scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h | + | scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h |
| − | usage: FindDuplicates.py [-h] [-threads | + | usage: FindDuplicates.py [-h] [\--threads THREADS] [\--sdk SDK] [\--precluster PRECLUSTER] [\--clahe CLAHE] [\--config CONFIG] [\--featdir FEATDIR] [\--bowsize BOWSIZE] [\--csv] [-v] dir all,extract,compare,train,bowhist,clean |
= User Experiences = | = User Experiences = | ||
| Line 59: | Line 55: | ||
= Development Activity = | = Development Activity = | ||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Latest revision as of 14:09, 6 December 2021
Description
The Matchbox tool is responsible for finding duplicatre pairs in a collection of digital documents based on SIFT features and SSIM methods. Consequently the tool takes a collection path with associated parameters as input. Currently three scenarios are implemented. These are:
- Duplicate search in one turn (parameter 'all')
- Professional duplicate search (experienced user can execute particular step in 'FindDuplicates' workflow)
- Quick check if two documents are duplicates (based on previous BoW dictionary).
Further parameters that influence and adjust duplicate analysis are currently investigated.
Image processing method:
The image processing algorithm can be described in 4 steps:
1. Document feature extraction
- Interest point detection (applying Scale Invariant Feature Transform (SIFT) keypoint extraction)
- Derivation of local feature descriptors (invariant to geometrical or radiometrical distortions)
2. Learning visual dictionary
- Clustering method applied to all SIFT descriptors of all images using k-means algorithm
- Run over collection and collect local descriptors in a visual dictionary using Bag-Of-Words (BoW) algorithm
3. Create visual histogram for each image document
4. Detect similar images based on visual histogram and local descriptors. Evaluate similarity score – pair-wise comparison of corresponding keyword frequency histograms for all documents. Conduct structural similarity analysis applying Sturctural SIMilarity (SSIM) approach (1 means identical and 0 means very different)
- Rotate
- Scale
- Mask
- Overlaying
Usage:
FindDuplicates script can be invoked from command line. For standard usage two parameters are required: path to the collection documents and 'all'.
scape/pc-qa-matchbox/Python# python2.7 FindDuplicates.py h
usage: FindDuplicates.py [-h] [\--threads THREADS] [\--sdk SDK] [\--precluster PRECLUSTER] [\--clahe CLAHE] [\--config CONFIG] [\--featdir FEATDIR] [\--bowsize BOWSIZE] [\--csv] [-v] dir all,extract,compare,train,bowhist,clean
User Experiences
currently installed at Austrian National Library