Duplicate detection for quality assurance of document image collections
Upcoming SlideShare
Loading in...5
×
 

Duplicate detection for quality assurance of document image collections

on

  • 1,056 views

Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto. ...

Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto.
In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 136-143.
ISBN 978-0-9917997-0-1

Statistics

Views

Total Views
1,056
Views on SlideShare
1,030
Embed Views
26

Actions

Likes
1
Downloads
9
Comments
0

1 Embed 26

http://www.scape-project.eu 26

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Duplicate detection for quality assurance of document image collections Duplicate detection for quality assurance of document image collections Presentation Transcript

  • Duplicate detection for quality assuranceof document image collectionsReinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb31 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology2 Department of Software Technology and Interactive Systems Vienna University of Technology3 Department for Research and Development Austrian National Library
  • Overview Digital preservation & quality assurance Digital image preservation workflows Image duplicate detection Keypoints and feature descriptors in Computer Vision Bag of visual words Results on a real-world data set22.11.2012 2
  • SCAPE project and quality assurance SCAlable Preservation Environments, EU FP7 Preservation Components: improve and extend existing tools, develop new ones where necessary, apply proven approaches like image and patterns analysis to the problem of ensuring quality in digital preservation22.11.2012 3
  • Quality assurance in image preservation Comparison of image content - automatic image processing worflows (e.g. format conversion) - reacquisition of images Duplicate detection - within a single collection (filtering) - between collections (merging, comparison) Solutions: - page segmention + OCR - feature based approaches22.11.2012 4
  • Book scan sequence with duplicates22.11.2012 5
  • Duplicatedetectionworkflow22.11.2012 6
  • Keypoint detection and description (1) Keypoints are detected at salient image regions A keypoint is described in a descriptor ( = vector of features) Scalable Invariant Feature Transform - SIFT (Lowe, 2004) 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 12022.11.2012 7
  • Keypoint detection and description (2) Invariance w.r.t. color/tone transformation Invariance w.r.t. rotation, scaling or translation 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.122.11.2012 0 8 20 40 60 80 100 120
  • Keypoint detection and description (3) All detections (ordered by scale)22.11.2012 9
  • Duplicatedetectionworkflow22.11.2012 10
  • Bag of visual words (1) Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ] Visual analogy: bag of visual words or bag of featuresDocument ImageDocument made of words Image made of descriptorsBag of words Bag of clustered descriptors = visual wordsWord occurrence histogram Visual word histogram / ”fingerprint”22.11.2012 11
  • Bag of visual words (2)22.11.2012 12
  • Bag of visual words (3) Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #25022.11.2012 13
  • Duplicatedetectionworkflow22.11.2012 14
  • Image comparison / duplicate detection schemes Comparison of visual histograms – tf (“term frequency”) score -3 x 10 -3 2 x 10 0 -3 50 100 150 200 250 300 350 400 450 500 2 x 10 4 0 2 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500 Inverse document frequency –idf Spatial verification – sv detailed image comparison22.11.2012 15
  • Spatial verification (1) Bag of visual words maintains no (or limited) spatial information Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity22.11.2012 16
  • Spatial verification (2)Pair of possible duplicates Descriptor matching Estimation of affine transformation Image overlay Similarity estimation Similarity measure MSSIM 22.11.2012 17
  • Duplicate detection (1) Pairwise comparison for a collection of N pages 1 0.9 0.8 0.7 0.6 max(Da) 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 50022.11.2012 image index a 18
  • Duplicate detection (2) Robust outlier detection 1 a=12..15 0.9 a=106,107 0.8 0.7 0.6 max(Da) 0.5 a=22..25 0.4 a=108,109 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 image index a a=188..197 a=198..20722.11.2012 19
  • Comparison of duplicate detection schemes a) Visual histogram comparison - tf b) tf and inv. document frequency - tf/idf c) tf and spatial verification – tf/sv22.11.2012 20
  • Results Manual vs. automatic detection 59 books, 34805 pages 53 books correctly processed 53/59 ≈ 90% correct 69 of 75 duplicate runs detected 69/75 ≈ 92% correct Missing detections due to heavily mixed content22.11.2012 21
  • Conclusion and outlook Workflows for duplicate detection for complex documents Keypoint detection and description = purely image based Bag of visual words provides fast matching Spatial verification applied to shortlist Robust thresholding scheme for duplicate identification Evaluation at Austrian National Library Integration on SCAPE platform for scalable preservation22.11.2012 22
  • AIT Austrian Institute of Technologyyour ingenious partnerreinhold.huber-moerk@ait.ac.at