0
Duplicate detection for quality assuranceof document image collectionsReinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven...
Overview    Digital preservation & quality assurance    Digital image preservation workflows    Image duplicate detecti...
SCAPE project and quality assurance    SCAlable Preservation Environments, EU FP7    Preservation Components:      impro...
Quality assurance in image preservation    Comparison of image content     - automatic image processing worflows (e.g. fo...
Book scan sequence with duplicates22.11.2012                           5
Duplicatedetectionworkflow22.11.2012   6
Keypoint detection and description (1)    Keypoints are detected at salient image regions    A keypoint is described in ...
Keypoint detection and description (2)    Invariance w.r.t. color/tone transformation    Invariance w.r.t. rotation, sca...
Keypoint detection and description (3)    All detections (ordered by scale)22.11.2012                               9
Duplicatedetectionworkflow22.11.2012   10
Bag of visual words (1)    Bag of words model in text information retrieval:     Document 1: “Peter likes to read books. ...
Bag of visual words (2)22.11.2012                             12
Bag of visual words (3)   Visual                          word                          #104                          Visu...
Duplicatedetectionworkflow22.11.2012   14
Image comparison / duplicate detection schemes              Comparison of visual histograms – tf (“term frequency”) score...
Spatial verification (1)    Bag of visual words maintains no (or limited) spatial information    Spatial verification:  ...
Spatial verification (2)Pair of possible duplicates          Descriptor matching                                          ...
Duplicate detection (1)    Pairwise comparison for a collection of N pages                        1                      ...
Duplicate detection (2)    Robust outlier detection                                                1             a=12..15...
Comparison of duplicate detection schemes    a) Visual histogram comparison - tf    b) tf and inv. document frequency - ...
Results    Manual vs. automatic detection    59 books, 34805 pages    53 books correctly processed     53/59 ≈ 90% corr...
Conclusion and outlook    Workflows for duplicate detection for complex documents    Keypoint detection and description ...
AIT Austrian Institute of Technologyyour ingenious partnerreinhold.huber-moerk@ait.ac.at
Upcoming SlideShare
Loading in...5
×

Duplicate detection for quality assurance of document image collections

735

Published on

Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto.
In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 136-143.
ISBN 978-0-9917997-0-1

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
735
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Duplicate detection for quality assurance of document image collections"

  1. 1. Duplicate detection for quality assuranceof document image collectionsReinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb31 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology2 Department of Software Technology and Interactive Systems Vienna University of Technology3 Department for Research and Development Austrian National Library
  2. 2. Overview Digital preservation & quality assurance Digital image preservation workflows Image duplicate detection Keypoints and feature descriptors in Computer Vision Bag of visual words Results on a real-world data set22.11.2012 2
  3. 3. SCAPE project and quality assurance SCAlable Preservation Environments, EU FP7 Preservation Components: improve and extend existing tools, develop new ones where necessary, apply proven approaches like image and patterns analysis to the problem of ensuring quality in digital preservation22.11.2012 3
  4. 4. Quality assurance in image preservation Comparison of image content - automatic image processing worflows (e.g. format conversion) - reacquisition of images Duplicate detection - within a single collection (filtering) - between collections (merging, comparison) Solutions: - page segmention + OCR - feature based approaches22.11.2012 4
  5. 5. Book scan sequence with duplicates22.11.2012 5
  6. 6. Duplicatedetectionworkflow22.11.2012 6
  7. 7. Keypoint detection and description (1) Keypoints are detected at salient image regions A keypoint is described in a descriptor ( = vector of features) Scalable Invariant Feature Transform - SIFT (Lowe, 2004) 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 12022.11.2012 7
  8. 8. Keypoint detection and description (2) Invariance w.r.t. color/tone transformation Invariance w.r.t. rotation, scaling or translation 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.122.11.2012 0 8 20 40 60 80 100 120
  9. 9. Keypoint detection and description (3) All detections (ordered by scale)22.11.2012 9
  10. 10. Duplicatedetectionworkflow22.11.2012 10
  11. 11. Bag of visual words (1) Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ] Visual analogy: bag of visual words or bag of featuresDocument ImageDocument made of words Image made of descriptorsBag of words Bag of clustered descriptors = visual wordsWord occurrence histogram Visual word histogram / ”fingerprint”22.11.2012 11
  12. 12. Bag of visual words (2)22.11.2012 12
  13. 13. Bag of visual words (3) Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #25022.11.2012 13
  14. 14. Duplicatedetectionworkflow22.11.2012 14
  15. 15. Image comparison / duplicate detection schemes Comparison of visual histograms – tf (“term frequency”) score -3 x 10 -3 2 x 10 0 -3 50 100 150 200 250 300 350 400 450 500 2 x 10 4 0 2 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500 Inverse document frequency –idf Spatial verification – sv detailed image comparison22.11.2012 15
  16. 16. Spatial verification (1) Bag of visual words maintains no (or limited) spatial information Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity22.11.2012 16
  17. 17. Spatial verification (2)Pair of possible duplicates Descriptor matching Estimation of affine transformation Image overlay Similarity estimation Similarity measure MSSIM 22.11.2012 17
  18. 18. Duplicate detection (1) Pairwise comparison for a collection of N pages 1 0.9 0.8 0.7 0.6 max(Da) 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 50022.11.2012 image index a 18
  19. 19. Duplicate detection (2) Robust outlier detection 1 a=12..15 0.9 a=106,107 0.8 0.7 0.6 max(Da) 0.5 a=22..25 0.4 a=108,109 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 image index a a=188..197 a=198..20722.11.2012 19
  20. 20. Comparison of duplicate detection schemes a) Visual histogram comparison - tf b) tf and inv. document frequency - tf/idf c) tf and spatial verification – tf/sv22.11.2012 20
  21. 21. Results Manual vs. automatic detection 59 books, 34805 pages 53 books correctly processed 53/59 ≈ 90% correct 69 of 75 duplicate runs detected 69/75 ≈ 92% correct Missing detections due to heavily mixed content22.11.2012 21
  22. 22. Conclusion and outlook Workflows for duplicate detection for complex documents Keypoint detection and description = purely image based Bag of visual words provides fast matching Spatial verification applied to shortlist Robust thresholding scheme for duplicate identification Evaluation at Austrian National Library Integration on SCAPE platform for scalable preservation22.11.2012 22
  23. 23. AIT Austrian Institute of Technologyyour ingenious partnerreinhold.huber-moerk@ait.ac.at
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×