An image based approach for content analysis in document collections

316 views

Published on

Reinhold Huber-Mörk of Austrain Institute of Technology presented ‘An image based approach for content analysis in document collections’ at
ISVC'13 (9th International Symposium on Visual Computing) in Rethymnon, Crete, Greece, on 31 July 2013.
The development of tools for library workflows for duplicate content detection and content verification for complex documents were presented accompanied by results of the work.

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
316
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An image based approach for content analysis in document collections

  1. 1. An image based approach for content analysis in document collections Reinhold Huber-Mörk & Alexander Schindler AIT Austrian Institute of Technology GmbH Department Safety & Security Intelligent Vision Systems Vienna, Austria This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  2. 2. Motivation  Large-scale book and newspaper scanning projects  Quality assurance: image quality, content preservation and detection of page duplication  Historical material: deprecated language & fonts, handwritten remarks, nontextual content,….  Museum collections: content is essential, different scans with same content  OCR often difficult  Content based approach = Image based 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 2
  3. 3. Content duplication and verification 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 3
  4. 4. Approach 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 4
  5. 5. Local Features All detections on a book front page (ordered by scale) 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 5
  6. 6. Local Features and Descriptors  Keypoints are detected at salient image regions  A keypoint is described in a descriptor ( = vector of features)  Scalable Invariant Feature Transform - SIFT [Lowe, 2004] 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 04.11.2013 20 40 60 80 100 120 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 6
  7. 7. Bag of Features (BoF) 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 7
  8. 8. Visual Vocabulary 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 8
  9. 9. Image comparison  Comparison of visual histograms – tf (“term frequency”) score  Spatial verification ? =  Comparison of 3 schemes for spatial verification based on  Homography estimation  Visual word co-occurrence  Global descriptor properties statistics 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 9
  10. 10. Spatial verification (1) Homography estimation and mapping Image pair Descriptor matching Image overlay Estimation of affine transformation Similarity estimation Similarity measure MSSIM 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 10
  11. 11. Spatial verification (2) A co-occurrence matrix of visual words counts the concurrent appearance of two visual words in a spatial neigbourhood 9 4.5 8 4 7 3.5 6 3 5 2.5 4 2 3 1.5 2 4 1 3.5 3 2.5 2 1.5 1 0.5 1 0.5 0 0 0  Comparison 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 11
  12. 12. Spatial verification (3a) Global keypoint property statistics from position, orientation and scale  Spatial inhomogeneity: Subdivison of images into a sequence of rectangular tiles [Schilcher et al., 2008] spatially uniformly distibuted h→0, spatially concentrated h→1 h=0.13 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). h=0.21 12
  13. 13. Spatial verification (3b)  Orientation uniformity: keypoint orientation angles order [Rao, 1972] sorted in ascending circular uniformly distributed u→0, dominant orientation u→1 u=0.63 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). u=0.76 13
  14. 14. Spatial verification (3c)  Size distribution: SIFT delivers a size estimation S for each keypoint. A normalized size s is obtained from Variance of S or s is used as feature 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 14
  15. 15. Combination of BoF with spatial verifaction   Derive shortlist L from tf matching Combine tf with spatial verification term: where is based on one of the 3 verification schemes  Homography estimation  Visual word co-occurrence  Global descriptor properties statistics  Computational efficiency (book with 256 pages @72 DPI)  tf only 31 sec.  Homography estimation 449 sec.  Visual word co-occurrence 451 sec.  Global descriptor properties statistics 128 sec. 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 15
  16. 16. Results combination of BoF with spatial verifaction (1)  Plot shows the maximum of the combined similarity for each book page measured to all other pages of the same book 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 16
  17. 17. Results combination of BoF with spatial verifaction (2)  How to interpret such plots w.r.t. duplication 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 17
  18. 18. Results combination of BoF with spatial verifaction (3)  Further work: content analysis w.r.t.  Sections of similar layout: main body, cover, index,….  Pages of unique content: cover, graphical art,…  Clustering of 2D space spanned by maximum similarity and page index by DBSCAN algorithm [Ester et al., 1996] Maximum similarity 1 0.8 2 1 3 4 5 0.6 7 6 0.4 0.2 0 0.2 0.4 0.5 0.6 0.8 1 Normalized page index 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 18
  19. 19. Results combination of BoF with spatial verifaction (4) Maximum similarity 1 0.8 2 1 3 4 5 0.6 7 6 0.4 0.2 0 0.2 0.4 0.5 0.6 0.8 1 Normalized page index 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 19
  20. 20. Results: duplicate detection  Manual vs. automatic detection  59 books, 34805 pages  53 books correctly processed 53/59 ≈ 90% correct  69 of 75 duplicate runs detected 69/75 ≈ 92% correct  Missing detections due to heavily mixed content 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 20
  21. 21. Results: content verification between two versions (scans) of book collections (1) Pairs with high similarity Pairs with low similarity (or low overlap) Pairs not matching 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 21
  22. 22. Results: content verification between two versions (scans) of book collections (2) Book/barcode nr. Book/Barcode name #Pairs Rate of matches 1 +Z13641740X_31525197396364410 546 0.9982 2 +Z13722110X_31525197396362478 18 1.0000 3 4 +Z136400800_31525197396361993 +Z136408008_31525197396361942 269 291 0.9888 0.9897 5 +Z136408409_31525197396362038 1 1.0000 6 +Z136409104_31525197396361681 182 0.9670 7 +Z136411408_31525197396362266 219 0.9954 8 +Z136415001_31525197396363522 360 0.9861 9 +Z136419900_31525197396360634 219 0.9954 10 +Z136428500_31525197396360351 249 0.9799 11 +Z136436004_31525197396361129 273 0.9853 12 +Z137116108_31525197396265632 589 0.9949 13 +Z137117708_31525197396287838 651 0.9969 14 +Z137118403_31525197396265776 505 0.9822 15 +Z137120100_31525197396265914 1231 0.9992 16 +Z137150402_31525197396389590 2 1.0000 17 +Z137219001_31525197396361518 664 0.9774 18 +Z150800609_31525197396361025 212 0.9858 19 443 0.9910 +Z152472403_31525197396313828 460 0.9913 21 04.11.2013 +Z152471307_31525197396311214 20 +Z152472701_31525197396315698 859 rate=1 means content is identical 0.9953 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 22
  23. 23. Conclusion  Tools for library workflows for duplicated content detection and content verification for complex documents  Keypoint detection and description = purely image based approach  Bag of visual words & spatial verification  Global descriptor statistics provides reasonable good and fast spatial verification  Further work content clustering and classification of defects 04.11.2013 This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 23
  24. 24. AIT Austrian Institute of Technology your ingenious partner Reinhold Huber-Mörk Reinhold.huber-moerk@ait.ac.at

×