Quality assurance for document
image collections in digital
preservation
Reinhold Huber-Mörk1 & Alexander Schindler1,2
1 R...
Overview
 Digital preservation
 Quality assurance in digital image preservation workflows
 Keypoint based approach for ...
Digital preservation
 „Set of processes, activities and management of digital information over time
to ensure its long te...
Quality assurance in digital image preservation workflows
 Automated preservation workflows are common in large digitizat...
Keypoint based approach for document comparison
 Local features are detected & described by standard LoG/SIFT approach
 ...
Spatially distinctive keypoints (SDKs) (1)
 High-resolution document scans contain large number of keypoints (e.g.
~50.00...
Spatially distinctive keypoints (SDKs) (2)
 Interest regions are distributed over the image using a regular grid
 Keypoi...
Evaluation SDK (1)
812.09.2013
 Mean SSIM vs. number of SDKs (evaluated on 1560 Dunhuang image pairs)
#SDKs=64 #SDKs=256
...
Evaluation SDK (2)
912.09.2013
 Dependency of mean SSIM on the number of SDKs
(evaluated on 1560 Dunhuang image pairs)
Robust symmetric matching
 RANSAC constrained by affine transformation
 Only accept significant matches - distance ratio...
Image preprocessing (1)
 Content in (historical) book collections is characterized by a mixture of text,
graphical art, e...
Image preprocessing (2)
 Locally adaptive histogram equalization to enhance paper structure while
preserving text structu...
Image preprocessing (3)
Tile centers Original Global hist. eq. CLAHE
1312.09.2013
Images provided by International Dunhuan...
Structural similarity (1)
 MSE, PSNR, etc. not well suited for content comparison –> perceptual
image quality assessment
...
Structural similarity (2)
 Registered and overlaid images (SSIM low … black, SSIM high …white)
1512.09.2013
Images provid...
16
Pairs not
matching
Pairs with
low
structural
similarity
Pairs with
high structural
similarity
Mean SSIM = 0
8 pairs
Mea...
Results - International Dunhuang Project data (2)
1712.09.2013
 Pairs of high mean SSIM are not subject to a human verifi...
Results - International Dunhuang Project data (3)
1812.09.2013
 Pairs of medium mean SSIM are possibly subject to human v...
Results - International Dunhuang Project data (4)
1912.09.2013
 Pairs of low mean SSIM are subject to human verification
...
20
rate=1…
content
is
identical
Book/barcode nr. Book/Barcode name #Pairs Rate of matches
1 +Z13641740X_31525197396364410 ...
21
Pairs not
matching
Pairs with
low similarity
(or low
overlap)
Pairs with high
similarity
Results - Google books redownl...
Conclusion and outlook
 Keypoint based approach for quality assurance in digital book preservation
 Combination of keypo...
AIT Austrian Institute of Technology
your ingenious partner
reinhold.huber-moerk@ait.ac.at
Upcoming SlideShare
Loading in …5
×

Quality assurance for document image collections in digital preservation

535 views
432 views

Published on

Reinhold Huber-Mörk, AIT Austrian Institute of Technology, gave a presentation on ‘Quality assurance for document image collections in digital preservation’ at the Acivs conference in Brno, Czech Republic in September 2012. Acivs is short for Advanced Concepts for Intelligent Vision Systems and focuses on techniques for building adaptive, intelligent, safe and secure imaging systems.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
535
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Quality assurance for document image collections in digital preservation

  1. 1. Quality assurance for document image collections in digital preservation Reinhold Huber-Mörk1 & Alexander Schindler1,2 1 Research Area Intelligent Vision Systems Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology
  2. 2. Overview  Digital preservation  Quality assurance in digital image preservation workflows  Keypoint based approach for image content comparison  Spatially distinctive keypoints  Document image preprocessing  Structural similarity assessment  Results on real-world data sets 212.09.2013
  3. 3. Digital preservation  „Set of processes, activities and management of digital information over time to ensure its long term accessibility“ (Source: Wikipedia)  Physical damage of digital/digitized content, e.g. „bit rot“ related to some storage media  Digital obsolescence of hardware/software, e.g. vanishing file formats  Content modification in preservation, e.g. error injection during file format conversion, digital manipulation, reacquisition,… 312.09.2013 Images provided by historical newspaper collection / The British Library
  4. 4. Quality assurance in digital image preservation workflows  Automated preservation workflows are common in large digitization projects (e.g. museum collections, Google books PPPs,…).  Automated quality assurance to ensure file format consistency, detection of duplicates and quality and content preservation.  SCAPE FP7 412.09.2013
  5. 5. Keypoint based approach for document comparison  Local features are detected & described by standard LoG/SIFT approach  Scaling, rotation, cropping and additional/missing content is handled  Affine transformation is sufficient (usually no perspective, bending etc.) 512.09.2013 200 400 600 800 1000 1200 100 200 300 400 500 600 700 800 Images provided by historical newspaper collection / The British Library
  6. 6. Spatially distinctive keypoints (SDKs) (1)  High-resolution document scans contain large number of keypoints (e.g. ~50.000 keypoints on ~5000x3000 pixel images)  Matching of descriptors results in high computational complexity  Changing of detector edge/peak thresholds often results in spatially uneven distribution of keypoints  One solution is dense/regular spatial sampling of keypoints  Another solution is adaptive non-maximal suppression (Brown et. al, 2005)  Our solution is to enforce keypoint selection at positions locally adjacent to spatially uniformly distributed interest regions 612.09.2013
  7. 7. Spatially distinctive keypoints (SDKs) (2)  Interest regions are distributed over the image using a regular grid  Keypoints with highest saliency are selected from each interest region (Harris & Stevens corner strength is used as saliency measure) 712.09.2013 Images provided by International Dunhuang Project / The British Library
  8. 8. Evaluation SDK (1) 812.09.2013  Mean SSIM vs. number of SDKs (evaluated on 1560 Dunhuang image pairs) #SDKs=64 #SDKs=256 #SDKs=512 #SDKs=1024 #SDKs=2048 all keypoints
  9. 9. Evaluation SDK (2) 912.09.2013  Dependency of mean SSIM on the number of SDKs (evaluated on 1560 Dunhuang image pairs)
  10. 10. Robust symmetric matching  RANSAC constrained by affine transformation  Only accept significant matches - distance ratio of best and second best match  Enforcing one-to-one matching of descriptors - ignoring ambiguous matches 1012.09.2013 Images provided by International Dunhuang Project / The British Library
  11. 11. Image preprocessing (1)  Content in (historical) book collections is characterized by a mixture of text, graphical art, empty pages & other artefacts  E.g. onsider a sample from the Dunhuang manuscripts 1112.09.2013 Images provided by International Dunhuang Project / The British Library
  12. 12. Image preprocessing (2)  Locally adaptive histogram equalization to enhance paper structure while preserving text structure  Contrast limited adaptive histogram equalization (CLAHE, Pizer et.al. 1987), where grid/tile spacing ~ character size (e.g. 40x50 pixels) 1212.09.2013 Images provided by International Dunhuang Project / The British Library
  13. 13. Image preprocessing (3) Tile centers Original Global hist. eq. CLAHE 1312.09.2013 Images provided by International Dunhuang Project / The British Library
  14. 14. Structural similarity (1)  MSE, PSNR, etc. not well suited for content comparison –> perceptual image quality assessment  Non-blind/full-reference image quality assessment  The mean structural similarity index (SSIM, Wang et. al 2004) compares two images based on luminance, contrast and structure terms.  Mean SSIM is evaluated for overlapping region of image pairs -> registration  To lower the influence of misregistration the local minimum of the mean SSIM between the images in the pair is evaluated 1412.09.2013
  15. 15. Structural similarity (2)  Registered and overlaid images (SSIM low … black, SSIM high …white) 1512.09.2013 Images provided by International Dunhuang Project / The British Library
  16. 16. 16 Pairs not matching Pairs with low structural similarity Pairs with high structural similarity Mean SSIM = 0 8 pairs Mean SSIM <0.67 78 pairs Mean SSIM >0.67 (p=5 quantile) 1482 pairs 1560 pairsTotal number Results - International Dunhuang Project data (1)
  17. 17. Results - International Dunhuang Project data (2) 1712.09.2013  Pairs of high mean SSIM are not subject to a human verification Images provided by International Dunhuang Project / The British Library
  18. 18. Results - International Dunhuang Project data (3) 1812.09.2013  Pairs of medium mean SSIM are possibly subject to human verification Images provided by International Dunhuang Project / The British Library
  19. 19. Results - International Dunhuang Project data (4) 1912.09.2013  Pairs of low mean SSIM are subject to human verification Images provided by International Dunhuang Project / The British Library
  20. 20. 20 rate=1… content is identical Book/barcode nr. Book/Barcode name #Pairs Rate of matches 1 +Z13641740X_31525197396364410 546 0.9982 2 +Z13722110X_31525197396362478 18 1.0000 3 +Z136400800_31525197396361993 269 0.9888 4 +Z136408008_31525197396361942 291 0.9897 5 +Z136408409_31525197396362038 1 1.0000 6 +Z136409104_31525197396361681 182 0.9670 7 +Z136411408_31525197396362266 219 0.9954 8 +Z136415001_31525197396363522 360 0.9861 9 +Z136419900_31525197396360634 219 0.9954 10 +Z136428500_31525197396360351 249 0.9799 11 +Z136436004_31525197396361129 273 0.9853 12 +Z137116108_31525197396265632 589 0.9949 13 +Z137117708_31525197396287838 651 0.9969 14 +Z137118403_31525197396265776 505 0.9822 15 +Z137120100_31525197396265914 1231 0.9992 16 +Z137150402_31525197396389590 2 1.0000 17 +Z137219001_31525197396361518 664 0.9774 18 +Z150800609_31525197396361025 212 0.9858 19 +Z152471307_31525197396311214 443 0.9910 20 +Z152472403_31525197396313828 460 0.9913 21 +Z152472701_31525197396315698 859 0.9953 Results - Google books redownload workflow (1)
  21. 21. 21 Pairs not matching Pairs with low similarity (or low overlap) Pairs with high similarity Results - Google books redownload workflow (2) Images provided by Google books collection / Austrian National Library
  22. 22. Conclusion and outlook  Keypoint based approach for quality assurance in digital book preservation  Combination of keypoints approach with perceptual similarity evaluation  Recently: combination with bag of keypoints approach for duplicate detection and collection comparison  Currently: Evaluation at Austrian National Library (Google books collection) and British Library (historical newspaper collection)  Future: Integration on SCAPE platform for scalable distributed computing 2212.09.2013
  23. 23. AIT Austrian Institute of Technology your ingenious partner reinhold.huber-moerk@ait.ac.at

×