SlideShare a Scribd company logo
1 of 23
Download to read offline
Duplicate detection for quality assurance
of document image collections
Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3

1   Research Area Intelligent Vision Systems, Department Safety & Security
    AIT Austrian Institute of Technology

2   Department of Software Technology and Interactive Systems
    Vienna University of Technology

3   Department for Research and Development
    Austrian National Library
Overview

    Digital preservation & quality assurance

    Digital image preservation workflows

    Image duplicate detection

    Keypoints and feature descriptors in Computer Vision

    Bag of visual words

    Results on a real-world data set




22.11.2012                                                  2
SCAPE project and quality assurance

    SCAlable Preservation Environments, EU FP7

    Preservation Components:

      improve and extend existing tools,

      develop new ones where necessary,

      apply proven approaches like

      image and patterns analysis to the

      problem of ensuring quality in digital

      preservation




22.11.2012                                        3
Quality assurance in image preservation

    Comparison of image content

     - automatic image processing worflows (e.g. format conversion)

     - reacquisition of images

    Duplicate detection

     - within a single collection (filtering)

     - between collections (merging, comparison)

     Solutions:

             - page segmention + OCR

             - feature based approaches

22.11.2012                                                            4
Book scan sequence with duplicates




22.11.2012                           5
Duplicate
detection
workflow




22.11.2012   6
Keypoint detection and description (1)

    Keypoints are detected at salient image regions

    A keypoint is described in a descriptor ( = vector of features)

    Scalable Invariant Feature Transform - SIFT (Lowe, 2004)


                                     0.2

                                     0.1

                                      0
                                           20   40    60    80        100    120




                                                0.2


                                                0.1


                                                 0
                                                       20        40     60     80   100   120




22.11.2012                                                                                      7
Keypoint detection and description (2)

    Invariance w.r.t. color/tone transformation

    Invariance w.r.t. rotation, scaling or translation
                            0.2

                            0.1

                              0
                                  20   40   60   80   100   120



                            0.2

                            0.1

                             0
                                  20   40   60   80   100   120


                            0.2

                            0.1

                             0
                                  20   40   60   80   100   120


                            0.2

                            0.1

22.11.2012                   0                                    8
                                  20   40   60   80   100   120
Keypoint detection and description (3)

    All detections (ordered by scale)




22.11.2012                               9
Duplicate
detection
workflow




22.11.2012   10
Bag of visual words (1)

    Bag of words model in text information retrieval:

     Document 1: “Peter likes to read books. Paul likes too”.
     Document 2: “Peter also likes to read poems”
     Bag:         [ Peter, likes, to, read, books, Paul, too, also, poems ]
     Histogram 1: [ 1,       2, 1, 1,         1,    1,    1, 0,       0 ]
     Histogram 2: [ 1,       1, 1, 1,         0,    0,    0, 1,       1 ]

    Visual analogy: bag of visual words or bag of features

Document                               Image
Document made of words                 Image made of descriptors
Bag of words                           Bag of clustered descriptors = visual words
Word occurrence histogram              Visual word histogram / ”fingerprint”
22.11.2012                                                                       11
Bag of visual words (2)




22.11.2012                             12
Bag of visual words (3)   Visual
                          word
                          #104

                          Visual
                          word
                          #15

                          Visual
                          word
                          #221

                          Visual
                          word
                          #312

                          Visual
                          word
                          #424

                          Visual
                          word
                          #250

22.11.2012                         13
Duplicate
detection
workflow




22.11.2012   14
Image comparison / duplicate detection schemes

              Comparison of visual histograms – tf (“term frequency”) score
           -3
        x 10
                                                                                 -3
    2                                                                         x 10
    0
           -3  50   100   150   200   250   300 350     400   450   500   2
        x 10
    4                                                                     0
    2                                                                                50   100   150   200   250   300 350   400   450   500
    0
               50   100   150   200   250   300   350   400   450   500




              Inverse document frequency –idf




              Spatial verification – sv detailed image comparison




22.11.2012                                                                                                                                    15
Spatial verification (1)

    Bag of visual words maintains no (or limited) spatial information




    Spatial verification:

                     1. Ranking of most similar images in a shortlist
                     2. Direct matching of descriptors for pairs of images
                     3. Overlaying of images
                     4. Estimation of similarity
22.11.2012                                                                   16
Spatial verification (2)
Pair of possible duplicates          Descriptor matching
                                                                      Estimation of affine
                                                                      transformation




                     Image overlay            Similarity estimation

                                                                            Similarity
                                                                            measure
                                                                            MSSIM



   22.11.2012                                                                            17
Duplicate detection (1)

    Pairwise comparison for a collection of N pages



                        1

                       0.9

                       0.8

                       0.7

                       0.6
             max(Da)




                       0.5

                       0.4

                       0.3

                       0.2

                       0.1

                        0
                         0   50   100   150   200    250        300   350   400   450   500
22.11.2012                                      image index a                                 18
Duplicate detection (2)

    Robust outlier detection



                                                1

             a=12..15                          0.9                                                                    a=106,107
                                               0.8

                                               0.7

                                               0.6
                                     max(Da)




                                               0.5
             a=22..25
                                               0.4
                                                                                                                       a=108,109
                                               0.3

                                               0.2

                                               0.1

                                                0
                                                 0   50   100   150   200    250        300   350   400   450   500
                                                                        image index a



                        a=188..197                                                  a=198..207




22.11.2012                                                                                                                         19
Comparison of duplicate detection schemes

    a) Visual histogram comparison - tf

    b) tf and inv. document frequency - tf/idf

    c) tf and spatial verification – tf/sv




22.11.2012                                        20
Results

    Manual vs. automatic detection

    59 books, 34805 pages

    53 books correctly processed

     53/59 ≈ 90% correct

    69 of 75 duplicate runs detected

     69/75 ≈ 92% correct

    Missing detections due to

     heavily mixed content




22.11.2012                              21
Conclusion and outlook

    Workflows for duplicate detection for complex documents

    Keypoint detection and description = purely image based

    Bag of visual words provides fast matching

    Spatial verification applied to shortlist

    Robust thresholding scheme for duplicate identification

    Evaluation at Austrian National Library

    Integration on SCAPE platform for scalable preservation




22.11.2012                                                     22
AIT Austrian Institute of Technology
your ingenious partner



reinhold.huber-moerk@ait.ac.at

More Related Content

Similar to Duplicate detection for quality assurance of document image collections

P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionzukun
 
PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes Nathan Garrett
 

Similar to Duplicate detection for quality assurance of document image collections (6)

Monovision vs Pinhole
Monovision vs PinholeMonovision vs Pinhole
Monovision vs Pinhole
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
DCT_TR802
DCT_TR802DCT_TR802
DCT_TR802
 
P02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for visionP02 sparse coding cvpr2012 deep learning methods for vision
P02 sparse coding cvpr2012 deep learning methods for vision
 
PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes PowerPoint’s Impact on Conference Ratings and Social Media Likes
PowerPoint’s Impact on Conference Ratings and Social Media Likes
 

More from SCAPE Project

SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 

More from SCAPE Project (20)

C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Duplicate detection for quality assurance of document image collections

  • 1. Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library
  • 2. Overview  Digital preservation & quality assurance  Digital image preservation workflows  Image duplicate detection  Keypoints and feature descriptors in Computer Vision  Bag of visual words  Results on a real-world data set 22.11.2012 2
  • 3. SCAPE project and quality assurance  SCAlable Preservation Environments, EU FP7  Preservation Components: improve and extend existing tools, develop new ones where necessary, apply proven approaches like image and patterns analysis to the problem of ensuring quality in digital preservation 22.11.2012 3
  • 4. Quality assurance in image preservation  Comparison of image content - automatic image processing worflows (e.g. format conversion) - reacquisition of images  Duplicate detection - within a single collection (filtering) - between collections (merging, comparison) Solutions: - page segmention + OCR - feature based approaches 22.11.2012 4
  • 5. Book scan sequence with duplicates 22.11.2012 5
  • 7. Keypoint detection and description (1)  Keypoints are detected at salient image regions  A keypoint is described in a descriptor ( = vector of features)  Scalable Invariant Feature Transform - SIFT (Lowe, 2004) 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 22.11.2012 7
  • 8. Keypoint detection and description (2)  Invariance w.r.t. color/tone transformation  Invariance w.r.t. rotation, scaling or translation 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 0 20 40 60 80 100 120 0.2 0.1 22.11.2012 0 8 20 40 60 80 100 120
  • 9. Keypoint detection and description (3)  All detections (ordered by scale) 22.11.2012 9
  • 11. Bag of visual words (1)  Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]  Visual analogy: bag of visual words or bag of features Document Image Document made of words Image made of descriptors Bag of words Bag of clustered descriptors = visual words Word occurrence histogram Visual word histogram / ”fingerprint” 22.11.2012 11
  • 12. Bag of visual words (2) 22.11.2012 12
  • 13. Bag of visual words (3) Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #250 22.11.2012 13
  • 15. Image comparison / duplicate detection schemes  Comparison of visual histograms – tf (“term frequency”) score -3 x 10 -3 2 x 10 0 -3 50 100 150 200 250 300 350 400 450 500 2 x 10 4 0 2 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500  Inverse document frequency –idf  Spatial verification – sv detailed image comparison 22.11.2012 15
  • 16. Spatial verification (1)  Bag of visual words maintains no (or limited) spatial information  Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity 22.11.2012 16
  • 17. Spatial verification (2) Pair of possible duplicates Descriptor matching Estimation of affine transformation Image overlay Similarity estimation Similarity measure MSSIM 22.11.2012 17
  • 18. Duplicate detection (1)  Pairwise comparison for a collection of N pages 1 0.9 0.8 0.7 0.6 max(Da) 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 22.11.2012 image index a 18
  • 19. Duplicate detection (2)  Robust outlier detection 1 a=12..15 0.9 a=106,107 0.8 0.7 0.6 max(Da) 0.5 a=22..25 0.4 a=108,109 0.3 0.2 0.1 0 0 50 100 150 200 250 300 350 400 450 500 image index a a=188..197 a=198..207 22.11.2012 19
  • 20. Comparison of duplicate detection schemes  a) Visual histogram comparison - tf  b) tf and inv. document frequency - tf/idf  c) tf and spatial verification – tf/sv 22.11.2012 20
  • 21. Results  Manual vs. automatic detection  59 books, 34805 pages  53 books correctly processed 53/59 ≈ 90% correct  69 of 75 duplicate runs detected 69/75 ≈ 92% correct  Missing detections due to heavily mixed content 22.11.2012 21
  • 22. Conclusion and outlook  Workflows for duplicate detection for complex documents  Keypoint detection and description = purely image based  Bag of visual words provides fast matching  Spatial verification applied to shortlist  Robust thresholding scheme for duplicate identification  Evaluation at Austrian National Library  Integration on SCAPE platform for scalable preservation 22.11.2012 22
  • 23. AIT Austrian Institute of Technology your ingenious partner reinhold.huber-moerk@ait.ac.at