IMPACT Final Conference - Stefan Pletschacher

2,243 views
2,207 views

Published on

Stefan Pletschacher from the University of Salford - IMPACT Evaluation Tools, ground truth and datasets

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,243
On SlideShare
0
From Embeds
0
Number of Embeds
1,578
Actions
Shares
0
Downloads
19
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 11 characters2 substitutions1 insertionError rate of 27% = 73% accuracy
  • IMPACT Final Conference - Stefan Pletschacher

    1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Evaluation Tools Stefan Pletschacher
    2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Overview Digitisation Workflows Performance Evaluation Ground Truth Evaluation Tools Segmentation and Layout OCR Text Interpretation of ResultsStefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 2
    3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Digitisation Workflow Evaluation  ① Scanning • Individual Processing Steps ② Image enhancement • Complex Workflows  Page splitting  Border removal  Dewarping (page curl, arbitrary warping)  Noise removal  Binarisation ③ Layout analysis  Segmentation of regions, lines, words and characters  Region classification  Logical layout analysis ④ OCR ⑤ Post-processingStefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 3
    4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Performance Evaluation Overview Evaluation Results Evaluation Tools Compatibility through one common format (PAGE) Image RepositoryStefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 4
    5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Image Repository Central management of Metadata, Images and Ground Truth Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 5
    6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Datasets Total number of images: 667,120 Institutional Datasets (10 libraries): 602,313 images Demonstrator Sets: 56,141 images Ground Truth in PAGE format: 36,498 approved instances, still growing Working Sets (Showcases, Typewritten Set, Dewarping Set, Challenge Sets etc.) Usage statistics (Since 6/10/2010) – 5,153,347 thumbs browsed – 810,001 images accessed (724,946 full quality images, 22,676 direct access calls) Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 6
    7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Tools for Ground Truth Production Aletheia  Page border, print space  Layout regions (incl. metadata)  Text lines, words and glyphs  Unicode text at all levels  Reading order, layers etc. FineReader Engine Exporter (Preproduction) GT Validator Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 7
    8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Ground Truthing Historical Documents Full Unicode Support (incl. special characters for historical documents)  Complex Reading Order (Groups of ordered and unordered elements) Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 8
    9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Ground Truth – Image EnhancementDeskew Dewarping Border Removal BinarisationStefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 9
    10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.The PAGE Format FrameworkPage Analysis and Ground-Truth Elements PAGE root Two-level architecture: (XML) – root structure – task specific sub-formats Separate XML Schema definitions PAGE gts PAGE gts PAGE gts (XML) (XML) (XML) Format identification via Namespaces Mapping of – dependencies – processing chains Representation of – alternative processing steps Processing Results / Linking via IDs Ground Truth http://schema.primaresearch.org/PAGE/Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 10
    11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation Tools Segmentation and Layout OCR Text Deskewing Dewarping Border Removal Binarisation Double Page Splitting Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 11
    12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Segmentation and Layout Ground Result Truth Error types  Differentiation of errors based on Miss / Part. Overlap reading order Miss Split allowable Misclass. Merge False non-allowable Detection Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 12
    13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Example – Ground Truth Page Header Paragraph Paragraph Caption ImageStefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 13
    14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.Example – Layout Analysis Result Header Paragraph Paragraph Image Image ImageStefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 14
    15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation Miss Partial MissMisclassi-fication Merge Caption Paragraph Ground Truth Layout Analysis Result Split Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 15
    16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR Text Comparison of Ground Truth and OCR output based on encoded text (ASCII, Unicode) Character accuracy – Distance measure: minimum number of edit operations (insertions, deletions, substitutions) – Per character class (lower case, upper case, whitespace characters, numbers, symbols, ...) Word accuracy – Correctly recognised words vs. total word count – Stop words and non-stop words Rejected and suspicious characters Substitution errors (higher penalty) Correction effort OCR is cool  OOR is cod Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 16
    17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Interpretation of Results Metrics Miss – Measurements of conditions Misclass. – Types and number of errors Merge Scenarios Split – Application context False detect. – Error weights M2 S1 S2 M1 M3  Overall success/error rates are based on – weighted individual results Merge Split ... – type and size of affected regions Rate Rate – allowable vs. non-allowable errors Error Rate Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 17
    18. 18. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. For more information visit: PRImA http://www.primaresearch.org IMPACT http://www.impact-project.eu Stefan Pletschacher - Evaluation Tools, IMPACT Conference, London, 24.10.2011 18

    ×