Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OCR-D: An end-to-end open source OCR framework for historical printed documents


Published on

Presented at #DATeCH2019, Brussels, Belgium

Published in: Technology
  • Login to see the comments

OCR-D: An end-to-end open source OCR framework for historical printed documents

  1. 1. OCR-D: An end-to-end open source OCR framework for historical printed documents Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann DATeCH2019 8-10 May 2019, Brussels, Belgium
  2. 2. Introduction 2 ● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf. ● Yet there is still a lack of open and comprehensive tools & methods for OCR/OLR specifically targeting historical printed documents ● New breakthroughs through artificial intelligence/machine learning enable competitive OCR quality given sufficient training ● With the advent of the Digital Humanities, the requirement for large scale text corpora with high quality OCR is growing rapidly → Setting up of the OCR-D project in 2015 Main goals of OCR-D: ● Development of OCR solutions suitable for historical prints ● Standardisation of metadata and Ground Truth ● Creation of training and evaluation data
  3. 3. Architecture OCR-D is composed of a “Coordination Project” consisting of ● Herzog-August-Library Wolfenbüttel ● Berlin-Brandenburg Academy of Sciences and Humanities ● Bavarian State Library (until 08/2016) ● Berlin State Library (from 12/2016) ● Karlsruhe Institute of Technology (from 08/2017) well as a total of 8 separate “Module Projects” that ● Develop technical solutions to the identified challenges ● Implement the specifications of the “Coordination Project” OCR-D receives funding from the DFG for 2015 - 2020 3
  4. 4. OCR-D Specifications Specifications and conventions for interfaces and exchange formats: ● Command Line Interface (CLI) ● Metadata and structural data (METS) ● Full Text (PAGE-XML) ● Software (ocrd-tool.json, Dockerfile) ● Long-term preservation (ocrd-zip, BagIt) → 4
  5. 5. OCR-D Specifications ● For sustainability and reuse, documentation beats implementation ● Open and transparent development on GitHub
  6. 6. OCR-D Core Reference implementation of the specifications: ● Utility functions for common tasks (ocrd_utils) ● Programmatic access to data formats (ocrd_models, ocrd_modelfactory) ● Validation of interfaces and data formats (ocrd_validators) ● Toolkit to create compatible command line tools (ocrd) → 6
  7. 7. OCR-D Core ● Easy to install via PyPi: pip install ocrd
  8. 8. OCR-D provides Scientific Workflow Components for OCR using the Apache Taverna Engine ● Via the Workflow Description (in SCUFL2 language), the workflows do become easily reproducible and can be shared and reused by others ● Simplifies the transparent benchmarking and evaluation of modules/components ● This approach also allows the capture of workflow provenance data → OCR-D Workflow 8
  9. 9. OCR-D Ground Truth In order to support the development, training and evaluation of tools and methods, the OCR-D Coordination Project provides: ● Comprehensive transcription guidelines (only in German) ● Ca. 60 complete volumes from 16th - 19th century ● Special corpora with ○ Low quality OCR ○ Challenging images ● “Structure” Ground Truth corpus ● Additional Ground Truth is currently in production All Ground Truth data is made freely available under open licenses: 9
  10. 10. Module 1: Image Optimisation Partner(s): ● DFKI Kaiserslautern Task(s): ● Deskewing ● Dewarping ● Despeckling ● Cropping (print space, border removal) ● Binarization Code: ● 10
  11. 11. Module 2: Layout Analysis Partner(s): ● DFKI Kaiserslautern Task(s): ● Page segmentation ● Line segmentation based on RAST ● Region classification with CNN ● Document analysis (e.g. reconstruction of table of contents) Code: ● Not yet available 11
  12. 12. Module 3: Layout Analysis & Region Extraction and Classification Partner(s): ● University Würzburg Task(s): ● Based on LAREX ● Development of a CNN-based pixel classifier ● Integration of document-specific rules and heuristics ● Highly automated processing Code: ● 12
  13. 13. Module 4: Unsupervised OCR Postcorrection Partner(s): ● University Leipzig Task(s): ● Combine finite-state-transducer and neural network based postcorrection in a noisy channel model ● Provision of (tools to create) language- and domain specific models Code: ● ● 13
  14. 14. Module 5: Tesseract Partner(s): ● University Library Mannheim Task(s): ● Adaptation of Tesseract to the OCR-D specifications and interfaces ● Improvement of the code quality ● Optimisation for high throughput Code: ● ● 14
  15. 15. Module 6: Automated Postcorrection with optional interactive Postcorrection Partner(s): ● University of Munich Task(s): ● Alignment of multiple OCR ● Profiling of OCR ● Automated postcorrection and correction protocol ● (Optional) interactive postcorrection Code: ● ● 15
  16. 16. Module 7: Automated Font Recognition & Model Training Infrastructure Partner(s): ● University of Mainz ● University of Erlangen ● University of Leipzig Task(s): ● Automated identification of fonts from images ● Development of an OCR training infrastructure and model repository Code: ● ● 16
  17. 17. Module 8: Long-term preservation Partner(s): ● Göttingen State and University Library ● GWDG Göttingen Task(s): ● Analysis of requirements for long-term preservation of OCR ● Concept and prototype implementation for ○ Persistent storage and identification of OCR data in the archive ○ Citation of OCR data in the archive ○ Search functionality within the archive Code: ● 17
  18. 18. OCR-D contact/access points ● OCR-D Website: ● OCR-D GitHub: ● OCR-D Specification and Documentation: ● OCR-D Ground Truth: ● OCR-D Gitter: ● OCR-D Docker: 18
  19. 19. Thank you for your attention! Questions, please? DATeCH2019 8-10 May 2019, Brussels, Belgium