OCR-D: An end-to-end open
source OCR framework for
historical printed documents
Clemens Neudecker, Konstantin Baierer, Maria
Federbusch, Matthias Boenig, Kay-Michael
Würzner, Volker Hartmann, Elisa Herrmann
DATeCH2019
8-10 May 2019, Brussels, Belgium
Introduction
2
● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf.
● Yet there is still a lack of open and comprehensive tools & methods
for OCR/OLR specifically targeting historical printed documents
● New breakthroughs through artificial intelligence/machine learning
enable competitive OCR quality given sufficient training
● With the advent of the Digital Humanities, the requirement for large
scale text corpora with high quality OCR is growing rapidly
→ Setting up of the OCR-D project in 2015
Main goals of OCR-D:
● Development of OCR solutions suitable for historical prints
● Standardisation of metadata and Ground Truth
● Creation of training and evaluation data
Architecture
OCR-D is composed of a “Coordination Project” consisting of
● Herzog-August-Library Wolfenbüttel
● Berlin-Brandenburg Academy of Sciences and Humanities
● Bavarian State Library (until 08/2016)
● Berlin State Library (from 12/2016)
● Karlsruhe Institute of Technology (from 08/2017)
...as well as a total of 8 separate “Module Projects” that
● Develop technical solutions to the identified challenges
● Implement the specifications of the “Coordination Project”
OCR-D receives funding from the DFG for 2015 - 2020
3
OCR-D Specifications
Specifications and conventions for interfaces and exchange formats:
● Command Line Interface (CLI)
● Metadata and structural data (METS)
● Full Text (PAGE-XML)
● Software (ocrd-tool.json, Dockerfile)
● Long-term preservation (ocrd-zip, BagIt)
→ https://ocr-d.github.io/
4
OCR-D Specifications
● For sustainability and reuse, documentation beats implementation
● Open and transparent development on GitHub
OCR-D Core
Reference implementation of the specifications:
● Utility functions for common tasks (ocrd_utils)
● Programmatic access to data formats (ocrd_models,
ocrd_modelfactory)
● Validation of interfaces and data formats (ocrd_validators)
● Toolkit to create compatible command line tools (ocrd)
→ https://github.com/OCR-D/core
6
OCR-D Core
● Easy to install via PyPi:
pip install ocrd
OCR-D provides Scientific Workflow Components for OCR using the
Apache Taverna Engine
● Via the Workflow Description
(in SCUFL2 language), the
workflows do become
easily reproducible and can
be shared and reused by others
● Simplifies the transparent
benchmarking and evaluation
of modules/components
● This approach also allows
the capture of workflow
provenance data
→ https://github.com/OCR-D/taverna_workflow
OCR-D Workflow
8
OCR-D Ground Truth
In order to support the development, training and evaluation of tools
and methods, the OCR-D Coordination Project provides:
● Comprehensive transcription guidelines (only in German)
● Ca. 60 complete volumes from 16th - 19th century
● Special corpora with
○ Low quality OCR
○ Challenging images
● “Structure” Ground Truth corpus
● Additional Ground Truth is currently in production
All Ground Truth data is made freely available under open licenses:
http://www.ocr-d.de/daten
9
Module 1: Image Optimisation
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Deskewing
● Dewarping
● Despeckling
● Cropping (print space, border removal)
● Binarization
Code:
● https://github.com/syedsaqibbukhari/docanalysis
10
Module 2: Layout Analysis
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Page segmentation
● Line segmentation based on RAST
● Region classification with CNN
● Document analysis (e.g.
reconstruction of table of contents)
Code:
● Not yet available
11
Module 3: Layout Analysis &
Region Extraction and Classification
Partner(s):
● University Würzburg
Task(s):
● Based on LAREX
● Development of a CNN-based
pixel classifier
● Integration of document-specific
rules and heuristics
● Highly automated processing
Code:
● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner
12
Module 4: Unsupervised OCR Postcorrection
Partner(s):
● University Leipzig
Task(s):
● Combine finite-state-transducer and
neural network based postcorrection
in a noisy channel model
● Provision of (tools to create)
language- and domain specific models
Code:
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann
13
Module 5: Tesseract
Partner(s):
● University Library Mannheim
Task(s):
● Adaptation of Tesseract to the
OCR-D specifications and interfaces
● Improvement of the code quality
● Optimisation for high throughput
Code:
● https://github.com/tesseract-ocr
● https://github.com/OCR-D/ocrd_tesserocr
14
Module 6: Automated Postcorrection with
optional interactive Postcorrection
Partner(s):
● University of Munich
Task(s):
● Alignment of multiple OCR
● Profiling of OCR
● Automated postcorrection
and correction protocol
● (Optional) interactive postcorrection
Code:
● https://github.com/cisocrgroup/ocrd-postcorrection
● https://github.com/cisocrgroup/cis-ocrd-py
15
Module 7: Automated Font Recognition &
Model Training Infrastructure
Partner(s):
● University of Mainz
● University of Erlangen
● University of Leipzig
Task(s):
● Automated identification of
fonts from images
● Development of an OCR training
infrastructure and model repository
Code:
● https://github.com/seuretm/ocrd_typegroups_classifier
● https://github.com/Doreenruirui/okralact
16
Module 8: Long-term preservation
Partner(s):
● Göttingen State and University Library
● GWDG Göttingen
Task(s):
● Analysis of requirements for
long-term preservation of OCR
● Concept and prototype implementation for
○ Persistent storage and identification
of OCR data in the archive
○ Citation of OCR data in the archive
○ Search functionality within the archive
Code:
● https://github.com/subugoe/OLA-HD-IMPL
17
OCR-D contact/access points
● OCR-D Website:
http://ocr-d.de/eng
● OCR-D GitHub:
https://github.com/OCR-D
● OCR-D Specification and Documentation:
https://ocr-d.github.io/
● OCR-D Ground Truth:
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit
● OCR-D Gitter:
https://gitter.im/OCR-D/Lobby
● OCR-D Docker:
https://hub.docker.com/u/ocrd
18
Thank you for your attention!
Questions, please?
DATeCH2019
8-10 May 2019, Brussels, Belgium

Session3 01.clemens neudecker

  • 1.
    OCR-D: An end-to-endopen source OCR framework for historical printed documents Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann DATeCH2019 8-10 May 2019, Brussels, Belgium
  • 2.
    Introduction 2 ● Many OCR-projectsin the past: METAe, IMPACT, eMOP, asf. ● Yet there is still a lack of open and comprehensive tools & methods for OCR/OLR specifically targeting historical printed documents ● New breakthroughs through artificial intelligence/machine learning enable competitive OCR quality given sufficient training ● With the advent of the Digital Humanities, the requirement for large scale text corpora with high quality OCR is growing rapidly → Setting up of the OCR-D project in 2015 Main goals of OCR-D: ● Development of OCR solutions suitable for historical prints ● Standardisation of metadata and Ground Truth ● Creation of training and evaluation data
  • 3.
    Architecture OCR-D is composedof a “Coordination Project” consisting of ● Herzog-August-Library Wolfenbüttel ● Berlin-Brandenburg Academy of Sciences and Humanities ● Bavarian State Library (until 08/2016) ● Berlin State Library (from 12/2016) ● Karlsruhe Institute of Technology (from 08/2017) ...as well as a total of 8 separate “Module Projects” that ● Develop technical solutions to the identified challenges ● Implement the specifications of the “Coordination Project” OCR-D receives funding from the DFG for 2015 - 2020 3
  • 4.
    OCR-D Specifications Specifications andconventions for interfaces and exchange formats: ● Command Line Interface (CLI) ● Metadata and structural data (METS) ● Full Text (PAGE-XML) ● Software (ocrd-tool.json, Dockerfile) ● Long-term preservation (ocrd-zip, BagIt) → https://ocr-d.github.io/ 4
  • 5.
    OCR-D Specifications ● Forsustainability and reuse, documentation beats implementation ● Open and transparent development on GitHub
  • 6.
    OCR-D Core Reference implementationof the specifications: ● Utility functions for common tasks (ocrd_utils) ● Programmatic access to data formats (ocrd_models, ocrd_modelfactory) ● Validation of interfaces and data formats (ocrd_validators) ● Toolkit to create compatible command line tools (ocrd) → https://github.com/OCR-D/core 6
  • 7.
    OCR-D Core ● Easyto install via PyPi: pip install ocrd
  • 8.
    OCR-D provides ScientificWorkflow Components for OCR using the Apache Taverna Engine ● Via the Workflow Description (in SCUFL2 language), the workflows do become easily reproducible and can be shared and reused by others ● Simplifies the transparent benchmarking and evaluation of modules/components ● This approach also allows the capture of workflow provenance data → https://github.com/OCR-D/taverna_workflow OCR-D Workflow 8
  • 9.
    OCR-D Ground Truth Inorder to support the development, training and evaluation of tools and methods, the OCR-D Coordination Project provides: ● Comprehensive transcription guidelines (only in German) ● Ca. 60 complete volumes from 16th - 19th century ● Special corpora with ○ Low quality OCR ○ Challenging images ● “Structure” Ground Truth corpus ● Additional Ground Truth is currently in production All Ground Truth data is made freely available under open licenses: http://www.ocr-d.de/daten 9
  • 10.
    Module 1: ImageOptimisation Partner(s): ● DFKI Kaiserslautern Task(s): ● Deskewing ● Dewarping ● Despeckling ● Cropping (print space, border removal) ● Binarization Code: ● https://github.com/syedsaqibbukhari/docanalysis 10
  • 11.
    Module 2: LayoutAnalysis Partner(s): ● DFKI Kaiserslautern Task(s): ● Page segmentation ● Line segmentation based on RAST ● Region classification with CNN ● Document analysis (e.g. reconstruction of table of contents) Code: ● Not yet available 11
  • 12.
    Module 3: LayoutAnalysis & Region Extraction and Classification Partner(s): ● University Würzburg Task(s): ● Based on LAREX ● Development of a CNN-based pixel classifier ● Integration of document-specific rules and heuristics ● Highly automated processing Code: ● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner 12
  • 13.
    Module 4: UnsupervisedOCR Postcorrection Partner(s): ● University Leipzig Task(s): ● Combine finite-state-transducer and neural network based postcorrection in a noisy channel model ● Provision of (tools to create) language- and domain specific models Code: ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann 13
  • 14.
    Module 5: Tesseract Partner(s): ●University Library Mannheim Task(s): ● Adaptation of Tesseract to the OCR-D specifications and interfaces ● Improvement of the code quality ● Optimisation for high throughput Code: ● https://github.com/tesseract-ocr ● https://github.com/OCR-D/ocrd_tesserocr 14
  • 15.
    Module 6: AutomatedPostcorrection with optional interactive Postcorrection Partner(s): ● University of Munich Task(s): ● Alignment of multiple OCR ● Profiling of OCR ● Automated postcorrection and correction protocol ● (Optional) interactive postcorrection Code: ● https://github.com/cisocrgroup/ocrd-postcorrection ● https://github.com/cisocrgroup/cis-ocrd-py 15
  • 16.
    Module 7: AutomatedFont Recognition & Model Training Infrastructure Partner(s): ● University of Mainz ● University of Erlangen ● University of Leipzig Task(s): ● Automated identification of fonts from images ● Development of an OCR training infrastructure and model repository Code: ● https://github.com/seuretm/ocrd_typegroups_classifier ● https://github.com/Doreenruirui/okralact 16
  • 17.
    Module 8: Long-termpreservation Partner(s): ● Göttingen State and University Library ● GWDG Göttingen Task(s): ● Analysis of requirements for long-term preservation of OCR ● Concept and prototype implementation for ○ Persistent storage and identification of OCR data in the archive ○ Citation of OCR data in the archive ○ Search functionality within the archive Code: ● https://github.com/subugoe/OLA-HD-IMPL 17
  • 18.
    OCR-D contact/access points ●OCR-D Website: http://ocr-d.de/eng ● OCR-D GitHub: https://github.com/OCR-D ● OCR-D Specification and Documentation: https://ocr-d.github.io/ ● OCR-D Ground Truth: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit ● OCR-D Gitter: https://gitter.im/OCR-D/Lobby ● OCR-D Docker: https://hub.docker.com/u/ocrd 18
  • 19.
    Thank you foryour attention! Questions, please? DATeCH2019 8-10 May 2019, Brussels, Belgium