Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
OCR-D: An end-to-end open
source OCR framework for
historical printed documents
Clemens Neudecker, Konstantin Baierer, Mar...
Introduction
2
● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf.
● Yet there is still a lack of open and comprehe...
Architecture
OCR-D is composed of a “Coordination Project” consisting of
● Herzog-August-Library Wolfenbüttel
● Berlin-Bra...
OCR-D Specifications
Specifications and conventions for interfaces and exchange formats:
● Command Line Interface (CLI)
● ...
OCR-D Specifications
● For sustainability and reuse, documentation beats implementation
● Open and transparent development...
OCR-D Core
Reference implementation of the specifications:
● Utility functions for common tasks (ocrd_utils)
● Programmati...
OCR-D Core
● Easy to install via PyPi:
pip install ocrd
OCR-D provides Scientific Workflow Components for OCR using the
Apache Taverna Engine
● Via the Workflow Description
(in S...
OCR-D Ground Truth
In order to support the development, training and evaluation of tools
and methods, the OCR-D Coordinati...
Module 1: Image Optimisation
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Deskewing
● Dewarping
● Despeckling
● Cropping (...
Module 2: Layout Analysis
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Page segmentation
● Line segmentation based on RAST...
Module 3: Layout Analysis &
Region Extraction and Classification
Partner(s):
● University Würzburg
Task(s):
● Based on LAR...
Module 4: Unsupervised OCR Postcorrection
Partner(s):
● University Leipzig
Task(s):
● Combine finite-state-transducer and
...
Module 5: Tesseract
Partner(s):
● University Library Mannheim
Task(s):
● Adaptation of Tesseract to the
OCR-D specificatio...
Module 6: Automated Postcorrection with
optional interactive Postcorrection
Partner(s):
● University of Munich
Task(s):
● ...
Module 7: Automated Font Recognition &
Model Training Infrastructure
Partner(s):
● University of Mainz
● University of Erl...
Module 8: Long-term preservation
Partner(s):
● Göttingen State and University Library
● GWDG Göttingen
Task(s):
● Analysis...
OCR-D contact/access points
● OCR-D Website:
http://ocr-d.de/eng
● OCR-D GitHub:
https://github.com/OCR-D
● OCR-D Specific...
Thank you for your attention!
Questions, please?
DATeCH2019
8-10 May 2019, Brussels, Belgium
Upcoming SlideShare
Loading in …5
×

of

OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 1 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 2 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 3 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 4 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 5 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 6 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 7 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 8 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 9 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 10 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 11 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 12 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 13 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 14 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 15 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 16 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 17 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 18 OCR-D: An end-to-end open source OCR framework for historical printed documents Slide 19
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

1 Like

Share

Download to read offline

OCR-D: An end-to-end open source OCR framework for historical printed documents

Download to read offline

Presented at #DATeCH2019, Brussels, Belgium

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

OCR-D: An end-to-end open source OCR framework for historical printed documents

  1. 1. OCR-D: An end-to-end open source OCR framework for historical printed documents Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann DATeCH2019 8-10 May 2019, Brussels, Belgium
  2. 2. Introduction 2 ● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf. ● Yet there is still a lack of open and comprehensive tools & methods for OCR/OLR specifically targeting historical printed documents ● New breakthroughs through artificial intelligence/machine learning enable competitive OCR quality given sufficient training ● With the advent of the Digital Humanities, the requirement for large scale text corpora with high quality OCR is growing rapidly → Setting up of the OCR-D project in 2015 Main goals of OCR-D: ● Development of OCR solutions suitable for historical prints ● Standardisation of metadata and Ground Truth ● Creation of training and evaluation data
  3. 3. Architecture OCR-D is composed of a “Coordination Project” consisting of ● Herzog-August-Library Wolfenbüttel ● Berlin-Brandenburg Academy of Sciences and Humanities ● Bavarian State Library (until 08/2016) ● Berlin State Library (from 12/2016) ● Karlsruhe Institute of Technology (from 08/2017) ...as well as a total of 8 separate “Module Projects” that ● Develop technical solutions to the identified challenges ● Implement the specifications of the “Coordination Project” OCR-D receives funding from the DFG for 2015 - 2020 3
  4. 4. OCR-D Specifications Specifications and conventions for interfaces and exchange formats: ● Command Line Interface (CLI) ● Metadata and structural data (METS) ● Full Text (PAGE-XML) ● Software (ocrd-tool.json, Dockerfile) ● Long-term preservation (ocrd-zip, BagIt) → https://ocr-d.github.io/ 4
  5. 5. OCR-D Specifications ● For sustainability and reuse, documentation beats implementation ● Open and transparent development on GitHub
  6. 6. OCR-D Core Reference implementation of the specifications: ● Utility functions for common tasks (ocrd_utils) ● Programmatic access to data formats (ocrd_models, ocrd_modelfactory) ● Validation of interfaces and data formats (ocrd_validators) ● Toolkit to create compatible command line tools (ocrd) → https://github.com/OCR-D/core 6
  7. 7. OCR-D Core ● Easy to install via PyPi: pip install ocrd
  8. 8. OCR-D provides Scientific Workflow Components for OCR using the Apache Taverna Engine ● Via the Workflow Description (in SCUFL2 language), the workflows do become easily reproducible and can be shared and reused by others ● Simplifies the transparent benchmarking and evaluation of modules/components ● This approach also allows the capture of workflow provenance data → https://github.com/OCR-D/taverna_workflow OCR-D Workflow 8
  9. 9. OCR-D Ground Truth In order to support the development, training and evaluation of tools and methods, the OCR-D Coordination Project provides: ● Comprehensive transcription guidelines (only in German) ● Ca. 60 complete volumes from 16th - 19th century ● Special corpora with ○ Low quality OCR ○ Challenging images ● “Structure” Ground Truth corpus ● Additional Ground Truth is currently in production All Ground Truth data is made freely available under open licenses: http://www.ocr-d.de/daten 9
  10. 10. Module 1: Image Optimisation Partner(s): ● DFKI Kaiserslautern Task(s): ● Deskewing ● Dewarping ● Despeckling ● Cropping (print space, border removal) ● Binarization Code: ● https://github.com/syedsaqibbukhari/docanalysis 10
  11. 11. Module 2: Layout Analysis Partner(s): ● DFKI Kaiserslautern Task(s): ● Page segmentation ● Line segmentation based on RAST ● Region classification with CNN ● Document analysis (e.g. reconstruction of table of contents) Code: ● Not yet available 11
  12. 12. Module 3: Layout Analysis & Region Extraction and Classification Partner(s): ● University Würzburg Task(s): ● Based on LAREX ● Development of a CNN-based pixel classifier ● Integration of document-specific rules and heuristics ● Highly automated processing Code: ● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner 12
  13. 13. Module 4: Unsupervised OCR Postcorrection Partner(s): ● University Leipzig Task(s): ● Combine finite-state-transducer and neural network based postcorrection in a noisy channel model ● Provision of (tools to create) language- and domain specific models Code: ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann 13
  14. 14. Module 5: Tesseract Partner(s): ● University Library Mannheim Task(s): ● Adaptation of Tesseract to the OCR-D specifications and interfaces ● Improvement of the code quality ● Optimisation for high throughput Code: ● https://github.com/tesseract-ocr ● https://github.com/OCR-D/ocrd_tesserocr 14
  15. 15. Module 6: Automated Postcorrection with optional interactive Postcorrection Partner(s): ● University of Munich Task(s): ● Alignment of multiple OCR ● Profiling of OCR ● Automated postcorrection and correction protocol ● (Optional) interactive postcorrection Code: ● https://github.com/cisocrgroup/ocrd-postcorrection ● https://github.com/cisocrgroup/cis-ocrd-py 15
  16. 16. Module 7: Automated Font Recognition & Model Training Infrastructure Partner(s): ● University of Mainz ● University of Erlangen ● University of Leipzig Task(s): ● Automated identification of fonts from images ● Development of an OCR training infrastructure and model repository Code: ● https://github.com/seuretm/ocrd_typegroups_classifier ● https://github.com/Doreenruirui/okralact 16
  17. 17. Module 8: Long-term preservation Partner(s): ● Göttingen State and University Library ● GWDG Göttingen Task(s): ● Analysis of requirements for long-term preservation of OCR ● Concept and prototype implementation for ○ Persistent storage and identification of OCR data in the archive ○ Citation of OCR data in the archive ○ Search functionality within the archive Code: ● https://github.com/subugoe/OLA-HD-IMPL 17
  18. 18. OCR-D contact/access points ● OCR-D Website: http://ocr-d.de/eng ● OCR-D GitHub: https://github.com/OCR-D ● OCR-D Specification and Documentation: https://ocr-d.github.io/ ● OCR-D Ground Truth: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit ● OCR-D Gitter: https://gitter.im/OCR-D/Lobby ● OCR-D Docker: https://hub.docker.com/u/ocrd 18
  19. 19. Thank you for your attention! Questions, please? DATeCH2019 8-10 May 2019, Brussels, Belgium
  • Jim_Salmons

    Dec. 11, 2019

Presented at #DATeCH2019, Brussels, Belgium

Views

Total views

1,847

On Slideshare

0

From embeds

0

Number of embeds

46

Actions

Downloads

17

Shares

0

Comments

0

Likes

1

×