SlideShare a Scribd company logo
1 of 19
OCR-D: An end-to-end open
source OCR framework for
historical printed documents
Clemens Neudecker, Konstantin Baierer, Maria
Federbusch, Matthias Boenig, Kay-Michael
Würzner, Volker Hartmann, Elisa Herrmann
DATeCH2019
8-10 May 2019, Brussels, Belgium
Introduction
2
● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf.
● Yet there is still a lack of open and comprehensive tools & methods
for OCR/OLR specifically targeting historical printed documents
● New breakthroughs through artificial intelligence/machine learning
enable competitive OCR quality given sufficient training
● With the advent of the Digital Humanities, the requirement for large
scale text corpora with high quality OCR is growing rapidly
→ Setting up of the OCR-D project in 2015
Main goals of OCR-D:
● Development of OCR solutions suitable for historical prints
● Standardisation of metadata and Ground Truth
● Creation of training and evaluation data
Architecture
OCR-D is composed of a “Coordination Project” consisting of
● Herzog-August-Library Wolfenbüttel
● Berlin-Brandenburg Academy of Sciences and Humanities
● Bavarian State Library (until 08/2016)
● Berlin State Library (from 12/2016)
● Karlsruhe Institute of Technology (from 08/2017)
...as well as a total of 8 separate “Module Projects” that
● Develop technical solutions to the identified challenges
● Implement the specifications of the “Coordination Project”
OCR-D receives funding from the DFG for 2015 - 2020
3
OCR-D Specifications
Specifications and conventions for interfaces and exchange formats:
● Command Line Interface (CLI)
● Metadata and structural data (METS)
● Full Text (PAGE-XML)
● Software (ocrd-tool.json, Dockerfile)
● Long-term preservation (ocrd-zip, BagIt)
→ https://ocr-d.github.io/
4
OCR-D Specifications
● For sustainability and reuse, documentation beats implementation
● Open and transparent development on GitHub
OCR-D Core
Reference implementation of the specifications:
● Utility functions for common tasks (ocrd_utils)
● Programmatic access to data formats (ocrd_models,
ocrd_modelfactory)
● Validation of interfaces and data formats (ocrd_validators)
● Toolkit to create compatible command line tools (ocrd)
→ https://github.com/OCR-D/core
6
OCR-D Core
● Easy to install via PyPi:
pip install ocrd
OCR-D provides Scientific Workflow Components for OCR using the
Apache Taverna Engine
● Via the Workflow Description
(in SCUFL2 language), the
workflows do become
easily reproducible and can
be shared and reused by others
● Simplifies the transparent
benchmarking and evaluation
of modules/components
● This approach also allows
the capture of workflow
provenance data
→ https://github.com/OCR-D/taverna_workflow
OCR-D Workflow
8
OCR-D Ground Truth
In order to support the development, training and evaluation of tools
and methods, the OCR-D Coordination Project provides:
● Comprehensive transcription guidelines (only in German)
● Ca. 60 complete volumes from 16th - 19th century
● Special corpora with
○ Low quality OCR
○ Challenging images
● “Structure” Ground Truth corpus
● Additional Ground Truth is currently in production
All Ground Truth data is made freely available under open licenses:
http://www.ocr-d.de/daten
9
Module 1: Image Optimisation
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Deskewing
● Dewarping
● Despeckling
● Cropping (print space, border removal)
● Binarization
Code:
● https://github.com/syedsaqibbukhari/docanalysis
10
Module 2: Layout Analysis
Partner(s):
● DFKI Kaiserslautern
Task(s):
● Page segmentation
● Line segmentation based on RAST
● Region classification with CNN
● Document analysis (e.g.
reconstruction of table of contents)
Code:
● Not yet available
11
Module 3: Layout Analysis &
Region Extraction and Classification
Partner(s):
● University Würzburg
Task(s):
● Based on LAREX
● Development of a CNN-based
pixel classifier
● Integration of document-specific
rules and heuristics
● Highly automated processing
Code:
● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner
12
Module 4: Unsupervised OCR Postcorrection
Partner(s):
● University Leipzig
Task(s):
● Combine finite-state-transducer and
neural network based postcorrection
in a noisy channel model
● Provision of (tools to create)
language- and domain specific models
Code:
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst
● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann
13
Module 5: Tesseract
Partner(s):
● University Library Mannheim
Task(s):
● Adaptation of Tesseract to the
OCR-D specifications and interfaces
● Improvement of the code quality
● Optimisation for high throughput
Code:
● https://github.com/tesseract-ocr
● https://github.com/OCR-D/ocrd_tesserocr
14
Module 6: Automated Postcorrection with
optional interactive Postcorrection
Partner(s):
● University of Munich
Task(s):
● Alignment of multiple OCR
● Profiling of OCR
● Automated postcorrection
and correction protocol
● (Optional) interactive postcorrection
Code:
● https://github.com/cisocrgroup/ocrd-postcorrection
● https://github.com/cisocrgroup/cis-ocrd-py
15
Module 7: Automated Font Recognition &
Model Training Infrastructure
Partner(s):
● University of Mainz
● University of Erlangen
● University of Leipzig
Task(s):
● Automated identification of
fonts from images
● Development of an OCR training
infrastructure and model repository
Code:
● https://github.com/seuretm/ocrd_typegroups_classifier
● https://github.com/Doreenruirui/okralact
16
Module 8: Long-term preservation
Partner(s):
● Göttingen State and University Library
● GWDG Göttingen
Task(s):
● Analysis of requirements for
long-term preservation of OCR
● Concept and prototype implementation for
○ Persistent storage and identification
of OCR data in the archive
○ Citation of OCR data in the archive
○ Search functionality within the archive
Code:
● https://github.com/subugoe/OLA-HD-IMPL
17
OCR-D contact/access points
● OCR-D Website:
http://ocr-d.de/eng
● OCR-D GitHub:
https://github.com/OCR-D
● OCR-D Specification and Documentation:
https://ocr-d.github.io/
● OCR-D Ground Truth:
https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit
● OCR-D Gitter:
https://gitter.im/OCR-D/Lobby
● OCR-D Docker:
https://hub.docker.com/u/ocrd
18
Thank you for your attention!
Questions, please?
DATeCH2019
8-10 May 2019, Brussels, Belgium

More Related Content

Similar to OCR-D: An end-to-end open source OCR framework for historical printed documents

Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaAccessITplus
 
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdFranco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdEOSC-hub project
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Librarycneudecker
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigData_Europe
 
The Open Chemistry Project
The Open Chemistry ProjectThe Open Chemistry Project
The Open Chemistry ProjectMarcus Hanwell
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analyticsKyle Bader
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
UGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and ConstructionUGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and ConstructionPieter Pauwels
 
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demoOCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demoMarc Dutoo
 
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overviewIntroduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overviewCisco DevNet
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsMarcus Hanwell
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache HadoopC4Media
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachNicola Ferraro
 
Introduction to LoCloud
Introduction to LoCloud Introduction to LoCloud
Introduction to LoCloud locloud
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overviewBigData_Europe
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012scorlosquet
 

Similar to OCR-D: An end-to-end open source OCR framework for historical printed documents (20)

Gjergj Sheldija: Albania
Gjergj Sheldija: AlbaniaGjergj Sheldija: Albania
Gjergj Sheldija: Albania
 
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowdFranco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd
 
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State LibraryEuropeanaTech x AI: Qurator.ai @ Berlin State Library
EuropeanaTech x AI: Qurator.ai @ Berlin State Library
 
BigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal PilotsBigDataEurope @BDVA Summit2016 2: Societal Pilots
BigDataEurope @BDVA Summit2016 2: Societal Pilots
 
The Open Chemistry Project
The Open Chemistry ProjectThe Open Chemistry Project
The Open Chemistry Project
 
AntoineLambertResume
AntoineLambertResumeAntoineLambertResume
AntoineLambertResume
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
UGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and ConstructionUGent Research Projects on Linked Data in Architecture and Construction
UGent Research Projects on Linked Data in Architecture and Construction
 
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demoOCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
OCCIware @ Cloud Computing World 2016 - year 1 milestone & Linked Data demo
 
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overviewIntroduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
 
Avogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and SemanticsAvogadro, Open Chemistry and Semantics
Avogadro, Open Chemistry and Semantics
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache Hadoop
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE Platform
 
Cloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps ApproachCloud Native Applications on Kubernetes: a DevOps Approach
Cloud Native Applications on Kubernetes: a DevOps Approach
 
Introduction to LoCloud
Introduction to LoCloud Introduction to LoCloud
Introduction to LoCloud
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
BDE SC3.3 Workshop - BDE Platform: Technical overview
 BDE SC3.3 Workshop -  BDE Platform: Technical overview BDE SC3.3 Workshop -  BDE Platform: Technical overview
BDE SC3.3 Workshop - BDE Platform: Technical overview
 
Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012Drupal and the semantic web - SemTechBiz 2012
Drupal and the semantic web - SemTechBiz 2012
 

More from cneudecker

ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltextecneudecker
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungencneudecker
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?cneudecker
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspaperscneudecker
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...cneudecker
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritagecneudecker
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenzcneudecker
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-Dcneudecker
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspaperscneudecker
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...cneudecker
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...cneudecker
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Miningcneudecker
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltextecneudecker
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europecneudecker
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minutencneudecker
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshellcneudecker
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlincneudecker
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspaperscneudecker
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?cneudecker
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBBcneudecker
 

More from cneudecker (20)

ALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für VolltexteALTO, PAGE & Co. Formate für Volltexte
ALTO, PAGE & Co. Formate für Volltexte
 
OCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für ZeitungenOCR und Strukturerkennung für Zeitungen
OCR und Strukturerkennung für Zeitungen
 
Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?Digitisation and Digital Humanities - what is the role of Libraries?
Digitisation and Digital Humanities - what is the role of Libraries?
 
Multimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical NewspapersMultimodal Perspectives for Digitised Historical Newspapers
Multimodal Perspectives for Digitised Historical Newspapers
 
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
OCR und Strukturerkennung: Herausforderungen und Ansätze für die Zeitungsdigi...
 
AI for digitized cultural heritage
AI for digitized cultural heritageAI for digitized cultural heritage
AI for digitized cultural heritage
 
Kuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher IntelligenzKuratieren mit künstlicher Intelligenz
Kuratieren mit künstlicher Intelligenz
 
Überblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-DÜberblick zum DFG-Projekt OCR-D
Überblick zum DFG-Projekt OCR-D
 
The many uses of digitized newspapers
The many uses of digitized newspapersThe many uses of digitized newspapers
The many uses of digitized newspapers
 
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
Digitalisate kuratieren mit KI - von unstrukturierten Daten zu strukturierten...
 
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
Von der Zeitungsdigitalisierung zu historischen Netzwerken - Methoden und Her...
 
Text and Data Mining
Text and Data MiningText and Data Mining
Text and Data Mining
 
Formate für Volltexte
Formate für VolltexteFormate für Volltexte
Formate für Volltexte
 
Extrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in EuropeExtrablatt: The Latest News on Newspaper Digitisation in Europe
Extrablatt: The Latest News on Newspaper Digitisation in Europe
 
Reise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 MinutenReise durch Europeana Collections in 11 Minuten
Reise durch Europeana Collections in 11 Minuten
 
Europeana Newspapers in a Nutshell
Europeana Newspapers in a NutshellEuropeana Newspapers in a Nutshell
Europeana Newspapers in a Nutshell
 
lab.sbb.berlin
lab.sbb.berlinlab.sbb.berlin
lab.sbb.berlin
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
What's up, Europeana Newspapers?
What's up, Europeana Newspapers?What's up, Europeana Newspapers?
What's up, Europeana Newspapers?
 
Active archives @SBB
Active archives @SBBActive archives @SBB
Active archives @SBB
 

Recently uploaded

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

OCR-D: An end-to-end open source OCR framework for historical printed documents

  • 1. OCR-D: An end-to-end open source OCR framework for historical printed documents Clemens Neudecker, Konstantin Baierer, Maria Federbusch, Matthias Boenig, Kay-Michael Würzner, Volker Hartmann, Elisa Herrmann DATeCH2019 8-10 May 2019, Brussels, Belgium
  • 2. Introduction 2 ● Many OCR-projects in the past: METAe, IMPACT, eMOP, asf. ● Yet there is still a lack of open and comprehensive tools & methods for OCR/OLR specifically targeting historical printed documents ● New breakthroughs through artificial intelligence/machine learning enable competitive OCR quality given sufficient training ● With the advent of the Digital Humanities, the requirement for large scale text corpora with high quality OCR is growing rapidly → Setting up of the OCR-D project in 2015 Main goals of OCR-D: ● Development of OCR solutions suitable for historical prints ● Standardisation of metadata and Ground Truth ● Creation of training and evaluation data
  • 3. Architecture OCR-D is composed of a “Coordination Project” consisting of ● Herzog-August-Library Wolfenbüttel ● Berlin-Brandenburg Academy of Sciences and Humanities ● Bavarian State Library (until 08/2016) ● Berlin State Library (from 12/2016) ● Karlsruhe Institute of Technology (from 08/2017) ...as well as a total of 8 separate “Module Projects” that ● Develop technical solutions to the identified challenges ● Implement the specifications of the “Coordination Project” OCR-D receives funding from the DFG for 2015 - 2020 3
  • 4. OCR-D Specifications Specifications and conventions for interfaces and exchange formats: ● Command Line Interface (CLI) ● Metadata and structural data (METS) ● Full Text (PAGE-XML) ● Software (ocrd-tool.json, Dockerfile) ● Long-term preservation (ocrd-zip, BagIt) → https://ocr-d.github.io/ 4
  • 5. OCR-D Specifications ● For sustainability and reuse, documentation beats implementation ● Open and transparent development on GitHub
  • 6. OCR-D Core Reference implementation of the specifications: ● Utility functions for common tasks (ocrd_utils) ● Programmatic access to data formats (ocrd_models, ocrd_modelfactory) ● Validation of interfaces and data formats (ocrd_validators) ● Toolkit to create compatible command line tools (ocrd) → https://github.com/OCR-D/core 6
  • 7. OCR-D Core ● Easy to install via PyPi: pip install ocrd
  • 8. OCR-D provides Scientific Workflow Components for OCR using the Apache Taverna Engine ● Via the Workflow Description (in SCUFL2 language), the workflows do become easily reproducible and can be shared and reused by others ● Simplifies the transparent benchmarking and evaluation of modules/components ● This approach also allows the capture of workflow provenance data → https://github.com/OCR-D/taverna_workflow OCR-D Workflow 8
  • 9. OCR-D Ground Truth In order to support the development, training and evaluation of tools and methods, the OCR-D Coordination Project provides: ● Comprehensive transcription guidelines (only in German) ● Ca. 60 complete volumes from 16th - 19th century ● Special corpora with ○ Low quality OCR ○ Challenging images ● “Structure” Ground Truth corpus ● Additional Ground Truth is currently in production All Ground Truth data is made freely available under open licenses: http://www.ocr-d.de/daten 9
  • 10. Module 1: Image Optimisation Partner(s): ● DFKI Kaiserslautern Task(s): ● Deskewing ● Dewarping ● Despeckling ● Cropping (print space, border removal) ● Binarization Code: ● https://github.com/syedsaqibbukhari/docanalysis 10
  • 11. Module 2: Layout Analysis Partner(s): ● DFKI Kaiserslautern Task(s): ● Page segmentation ● Line segmentation based on RAST ● Region classification with CNN ● Document analysis (e.g. reconstruction of table of contents) Code: ● Not yet available 11
  • 12. Module 3: Layout Analysis & Region Extraction and Classification Partner(s): ● University Würzburg Task(s): ● Based on LAREX ● Development of a CNN-based pixel classifier ● Integration of document-specific rules and heuristics ● Highly automated processing Code: ● https://github.com/ocr-d-modul-2-segmentierung/segmentation-runner 12
  • 13. Module 4: Unsupervised OCR Postcorrection Partner(s): ● University Leipzig Task(s): ● Combine finite-state-transducer and neural network based postcorrection in a noisy channel model ● Provision of (tools to create) language- and domain specific models Code: ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-fst ● https://git.informatik.uni-leipzig.de/ocr-d/cor-asv-ann 13
  • 14. Module 5: Tesseract Partner(s): ● University Library Mannheim Task(s): ● Adaptation of Tesseract to the OCR-D specifications and interfaces ● Improvement of the code quality ● Optimisation for high throughput Code: ● https://github.com/tesseract-ocr ● https://github.com/OCR-D/ocrd_tesserocr 14
  • 15. Module 6: Automated Postcorrection with optional interactive Postcorrection Partner(s): ● University of Munich Task(s): ● Alignment of multiple OCR ● Profiling of OCR ● Automated postcorrection and correction protocol ● (Optional) interactive postcorrection Code: ● https://github.com/cisocrgroup/ocrd-postcorrection ● https://github.com/cisocrgroup/cis-ocrd-py 15
  • 16. Module 7: Automated Font Recognition & Model Training Infrastructure Partner(s): ● University of Mainz ● University of Erlangen ● University of Leipzig Task(s): ● Automated identification of fonts from images ● Development of an OCR training infrastructure and model repository Code: ● https://github.com/seuretm/ocrd_typegroups_classifier ● https://github.com/Doreenruirui/okralact 16
  • 17. Module 8: Long-term preservation Partner(s): ● Göttingen State and University Library ● GWDG Göttingen Task(s): ● Analysis of requirements for long-term preservation of OCR ● Concept and prototype implementation for ○ Persistent storage and identification of OCR data in the archive ○ Citation of OCR data in the archive ○ Search functionality within the archive Code: ● https://github.com/subugoe/OLA-HD-IMPL 17
  • 18. OCR-D contact/access points ● OCR-D Website: http://ocr-d.de/eng ● OCR-D GitHub: https://github.com/OCR-D ● OCR-D Specification and Documentation: https://ocr-d.github.io/ ● OCR-D Ground Truth: https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit ● OCR-D Gitter: https://gitter.im/OCR-D/Lobby ● OCR-D Docker: https://hub.docker.com/u/ocrd 18
  • 19. Thank you for your attention! Questions, please? DATeCH2019 8-10 May 2019, Brussels, Belgium