SlideShare a Scribd company logo
PoCoTo
An Open Source System for Efficient
Interactive Postcorrection of OCRed
Historical Texts
Thorsten Vobl, Annette Gotscharek, Ulrich Reffle,
Christoph Ringlstetter, Klaus U. Schulz
CIS - Center for Information and Language Processing
University of Munich
Gini GmbH Munich
Motivation
- For historical texts still many OCR errors
- Downstream Applications harmed
Option to improve quality with interactive Postcorrection
Why: selected and important texts/corpora or parts can/must be lifted
to a much higher level of accuracy/to perfection.
Somehow “business driven”
How: The user experience of the software has a major influence on time and
efforts needed for improving accuracy.
Approach
Features to Raise Productivity within our competence and explorative :
•  Plugin Language technology that unmasks orthographic variation in historical
language and returns document specific distributions of OCR errors.
•  Tool visualizes series of similar OCR errors
•  Error series can be corrected in one shot
•  Implement productive UX through interface and functionality
Evaluation
Tool developed in University Environment during EU project IMPACT
and maintained since despite serious fluctuation
Practical user tests in three major European libraries have shown
Gains in time/corrections rates. User ratings from practitioners high.
Maintaining Interest, open for new languages, new functionalities.
Division of language resources and tool through a server-client model
Published as an open source tool under GitHub.
§  Language technology used for improvement of
interactive postcorrection
§  Lexica, matching tool, profiler integrated as background technology
§  Document centric knowledge from unsupervised analysis of OCRed
document used for detection of error classes and suggested corrections
§  Batchmode for corrections of many errors in „one shot“
§  Rich graphical user interface to let users fully benefit
from „knowledge“ on document derived error classes
Starting Point: Postcorrection Tool as
a Carrier of Technology
Flexible GUI
OCR
Correction candidates,
Special workflows
Image
§  Unlimited configuration of
the views:
–  OCR with image snippets
–  Complete image page
–  Correction candidates, special
workflows
Font-/window size
configuration
§  OCRed text is presented to
the user with word-image
alignment.
§  Natural flow of text is
maintained, comparison
with original text images a
lot easier than with focus
hopping
View: OCR + Image Snippets
§  Alternative view with the
complete page image.
–  Useful for difficult to read words
–  Useful if word segmentation of the OCR
is too poor
–  Useful if long distance text understanding
is needed
View: Original Image
§  Classical correction
workflow through seuential
manual input
Manual Correction
§  Speed-up through
selection of proposed
correction candidates
In line with what is usually
offered: „Base Mode“
Drop Down Selection of Correction
Candidates
Modern word word form in word form in
form ground truth OCRed text
Wmod Wgt Wocr
Patterns applied
„pattern trace“
OCR errors applied
„OCR trace“
„Interpretation“ of the OCR token
Starting from OCR token Wocr Estimation of the Channel Model
Two-Channel Model for OCRed
historical Text
Improved model for
• words
• patterns
• OCR errors
and their probabilities
.
.
for each OCR token Wocr
Improved list of
interpretations
with probabilities
Final Result
Modern word
Ground truth
OCR trace
Hist trace
Local guess Global guess
Profiling of historical OCRed corpora
with EM
Document Eckartshausen
Result Probabilities historical patterns
LMF
Document Eckartshausen
Result Probabilities OCR errors
§  Valid historical words not
marked as errors even if
not in the lexicon
(„hypothetical lexicon“)
§  Historical variants
proposed as correction
candidates
Lexicons Triggered by Profiles
§  Improved Ranking of candidates through document
specific language and error profile
§  Concordance Error View with high confidence
corrections
Selection of Correction Candidates
§  High Probability Identical strings
corrected as batch
§  Concordance views optional
Rapid Workflow - Batch Processing
Identical Strings
§  Strings with identical error patterns
corrected as batch
§  In the example: n -> u
Rapid Workflow - Batch Processing
Identical Error Patterns
Controlled “Hard” Evaluations
0 10 20 30 40 50 60 70 80 90
0
100
200
300
400
500
600
700
800
BSB Dokument1
Corrections made
User1 F
User2 F
User3 B
User4 B
User5 F
User6 B
time in minutes
correctionsmade
§  Measure Points every 10
minutes for 90 minutes
§  Each User with a base/full
session (inter/intra User
comparison)
§  More corrections avg. 1.5x – 3x
for Full Mode
§  Earley Gains: First 10 Minutes
Closer Look into the Data
Soft Evaluations
Questionaires with all three institutions.
Most favorite aspect:
Batch Corrections
Main problems:
Stability
Correction of Segmentation Errors
Future work
•  Extend to new Languages e.g. Latin
•  New Correction Scenarios e.g. specific Named
Entity Correction
•  Turn Interest into a Community and Implement
Industrial Tool Partnerships for isolated parts of
the Software
Thanks for your attention!
… and special thanks to University of Alicante, Bavarian State Library, Royal
Library of the Netherlands for their Time and Efforts during the Experiments

More Related Content

Similar to Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
REMEGIUSPRAVEENSAHAY
 
Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...
Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...
Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...
DigitalClassicistLondon
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
IMPACT Centre of Competence
 
Wroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna PiotrowiczWroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna Piotrowicz
IMPACT Centre of Competence
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
Lifeng (Aaron) Han
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Centre of Competence
 
OCR training dataset (1).docx
OCR training dataset (1).docxOCR training dataset (1).docx
OCR training dataset (1).docx
Shalini104884
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
WarNik Chow
 
Enabling the Production of High-Quality English Glosses of Every Word in the ...
Enabling the Production of High-Quality English Glosses of Every Word in the ...Enabling the Production of High-Quality English Glosses of Every Word in the ...
Enabling the Production of High-Quality English Glosses of Every Word in the ...
jrcovington
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Php packages
Php packagesPhp packages
Php packages
abdelrahman samy
 
Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?
Georg Rehm
 
Introduction+to+software+design
Introduction+to+software+designIntroduction+to+software+design
Introduction+to+software+design
Munazza-Mah-Jabeen
 
Functional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorFunctional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text Editor
Baden Hughes
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
IMPACT Centre of Competence
 

Similar to Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text (20)

team10.ppt.pptx
team10.ppt.pptxteam10.ppt.pptx
team10.ppt.pptx
 
Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...
Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...
Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Br...
 
Session1 02.anna-maria sichani
Session1 02.anna-maria sichaniSession1 02.anna-maria sichani
Session1 02.anna-maria sichani
 
Antconc
AntconcAntconc
Antconc
 
Wroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna PiotrowiczWroclaw university library - Grazyna Piotrowicz
Wroclaw university library - Grazyna Piotrowicz
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
 
CRC Final Report
CRC Final ReportCRC Final Report
CRC Final Report
 
OCR training dataset (1).docx
OCR training dataset (1).docxOCR training dataset (1).docx
OCR training dataset (1).docx
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
Enabling the Production of High-Quality English Glosses of Every Word in the ...
Enabling the Production of High-Quality English Glosses of Every Word in the ...Enabling the Production of High-Quality English Glosses of Every Word in the ...
Enabling the Production of High-Quality English Glosses of Every Word in the ...
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Php packages
Php packagesPhp packages
Php packages
 
Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?Web Annotations – A Game Changer for Language Technology?
Web Annotations – A Game Changer for Language Technology?
 
Introduction+to+software+design
Introduction+to+software+designIntroduction+to+software+design
Introduction+to+software+design
 
INLS890_ProjectPlan
INLS890_ProjectPlanINLS890_ProjectPlan
INLS890_ProjectPlan
 
INLS890_ProjectPlan
INLS890_ProjectPlanINLS890_ProjectPlan
INLS890_ProjectPlan
 
Functional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text EditorFunctional Requirements for an Interlinear Text Editor
Functional Requirements for an Interlinear Text Editor
 
IMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus GravenhorstIMPACT Final Conference - Claus Gravenhorst
IMPACT Final Conference - Claus Gravenhorst
 

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interactive Postcorrection of OCRed Historical Text

  • 1. PoCoTo An Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts Thorsten Vobl, Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz CIS - Center for Information and Language Processing University of Munich Gini GmbH Munich
  • 2. Motivation - For historical texts still many OCR errors - Downstream Applications harmed Option to improve quality with interactive Postcorrection Why: selected and important texts/corpora or parts can/must be lifted to a much higher level of accuracy/to perfection. Somehow “business driven” How: The user experience of the software has a major influence on time and efforts needed for improving accuracy.
  • 3. Approach Features to Raise Productivity within our competence and explorative : •  Plugin Language technology that unmasks orthographic variation in historical language and returns document specific distributions of OCR errors. •  Tool visualizes series of similar OCR errors •  Error series can be corrected in one shot •  Implement productive UX through interface and functionality
  • 4. Evaluation Tool developed in University Environment during EU project IMPACT and maintained since despite serious fluctuation Practical user tests in three major European libraries have shown Gains in time/corrections rates. User ratings from practitioners high. Maintaining Interest, open for new languages, new functionalities. Division of language resources and tool through a server-client model Published as an open source tool under GitHub.
  • 5. §  Language technology used for improvement of interactive postcorrection §  Lexica, matching tool, profiler integrated as background technology §  Document centric knowledge from unsupervised analysis of OCRed document used for detection of error classes and suggested corrections §  Batchmode for corrections of many errors in „one shot“ §  Rich graphical user interface to let users fully benefit from „knowledge“ on document derived error classes Starting Point: Postcorrection Tool as a Carrier of Technology
  • 6. Flexible GUI OCR Correction candidates, Special workflows Image §  Unlimited configuration of the views: –  OCR with image snippets –  Complete image page –  Correction candidates, special workflows Font-/window size configuration
  • 7. §  OCRed text is presented to the user with word-image alignment. §  Natural flow of text is maintained, comparison with original text images a lot easier than with focus hopping View: OCR + Image Snippets
  • 8. §  Alternative view with the complete page image. –  Useful for difficult to read words –  Useful if word segmentation of the OCR is too poor –  Useful if long distance text understanding is needed View: Original Image
  • 9. §  Classical correction workflow through seuential manual input Manual Correction
  • 10. §  Speed-up through selection of proposed correction candidates In line with what is usually offered: „Base Mode“ Drop Down Selection of Correction Candidates
  • 11. Modern word word form in word form in form ground truth OCRed text Wmod Wgt Wocr Patterns applied „pattern trace“ OCR errors applied „OCR trace“ „Interpretation“ of the OCR token Starting from OCR token Wocr Estimation of the Channel Model Two-Channel Model for OCRed historical Text
  • 12. Improved model for • words • patterns • OCR errors and their probabilities . . for each OCR token Wocr Improved list of interpretations with probabilities Final Result Modern word Ground truth OCR trace Hist trace Local guess Global guess Profiling of historical OCRed corpora with EM
  • 15. §  Valid historical words not marked as errors even if not in the lexicon („hypothetical lexicon“) §  Historical variants proposed as correction candidates Lexicons Triggered by Profiles
  • 16. §  Improved Ranking of candidates through document specific language and error profile §  Concordance Error View with high confidence corrections Selection of Correction Candidates
  • 17. §  High Probability Identical strings corrected as batch §  Concordance views optional Rapid Workflow - Batch Processing Identical Strings
  • 18. §  Strings with identical error patterns corrected as batch §  In the example: n -> u Rapid Workflow - Batch Processing Identical Error Patterns
  • 19. Controlled “Hard” Evaluations 0 10 20 30 40 50 60 70 80 90 0 100 200 300 400 500 600 700 800 BSB Dokument1 Corrections made User1 F User2 F User3 B User4 B User5 F User6 B time in minutes correctionsmade §  Measure Points every 10 minutes for 90 minutes §  Each User with a base/full session (inter/intra User comparison) §  More corrections avg. 1.5x – 3x for Full Mode §  Earley Gains: First 10 Minutes
  • 20. Closer Look into the Data
  • 21. Soft Evaluations Questionaires with all three institutions. Most favorite aspect: Batch Corrections Main problems: Stability Correction of Segmentation Errors
  • 22. Future work •  Extend to new Languages e.g. Latin •  New Correction Scenarios e.g. specific Named Entity Correction •  Turn Interest into a Community and Implement Industrial Tool Partnerships for isolated parts of the Software
  • 23. Thanks for your attention! … and special thanks to University of Alicante, Bavarian State Library, Royal Library of the Netherlands for their Time and Efforts during the Experiments