SlideShare a Scribd company logo
Automatic Semantic Text Tagging on Historical Lexica
by Combining OCR and Typography Classification
A Case Study on Daniel Sanders‘ Wörterbuch der Deutschen Sprache
Christian Reul1, Sebastian Göttel2, Uwe Springmann3,
Christoph Wick1, Kay-Michael Würzner2, and Frank Puppe1
1Chair for Artificial Intelligence and Applied Computer Science; University of Würzburg
2Berlin-Brandenburg Academy of Sciences and Humanities (BBAW)
3Center for Information and Language Processing (CIS); LMU Munich
09.05.2019
 Great progress in the area of historical OCR on various materials.
 But raw textual OCR sometimes not sufficient.
 Typography within a lexicon
 represents semantic meaning.
 encodes a complex structure within the text (lemmata, definitions, grammatical
information, references, possible word formations, …).
 Goal: Thoroughly indexing of a historical lexicon by combining textual
OCR and typography classification.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
1
Motivation
 Treat the problem as two individual sequence classification tasks:
 Textual OCR.
 Typography classification.
 Perform GT production, training, and recognition separately.
 Combine the results afterwards.
 Assign a distinct label to each of the typography classes:
Image: Hello World
OCR: Hello World
Typo: nnnnn bbbbb
 Use this representation to train an open source OCR engine.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
2
Basic Idea
 Wörterbuch der deutschen
Sprache by famous German
lexicographer Daniel Sanders
(turning 200 this November).
 Cooperation with the Berlin-
Brandenburgische Academy of
Sciences and Humanities (BBAW).
 Printed between 1859 and 1865.
 Three part-volumes comprising almost
3,000 pages and ca. 800,000 text lines.
 Excellent print and scan quality.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
3
Material: Sanders’ Dictionary I
 Main lemmata always bold Fraktur
(assigned label of typographical class l).
 Followed by grammatical properties in
Antiqua (a).
 Definitions in Fraktur (f).
 Typeface of the quotations divided in
 the authors name, different Fraktur type (n),
 the page number (a).
 Possible word formations in
letter-spacing (F).
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
4
Material: Sanders’ Dictionary II
 Binarisation and deskewing (ocropus-nlbin).
 Column segmentation.
 Simple whitespace-based approach.
 https://github.com/wrznr/column-detect
 Deskew columns separately (ocropus-nlbin).
 Line segmentation (ocropus-gpageseg).
 Keep rotational angles and segment/line
coordinates for later use.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
5
Preprocessing and Segmentation
 Open Source OCR engine Calamari.
 https://github.com/Calamari-OCR
 Great recognition capabilities (CNN-LSTM) and very fast (GPU support).
 Natively supports accuracy improving techniques (see below).
 Voting:
 Train model ensemble instead of a single model.
 Combine outputs via confidence voting.
 Better recognition results.
 Pretraining:
 Start training from an existing model instead from scratch.
 Faster training and better recognition results.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
6
OCR Basics
 Manually transcribing the typography GT
cumbersome and error prone.
 Observation: The typography does not change
within a word.
 Idea: Use the OCR GT and label all characters
of a word at once.
 Example (to the right):
 Input (at the top): OCR GT and the line image.
 Transcription steps:
(1) The first word is highlighted and labelled at once.
(2-4) Repeating step 1 for the next words.
(5) All remaining words can be labelled in one go.
(6) Final OCR and typography GT result.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
7
Ground Truth Production
 Voting ensemble consisting of five Calamari models.
 Highly performant mixed Fraktur model as a starting point:
 https://github.com/chreul/19th-century-fraktur-OCR
 Able to recognize 93 distinct characters (Sanders contains over 150).
 Calamari extended recognition output for each character:
 Voted probability for the most likely character and its top alternatives.
 Start and end positions.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
8
Training and Recognition
 Alignment on word level:
Assign typography output to the words
based on the character positions.
 Typography voting:
Identify most likely label for each word
by confidence voting.
 Final output: JSON file containing:
 OCR and typography label for each word.
 A words minimal character confidence.
 Segment, line, and word bounding boxes
with respect to the original scan.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
9
Combining the Outputs
Typography alignment for an example line. From top to bottom:
Line image with OCR whitespace positions (|). Textual OCR output.
Typography output with character positions ( ').
(Slightly flawed) textual typography output on character level.
Final combined output with typography classes assigned on word level.
 Full set of training GT: 765 lines.
 Subsets (400, 200, 100, 50 lines) to examine the influence of the number of GT lines.
 Evaluations set: six columns comprising 630 lines.
 OCR: Character Error Rate (CER) calculated using Calamari’s eval script.
 Typography does not change within a word → Word Error Rate (WER) makes sense:
 Collapse each word in the voted output to a single character.
 Remove all whitespaces.
 Example: aaaa ffffffffff fff nnnnn ffff → affnf.
 Calculate CER using analogously preprocessed GT.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
10
Experiments – Data and Performance Measures
 More training lines → lower CER.
 Excellent CER of 0.35% when training
on all available lines.
 Most frequent errors: insertions and
deletions of whitespaces.
 Standard approaches cannot deal
with the peculiarities of the material.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
11
Experiments – OCR
# Lines Calamari ABBYY
- 3.69% 10.28%
50 1.83% -
100 1.05% -
200 0.67% -
400 0.43% -
765 0.35% -
 More training lines → lower WER.
 Correct typography label
assigned to over 98.5% of words.
 Data augmentation yields minor
improvements (1.38% WER).
 Most frequent errors insertions
and deletions of words resulting
from misrecognized whitespaces.
 Short words especially
susceptible to errors.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
12
Experiments – Typography
GT Pred. Count Perc.
f 15 10.0%
f 15 10.0%
a 10 6.7%
a f 6 4.0%
f F 4 2.7%
# Lines WER
50 9.82%
100 4.08%
200 2.66%
400 1.72%
765 1.47%
 Typography recognition possible and very precise.
 Despite several very similar typography classes.
 Flexible approach using an open source OCR engine.
 Efficient GT production method.
 Main problem: insertion and deletion of whitespaces.
 Typography in Sanders’ dictionary ambiguous.
 Subsequent rule-based postprocessing step required
to produce TEI output.
 Enables complex search queries like: “show all
lemmata which include Goethe as a source”.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
13
Discussion
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
14
Example from the Online Dictionary (work in progress)
 Successful case study on a challen-
ging real world dictionary.
 Hope / Aim: Generic workflow to
obtain complete electronic repre-
sentations of (historical) lexica.
 Further experiments needed (other lexica, different typographical attributes).
 Meta learner judging whitespaces proposed by the OCR and typography models.
 Type-specific OCR models to further increase the accuracy.
 Application on word instead of line level.
 Already promising results.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
15
Conclusion and Future Work
Schweizerisches Idiotikon (https://www.idiotikon.ch)
Calamari: https://github.com/Calamari-OCR
 OCR4all: https://github.com/OCR4all
 GT Production: https://github.com/ChWick/ocrgtannotator
 Reul, Springmann, Wick, Puppe: Improving OCR Accuracy on Early Printed Books
by combining Pretraining, Voting, and Active Learning.
 Ul-Hasan, Afzal, Shafait, Liwicki, Breuel: A Sequence Learning Approach for
Multiple Script Identification.
 Wick, Reul, Puppe: Comparison of OCR Accuracy on Early Printed Books using
the Open Source Engines Calamari and OCRopus.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
16
Thank you for your Attention!
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
18
Word and Type Statistics
Length # Perc.
1 711 5.8%
2 1,622 13.2%
3 2,883 23.6%
4 1,754 14.3%
5 1,254 10.2%
>5 4,018 32.8%
a f F l N All
Words
2,754 8,066 363 469 589 12,241
22.5% 65.9% 3.0% 3.8% 4.8% 100.0%
Chars
8,365 40,682 2,636 3,416 3,424 58,523
14.3% 69.5% 4.5% 5.8% 5.9% 100.0%
Length 3.04 5.04 7.26 7.28 5.81 4.78
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
19
OCR Errors per Type
Type a f F l n
GT 2,580 21,936 747 1,768 333
50
76
2.95%
260
1.19%
36
4.82%
22
6.61%
98
5.54%
200
16
0.62%
64
0.29%
16
2.14%
17
5.11%
51
2.88%
765
4
0.16%
40
0.18%
17
2.28%
8
2.40%
25
1.41%
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
20
Typography – Error Analysis
Postprocessing yields no noteworthy improvements on character level.
 Unexpected since: ffffnff → fffffff.
 Dominant errors: insertions and deletions.
 Missed whitespaces can introduce errors:
aaaaffff → aaaaaaaa (GT: aaaa ffff)
 Using the OCRopus3 ocrodeg module:
 Data augmentation improves the results.
 The more augmentations the better –
but saturation quickly kicks in.
 The less real lines available the bigger the effect.
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
21
Typography – Data Augmentation
Automatic Semantic Text Tagging on Historical Lexica
Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe
22
OCR4all
 Goal: enable non-technical users to
independently capture historical
printings with high accuracy.
 Encapsulating a comprehensive OCR
workflows in a single Docker image.
 Plattform-independent.
 Easy installation.
 Incorporating open source solutions
(OCRopus, Calamari, LAREX, …).
 Comfortable usage (Web-GUI).
 https://github.com/OCR4all

More Related Content

More from IMPACT Centre of Competence

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
IMPACT Centre of Competence
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session6 01.helmut schmid
Session6 01.helmut schmidSession6 01.helmut schmid
Session6 01.helmut schmid
 
Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
Federico Razzoli
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 

Recently uploaded (20)

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
Webinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data WarehouseWebinar: Designing a schema for a Data Warehouse
Webinar: Designing a schema for a Data Warehouse
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 

Session2 02.christian reul

  • 1. Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification A Case Study on Daniel Sanders‘ Wörterbuch der Deutschen Sprache Christian Reul1, Sebastian Göttel2, Uwe Springmann3, Christoph Wick1, Kay-Michael Würzner2, and Frank Puppe1 1Chair for Artificial Intelligence and Applied Computer Science; University of Würzburg 2Berlin-Brandenburg Academy of Sciences and Humanities (BBAW) 3Center for Information and Language Processing (CIS); LMU Munich 09.05.2019
  • 2.  Great progress in the area of historical OCR on various materials.  But raw textual OCR sometimes not sufficient.  Typography within a lexicon  represents semantic meaning.  encodes a complex structure within the text (lemmata, definitions, grammatical information, references, possible word formations, …).  Goal: Thoroughly indexing of a historical lexicon by combining textual OCR and typography classification. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 1 Motivation
  • 3.  Treat the problem as two individual sequence classification tasks:  Textual OCR.  Typography classification.  Perform GT production, training, and recognition separately.  Combine the results afterwards.  Assign a distinct label to each of the typography classes: Image: Hello World OCR: Hello World Typo: nnnnn bbbbb  Use this representation to train an open source OCR engine. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 2 Basic Idea
  • 4.  Wörterbuch der deutschen Sprache by famous German lexicographer Daniel Sanders (turning 200 this November).  Cooperation with the Berlin- Brandenburgische Academy of Sciences and Humanities (BBAW).  Printed between 1859 and 1865.  Three part-volumes comprising almost 3,000 pages and ca. 800,000 text lines.  Excellent print and scan quality. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 3 Material: Sanders’ Dictionary I
  • 5.  Main lemmata always bold Fraktur (assigned label of typographical class l).  Followed by grammatical properties in Antiqua (a).  Definitions in Fraktur (f).  Typeface of the quotations divided in  the authors name, different Fraktur type (n),  the page number (a).  Possible word formations in letter-spacing (F). Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 4 Material: Sanders’ Dictionary II
  • 6.  Binarisation and deskewing (ocropus-nlbin).  Column segmentation.  Simple whitespace-based approach.  https://github.com/wrznr/column-detect  Deskew columns separately (ocropus-nlbin).  Line segmentation (ocropus-gpageseg).  Keep rotational angles and segment/line coordinates for later use. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 5 Preprocessing and Segmentation
  • 7.  Open Source OCR engine Calamari.  https://github.com/Calamari-OCR  Great recognition capabilities (CNN-LSTM) and very fast (GPU support).  Natively supports accuracy improving techniques (see below).  Voting:  Train model ensemble instead of a single model.  Combine outputs via confidence voting.  Better recognition results.  Pretraining:  Start training from an existing model instead from scratch.  Faster training and better recognition results. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 6 OCR Basics
  • 8.  Manually transcribing the typography GT cumbersome and error prone.  Observation: The typography does not change within a word.  Idea: Use the OCR GT and label all characters of a word at once.  Example (to the right):  Input (at the top): OCR GT and the line image.  Transcription steps: (1) The first word is highlighted and labelled at once. (2-4) Repeating step 1 for the next words. (5) All remaining words can be labelled in one go. (6) Final OCR and typography GT result. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 7 Ground Truth Production
  • 9.  Voting ensemble consisting of five Calamari models.  Highly performant mixed Fraktur model as a starting point:  https://github.com/chreul/19th-century-fraktur-OCR  Able to recognize 93 distinct characters (Sanders contains over 150).  Calamari extended recognition output for each character:  Voted probability for the most likely character and its top alternatives.  Start and end positions. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 8 Training and Recognition
  • 10.  Alignment on word level: Assign typography output to the words based on the character positions.  Typography voting: Identify most likely label for each word by confidence voting.  Final output: JSON file containing:  OCR and typography label for each word.  A words minimal character confidence.  Segment, line, and word bounding boxes with respect to the original scan. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 9 Combining the Outputs Typography alignment for an example line. From top to bottom: Line image with OCR whitespace positions (|). Textual OCR output. Typography output with character positions ( '). (Slightly flawed) textual typography output on character level. Final combined output with typography classes assigned on word level.
  • 11.  Full set of training GT: 765 lines.  Subsets (400, 200, 100, 50 lines) to examine the influence of the number of GT lines.  Evaluations set: six columns comprising 630 lines.  OCR: Character Error Rate (CER) calculated using Calamari’s eval script.  Typography does not change within a word → Word Error Rate (WER) makes sense:  Collapse each word in the voted output to a single character.  Remove all whitespaces.  Example: aaaa ffffffffff fff nnnnn ffff → affnf.  Calculate CER using analogously preprocessed GT. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 10 Experiments – Data and Performance Measures
  • 12.  More training lines → lower CER.  Excellent CER of 0.35% when training on all available lines.  Most frequent errors: insertions and deletions of whitespaces.  Standard approaches cannot deal with the peculiarities of the material. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 11 Experiments – OCR # Lines Calamari ABBYY - 3.69% 10.28% 50 1.83% - 100 1.05% - 200 0.67% - 400 0.43% - 765 0.35% -
  • 13.  More training lines → lower WER.  Correct typography label assigned to over 98.5% of words.  Data augmentation yields minor improvements (1.38% WER).  Most frequent errors insertions and deletions of words resulting from misrecognized whitespaces.  Short words especially susceptible to errors. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 12 Experiments – Typography GT Pred. Count Perc. f 15 10.0% f 15 10.0% a 10 6.7% a f 6 4.0% f F 4 2.7% # Lines WER 50 9.82% 100 4.08% 200 2.66% 400 1.72% 765 1.47%
  • 14.  Typography recognition possible and very precise.  Despite several very similar typography classes.  Flexible approach using an open source OCR engine.  Efficient GT production method.  Main problem: insertion and deletion of whitespaces.  Typography in Sanders’ dictionary ambiguous.  Subsequent rule-based postprocessing step required to produce TEI output.  Enables complex search queries like: “show all lemmata which include Goethe as a source”. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 13 Discussion
  • 15. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 14 Example from the Online Dictionary (work in progress)
  • 16.  Successful case study on a challen- ging real world dictionary.  Hope / Aim: Generic workflow to obtain complete electronic repre- sentations of (historical) lexica.  Further experiments needed (other lexica, different typographical attributes).  Meta learner judging whitespaces proposed by the OCR and typography models.  Type-specific OCR models to further increase the accuracy.  Application on word instead of line level.  Already promising results. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 15 Conclusion and Future Work Schweizerisches Idiotikon (https://www.idiotikon.ch)
  • 17. Calamari: https://github.com/Calamari-OCR  OCR4all: https://github.com/OCR4all  GT Production: https://github.com/ChWick/ocrgtannotator  Reul, Springmann, Wick, Puppe: Improving OCR Accuracy on Early Printed Books by combining Pretraining, Voting, and Active Learning.  Ul-Hasan, Afzal, Shafait, Liwicki, Breuel: A Sequence Learning Approach for Multiple Script Identification.  Wick, Reul, Puppe: Comparison of OCR Accuracy on Early Printed Books using the Open Source Engines Calamari and OCRopus. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 16 Thank you for your Attention!
  • 18.
  • 19. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 18 Word and Type Statistics Length # Perc. 1 711 5.8% 2 1,622 13.2% 3 2,883 23.6% 4 1,754 14.3% 5 1,254 10.2% >5 4,018 32.8% a f F l N All Words 2,754 8,066 363 469 589 12,241 22.5% 65.9% 3.0% 3.8% 4.8% 100.0% Chars 8,365 40,682 2,636 3,416 3,424 58,523 14.3% 69.5% 4.5% 5.8% 5.9% 100.0% Length 3.04 5.04 7.26 7.28 5.81 4.78
  • 20. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 19 OCR Errors per Type Type a f F l n GT 2,580 21,936 747 1,768 333 50 76 2.95% 260 1.19% 36 4.82% 22 6.61% 98 5.54% 200 16 0.62% 64 0.29% 16 2.14% 17 5.11% 51 2.88% 765 4 0.16% 40 0.18% 17 2.28% 8 2.40% 25 1.41%
  • 21. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 20 Typography – Error Analysis Postprocessing yields no noteworthy improvements on character level.  Unexpected since: ffffnff → fffffff.  Dominant errors: insertions and deletions.  Missed whitespaces can introduce errors: aaaaffff → aaaaaaaa (GT: aaaa ffff)
  • 22.  Using the OCRopus3 ocrodeg module:  Data augmentation improves the results.  The more augmentations the better – but saturation quickly kicks in.  The less real lines available the bigger the effect. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 21 Typography – Data Augmentation
  • 23. Automatic Semantic Text Tagging on Historical Lexica Christian Reul, Sebastian Göttel, Uwe Springmann, Christoph Wick, Kay-Michael Würzner, Frank Puppe 22 OCR4all  Goal: enable non-technical users to independently capture historical printings with high accuracy.  Encapsulating a comprehensive OCR workflows in a single Docker image.  Plattform-independent.  Easy installation.  Incorporating open source solutions (OCRopus, Calamari, LAREX, …).  Comfortable usage (Web-GUI).  https://github.com/OCR4all