SlideShare a Scribd company logo
1 of 65
Download to read offline
An Integrated System for
Polytonic Greek OCR
I. Generating the Data
Bruce Robertson, Dept. of Classics, Mount
Allison University, New Brunswick, Canada
Digital Classicist Seminar, Institute of Classical
Studies, London, UK, July 19, 2013
I. Generating the Data
A. Reason
Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
2. Study of textual variants and app. crit.
Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
2. Study of textual variants and app. crit.
3. Text reuse analysis
Why Ancient Greek OCR?
1. Rapid digitization of Greek texts not yet in
digital libraries
2. Study of textual variants and app. crit.
3. Text reuse analysis
4. General-purpose OCR search, like Google
Books
Use Manual Editing? Automatic Spell-
checking?
Digitization ✓ ✓
Textual Variants ✗ ✗
Text Reuse ✗ ✓
OCR Search ✗ ✗ or ✓
Use Manual Editing? Automatic Spell-
checking?
Digitization ✓ ✓
Textual Variants ✗ ✗
Text Reuse ✗ ✓
OCR Search ✗ ✗ or ✓
I. Generating the Data
B. Challenge
Example Text: John 1:1
Ἐν ἀρχῇ ἦν ὁ Λόγος,
καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν,
καὶ Θεὸς ἦν ὁ Λόγος.
Acute, Grave and Circumflex Accents
Ἐν ἀρχῇ ἦν ὁ Λόγος,
καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν,
καὶ Θεὸς ἦν ὁ Λόγος.
Smooth and Rough Breathing Marks
Ἐν ἀρχῇ ἦν ὁ Λόγος,
καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν,
καὶ Θεὸς ἦν ὁ Λόγος.
Iota Subscript
Ἐν ἀρχῇ ἦν ὁ Λόγος,
καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν,
καὶ Θεὸς ἦν ὁ Λόγος.
Diversity of Greek Fonts in 19th C.
archive.org Texts
Recognizing Lines
Recognizing Lines
Accurate Binarization
Binarization Important to Results
I. Generating the Data
C. Resources
Contextless 'Greekness' Index
● Devised by Dr. Boschetti
● Based on dictionary and likely sequences of
letters, etc.
● Named 'B-score' in these slides
Archive.org
● Provides:
○ Thousands of volumes rendered in high-
resolution (400 ppi +) colour images
○ OCR results from ABBYY Finereader
■ Excellent Latin-script recognition
■ Poor Greek results
■ Top-quality line-segmentation
Open-source OCR Engines
● Gamera
○ Current focus of my team
● Tesseract
○ Nick White has worked extensively on this to generate
good results
● OCRopus
○ Dr. Boschetti recently has been able to use Tesseract
training sets for this engine
Interchange Format: HOCR
<p>
<span class="ocr_line" title="bbox 196 751
1059 808">
<span class="ocr_word" title="bbox 196 769
269 794" xml:lang="grc">νῦν</span>
<span class="ocr_word" title="bbox 301 752
326 793" xml:lang="grc">δ'</span>
<span class="ocr_word" title="bbox 378 755
598 804" xml:lang="grc">ὤρθωσας</span>
</p>
I. Generating the Data
D. Method
HOCR
Output
Page Segmentation Thru HOCR
Input
Gamera 3.3.3 Image Recognition
JP2 Input
Library
Greek OCR For Gamera
Classifiers for Teubner Sans, Teubner Serif, Oxford (Loeb, etc.)
HOCR Results at a range of
binarization thresholds
Parallel
Process
x35
Cores
410,000 word
dictionary from
open Perseus
Greek texts
Weighted
Levenshtein
Automatic OCR
Spellchecker
(x14 cores)
Per-volume
spellcheck
table
Reduction to unique
Greek strings
Images from
Archive.org
ABBYY OCR
Information file
ABBYY to
HOCR
Conversion
Score table for
binarization
thresholds
Select highest-ranking
binarization page
Weighted Edit
Table for Classifier
Replace
spellchecked words
...
HOCR Output
Replace Latin-script
output words with ones
in same position from
Archive's ABBYY
output Does ABBYY
OCR file
contain Latin-
script output?
Rigaudon Greek
OCR Process
Automatic
OCR
Spellcheck
OCR
HOCR Latin /
Greek
Combining
HOCR
"Blending"
Boschetti scoring
Replace non-dictionary words with
dict words from other binarization
pages
Raw HOCR Production Using Gamera
● Plugin for Gamera OCR allows it to import high-quality
line-segmentation information, compensating for Gamera's
poor results in this critical function
● Plugin to output HOCR
● Wrapper function generates a range of output pages based
on binarization threshold (typically 10 - 20 per page)
HOCR 'Blending'
● This step aims to gather word-by-word the 'best' results
from the range of results pages for each image
● Selects the highest-scoring result page overall
● Where a Greek word in this page is not in the dictionary
and another page has a dictionary word in the exact same
physical location, it replaces with dictionary word
Automatic Spellcheck
● All pages in volume are reduced to a set of unique,
decomposed Greek strings
● These are compared to dictionary using Levenshtein
distances
● A 'weighting table', suitable for a given font, indicates
which edits are preferable or allowed
● Result is 'light' correction, esp. of diacritics
Weighting Table
['replace', ur'ϲϹ', ur'σςΣ', 1],#for lunate fonts
['replace', ur'c', ur'σς', 1],#for lunate fonts
['replace', ur'T', ur'Ττ', 1],
['replace', ur'Τr', ur'τ', 1],
['replace', ur'Uu', ur'υ', 1],
['replace', ur'Y', ur'Υ', 1],
['replace', ur'E', ur'Ε', 1],
['replace', ur'E', ur'ε', 2],
['replace', ur'Z', ur'Ζ', 1],
['replace', ur'K', ur'κΚ', 1],
Automatic Spellcheck
Optionally injecting Greek into
Original Latin HOCR
● Don't want to try to get excellent Greek and Latin results,
esp. when ABBYY and others do better job with Latin
● In the case that archive.org provides Latin OCR:
○ If Rigaudon's output word is Greek, replace archive.
org's ABBYY output word with Rigaudon's
Reporting
I. Generating the Data
F. Results
Περὶ δὲ δικαιοσύνης καὶ ἀδικίας ...
τε τυγχάνουσιν οὖσαι πράξεις, καὶ ποία ...
κότος ἀλλ' οὐ τῆς κομιζούσης φύσεως κτῆμα ...
τπ οὖν, ὡς ἓν ἔργον ἡμῖν ἐπιβάλλει μόνον ...
Results
πιπλαττομένω[ν ὴ μὴ
κ[αλ]ῶς ά̓λλως προσ[τιθεμέ-
κράτος τ' ἰσόψυχον ἐκ γυναικῶν
καρδιόδηκτον ἐμοὶ κρατύνεις.
Results
I. Generating the Data
G. Future
Multiple OCR Engines
● Take ABBYY data out of the process
○ With 'cleaning' Tesseract's line-segmentation is often
as good
● Use Nick White's general-purpose polytonic classifier and
ones specifically designed for a font
HOCR Results at a range of
binarization thresholds
Score table for
binarization
thresholds
Select highest-ranking
binarization page
Boschetti scoring
Replace non-dictionary words with
dict words from other binarization
pages
TesseractGameraOCRopus
Line Segmentation
OCR
HOCR
"Blending"
Resources
Output:
http://heml.mta.ca/rigaudon
Code:
https://github.com/brobertson/rigaudon
Further Topics
HPC Computing with Grid Engine
Python Flask Web Microframework
Making Book Images
An Integrated System for Generating and
Correcting Polytonic Greek OCR
Bruce Robertson and Federico Boschetti
Part II
The Proof-reading Process
Federico Boschetti∗
federico.boschetti@ilc.cnr.it
∗ILC-CNR of Pisa
Digital Classicist Seminars – London, 19 July 2013
Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20
Information Aggregation
Proof-reader Web Application
False positives
Introduction
Manual corrections on OCR output may be performed by
Experts Classicists devoted to proof-reading for a long-term
project
Data Entry Firms Professional proof-readers not skilled in
the target language(s)
Crowd Sourcing Students that are learning the target
language(s)
Random Volunteers People with heterogeneous education
and skills
Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20
Information Aggregation
Proof-reader Web Application
False positives
Introduction
For this reason proof-reading tools focused on ancient languages
should be
centralized
easy to use
based on image / text comparison line by line
optimized to catch attention on possible errors, distinguished
by category
efficiently providing the most probable correction
Federico Boschetti Generating and Correcting Polytonic Greek OCR 2/ 20
Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Overview
1 Information Aggregation
Enriched hocr files
Alignment with other editions
False negatives
2 Proof-reader Web Application
3 False positives
Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20
Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Enriched hocr files
OCR output formatted in hocr microformat
The hocr output produced by Rigaudon is postprocessed, in order
to add information managed by the Proof-reading Web Application
Multiple sources
Dictionaries with and without diacritics
Multiple editions of the same work (if available)
Syllabic repertory
Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20
Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Dictionaries
In order to identify possible errors and provide good suggestions to
correct them, the OCR output is spell-checked and the potential
errors are processed step by step
The spell-checker is based on dictionaries generated from Perseus’
text collection. An upper-case dictionary is used to evaluate if a
character sequence is a word with a wrong accent or breathing
mark
Federico Boschetti Generating and Correcting Polytonic Greek OCR 4/ 20
Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Alignment with other editions
When another edition of the same work is available, the two
editions are aligned word by word applying the Needleman-Wunsch
algorithm
ὁ Γαδαρεὺς ἐν ταῖς Χάρισιν ἐπιγραφομέναις ἔφη τὸν ῞Ομηρον Σύρον ὄντα τὸ
| | | | | | | | | | | | |
ὁ Γαδαρεὺς ἐν τ αχς Σάρισιν ἐπιγραφομέναις ἔφη τὸν ῞Ομeρον Σύρον ὄντα τὸ
Federico Boschetti Generating and Correcting Polytonic Greek OCR 5/ 20
Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
False negatives and the risk of digital contaminatio
An example
Rigaudon on the Anecdota Graeca edited by Cramer
recognizes the word χόρος, which is rejected by the current
spellchecker
The spell-checker suggests χορός as a correction
Also the alignment with Koster’s edition of the Prolegomena
de comoedia suggests χορός
But the page image contains χόρος, a late form attested from
Athenaeus to the Byzantine period
Federico Boschetti Generating and Correcting Polytonic Greek OCR 6/ 20
Information Aggregation
Proof-reader Web Application
False positives
Enriched hocr files
Alignment with other editions
False negatives
Syllabication
In order to prevent false negatives due to (rare) variants
ignored by the dictionaries, the system distinguishes between
random character sequences and well-formed syllabic
sequences
Each potential error is divided in syllables and each syllable is
evaluated according to its position
For example, χό-ρος is a well-formed syllabic sequence: χό- is
a valid Greek initial syllable and -ρος is a valid final Greek
syllable
Federico Boschetti Generating and Correcting Polytonic Greek OCR 7/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Overview
1 Information Aggregation
2 Proof-reader Web Application
The web interface
Cues
Self-corrections
3 False positives
Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Centralization
The proof-reader is a web application inspired by the Mozilla
hocr Editor interface but employs the WikiSource
collaborative philosophy
Texts are stored in a central XML native database
Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
The Control Panel
Federico Boschetti Generating and Correcting Polytonic Greek OCR 9/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Image / Text Pairs
Federico Boschetti Generating and Correcting Polytonic Greek OCR 10/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Cues
Wrong accents and breathing marks Attention is focused
on diacritics
Self-corrections Special care is necessary to avoid the risk of
contaminatio
Errors Random errors
Federico Boschetti Generating and Correcting Polytonic Greek OCR 11/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Example
Federico Boschetti Generating and Correcting Polytonic Greek OCR 12/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Self-corrections and suggestions generated by alignment
In a self-correction, the reading has been substituted by the aligned
word of another edition. Self corrections need three conditions:
character sequence is refused by the spell-checker
edit distance between the character sequence and the aligned
edition is very close
the character sequence is not a well-formed syllabic sequence
Federico Boschetti Generating and Correcting Polytonic Greek OCR 13/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Example
Federico Boschetti Generating and Correcting Polytonic Greek OCR 14/ 20
Information Aggregation
Proof-reader Web Application
False positives
The web interface
Cues
Self-corrections
Dynamic Dictionaries
Dictionaries used by the spell-checker are dynamically rebuilt
when a milestone in proof-reading is reached
Enlarging the dictionaries, rare variants are acquired and used
to spell-check the next works
Federico Boschetti Generating and Correcting Polytonic Greek OCR 15/ 20
Information Aggregation
Proof-reader Web Application
False positives
Overview
1 Information Aggregation
2 Proof-reader Web Application
3 False positives
Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20
Information Aggregation
Proof-reader Web Application
False positives
False positives are deceitful
By definition, false positives pass the spell-checking
Specially if they are graphically similar to the correct word,
such as δ and ὁ in Greek or m and ni in Latin, they are
difficult to be seen, in particular by proof-readers not skilled in
the target language(s)
Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20
Information Aggregation
Proof-reader Web Application
False positives
Example
Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20
Information Aggregation
Proof-reader Web Application
False positives
Example
Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20
Information Aggregation
Proof-reader Web Application
False positives
Semantic Distance
Semantic distance is calculated along the nodes of WordNet’s
hierarchy, i.e. along the chain of hyponyms / hypernyms, in
order to reach co-hyponyms
Different translations of the same concepts (e.g. vis in Latin
and efficacia in Italian or efficacy in English) have semantic
distance equal to zero
Semantically unrelated words (e.g. vinum in Latin and
efficacia in Italian) have a large semantic distance
Federico Boschetti Generating and Correcting Polytonic Greek OCR 18/ 20
Information Aggregation
Proof-reader Web Application
False positives
AncientWordNet
Synsets of AncientGreekWordNet and LatinWordNet have
been extracted from bilingual dictionaries
They are aligned to modern languages such as English, Italian,
etc.
Federico Boschetti Generating and Correcting Polytonic Greek OCR 19/ 20
Information Aggregation
Proof-reader Web Application
False positives
Conclusion
The proof-reading Web Application puts together the main
features of individual and collaborative proof-reading tools
currently available
The entire work-flow is circular: Training OCR - Performing
OCR - Spell-checking OCR - Correcting OCR - Enlarging
dictionaries - Retraining OCR
Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20
Information Aggregation
Proof-reader Web Application
False positives
Thank you for your attention
Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20
Information Aggregation
Proof-reader Web Application
False positives
References
S. Feng, R. Manmatha: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital
Library of Books. JCDL 2006, 109–118 (2006)
W.B. Lund, E.K. Ringger: Improving Optical Character Recognition through Efficient Multiple System
Alignment, JCDL (2009)
M. Reynaert: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. A. Gelbukh (ed.):
CICLing 2008, LNCS 4919, 617–630 (2008)
M. Reynaert: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction
Evaluation. 6th International Conference on Language Resources and Evaluation 2008, 1867–1872 (2008)
C. Ringlstetter, K. Schulz, S. Mihov, K. Louka: The same is not the same - postcorrection of alphabet
confusion errors in mixed-alphabet OCR recognition. 8th International Conference on Document Analysis
and Recognition, 1, 406–410 (2005)
M. Spencer, C. Howe: Collating texts using progressive multiple alignment. Computer and the Humanities,
37, 1, 97–109 (2003)
G. Stewart, G. Crane, A. Babeu: A New Generation of Textual Corpora. JCDL 2007, 356–365 (2007)
L. Zhuang, X. Zhu: An OCR Post-processing Approach Based on Multi-knowledge. 9th International
Conference on Knowledge-Based Intelligent Information and Engineering Systems, 346–352 (2005)
Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20

More Related Content

Similar to Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Bruce Robertson

Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...IMPACT Centre of Competence
 
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...TAUS - The Language Data Network
 
BL Demo Day - July2011 - (5) OCR for IMPACT Part 2
BL Demo Day - July2011 - (5) OCR for IMPACT Part 2BL Demo Day - July2011 - (5) OCR for IMPACT Part 2
BL Demo Day - July2011 - (5) OCR for IMPACT Part 2IMPACT Centre of Competence
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Janifer Gatenby
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design IntroductionAman Sharma
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3Nick Grattan
 
Corpora, tracked changes, and PDFs: some useful tips, at no cost!
Corpora, tracked changes, and PDFs: some useful tips, at no cost!Corpora, tracked changes, and PDFs: some useful tips, at no cost!
Corpora, tracked changes, and PDFs: some useful tips, at no cost!Patricia Maria Ferreira Larrieux
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramAnnalina Caputo
 
Online handwritten script recognition (synopsis)
Online handwritten script recognition (synopsis)Online handwritten script recognition (synopsis)
Online handwritten script recognition (synopsis)Mumbai Academisc
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Tobias Wunner
 
What if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos SilveiraWhat if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos SilveiraThoughtworks
 
What if-your-application-could-speak
What if-your-application-could-speakWhat if-your-application-could-speak
What if-your-application-could-speakMarcos Vinícius
 

Similar to Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Bruce Robertson (20)

Php packages
Php packagesPhp packages
Php packages
 
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
Datech2014 - Session 3 - PoCoTo - An Open Source System For Efficient Interac...
 
IMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf TzadokIMPACT Final Conference - Asaf Tzadok
IMPACT Final Conference - Asaf Tzadok
 
Moses
MosesMoses
Moses
 
methods and resources
methods and resourcesmethods and resources
methods and resources
 
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...
Bitextor: harvest your own parallel corpora from the Web, Miquel Esplà-Gomis,...
 
BL Demo Day - July2011 - (5) OCR for IMPACT Part 2
BL Demo Day - July2011 - (5) OCR for IMPACT Part 2BL Demo Day - July2011 - (5) OCR for IMPACT Part 2
BL Demo Day - July2011 - (5) OCR for IMPACT Part 2
 
Multilingualism ifla 2014 08
Multilingualism ifla 2014 08Multilingualism ifla 2014 08
Multilingualism ifla 2014 08
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Compiler design Introduction
Compiler design IntroductionCompiler design Introduction
Compiler design Introduction
 
CROSER
CROSERCROSER
CROSER
 
co:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidalco:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidal
 
Google code search
Google code searchGoogle code search
Google code search
 
Cork AI Meetup Number 3
Cork AI Meetup Number 3Cork AI Meetup Number 3
Cork AI Meetup Number 3
 
Corpora, tracked changes, and PDFs: some useful tips, at no cost!
Corpora, tracked changes, and PDFs: some useful tips, at no cost!Corpora, tracked changes, and PDFs: some useful tips, at no cost!
Corpora, tracked changes, and PDFs: some useful tips, at no cost!
 
Diachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google NgramDiachronic Analysis of Language exploiting Google Ngram
Diachronic Analysis of Language exploiting Google Ngram
 
Online handwritten script recognition (synopsis)
Online handwritten script recognition (synopsis)Online handwritten script recognition (synopsis)
Online handwritten script recognition (synopsis)
 
Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1Enriching the semantic web tutorial session 1
Enriching the semantic web tutorial session 1
 
What if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos SilveiraWhat if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos Silveira
 
What if-your-application-could-speak
What if-your-application-could-speakWhat if-your-application-could-speak
What if-your-application-could-speak
 

Recently uploaded

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 

Recently uploaded (20)

POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 

Digital Classicist London Seminars 2013 - Seminar 7 - Federico Boschetti & Bruce Robertson

  • 1. An Integrated System for Polytonic Greek OCR I. Generating the Data Bruce Robertson, Dept. of Classics, Mount Allison University, New Brunswick, Canada Digital Classicist Seminar, Institute of Classical Studies, London, UK, July 19, 2013
  • 2. I. Generating the Data A. Reason
  • 3. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries
  • 4. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit.
  • 5. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit. 3. Text reuse analysis
  • 6. Why Ancient Greek OCR? 1. Rapid digitization of Greek texts not yet in digital libraries 2. Study of textual variants and app. crit. 3. Text reuse analysis 4. General-purpose OCR search, like Google Books
  • 7. Use Manual Editing? Automatic Spell- checking? Digitization ✓ ✓ Textual Variants ✗ ✗ Text Reuse ✗ ✓ OCR Search ✗ ✗ or ✓
  • 8. Use Manual Editing? Automatic Spell- checking? Digitization ✓ ✓ Textual Variants ✗ ✗ Text Reuse ✗ ✓ OCR Search ✗ ✗ or ✓
  • 9. I. Generating the Data B. Challenge
  • 10. Example Text: John 1:1 Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.
  • 11. Acute, Grave and Circumflex Accents Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.
  • 12. Smooth and Rough Breathing Marks Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.
  • 13. Iota Subscript Ἐν ἀρχῇ ἦν ὁ Λόγος, καὶ ὁ Λόγος ἦν πρὸς τὸν Θεόν, καὶ Θεὸς ἦν ὁ Λόγος.
  • 14. Diversity of Greek Fonts in 19th C. archive.org Texts
  • 15.
  • 20. I. Generating the Data C. Resources
  • 21. Contextless 'Greekness' Index ● Devised by Dr. Boschetti ● Based on dictionary and likely sequences of letters, etc. ● Named 'B-score' in these slides
  • 22. Archive.org ● Provides: ○ Thousands of volumes rendered in high- resolution (400 ppi +) colour images ○ OCR results from ABBYY Finereader ■ Excellent Latin-script recognition ■ Poor Greek results ■ Top-quality line-segmentation
  • 23. Open-source OCR Engines ● Gamera ○ Current focus of my team ● Tesseract ○ Nick White has worked extensively on this to generate good results ● OCRopus ○ Dr. Boschetti recently has been able to use Tesseract training sets for this engine
  • 24. Interchange Format: HOCR <p> <span class="ocr_line" title="bbox 196 751 1059 808"> <span class="ocr_word" title="bbox 196 769 269 794" xml:lang="grc">νῦν</span> <span class="ocr_word" title="bbox 301 752 326 793" xml:lang="grc">δ'</span> <span class="ocr_word" title="bbox 378 755 598 804" xml:lang="grc">ὤρθωσας</span> </p>
  • 25. I. Generating the Data D. Method
  • 26. HOCR Output Page Segmentation Thru HOCR Input Gamera 3.3.3 Image Recognition JP2 Input Library Greek OCR For Gamera Classifiers for Teubner Sans, Teubner Serif, Oxford (Loeb, etc.) HOCR Results at a range of binarization thresholds Parallel Process x35 Cores 410,000 word dictionary from open Perseus Greek texts Weighted Levenshtein Automatic OCR Spellchecker (x14 cores) Per-volume spellcheck table Reduction to unique Greek strings Images from Archive.org ABBYY OCR Information file ABBYY to HOCR Conversion Score table for binarization thresholds Select highest-ranking binarization page Weighted Edit Table for Classifier Replace spellchecked words ... HOCR Output Replace Latin-script output words with ones in same position from Archive's ABBYY output Does ABBYY OCR file contain Latin- script output? Rigaudon Greek OCR Process Automatic OCR Spellcheck OCR HOCR Latin / Greek Combining HOCR "Blending" Boschetti scoring Replace non-dictionary words with dict words from other binarization pages
  • 27. Raw HOCR Production Using Gamera ● Plugin for Gamera OCR allows it to import high-quality line-segmentation information, compensating for Gamera's poor results in this critical function ● Plugin to output HOCR ● Wrapper function generates a range of output pages based on binarization threshold (typically 10 - 20 per page)
  • 28. HOCR 'Blending' ● This step aims to gather word-by-word the 'best' results from the range of results pages for each image ● Selects the highest-scoring result page overall ● Where a Greek word in this page is not in the dictionary and another page has a dictionary word in the exact same physical location, it replaces with dictionary word
  • 29. Automatic Spellcheck ● All pages in volume are reduced to a set of unique, decomposed Greek strings ● These are compared to dictionary using Levenshtein distances ● A 'weighting table', suitable for a given font, indicates which edits are preferable or allowed ● Result is 'light' correction, esp. of diacritics
  • 30. Weighting Table ['replace', ur'ϲϹ', ur'σςΣ', 1],#for lunate fonts ['replace', ur'c', ur'σς', 1],#for lunate fonts ['replace', ur'T', ur'Ττ', 1], ['replace', ur'Τr', ur'τ', 1], ['replace', ur'Uu', ur'υ', 1], ['replace', ur'Y', ur'Υ', 1], ['replace', ur'E', ur'Ε', 1], ['replace', ur'E', ur'ε', 2], ['replace', ur'Z', ur'Ζ', 1], ['replace', ur'K', ur'κΚ', 1], Automatic Spellcheck
  • 31. Optionally injecting Greek into Original Latin HOCR ● Don't want to try to get excellent Greek and Latin results, esp. when ABBYY and others do better job with Latin ● In the case that archive.org provides Latin OCR: ○ If Rigaudon's output word is Greek, replace archive. org's ABBYY output word with Rigaudon's
  • 33. I. Generating the Data F. Results
  • 34. Περὶ δὲ δικαιοσύνης καὶ ἀδικίας ... τε τυγχάνουσιν οὖσαι πράξεις, καὶ ποία ... κότος ἀλλ' οὐ τῆς κομιζούσης φύσεως κτῆμα ... τπ οὖν, ὡς ἓν ἔργον ἡμῖν ἐπιβάλλει μόνον ... Results
  • 35. πιπλαττομένω[ν ὴ μὴ κ[αλ]ῶς ά̓λλως προσ[τιθεμέ- κράτος τ' ἰσόψυχον ἐκ γυναικῶν καρδιόδηκτον ἐμοὶ κρατύνεις. Results
  • 36. I. Generating the Data G. Future
  • 37. Multiple OCR Engines ● Take ABBYY data out of the process ○ With 'cleaning' Tesseract's line-segmentation is often as good ● Use Nick White's general-purpose polytonic classifier and ones specifically designed for a font HOCR Results at a range of binarization thresholds Score table for binarization thresholds Select highest-ranking binarization page Boschetti scoring Replace non-dictionary words with dict words from other binarization pages TesseractGameraOCRopus Line Segmentation OCR HOCR "Blending"
  • 39. An Integrated System for Generating and Correcting Polytonic Greek OCR Bruce Robertson and Federico Boschetti Part II The Proof-reading Process Federico Boschetti∗ federico.boschetti@ilc.cnr.it ∗ILC-CNR of Pisa Digital Classicist Seminars – London, 19 July 2013 Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20
  • 40. Information Aggregation Proof-reader Web Application False positives Introduction Manual corrections on OCR output may be performed by Experts Classicists devoted to proof-reading for a long-term project Data Entry Firms Professional proof-readers not skilled in the target language(s) Crowd Sourcing Students that are learning the target language(s) Random Volunteers People with heterogeneous education and skills Federico Boschetti Generating and Correcting Polytonic Greek OCR 1/ 20
  • 41. Information Aggregation Proof-reader Web Application False positives Introduction For this reason proof-reading tools focused on ancient languages should be centralized easy to use based on image / text comparison line by line optimized to catch attention on possible errors, distinguished by category efficiently providing the most probable correction Federico Boschetti Generating and Correcting Polytonic Greek OCR 2/ 20
  • 42. Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives Overview 1 Information Aggregation Enriched hocr files Alignment with other editions False negatives 2 Proof-reader Web Application 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20
  • 43. Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives Enriched hocr files OCR output formatted in hocr microformat The hocr output produced by Rigaudon is postprocessed, in order to add information managed by the Proof-reading Web Application Multiple sources Dictionaries with and without diacritics Multiple editions of the same work (if available) Syllabic repertory Federico Boschetti Generating and Correcting Polytonic Greek OCR 3/ 20
  • 44. Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives Dictionaries In order to identify possible errors and provide good suggestions to correct them, the OCR output is spell-checked and the potential errors are processed step by step The spell-checker is based on dictionaries generated from Perseus’ text collection. An upper-case dictionary is used to evaluate if a character sequence is a word with a wrong accent or breathing mark Federico Boschetti Generating and Correcting Polytonic Greek OCR 4/ 20
  • 45. Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives Alignment with other editions When another edition of the same work is available, the two editions are aligned word by word applying the Needleman-Wunsch algorithm ὁ Γαδαρεὺς ἐν ταῖς Χάρισιν ἐπιγραφομέναις ἔφη τὸν ῞Ομηρον Σύρον ὄντα τὸ | | | | | | | | | | | | | ὁ Γαδαρεὺς ἐν τ αχς Σάρισιν ἐπιγραφομέναις ἔφη τὸν ῞Ομeρον Σύρον ὄντα τὸ Federico Boschetti Generating and Correcting Polytonic Greek OCR 5/ 20
  • 46. Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives False negatives and the risk of digital contaminatio An example Rigaudon on the Anecdota Graeca edited by Cramer recognizes the word χόρος, which is rejected by the current spellchecker The spell-checker suggests χορός as a correction Also the alignment with Koster’s edition of the Prolegomena de comoedia suggests χορός But the page image contains χόρος, a late form attested from Athenaeus to the Byzantine period Federico Boschetti Generating and Correcting Polytonic Greek OCR 6/ 20
  • 47. Information Aggregation Proof-reader Web Application False positives Enriched hocr files Alignment with other editions False negatives Syllabication In order to prevent false negatives due to (rare) variants ignored by the dictionaries, the system distinguishes between random character sequences and well-formed syllabic sequences Each potential error is divided in syllables and each syllable is evaluated according to its position For example, χό-ρος is a well-formed syllabic sequence: χό- is a valid Greek initial syllable and -ρος is a valid final Greek syllable Federico Boschetti Generating and Correcting Polytonic Greek OCR 7/ 20
  • 48. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Overview 1 Information Aggregation 2 Proof-reader Web Application The web interface Cues Self-corrections 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20
  • 49. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Centralization The proof-reader is a web application inspired by the Mozilla hocr Editor interface but employs the WikiSource collaborative philosophy Texts are stored in a central XML native database Federico Boschetti Generating and Correcting Polytonic Greek OCR 8/ 20
  • 50. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections The Control Panel Federico Boschetti Generating and Correcting Polytonic Greek OCR 9/ 20
  • 51. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Image / Text Pairs Federico Boschetti Generating and Correcting Polytonic Greek OCR 10/ 20
  • 52. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Cues Wrong accents and breathing marks Attention is focused on diacritics Self-corrections Special care is necessary to avoid the risk of contaminatio Errors Random errors Federico Boschetti Generating and Correcting Polytonic Greek OCR 11/ 20
  • 53. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 12/ 20
  • 54. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Self-corrections and suggestions generated by alignment In a self-correction, the reading has been substituted by the aligned word of another edition. Self corrections need three conditions: character sequence is refused by the spell-checker edit distance between the character sequence and the aligned edition is very close the character sequence is not a well-formed syllabic sequence Federico Boschetti Generating and Correcting Polytonic Greek OCR 13/ 20
  • 55. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 14/ 20
  • 56. Information Aggregation Proof-reader Web Application False positives The web interface Cues Self-corrections Dynamic Dictionaries Dictionaries used by the spell-checker are dynamically rebuilt when a milestone in proof-reading is reached Enlarging the dictionaries, rare variants are acquired and used to spell-check the next works Federico Boschetti Generating and Correcting Polytonic Greek OCR 15/ 20
  • 57. Information Aggregation Proof-reader Web Application False positives Overview 1 Information Aggregation 2 Proof-reader Web Application 3 False positives Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20
  • 58. Information Aggregation Proof-reader Web Application False positives False positives are deceitful By definition, false positives pass the spell-checking Specially if they are graphically similar to the correct word, such as δ and ὁ in Greek or m and ni in Latin, they are difficult to be seen, in particular by proof-readers not skilled in the target language(s) Federico Boschetti Generating and Correcting Polytonic Greek OCR 16/ 20
  • 59. Information Aggregation Proof-reader Web Application False positives Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20
  • 60. Information Aggregation Proof-reader Web Application False positives Example Federico Boschetti Generating and Correcting Polytonic Greek OCR 17/ 20
  • 61. Information Aggregation Proof-reader Web Application False positives Semantic Distance Semantic distance is calculated along the nodes of WordNet’s hierarchy, i.e. along the chain of hyponyms / hypernyms, in order to reach co-hyponyms Different translations of the same concepts (e.g. vis in Latin and efficacia in Italian or efficacy in English) have semantic distance equal to zero Semantically unrelated words (e.g. vinum in Latin and efficacia in Italian) have a large semantic distance Federico Boschetti Generating and Correcting Polytonic Greek OCR 18/ 20
  • 62. Information Aggregation Proof-reader Web Application False positives AncientWordNet Synsets of AncientGreekWordNet and LatinWordNet have been extracted from bilingual dictionaries They are aligned to modern languages such as English, Italian, etc. Federico Boschetti Generating and Correcting Polytonic Greek OCR 19/ 20
  • 63. Information Aggregation Proof-reader Web Application False positives Conclusion The proof-reading Web Application puts together the main features of individual and collaborative proof-reading tools currently available The entire work-flow is circular: Training OCR - Performing OCR - Spell-checking OCR - Correcting OCR - Enlarging dictionaries - Retraining OCR Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20
  • 64. Information Aggregation Proof-reader Web Application False positives Thank you for your attention Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20
  • 65. Information Aggregation Proof-reader Web Application False positives References S. Feng, R. Manmatha: A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books. JCDL 2006, 109–118 (2006) W.B. Lund, E.K. Ringger: Improving Optical Character Recognition through Efficient Multiple System Alignment, JCDL (2009) M. Reynaert: Non-interactive OCR Post-correction for Giga-Scale Digitization Projects. A. Gelbukh (ed.): CICLing 2008, LNCS 4919, 617–630 (2008) M. Reynaert: All, and only, the Errors: more Complete and Consistent Spelling and OCR-Error Correction Evaluation. 6th International Conference on Language Resources and Evaluation 2008, 1867–1872 (2008) C. Ringlstetter, K. Schulz, S. Mihov, K. Louka: The same is not the same - postcorrection of alphabet confusion errors in mixed-alphabet OCR recognition. 8th International Conference on Document Analysis and Recognition, 1, 406–410 (2005) M. Spencer, C. Howe: Collating texts using progressive multiple alignment. Computer and the Humanities, 37, 1, 97–109 (2003) G. Stewart, G. Crane, A. Babeu: A New Generation of Textual Corpora. JCDL 2007, 356–365 (2007) L. Zhuang, X. Zhu: An OCR Post-processing Approach Based on Multi-knowledge. 9th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, 346–352 (2005) Federico Boschetti Generating and Correcting Polytonic Greek OCR 20/ 20