2013 easy toolsfordifficulttexts

Nada Naji, Andreas Fischer, Micheal Baechler, Marcus Liwicki
Jacques Savoy, Horst Bunke, Rolf Ingold
Nada.Naji@UniNE.ch, Marcus.Liwicki@unifr.ch
Universities of Neuchatel, Bern, and Fribourg ‐ Switzerland
Computer Science Departments

Manuscript
Automatic image processing
& text recognition
Scanned image
dem
man
dirre
aventivre
giht
...
textual content
2
& alignment
Meta‐data and
1000000

 HisDoc*: Historical Document Analysis, Recognition & Retrieval
 Synergy research project in computer science of the Swiss National Science Foundation
 Fundamental Research.
 Support by domain experts: Prof. Michael Stolz, Dr. Gabriel Viehhauser, Prof. Anton Näf,
Eva Wiedenkeller, Prof. Christoph Flüeler, Prof. Ernst Tremp, Max Bänziger
* hisdoc.unine.ch

 Image Analysis
 Handwriting Recognition
 Information Retrieval
 Conclusions & Outlook

 Goal: recognize page elements such as page numbers, initials, text
blocks, text lines, marginal notes, . . .
 Challenges: paper and parchment texture, ink bleed‐through,
stains, seams, holes, faded ink, . . .
 Approach: pyramid method, i.e., multi‐scale analysis.

 For adjusting and testing the approach, a dataset was created
 First comprehensible, publicly available research database for CS
 Three databases based on extracts of three manuscripts
1. Saint Gall DB: Abbey Library of St. Gall, Cod. Sang. 562, Carolingian
script, Latin, 9th century (60 pages, 30 for learning).
2. Parzival DB: Abbey Library of St. Gall, Cod. Sang. 857, Gothic script,
Middle High German, 13th century (47 pages, 23 for learning).
3. George Washington DB: Library of Congress, G. W. Papers, longhand,
English, 18th century (20 pages, 10 for learning).
1 2 3

 Out‐of‐page (OOP), background (BG), text blocks (TB),
text lines (TL).
14
12
10
8
6
4
2
0
Error [%]
OOP BG TB TL

 HisDoc 2.0: Towards
Computer‐Assisted Paleography
 Integrated text localization and
script analysis
 Handle more complex
documents
 Incorporate existing meta‐data
from databases like e‐codices,
manuscripta mediaevalia, …
 Buzzwords: TEI, RDF, Linked
Data, OWL
 (thanks to SAWS ontology)
St. Gallen, Stiftsbibliothek, Cod.Sang. 863, 11th century. Folio 4

 Layout Analysis
 Handwriting Recognition
 Information Retrieval
 Conclusions & Outlook

 Goal: extract computer‐readable text from images, “reading”.
 Challenges: Character models are learned from samples,
providing learning samples is a costly, time‐consuming task,
Sayre’s paradox (1973).
 Approach: Sliding window features from normalized text line
images (Hidden Markov models and special neural networks)
dem man dirre aventivre giht

 Assumptions: perfect text line extraction, known lexicon
of words
Word Error [%]
SG PAR GW
35
30
25
20
15
10
5
0

 Text Alignment
 Given the photographed image and the transcription, map the text to
the position in the document
 This is useful for the crowd to get rid of manual work
 Word Spotting
 Find all occurrences of a queried word
allez
?

 Goal: effective retrieval of textual items (lines/ paragraphs)
des pfliget ouch tîvscher erde ein ort
daz ist ein warheit sunder wan
aventivre Search

 Modern English
 Scanned printed text
 5% & 20% char error rates
15
In withdrawing the riskless
principal mark‐up
disclosure proposal in the
1978 Release, the
Commission stated that it
would ''maintain close
scrutiny to prevent
excessive mark‐ups and
take enforcement action
where appropriate.''
ln withdrawlng the risyless
principal mary‐up
disclosure proposal in the
191W helease1 the
Commission stated that it
would 44maintain close
scrutiny to prevent
excessive mary‐ups and
taye enforcement action
where appropriate.:: 20
fa ‐thtlrawing the WfUefqs
priucipA mary‐up
dRclosure proposA in the
191@ M,lease, the
ComMssioa stated that it
would amUntdn close
scrutAy to preveat
excessive m=y‐upqe at nd
tttes eaforcemebt actioa
where approphate.. 2e 0

 Challenges:
 Corpus: Non‐standard orthography (punctuation, spelling variation:
(Parcifal = Parzifal), inflectional morphology (Parcifal, Parcifale, Parcifals,
Parcivalen)
 User: Term confusion (which, witch, watch), spelling errors (whitch)
 Recognition errors:
▪ Word: dem  den
▪ Character: withdrawing  withdrawln5
 Other

 Approaches:
 Multiple recognition hypotheses
 Stemming (light: bookings  booking, aggressive: bookings  book)
 Decompounding: Kühlschrank  Kühl + Schrank
 n‐grams: n=4, information  info, nfor, form, …, tion
 (Massive) query expansion: US, USA, Unites States, the States, America, États‐Unis, États‐
Unis d'Amérique, …

Manual transcription
Feirefiz vnt Parcifal
Searched text (BW)
Feirefiz Feirefize vnt vnz vart vert vrîe vatr vier Parcifal Parzifal

Parzival Parcifal,
Parcival, Parzifal,
Parcivale, Parcivals,
Parcivalen
23
“dem man dirre aventivre giht”
man #36006.7
min #35656.8
mat #35452.5
nam #35424.7
arm #35296.2
nimt #35278.2
gan #35265.7
 BW1  man (1st Best Word)
 BW3  man min mat (top 3 Best Words)
 BW7  man min mat nam arm nimt gan
 BW  man min
  = 1.5%

Manual transcription
Searched text (BW)
dem zein zem dan den gein win man min dine dirre chrîe dirz dane
Amis dîner aventivre daventivre Aventivre giht gibt

 Two main versions of the corpus
 Ground‐Truth (GT)
▪ Manually transcribed by experts
▪ Error‐free text (Evaluation baseline)
 Automatic recognition version (HisDoc)
▪ Noisy text: Word‐error rate ~6%
26

 MRR (Mean Reciprocal Rank)
 The inverse of the rank of the first relevant item
retrieved
 Reflects the user concern wishing to find one or a
few good responses to a given request
In other words…
Every searcher’s dream:
The top search result
is what s/he’s looking for!
RR=1
RR=1/2
RR=1/3
.
.
.
RR=0
27

28
 User confusion of terms
▪ smashed potatoes vs. mashed potatoes
 Spelling variations
▪ color, colour
▪ Parzifal, Parcifal
 Inflectional morphology
▪ smash vs. smashed
▪ Parcifal vs. Parcifalen
 Spelling errors
▪ mashed “mached”, “masched”
▪ laptop  “laptpo”
▪ Edit distance: (book, cook) = 1

 Trying to overcome
 User confusion, spelling variations, inflectional
morphology & even recognition errors
29

30
man # 36006.7
min # 35656.8
mat # 35452.5
nam # 35424.7
arm # 35296.2
nimt # 35278.2
gan # 35265.7
nam # 39678.5
mann # 39166.9
mit # 39134.9
mat # 39133.0
manz # 39001.1
man # 38997.0
mit # 38974.4
mat # 50135.5
nam # 50115.2
man # 50111.4
min # 50056.5
ram # 50056.4
nimt # 49839.0
mine # 49837.9
...
“dem man dirre
aventivre giht”
“iwer oder
decheines man”
“als man von siner
helfe saget“

31
man
man 39
min 18.02
mat 9.51
nam 5.4
miren 4
manz 3.16
maze 2.35
mann 2.08
dran 2.03
maz 1.75
dan 1.73
maht 1.65
mal 1.23
minen 0.96
erlan 0.84
meine 0.82
gan 0.81
han 0.75
man
min
mat
nam
arm
nimt
gan
nam
mann
mit
mat
manz
man
mit
mat
nam
man
min
ram
nimt
mine
(+)     (+)(+) ...
1)   Calculate scores
Based on frequency & ranking within each subset
2)   Sort accordingly

mit
man
min
mat
mine
gan
manz
mat
32
nam
namz
nante
man
namn
mann
ram
nimt
mins
miner
mal
nider
immer
man
mante
nante
namen
nemen
mat
mac
maz
mit
ime
mir
mere
man
mach
mage
sage
man man mit nam
e x p a n d

 Reverse thesaurus lookup
 Edit (Levenshtein) distance
 Ld(cook, book) = 1
 Expansion term weighting
33

RR=1
RR=0.5
RR=0.3
.
.
.
RR=0
Mean RR = 0.6
Target document in rank 1.7
34

 Parzival DB
 6% word error rate
▪ Clean queries (Q): IR degradation ~ ‐5%
▪ Noisy queries (Q*): IR degradation ~ ‐100%
▪ Noisy queries –Expanded (Q*E): IR degradation ~ ‐14%
100
90
80
70
60
50
40
30
20
10
0
IR Degradation [%]
Q Q* Q*E
35

 Scalable (can handle large corpora)
 Applicable to other languages
 The methods can be useful for other similar
problems as well
 Real user logs (feedback, information needs)

 Layout analysis: text line extraction with 8% error in Latin
manuscripts.
 Towards computer assisted paleography for complex documents
 Handwriting recognition: transcription with 6% word error in
SG30 and PAR23, 18% word error in GW10.
 Towards text alignment and word spotting
 Information retrieval: degradation of 5% for PAR23.
 Towards more challenging problems
 Integrate the HisDoc outcomes into tools useful for practice
 We are open for new collaborations in integrated and application
oriented projects
 Our methods can be integrated in your tools!
37

 Printed modern English
 5% error rate (character)
IR degradation ‐17%
 20% error rate (character)
IR degradation ‐46%
 Handwritten 13th century German
 6% error rate (word)
▪ Clean queries (Q): IR degradation ‐5%
▪ Noisy queries (Q*): IR degradation  ‐100%
▪ Noisy queries –Expanded (Q*E): IR degradation ‐14%
100
90
80
70
60
50
40
30
20
10
0
IR Degradation  [%]
Q      Q*   Q*E
5%    20%
Modern English
Printed
Middle High German
Handwritten
6%
39

2013 easy toolsfordifficulttexts

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to 2013 easy toolsfordifficulttexts

Similar to 2013 easy toolsfordifficulttexts (10)

2013 easy toolsfordifficulttexts