SlideShare a Scribd company logo
1 of 39
Download to read offline
Nada Naji, Andreas Fischer, Micheal Baechler, Marcus Liwicki
Jacques Savoy, Horst Bunke, Rolf Ingold
Nada.Naji@UniNE.ch, Marcus.Liwicki@unifr.ch
Universities of Neuchatel, Bern, and Fribourg ‐ Switzerland
Computer Science Departments
Manuscript
Automatic image processing 
& text recognition
Scanned image
dem
man
dirre
aventivre
giht
...
textual content
2
& alignment
Meta‐data and
1000000
 HisDoc*: Historical Document Analysis, Recognition & Retrieval 
 Synergy research project in computer science of the Swiss National Science Foundation 
 Fundamental Research.
 Support by domain experts: Prof. Michael Stolz, Dr. Gabriel Viehhauser, Prof. Anton Näf, 
Eva Wiedenkeller, Prof. Christoph Flüeler, Prof. Ernst Tremp, Max Bänziger
* hisdoc.unine.ch
 Image Analysis
 Handwriting Recognition
 Information Retrieval
 Conclusions & Outlook
 Goal: recognize page elements such as page numbers, initials, text 
blocks, text lines, marginal notes, . . .
 Challenges: paper and parchment texture, ink bleed‐through, 
stains, seams, holes, faded ink, . . .
 Approach: pyramid method, i.e., multi‐scale analysis.
 For adjusting and testing the approach, a dataset was created
 First comprehensible, publicly available research database for CS
 Three databases based on extracts of three manuscripts
1. Saint Gall DB: Abbey Library of St. Gall, Cod. Sang. 562, Carolingian 
script, Latin, 9th century (60 pages, 30 for learning).
2. Parzival DB: Abbey Library of St. Gall, Cod. Sang. 857, Gothic script, 
Middle High German, 13th century (47 pages, 23 for learning).
3. George Washington DB: Library of Congress, G. W. Papers, longhand, 
English, 18th century (20 pages, 10 for learning).
1                                                 2                                            3
 Out‐of‐page (OOP), background (BG), text blocks (TB), 
text lines (TL).
14
12
10
8
6
4
2
0
Error [%]
OOP  BG     TB    TL
 HisDoc 2.0: Towards      
Computer‐Assisted Paleography
 Integrated text localization and 
script analysis
 Handle more complex 
documents
 Incorporate existing meta‐data 
from databases like e‐codices, 
manuscripta mediaevalia, …
 Buzzwords: TEI, RDF, Linked 
Data, OWL
 (thanks to SAWS ontology)
St. Gallen, Stiftsbibliothek, Cod.Sang. 863, 11th century. Folio 4
 Layout Analysis
 Handwriting Recognition
 Information Retrieval
 Conclusions & Outlook
 Goal: extract computer‐readable text from images, “reading”.
 Challenges: Character models are learned from samples, 
providing learning samples is a costly, time‐consuming task, 
Sayre’s paradox (1973).
 Approach: Sliding window features from normalized text line 
images (Hidden Markov models and special neural networks)
dem man   dirre aventivre giht
 Assumptions: perfect text line extraction, known lexicon 
of words
Word Error [%]
SG     PAR  GW
35
30
25
20
15
10
5
0
 Text Alignment
 Given the photographed image and the transcription, map the text to 
the position in the document
 This is useful for the crowd to get rid of manual work 
 Word Spotting
 Find all occurrences of a queried word
allez
?
 Layout Analysis
 Handwriting Recognition
 Information Retrieval
 Conclusions & Outlook
 Goal: effective retrieval of textual items (lines/ paragraphs)
dem man dirre aventivre giht
des pfliget ouch tîvscher erde ein ort
daz ist ein warheit sunder wan
aventivre Search
 Modern English
 Scanned printed text 
 5% & 20% char error rates
15
In withdrawing the riskless 
principal mark‐up 
disclosure proposal in the 
1978 Release, the 
Commission stated that it 
would ''maintain close 
scrutiny to prevent 
excessive mark‐ups and 
take enforcement action 
where appropriate.''
ln withdrawlng the risyless
principal mary‐up
disclosure proposal in the 
191W helease1 the 
Commission stated that it 
would 44maintain close 
scrutiny to prevent 
excessive mary‐ups and 
taye enforcement action 
where appropriate.:: 20
fa ‐thtlrawing the WfUefqs
priucipA mary‐up 
dRclosure proposA in the 
191@ M,lease, the 
ComMssioa stated that it 
would amUntdn close 
scrutAy to preveat
excessive m=y‐upqe at nd
tttes eaforcemebt actioa
where approphate.. 2e 0
 Challenges: 
 Corpus: Non‐standard orthography (punctuation, spelling variation:
(Parcifal = Parzifal), inflectional morphology  (Parcifal, Parcifale, Parcifals, 
Parcivalen)
 User: Term confusion (which, witch, watch), spelling errors (whitch)
 Recognition errors:
▪ Word: dem  den
▪ Character: withdrawing  withdrawln5
 Other
 Approaches: 
 Multiple recognition hypotheses 
 Stemming (light:  bookings  booking, aggressive: bookings  book)
 Decompounding: Kühlschrank  Kühl + Schrank
 n‐grams: n=4, information  info, nfor, form, …, tion
 (Massive) query expansion: US, USA, Unites States, the States, America, États‐Unis, États‐
Unis d'Amérique, …
Manual transcription
Feirefiz vnt Parcifal
Searched text (BW)
Feirefiz Feirefize vnt vnz vart vert vrîe vatr vier Parcifal Parzifal
dem man   dirre aventivre giht
Parzival Parcifal, 
Parcival, Parzifal, 
Parcivale, Parcivals, 
Parcivalen
23
“dem man dirre aventivre giht”
man #36006.7
min #35656.8
mat #35452.5
nam #35424.7
arm #35296.2
nimt #35278.2
gan #35265.7
 BW1  man (1st Best Word)
 BW3  man min mat (top 3 Best Words)
 BW7  man min mat nam arm nimt gan
 BW  man min
  = 1.5%
Manual transcription
dem man dirre aventivre giht
Searched text (BW)
dem zein zem dan den gein win man min dine dirre chrîe dirz dane
Amis dîner aventivre daventivre Aventivre giht gibt
 Two main versions of the corpus
 Ground‐Truth (GT)
▪ Manually  transcribed by experts
▪ Error‐free text (Evaluation baseline)
 Automatic recognition version (HisDoc)
▪ Noisy text: Word‐error rate ~6%
26
 MRR (Mean Reciprocal Rank) 
 The inverse of the rank of the first relevant item 
retrieved 
 Reflects the user concern wishing to find one or a 
few good responses to a given request
In other words…
Every searcher’s dream: 
The top search result 
is what s/he’s looking for!
RR=1
RR=1/2
RR=1/3
.
.
.
RR=0
27
28
 User confusion of terms
▪ smashed potatoes vs. mashed potatoes
 Spelling variations
▪ color, colour
▪ Parzifal, Parcifal
 Inflectional morphology
▪ smash vs. smashed
▪ Parcifal vs. Parcifalen
 Spelling errors
▪ mashed “mached”, “masched”
▪ laptop  “laptpo”
▪ Edit distance: (book, cook) = 1
 Trying to overcome
 User confusion, spelling variations, inflectional 
morphology & even recognition errors
29
30
man # 36006.7
min # 35656.8
mat # 35452.5
nam # 35424.7
arm # 35296.2
nimt # 35278.2
gan # 35265.7
nam # 39678.5
mann # 39166.9
mit # 39134.9
mat # 39133.0
manz # 39001.1
man # 38997.0
mit # 38974.4
mat # 50135.5
nam # 50115.2
man # 50111.4
min # 50056.5
ram # 50056.4
nimt # 49839.0
mine # 49837.9
...
“dem man dirre
aventivre giht”
“iwer oder
decheines man”
“als man von siner 
helfe saget“
31
man
man 39
min 18.02
mat 9.51
nam 5.4
miren 4
manz 3.16
maze 2.35
mann 2.08
dran 2.03
maz 1.75
dan 1.73
maht 1.65
mal 1.23
minen 0.96
erlan 0.84
meine 0.82
gan 0.81
han 0.75
man
min
mat
nam
arm
nimt
gan
nam
mann
mit
mat
manz
man
mit
mat
nam
man
min
ram
nimt
mine
(+)     (+)(+) ...
1)   Calculate scores 
Based on frequency & ranking within each subset
2)   Sort accordingly
mit
man
min
mat
mine
gan
manz
mat
32
nam
namz
nante
man
namn
mann
ram
nimt
mins
miner
mal
nider
immer
man
mante
nante
namen
nemen
mat
mac
maz
mit
ime
mir
mere
man
mach
mage
sage
man man mit nam
e x p a n d
 Reverse thesaurus lookup
 Edit (Levenshtein) distance
 Ld(cook, book) = 1
 Expansion term weighting
33
RR=1
RR=0.5
RR=0.3
.
.
.
RR=0
Mean RR = 0.6
Target document in rank 1.7
34
 Parzival DB 
 6% word error rate
▪ Clean queries (Q): IR degradation ~ ‐5%
▪ Noisy queries (Q*): IR degradation ~ ‐100%
▪ Noisy queries –Expanded (Q*E): IR degradation ~ ‐14%
100
90
80
70
60
50
40
30
20
10
0
IR Degradation  [%]
Q    Q*    Q*E
35
 Scalable (can handle large corpora)
 Applicable to other languages
 The methods can be useful for other similar 
problems as well
 Real user logs (feedback, information needs)
 Layout analysis: text line extraction with 8% error in Latin 
manuscripts.
 Towards computer assisted paleography for complex documents
 Handwriting recognition: transcription with 6% word error in 
SG30 and PAR23, 18% word error in GW10.
 Towards text alignment and word spotting 
 Information retrieval: degradation of 5% for PAR23.
 Towards more challenging problems
 Integrate the HisDoc outcomes into tools useful for practice
 We are open for new collaborations in integrated and application 
oriented projects
 Our methods can be integrated in your tools!
37
 Printed modern English
 5% error rate (character) 
IR degradation ‐17% 
 20% error rate (character)
IR degradation ‐46%  
 Handwritten 13th century German
 6% error rate (word) 
▪ Clean queries (Q): IR degradation ‐5%
▪ Noisy queries (Q*): IR degradation  ‐100%
▪ Noisy queries –Expanded (Q*E): IR degradation ‐14%
100
90
80
70
60
50
40
30
20
10
0
IR Degradation  [%]
Q      Q*   Q*E
5%    20%
Modern English
Printed
Middle High German
Handwritten
6%
39

More Related Content

Viewers also liked

Ementas das disciplinas lecionadas
Ementas das disciplinas lecionadasEmentas das disciplinas lecionadas
Ementas das disciplinas lecionadasRaquel Weijh
 
2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)
2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)
2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)Hannah Kilcoyne
 
Pdhpe rationale
Pdhpe rationalePdhpe rationale
Pdhpe rationalekim0chi
 
LA ENEIDA SECUNDA PARS
LA ENEIDA  SECUNDA PARSLA ENEIDA  SECUNDA PARS
LA ENEIDA SECUNDA PARSNausica
 
ImaginativeHR e-bulletin Dec 2015
ImaginativeHR e-bulletin Dec 2015ImaginativeHR e-bulletin Dec 2015
ImaginativeHR e-bulletin Dec 2015ImaginativeHR
 
Presentatie7maart2013 130311035833-phpapp02
Presentatie7maart2013 130311035833-phpapp02Presentatie7maart2013 130311035833-phpapp02
Presentatie7maart2013 130311035833-phpapp02Hanneke Heinen
 
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...Aliva Kar
 
ปัญหาการตั้งครรภ์ในวัยเรียน
ปัญหาการตั้งครรภ์ในวัยเรียนปัญหาการตั้งครรภ์ในวัยเรียน
ปัญหาการตั้งครรภ์ในวัยเรียนAuntika11
 

Viewers also liked (15)

Combinacaion
CombinacaionCombinacaion
Combinacaion
 
Catalogo Software Libre
Catalogo Software LibreCatalogo Software Libre
Catalogo Software Libre
 
Ementas das disciplinas lecionadas
Ementas das disciplinas lecionadasEmentas das disciplinas lecionadas
Ementas das disciplinas lecionadas
 
2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)
2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)
2016 Brandeis Student Conference POSTER_Kilcoyne_FINAL (3)
 
Pdhpe rationale
Pdhpe rationalePdhpe rationale
Pdhpe rationale
 
slideshare
slideshareslideshare
slideshare
 
LA ENEIDA SECUNDA PARS
LA ENEIDA  SECUNDA PARSLA ENEIDA  SECUNDA PARS
LA ENEIDA SECUNDA PARS
 
ImaginativeHR e-bulletin Dec 2015
ImaginativeHR e-bulletin Dec 2015ImaginativeHR e-bulletin Dec 2015
ImaginativeHR e-bulletin Dec 2015
 
Presentatie7maart2013 130311035833-phpapp02
Presentatie7maart2013 130311035833-phpapp02Presentatie7maart2013 130311035833-phpapp02
Presentatie7maart2013 130311035833-phpapp02
 
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...
Tata Destination 150 Value Homes Noida Expressway Floor Plan Price List Locat...
 
ปัญหาการตั้งครรภ์ในวัยเรียน
ปัญหาการตั้งครรภ์ในวัยเรียนปัญหาการตั้งครรภ์ในวัยเรียน
ปัญหาการตั้งครรภ์ในวัยเรียน
 
Subject cataloguing
Subject cataloguingSubject cataloguing
Subject cataloguing
 
PR для стартапов
PR для стартаповPR для стартапов
PR для стартапов
 
Mr. Bean
Mr. BeanMr. Bean
Mr. Bean
 
resume
resumeresume
resume
 

Similar to 2013 easy toolsfordifficulttexts

Digibury: Martin Jewiss - Colour, Creativity and Running Away
Digibury: Martin Jewiss - Colour, Creativity and Running AwayDigibury: Martin Jewiss - Colour, Creativity and Running Away
Digibury: Martin Jewiss - Colour, Creativity and Running AwayLizzieHodgson
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesZoltan Varju
 
Understanding natural language processing
Understanding natural language processingUnderstanding natural language processing
Understanding natural language processingjbene mourad
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
Aibdconference chat bot for every product Maksym Volchenko
Aibdconference chat bot for every product Maksym VolchenkoAibdconference chat bot for every product Maksym Volchenko
Aibdconference chat bot for every product Maksym VolchenkoOlga Zinkevych
 
DH Tools Workshop #1: Text Analysis
DH Tools Workshop #1:  Text AnalysisDH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1: Text Analysiscjbuckner
 

Similar to 2013 easy toolsfordifficulttexts (10)

Digibury: Martin Jewiss - Colour, Creativity and Running Away
Digibury: Martin Jewiss - Colour, Creativity and Running AwayDigibury: Martin Jewiss - Colour, Creativity and Running Away
Digibury: Martin Jewiss - Colour, Creativity and Running Away
 
Babak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entitiesBabak Rasolzadeh: The importance of entities
Babak Rasolzadeh: The importance of entities
 
Understanding natural language processing
Understanding natural language processingUnderstanding natural language processing
Understanding natural language processing
 
Esa act
Esa actEsa act
Esa act
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
What is AI ML NLP and how to apply them
What is AI ML NLP and how to apply themWhat is AI ML NLP and how to apply them
What is AI ML NLP and how to apply them
 
Aibdconference chat bot for every product Maksym Volchenko
Aibdconference chat bot for every product Maksym VolchenkoAibdconference chat bot for every product Maksym Volchenko
Aibdconference chat bot for every product Maksym Volchenko
 
DH Tools Workshop #1: Text Analysis
DH Tools Workshop #1:  Text AnalysisDH Tools Workshop #1:  Text Analysis
DH Tools Workshop #1: Text Analysis
 

2013 easy toolsfordifficulttexts