This document summarizes research on analyzing and extracting information from scanned historical documents. It discusses developing techniques for layout analysis, handwriting recognition, and information retrieval on datasets of manuscripts from the 8th to 18th centuries. The techniques aim to recognize page elements, extract text from images, and retrieve relevant text despite errors from the recognition systems. Evaluation shows the approaches can analyze layout with 8% error and recognize text with up to 18% word error for certain datasets. The methods aim to support applications like computer-assisted paleography and retrieval on historical collections.
6. For adjusting and testing the approach, a dataset was created
First comprehensible, publicly available research database for CS
Three databases based on extracts of three manuscripts
1. Saint Gall DB: Abbey Library of St. Gall, Cod. Sang. 562, Carolingian
script, Latin, 9th century (60 pages, 30 for learning).
2. Parzival DB: Abbey Library of St. Gall, Cod. Sang. 857, Gothic script,
Middle High German, 13th century (47 pages, 23 for learning).
3. George Washington DB: Library of Congress, G. W. Papers, longhand,
English, 18th century (20 pages, 10 for learning).
1 2 3
15. Modern English
Scanned printed text
5% & 20% char error rates
15
In withdrawing the riskless
principal mark‐up
disclosure proposal in the
1978 Release, the
Commission stated that it
would ''maintain close
scrutiny to prevent
excessive mark‐ups and
take enforcement action
where appropriate.''
ln withdrawlng the risyless
principal mary‐up
disclosure proposal in the
191W helease1 the
Commission stated that it
would 44maintain close
scrutiny to prevent
excessive mary‐ups and
taye enforcement action
where appropriate.:: 20
fa ‐thtlrawing the WfUefqs
priucipA mary‐up
dRclosure proposA in the
191@ M,lease, the
ComMssioa stated that it
would amUntdn close
scrutAy to preveat
excessive m=y‐upqe at nd
tttes eaforcemebt actioa
where approphate.. 2e 0
25. Manual transcription
dem man dirre aventivre giht
Searched text (BW)
dem zein zem dan den gein win man min dine dirre chrîe dirz dane
Amis dîner aventivre daventivre Aventivre giht gibt
27. MRR (Mean Reciprocal Rank)
The inverse of the rank of the first relevant item
retrieved
Reflects the user concern wishing to find one or a
few good responses to a given request
In other words…
Every searcher’s dream:
The top search result
is what s/he’s looking for!
RR=1
RR=1/2
RR=1/3
.
.
.
RR=0
27
30. 30
man # 36006.7
min # 35656.8
mat # 35452.5
nam # 35424.7
arm # 35296.2
nimt # 35278.2
gan # 35265.7
nam # 39678.5
mann # 39166.9
mit # 39134.9
mat # 39133.0
manz # 39001.1
man # 38997.0
mit # 38974.4
mat # 50135.5
nam # 50115.2
man # 50111.4
min # 50056.5
ram # 50056.4
nimt # 49839.0
mine # 49837.9
...
“dem man dirre
aventivre giht”
“iwer oder
decheines man”
“als man von siner
helfe saget“
31. 31
man
man 39
min 18.02
mat 9.51
nam 5.4
miren 4
manz 3.16
maze 2.35
mann 2.08
dran 2.03
maz 1.75
dan 1.73
maht 1.65
mal 1.23
minen 0.96
erlan 0.84
meine 0.82
gan 0.81
han 0.75
man
min
mat
nam
arm
nimt
gan
nam
mann
mit
mat
manz
man
mit
mat
nam
man
min
ram
nimt
mine
(+) (+)(+) ...
1) Calculate scores
Based on frequency & ranking within each subset
2) Sort accordingly
37. Layout analysis: text line extraction with 8% error in Latin
manuscripts.
Towards computer assisted paleography for complex documents
Handwriting recognition: transcription with 6% word error in
SG30 and PAR23, 18% word error in GW10.
Towards text alignment and word spotting
Information retrieval: degradation of 5% for PAR23.
Towards more challenging problems
Integrate the HisDoc outcomes into tools useful for practice
We are open for new collaborations in integrated and application
oriented projects
Our methods can be integrated in your tools!
37
38.
39. Printed modern English
5% error rate (character)
IR degradation ‐17%
20% error rate (character)
IR degradation ‐46%
Handwritten 13th century German
6% error rate (word)
▪ Clean queries (Q): IR degradation ‐5%
▪ Noisy queries (Q*): IR degradation ‐100%
▪ Noisy queries –Expanded (Q*E): IR degradation ‐14%
100
90
80
70
60
50
40
30
20
10
0
IR Degradation [%]
Q Q* Q*E
5% 20%
Modern English
Printed
Middle High German
Handwritten
6%
39