Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL

Background
Data and Methods
Results
Conclusion
Language adaptability and performance evaluation
of historical text normalization tools VARD2 and
TICCL
Iris Hendrickx and Martin Reynaert
Center for Language Studies, Radboud University Nijmegen, The Netherlands
June 12, 2014
Historical text normalization

Background
Data and Methods
Results
Conclusion
Background
Digitizing historical texts:
1 scanning & OCR of old books
2 by manual transcription (original spelling is usually preserved)
Digital historical texts contain many spelling variants as:
no oﬃcial spelling existed at that time
texts written by half-literate authors
in case of OCR: OCR errors

Background
Data and Methods
Results
Conclusion
Motivation
However the spelling variation is distracting for:
Lexical or grammatical research
Searching in a digital collection: mismatch with modern word
query
Automatic natural language processing tools developed for
modern text
Collection is valuable as country’s cultural heritage: Editions
intended for the lay public should be in clean text.

Background
Data and Methods
Results
Conclusion
Aim: Automatic spelling variation reduction in historical
text collections
We compare two diﬀerent spelling normalization tools
VARD2 (Baron, 2011) and TICCL (Reynaert, 2010) on historical
Spanish and Portuguese data.
TICCL will also evaluated on historical Dutch as part of the
Nederlab project.

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Data collections
Spanish and Portuguese
Project Post Scriptum
Manual digitization of a wide collection of 7000 personal letters
(half Spanish/ Portuguese) from diﬀerent historical archives.
The letters are manually transcribed into an electronic XML-TEI
ﬁle format including rich and detailed historical and sociological
meta-data.
Dutch: future work
17th C book: the 1637 edition of the State Bible with a gold
standard modern Dutch transcription from 2010
18th C book: manually OCR-corrected and transcribed into both
historical and modern gold standards: Kort begrip der
waereld-historie voor de jeugd. Martinet, 1789

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Portuguese Letter from 1592 addressed to merchandiser
Jo˜ao Nunes

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Manual transcription of the letter
Figure : Full description at: http://ps.clul.ul.pt/index.php?page=infoLetter&carta=CARDS4006.xml

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Manual transcription of the letter in XML
Figure : Full description at: http://ps.clul.ul.pt/index.php?page=infoLetter&carta=CARDS4006.xml

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Aim: Spelling normalization of the transcription
Figure : English translation: I have more than once asked Your Honour
and begged Your Honour to leave me alone. But Your Honour has
insisted on defying me, dishonouring me, lessening me, engaging in gossip
about me at every corner, both by words spoken and by letters written to
whoever you choose. I remind you, speaking as a friend...

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
VARD2 normalisation tool
VARD2 (Baron, 2011)
developed for Early-modern English and combines several resources
to detect and replace spelling variants with normalised forms.
VARD2 uses:
a modern lexicon
a spelling variants dictionary list that matches variants against
their modern counterparts
a list of letter replacement rules
a phonetic matching algorithm
an edit distance algorithm to determine the most likely
candidate
a training set with encoded normalisations (optional)

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
TICCL
TICCL (Reynaert, 2014)
New C++ implementation geared at being easily adaptable to
other languages and older language varieties.
TICCL uses:
a large lexicon
a numerical list of Known Historical Character Confusions
exhaustive variant look-up up to a given Levenshtein distance
a combination of corpus-induced ranking features to
determine the most likely candidate
a dictionary of known historical-modern word form pairs
(optional)

Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Experimental Setup
For the experiments for both Spanish and Portuguese
For Spanish 200 letters from the time period 1550 to 1830.
For Portuguese 200 letters from 1550 until 1911.
Normalisation manually veriﬁed by a linguist.
Data set was split into 100 letters for training the tools, and
100 for the evaluation set.
Evaluation scores are computed with recall, precision and
F-score.

Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Comparison of VARD2 and TICCL on Portuguese
Table : Best-ﬁrst ranked performance of TICCL and VARD2 on the
tokens of the test set. TICCL and VARD2 were trained on the same
resources.
Tool acc prec recall f-score
VARD2-notraining 90.6 93.8 53.1 67.8
TICCL-notraining 89.2 92.0 46.0 61.4
VARD2 94.7 97.0 73.6 83.7
TICCL 93.5 94.4 69.3 79.9
TICCLrank 95.7 96.4 79.8 87.3

Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Error analysis Portuguese
Most frequent error: spelling of ‘um’ with ‘h-’. System does
not recognise this since hum is listed in the modern lexicon.
For all periods: diacritics problems.
The older letters have many archaisms (e.g inda, cousa ) that
are erroneously part of modern lexicon list.
The older letters also have many abbreviations (e.g. v., va.,
etcra. ) which are diﬃcult to recognise automatically.
Confusion between diﬀerent spellings: For 1500-1700, s/c/ss
for the sound [s]; for 1701-1800, the use of z/s for the sound
[z], whilst 1801-1930 the phonetic spelling of ‘i’ for ‘e’
frequently occurs.

Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Comparison of VARD2 and TICCL on Spanish
Table : Best-ﬁrst ranked performance of TICCL and VARD2 on the
tokens of the test set. TICCL and VARD2 were trained on the same
resources.
Tool acc prec recall f-score
VARD2-notraining 76.1 71.8 37.3 49.1
TICCL-notraining 74.0 81.3 20.8 33.1
VARD2 87.2 96.4 66.0 78.4
TICCL 89.0 91.6 77.3 83.9

Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Errors Analysis Spanish
Typical errors made by VARD2 and TICCL:
around 41-47% of words that were not corrected, were not
spotted as errors because the word occurred in lexicon
for example ‘tu’ when used as personal pronoun needs an accent in modern Spanish: t´u but is used as ‘tu’
in possesive form
around 37-43% of words that were not corrected, the correct
forms did not occur in lexicon (names and conjugated verbs
for example) and could never have been resolved with current
settings.
Around 15% of errors is due to abbreviations

Background
Data and Methods
Results
Conclusion
Conclusion
VARD2 can be trained on other languages to good effect,
needs manually constructed resources
TICCL can be successfully extended to these languages too,
without manual work
VARD2 outperforms TICCL without training on
domain-specific examples
TICCL outperforms VARD2 when trained
TICCL can handle far greater amounts of language specific
resources such as lexicons, name lists

Background
Data and Methods
Results
Conclusion
Thank you for your attention!
References
Martin Reynaert. On OCR ground truths and OCR post-correction gold standards, tools and formats.
Proceedings of DATeCH 2014: Digital Access to Textual Cultural Heritage, Madrid, 2014
Martin Reynaert. Synergy of Nederlab and @PhilosTEI: diachronic and multilingual Text-Induced Corpus
Clean-up. Proceedings of LREC 2014: Language Resources and Evaluaton Conference, Reykjavik, 2014
Rita Marquilhas and Iris Hendrickx Manuscripts and machines: the automatic replacement of spelling
variants in a Portuguese historical corpus. International Journal of Humanities and Arts Computing, 18.1
(2014): 53−−68, Edinburgh University Press
Martin Reynaert, Iris Hendrickx, and Rita Marquilhas. Historical spelling normalization. A comparison of
two statistical methods: TICCL and VARD2. Proceedings of the Second Workshop on Annotation of
Corpora for Research in the Humanities (ACRH-2), pages 87−−98, 2012.
Alistair Baron. Dealing with spelling variation in Early Modern English texts. PhD thesis, University of
Lancaster, Lancaster, UK, 2011.
Martin Reynaert. Character confusion versus focus word-based correction of spelling and OCR variants in
corpora. International Journal on Document Analysis and Recognition, 14:173-187, 2010.
Iris Hendrickx and Rita Marquilhas. From Old Texts to Modern Spellings: An Experiment in Automatic
Normalisation. Journal for Language Technology and Computational Linguistics (JLCL), 26(2):65-76, 2011.

Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL

Recommended

Recommended

More Related Content

Similar to Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL

Similar to Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL (20)

More from DH Benelux

More from DH Benelux (20)

Recently uploaded

Recently uploaded (19)

Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL