SlideShare a Scribd company logo
Background
Data and Methods
Results
Conclusion
Language adaptability and performance evaluation
of historical text normalization tools VARD2 and
TICCL
Iris Hendrickx and Martin Reynaert
Center for Language Studies, Radboud University Nijmegen, The Netherlands
June 12, 2014
Historical text normalization
Background
Data and Methods
Results
Conclusion
Background
Digitizing historical texts:
1 scanning & OCR of old books
2 by manual transcription (original spelling is usually preserved)
Digital historical texts contain many spelling variants as:
no official spelling existed at that time
texts written by half-literate authors
in case of OCR: OCR errors
Historical text normalization
Background
Data and Methods
Results
Conclusion
Motivation
However the spelling variation is distracting for:
Lexical or grammatical research
Searching in a digital collection: mismatch with modern word
query
Automatic natural language processing tools developed for
modern text
Collection is valuable as country’s cultural heritage: Editions
intended for the lay public should be in clean text.
Historical text normalization
Background
Data and Methods
Results
Conclusion
Aim: Automatic spelling variation reduction in historical
text collections
We compare two different spelling normalization tools
VARD2 (Baron, 2011) and TICCL (Reynaert, 2010) on historical
Spanish and Portuguese data.
TICCL will also evaluated on historical Dutch as part of the
Nederlab project.
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Data collections
Spanish and Portuguese
Project Post Scriptum
Manual digitization of a wide collection of 7000 personal letters
(half Spanish/ Portuguese) from different historical archives.
The letters are manually transcribed into an electronic XML-TEI
file format including rich and detailed historical and sociological
meta-data.
Dutch: future work
17th C book: the 1637 edition of the State Bible with a gold
standard modern Dutch transcription from 2010
18th C book: manually OCR-corrected and transcribed into both
historical and modern gold standards: Kort begrip der
waereld-historie voor de jeugd. Martinet, 1789
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Portuguese Letter from 1592 addressed to merchandiser
Jo˜ao Nunes
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Manual transcription of the letter
Figure : Full description at: http://ps.clul.ul.pt/index.php?page=infoLetter&carta=CARDS4006.xml
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Manual transcription of the letter in XML
Figure : Full description at: http://ps.clul.ul.pt/index.php?page=infoLetter&carta=CARDS4006.xml
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Aim: Spelling normalization of the transcription
Figure : English translation: I have more than once asked Your Honour
and begged Your Honour to leave me alone. But Your Honour has
insisted on defying me, dishonouring me, lessening me, engaging in gossip
about me at every corner, both by words spoken and by letters written to
whoever you choose. I remind you, speaking as a friend...
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
VARD2 normalisation tool
VARD2 (Baron, 2011)
developed for Early-modern English and combines several resources
to detect and replace spelling variants with normalised forms.
VARD2 uses:
a modern lexicon
a spelling variants dictionary list that matches variants against
their modern counterparts
a list of letter replacement rules
a phonetic matching algorithm
an edit distance algorithm to determine the most likely
candidate
a training set with encoded normalisations (optional)
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
TICCL
TICCL (Reynaert, 2014)
New C++ implementation geared at being easily adaptable to
other languages and older language varieties.
TICCL uses:
a large lexicon
a numerical list of Known Historical Character Confusions
exhaustive variant look-up up to a given Levenshtein distance
a combination of corpus-induced ranking features to
determine the most likely candidate
a dictionary of known historical-modern word form pairs
(optional)
Historical text normalization
Background
Data and Methods
Results
Conclusion
Data sets
VARD2
TICCL
Exp Setup
Experimental Setup
For the experiments for both Spanish and Portuguese
For Spanish 200 letters from the time period 1550 to 1830.
For Portuguese 200 letters from 1550 until 1911.
Normalisation manually verified by a linguist.
Data set was split into 100 letters for training the tools, and
100 for the evaluation set.
Evaluation scores are computed with recall, precision and
F-score.
Historical text normalization
Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Comparison of VARD2 and TICCL on Portuguese
Table : Best-first ranked performance of TICCL and VARD2 on the
tokens of the test set. TICCL and VARD2 were trained on the same
resources.
Tool acc prec recall f-score
VARD2-notraining 90.6 93.8 53.1 67.8
TICCL-notraining 89.2 92.0 46.0 61.4
VARD2 94.7 97.0 73.6 83.7
TICCL 93.5 94.4 69.3 79.9
TICCLrank 95.7 96.4 79.8 87.3
Historical text normalization
Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Error analysis Portuguese
Most frequent error: spelling of ‘um’ with ‘h-’. System does
not recognise this since hum is listed in the modern lexicon.
For all periods: diacritics problems.
The older letters have many archaisms (e.g inda, cousa ) that
are erroneously part of modern lexicon list.
The older letters also have many abbreviations (e.g. v., va.,
etcra. ) which are difficult to recognise automatically.
Confusion between different spellings: For 1500-1700, s/c/ss
for the sound [s]; for 1701-1800, the use of z/s for the sound
[z], whilst 1801-1930 the phonetic spelling of ‘i’ for ‘e’
frequently occurs.
Historical text normalization
Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Comparison of VARD2 and TICCL on Spanish
Table : Best-first ranked performance of TICCL and VARD2 on the
tokens of the test set. TICCL and VARD2 were trained on the same
resources.
Tool acc prec recall f-score
VARD2-notraining 76.1 71.8 37.3 49.1
TICCL-notraining 74.0 81.3 20.8 33.1
VARD2 87.2 96.4 66.0 78.4
TICCL 89.0 91.6 77.3 83.9
Historical text normalization
Background
Data and Methods
Results
Conclusion
Portuguese
Spanish
Errors Analysis Spanish
Typical errors made by VARD2 and TICCL:
around 41-47% of words that were not corrected, were not
spotted as errors because the word occurred in lexicon
for example ‘tu’ when used as personal pronoun needs an accent in modern Spanish: t´u but is used as ‘tu’
in possesive form
around 37-43% of words that were not corrected, the correct
forms did not occur in lexicon (names and conjugated verbs
for example) and could never have been resolved with current
settings.
Around 15% of errors is due to abbreviations
Historical text normalization
Background
Data and Methods
Results
Conclusion
Conclusion
VARD2 can be trained on other languages to good effect,
needs manually constructed resources
TICCL can be successfully extended to these languages too,
without manual work
VARD2 outperforms TICCL without training on
domain-specific examples
TICCL outperforms VARD2 when trained
TICCL can handle far greater amounts of language specific
resources such as lexicons, name lists
Historical text normalization
Background
Data and Methods
Results
Conclusion
Thank you for your attention!
References
Martin Reynaert. On OCR ground truths and OCR post-correction gold standards, tools and formats.
Proceedings of DATeCH 2014: Digital Access to Textual Cultural Heritage, Madrid, 2014
Martin Reynaert. Synergy of Nederlab and @PhilosTEI: diachronic and multilingual Text-Induced Corpus
Clean-up. Proceedings of LREC 2014: Language Resources and Evaluaton Conference, Reykjavik, 2014
Rita Marquilhas and Iris Hendrickx Manuscripts and machines: the automatic replacement of spelling
variants in a Portuguese historical corpus. International Journal of Humanities and Arts Computing, 18.1
(2014): 53−−68, Edinburgh University Press
Martin Reynaert, Iris Hendrickx, and Rita Marquilhas. Historical spelling normalization. A comparison of
two statistical methods: TICCL and VARD2. Proceedings of the Second Workshop on Annotation of
Corpora for Research in the Humanities (ACRH-2), pages 87−−98, 2012.
Alistair Baron. Dealing with spelling variation in Early Modern English texts. PhD thesis, University of
Lancaster, Lancaster, UK, 2011.
Martin Reynaert. Character confusion versus focus word-based correction of spelling and OCR variants in
corpora. International Journal on Document Analysis and Recognition, 14:173-187, 2010.
Iris Hendrickx and Rita Marquilhas. From Old Texts to Modern Spellings: An Experiment in Automatic
Normalisation. Journal for Language Technology and Computational Linguistics (JLCL), 26(2):65-76, 2011.
Historical text normalization

More Related Content

Similar to Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL

Historical spelling normalization
Historical spelling normalizationHistorical spelling normalization
Historical spelling normalization
rmarquilhas
 
From old texts to modern spellings
From old texts to modern spellingsFrom old texts to modern spellings
From old texts to modern spellings
rmarquilhas
 
EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03hpcosta
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
RIILP
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Project
 
Añotador: a Temporal Tagger for Spanish
Añotador: a Temporal Tagger for SpanishAñotador: a Temporal Tagger for Spanish
Añotador: a Temporal Tagger for Spanish
María Navas Loro
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
Carlos Castillo (ChaTo)
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
Alexandre Rademaker
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Christophe Tricot
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
Pascual Pérez-Paredes
 
Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Beatrice Alex
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
Georg Rehm
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Péter Király
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Centre of Competence
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital Europe
Georg Rehm
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
cneudecker
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsEmma Huber
 
Mapping Early Modern News Networks
Mapping Early Modern News NetworksMapping Early Modern News Networks
Mapping Early Modern News Networks
Giovanni Colavizza
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana
 

Similar to Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL (20)

Historical spelling normalization
Historical spelling normalizationHistorical spelling normalization
Historical spelling normalization
 
From old texts to modern spellings
From old texts to modern spellingsFrom old texts to modern spellings
From old texts to modern spellings
 
EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03EXPERT_Malaga-ESR03
EXPERT_Malaga-ESR03
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The ServicesLynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
Lynx Webinar #4: Lynx Services Platform (LySP) - Part 2 - The Services
 
Añotador: a Temporal Tagger for Spanish
Añotador: a Temporal Tagger for SpanishAñotador: a Temporal Tagger for Spanish
Añotador: a Temporal Tagger for Spanish
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Ext...
 
Applications of CL to FLT
Applications of CL to FLTApplications of CL to FLT
Applications of CL to FLT
 
Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text Exploring Challenges in Mining Historical Text
Exploring Challenges in Mining Historical Text
 
Curation Technologies for Multilingual Europe
Curation Technologies for Multilingual EuropeCuration Technologies for Multilingual Europe
Curation Technologies for Multilingual Europe
 
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
Evaluating Data Quality in Europeana: Metrics for Multilinguality (MTSR 2018)
 
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de JonghIMPACT Final Conference - Hildelies Balk-Pennington de Jongh
IMPACT Final Conference - Hildelies Balk-Pennington de Jongh
 
Multilingualism for Digital Europe
Multilingualism for Digital EuropeMultilingualism for Digital Europe
Multilingualism for Digital Europe
 
Content Writing Optimization with ReWriter
Content Writing Optimization with ReWriterContent Writing Optimization with ReWriter
Content Writing Optimization with ReWriter
 
Named Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana NewspapersNamed Entity Recognition for Europeana Newspapers
Named Entity Recognition for Europeana Newspapers
 
Targeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical CollectionsTargeted Language Resources for the Digitisation of Historical Collections
Targeted Language Resources for the Digitisation of Historical Collections
 
Mapping Early Modern News Networks
Mapping Early Modern News NetworksMapping Early Modern News Networks
Mapping Early Modern News Networks
 
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
Europeana meeting under Finland’s Presidency of the Council of the EU - Day 2...
 

More from DH Benelux

Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...
Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...
Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...
DH Benelux
 
Digital Humanities in the Classroom
Digital Humanities in the ClassroomDigital Humanities in the Classroom
Digital Humanities in the Classroom
DH Benelux
 
Authorship and authenticity in the visions of Elisabeth of Schönau
Authorship and authenticity in the visions of Elisabeth of SchönauAuthorship and authenticity in the visions of Elisabeth of Schönau
Authorship and authenticity in the visions of Elisabeth of Schönau
DH Benelux
 
In Search for Patterns
In Search for PatternsIn Search for Patterns
In Search for Patterns
DH Benelux
 
Vici.org
Vici.orgVici.org
Vici.org
DH Benelux
 
The great 20th-century hole
The great 20th-century holeThe great 20th-century hole
The great 20th-century hole
DH Benelux
 
Talk of Europe
Talk of EuropeTalk of Europe
Talk of Europe
DH Benelux
 
Linking the STCN and performing big data queries in the humanities
Linking the STCN and performing big data queries in the humanitiesLinking the STCN and performing big data queries in the humanities
Linking the STCN and performing big data queries in the humanities
DH Benelux
 
Describing and explaining shifts in labour relations using a micro-macro appr...
Describing and explaining shifts in labour relations using a micro-macro appr...Describing and explaining shifts in labour relations using a micro-macro appr...
Describing and explaining shifts in labour relations using a micro-macro appr...
DH Benelux
 
Telematic resonance in digital performance
Telematic resonance in digital performanceTelematic resonance in digital performance
Telematic resonance in digital performance
DH Benelux
 
LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...
LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...
LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...
DH Benelux
 
DH, AIME and the philosophical inquiry
DH, AIME and the philosophical inquiryDH, AIME and the philosophical inquiry
DH, AIME and the philosophical inquiry
DH Benelux
 
Towards a digital edition of the Vierde Partie of the Speigel Historiael
Towards a digital edition of the Vierde Partie of the Speigel HistoriaelTowards a digital edition of the Vierde Partie of the Speigel Historiael
Towards a digital edition of the Vierde Partie of the Speigel Historiael
DH Benelux
 
Digital Architecture and the Role of the Editor
Digital Architecture and the Role of the EditorDigital Architecture and the Role of the Editor
Digital Architecture and the Role of the Editor
DH Benelux
 
Access to data
Access to dataAccess to data
Access to data
DH Benelux
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt Research
DH Benelux
 
Finding Syntactic Characteristics of Surinamese Dutch
Finding Syntactic Characteristics of Surinamese DutchFinding Syntactic Characteristics of Surinamese Dutch
Finding Syntactic Characteristics of Surinamese Dutch
DH Benelux
 
Persistent identification
Persistent identificationPersistent identification
Persistent identification
DH Benelux
 
Understanding film scholars' annotation behavior
Understanding film scholars' annotation behaviorUnderstanding film scholars' annotation behavior
Understanding film scholars' annotation behavior
DH Benelux
 
Migration stories in a digital era.
Migration stories in a digital era.Migration stories in a digital era.
Migration stories in a digital era.
DH Benelux
 

More from DH Benelux (20)

Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...
Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...
Text Analytics for Detecting Dutch, German, and Allied Perspectives on Events...
 
Digital Humanities in the Classroom
Digital Humanities in the ClassroomDigital Humanities in the Classroom
Digital Humanities in the Classroom
 
Authorship and authenticity in the visions of Elisabeth of Schönau
Authorship and authenticity in the visions of Elisabeth of SchönauAuthorship and authenticity in the visions of Elisabeth of Schönau
Authorship and authenticity in the visions of Elisabeth of Schönau
 
In Search for Patterns
In Search for PatternsIn Search for Patterns
In Search for Patterns
 
Vici.org
Vici.orgVici.org
Vici.org
 
The great 20th-century hole
The great 20th-century holeThe great 20th-century hole
The great 20th-century hole
 
Talk of Europe
Talk of EuropeTalk of Europe
Talk of Europe
 
Linking the STCN and performing big data queries in the humanities
Linking the STCN and performing big data queries in the humanitiesLinking the STCN and performing big data queries in the humanities
Linking the STCN and performing big data queries in the humanities
 
Describing and explaining shifts in labour relations using a micro-macro appr...
Describing and explaining shifts in labour relations using a micro-macro appr...Describing and explaining shifts in labour relations using a micro-macro appr...
Describing and explaining shifts in labour relations using a micro-macro appr...
 
Telematic resonance in digital performance
Telematic resonance in digital performanceTelematic resonance in digital performance
Telematic resonance in digital performance
 
LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...
LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...
LAF-Fabric: a data analysis tool for Linguistic Annotation Framework with an ...
 
DH, AIME and the philosophical inquiry
DH, AIME and the philosophical inquiryDH, AIME and the philosophical inquiry
DH, AIME and the philosophical inquiry
 
Towards a digital edition of the Vierde Partie of the Speigel Historiael
Towards a digital edition of the Vierde Partie of the Speigel HistoriaelTowards a digital edition of the Vierde Partie of the Speigel Historiael
Towards a digital edition of the Vierde Partie of the Speigel Historiael
 
Digital Architecture and the Role of the Editor
Digital Architecture and the Role of the EditorDigital Architecture and the Role of the Editor
Digital Architecture and the Role of the Editor
 
Access to data
Access to dataAccess to data
Access to data
 
RemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt ResearchRemBench: A Digital Workbench for Rembrandt Research
RemBench: A Digital Workbench for Rembrandt Research
 
Finding Syntactic Characteristics of Surinamese Dutch
Finding Syntactic Characteristics of Surinamese DutchFinding Syntactic Characteristics of Surinamese Dutch
Finding Syntactic Characteristics of Surinamese Dutch
 
Persistent identification
Persistent identificationPersistent identification
Persistent identification
 
Understanding film scholars' annotation behavior
Understanding film scholars' annotation behaviorUnderstanding film scholars' annotation behavior
Understanding film scholars' annotation behavior
 
Migration stories in a digital era.
Migration stories in a digital era.Migration stories in a digital era.
Migration stories in a digital era.
 

Recently uploaded

Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
kkirkland2
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Access Innovations, Inc.
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Sebastiano Panichella
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
eCommerce Institute
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
amekonnen
 
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsCollapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Rosie Wells
 
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AwangAniqkmals
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
SkillCertProExams
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Dutch Power
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Dutch Power
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
faizulhassanfaiz1670
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
Faculty of Medicine And Health Sciences
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
gharris9
 
Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
gharris9
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
Sebastiano Panichella
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
Frederic Leger
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Sebastiano Panichella
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
Howard Spence
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
khadija278284
 

Recently uploaded (19)

Burning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdfBurning Issue Presentation By Kenmaryon.pdf
Burning Issue Presentation By Kenmaryon.pdf
 
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdfSupercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
Supercharge your AI - SSP Industry Breakout Session 2024-v2_1.pdf
 
Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...Announcement of 18th IEEE International Conference on Software Testing, Verif...
Announcement of 18th IEEE International Conference on Software Testing, Verif...
 
María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024María Carolina Martínez - eCommerce Day Colombia 2024
María Carolina Martínez - eCommerce Day Colombia 2024
 
Tom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issueTom tresser burning issue.pptx My Burning issue
Tom tresser burning issue.pptx My Burning issue
 
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie WellsCollapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
Collapsing Narratives: Exploring Non-Linearity • a micro report by Rosie Wells
 
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
AWANG ANIQKMALBIN AWANG TAJUDIN B22080004 ASSIGNMENT 2 MPU3193 PHILOSOPHY AND...
 
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
Mastering the Concepts Tested in the Databricks Certified Data Engineer Assoc...
 
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
Presentatie 4. Jochen Cremer - TU Delft 28 mei 2024
 
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
Presentatie 8. Joost van der Linde & Daniel Anderton - Eliq 28 mei 2024
 
Media as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern EraMedia as a Mind Controlling Strategy In Old and Modern Era
Media as a Mind Controlling Strategy In Old and Modern Era
 
Obesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditionsObesity causes and management and associated medical conditions
Obesity causes and management and associated medical conditions
 
Gregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptxGregory Harris' Civics Presentation.pptx
Gregory Harris' Civics Presentation.pptx
 
Gregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics PresentationGregory Harris - Cycle 2 - Civics Presentation
Gregory Harris - Cycle 2 - Civics Presentation
 
International Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software TestingInternational Workshop on Artificial Intelligence in Software Testing
International Workshop on Artificial Intelligence in Software Testing
 
2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf2024-05-30_meetup_devops_aix-marseille.pdf
2024-05-30_meetup_devops_aix-marseille.pdf
 
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...Doctoral Symposium at the 17th IEEE International Conference on Software Test...
Doctoral Symposium at the 17th IEEE International Conference on Software Test...
 
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptxsomanykidsbutsofewfathers-140705000023-phpapp02.pptx
somanykidsbutsofewfathers-140705000023-phpapp02.pptx
 
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdfBonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
Bonzo subscription_hjjjjjjjj5hhhhhhh_2024.pdf
 

Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL

  • 1. Background Data and Methods Results Conclusion Language adaptability and performance evaluation of historical text normalization tools VARD2 and TICCL Iris Hendrickx and Martin Reynaert Center for Language Studies, Radboud University Nijmegen, The Netherlands June 12, 2014 Historical text normalization
  • 2. Background Data and Methods Results Conclusion Background Digitizing historical texts: 1 scanning & OCR of old books 2 by manual transcription (original spelling is usually preserved) Digital historical texts contain many spelling variants as: no official spelling existed at that time texts written by half-literate authors in case of OCR: OCR errors Historical text normalization
  • 3. Background Data and Methods Results Conclusion Motivation However the spelling variation is distracting for: Lexical or grammatical research Searching in a digital collection: mismatch with modern word query Automatic natural language processing tools developed for modern text Collection is valuable as country’s cultural heritage: Editions intended for the lay public should be in clean text. Historical text normalization
  • 4. Background Data and Methods Results Conclusion Aim: Automatic spelling variation reduction in historical text collections We compare two different spelling normalization tools VARD2 (Baron, 2011) and TICCL (Reynaert, 2010) on historical Spanish and Portuguese data. TICCL will also evaluated on historical Dutch as part of the Nederlab project. Historical text normalization
  • 5. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup Data collections Spanish and Portuguese Project Post Scriptum Manual digitization of a wide collection of 7000 personal letters (half Spanish/ Portuguese) from different historical archives. The letters are manually transcribed into an electronic XML-TEI file format including rich and detailed historical and sociological meta-data. Dutch: future work 17th C book: the 1637 edition of the State Bible with a gold standard modern Dutch transcription from 2010 18th C book: manually OCR-corrected and transcribed into both historical and modern gold standards: Kort begrip der waereld-historie voor de jeugd. Martinet, 1789 Historical text normalization
  • 6. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup Portuguese Letter from 1592 addressed to merchandiser Jo˜ao Nunes Historical text normalization
  • 7. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup Manual transcription of the letter Figure : Full description at: http://ps.clul.ul.pt/index.php?page=infoLetter&carta=CARDS4006.xml Historical text normalization
  • 8. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup Manual transcription of the letter in XML Figure : Full description at: http://ps.clul.ul.pt/index.php?page=infoLetter&carta=CARDS4006.xml Historical text normalization
  • 9. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup Aim: Spelling normalization of the transcription Figure : English translation: I have more than once asked Your Honour and begged Your Honour to leave me alone. But Your Honour has insisted on defying me, dishonouring me, lessening me, engaging in gossip about me at every corner, both by words spoken and by letters written to whoever you choose. I remind you, speaking as a friend... Historical text normalization
  • 10. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup VARD2 normalisation tool VARD2 (Baron, 2011) developed for Early-modern English and combines several resources to detect and replace spelling variants with normalised forms. VARD2 uses: a modern lexicon a spelling variants dictionary list that matches variants against their modern counterparts a list of letter replacement rules a phonetic matching algorithm an edit distance algorithm to determine the most likely candidate a training set with encoded normalisations (optional) Historical text normalization
  • 11. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup TICCL TICCL (Reynaert, 2014) New C++ implementation geared at being easily adaptable to other languages and older language varieties. TICCL uses: a large lexicon a numerical list of Known Historical Character Confusions exhaustive variant look-up up to a given Levenshtein distance a combination of corpus-induced ranking features to determine the most likely candidate a dictionary of known historical-modern word form pairs (optional) Historical text normalization
  • 12. Background Data and Methods Results Conclusion Data sets VARD2 TICCL Exp Setup Experimental Setup For the experiments for both Spanish and Portuguese For Spanish 200 letters from the time period 1550 to 1830. For Portuguese 200 letters from 1550 until 1911. Normalisation manually verified by a linguist. Data set was split into 100 letters for training the tools, and 100 for the evaluation set. Evaluation scores are computed with recall, precision and F-score. Historical text normalization
  • 13. Background Data and Methods Results Conclusion Portuguese Spanish Comparison of VARD2 and TICCL on Portuguese Table : Best-first ranked performance of TICCL and VARD2 on the tokens of the test set. TICCL and VARD2 were trained on the same resources. Tool acc prec recall f-score VARD2-notraining 90.6 93.8 53.1 67.8 TICCL-notraining 89.2 92.0 46.0 61.4 VARD2 94.7 97.0 73.6 83.7 TICCL 93.5 94.4 69.3 79.9 TICCLrank 95.7 96.4 79.8 87.3 Historical text normalization
  • 14. Background Data and Methods Results Conclusion Portuguese Spanish Error analysis Portuguese Most frequent error: spelling of ‘um’ with ‘h-’. System does not recognise this since hum is listed in the modern lexicon. For all periods: diacritics problems. The older letters have many archaisms (e.g inda, cousa ) that are erroneously part of modern lexicon list. The older letters also have many abbreviations (e.g. v., va., etcra. ) which are difficult to recognise automatically. Confusion between different spellings: For 1500-1700, s/c/ss for the sound [s]; for 1701-1800, the use of z/s for the sound [z], whilst 1801-1930 the phonetic spelling of ‘i’ for ‘e’ frequently occurs. Historical text normalization
  • 15. Background Data and Methods Results Conclusion Portuguese Spanish Comparison of VARD2 and TICCL on Spanish Table : Best-first ranked performance of TICCL and VARD2 on the tokens of the test set. TICCL and VARD2 were trained on the same resources. Tool acc prec recall f-score VARD2-notraining 76.1 71.8 37.3 49.1 TICCL-notraining 74.0 81.3 20.8 33.1 VARD2 87.2 96.4 66.0 78.4 TICCL 89.0 91.6 77.3 83.9 Historical text normalization
  • 16. Background Data and Methods Results Conclusion Portuguese Spanish Errors Analysis Spanish Typical errors made by VARD2 and TICCL: around 41-47% of words that were not corrected, were not spotted as errors because the word occurred in lexicon for example ‘tu’ when used as personal pronoun needs an accent in modern Spanish: t´u but is used as ‘tu’ in possesive form around 37-43% of words that were not corrected, the correct forms did not occur in lexicon (names and conjugated verbs for example) and could never have been resolved with current settings. Around 15% of errors is due to abbreviations Historical text normalization
  • 17. Background Data and Methods Results Conclusion Conclusion VARD2 can be trained on other languages to good effect, needs manually constructed resources TICCL can be successfully extended to these languages too, without manual work VARD2 outperforms TICCL without training on domain-specific examples TICCL outperforms VARD2 when trained TICCL can handle far greater amounts of language specific resources such as lexicons, name lists Historical text normalization
  • 18. Background Data and Methods Results Conclusion Thank you for your attention! References Martin Reynaert. On OCR ground truths and OCR post-correction gold standards, tools and formats. Proceedings of DATeCH 2014: Digital Access to Textual Cultural Heritage, Madrid, 2014 Martin Reynaert. Synergy of Nederlab and @PhilosTEI: diachronic and multilingual Text-Induced Corpus Clean-up. Proceedings of LREC 2014: Language Resources and Evaluaton Conference, Reykjavik, 2014 Rita Marquilhas and Iris Hendrickx Manuscripts and machines: the automatic replacement of spelling variants in a Portuguese historical corpus. International Journal of Humanities and Arts Computing, 18.1 (2014): 53−−68, Edinburgh University Press Martin Reynaert, Iris Hendrickx, and Rita Marquilhas. Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2. Proceedings of the Second Workshop on Annotation of Corpora for Research in the Humanities (ACRH-2), pages 87−−98, 2012. Alistair Baron. Dealing with spelling variation in Early Modern English texts. PhD thesis, University of Lancaster, Lancaster, UK, 2011. Martin Reynaert. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. International Journal on Document Analysis and Recognition, 14:173-187, 2010. Iris Hendrickx and Rita Marquilhas. From Old Texts to Modern Spellings: An Experiment in Automatic Normalisation. Journal for Language Technology and Computational Linguistics (JLCL), 26(2):65-76, 2011. Historical text normalization