SlideShare a Scribd company logo
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
1 / 25
"God Wat þæt Ic Eom God"
Word Sense Disambiguation in Old English
Bamberg, Staatsbibliothek, Msc.Nat.1 (9th century)
Martin Wunderlich and Alexander Fraser (LMU M nchen)
Paul Sander Langeslag (University of G ttingen)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
2 / 25
Can we apply WSD
techniques to a
historical language
like Old English
and
what are the
specific challenges?
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
3 / 25
Overview
●
Background on the Old English language
●
NLP and historical languages – problems and
opportunities
●
Old English digital resources
●
WSD methodologies applied here
●
Experiments and results
●
Summary and discussion
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
4 / 25
Background on the OE language 1
●
Spoken ca. 450 – 1100 AD
●
A Germanic language:
„God Wat þæt Ic Eom God‟
→ „Gott weiß, dass ich gut bin‟
(„God knows I'm good‟ - David Bowie)
●
5 cases, 3 genders, 3 numbers (singual, dual, plural)
An example:
– „Seo cwen geseah þone guman.‟ *
– „Se guma geseah þa cwen.‟ **
(from Crystal, 2010)
* „The woman saw the man.‟ ** „The man saw the woman‟
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
5 / 25
Background on the OE language 2
●
Initially a runic alphabet known as „futhorc‟
(after the first letters -ᚠᚢᚦᚩᚱᚳ)
●
...keeping Thorn ᚦ and Wynn ƿ and adding Latin
●
24 letter alphabet:
a æ b c d ð e f ᵹ/g h i l m n o p r s/ſ t þ u ƿ/w x y
●
Introduced around 600 AD
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
6 / 25
Background on the OE language 3
Migrations and settlements:
https://www.uni-due.de/SHE/Germanic_Migration_to_Britain.gif
(site maintained by Prof. Raymond Hickey, Chair of Linguistics)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
7 / 25
NLP & historical languages: problems
●
Stopword lists
●
POS taggers
●
Word and sentence tokenizers
●
Standard tools and libraries
●
Shared tasks with prepared training data
●
Existing research
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
8 / 25
NLP & historical languages: problems
●
Stopword lists
●
POS taggers
●
Word and sentence tokenizers
●
Standard tools and libraries
●
Shared tasks with prepared training data
●
Existing research … well, a bit ...
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
9 / 25
NLP & historical languages: related work
●
Annotation projection in Germanic languages with parallel bible texts
(Sukhareva and Chiarcos, 2014)
●
Application of existing NLP tools to ancient Italian
(Pennacchiotti and Zanzotto, 2008)
●
Tagging Old East Slavonic texts
(Meyer, 2011)
●
POS tagging Early Modern German texts
(Bollmann, 2013)
●
Projection of tags from contemporary EN to ME
(Moon and Baldridge, 2007)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
10 / 25
NLP & historical languages: opportunities
1.Digital corpora & dictionaries/lexicons do exist
(incl. OE Wikipedia: https://ang.wikipedia.org/wiki/H%C4%93afodtramet)
2.Static corpus
3.Few existing NLP applications → lots to explore
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
11 / 25
Old English digital resources: corpora
●
York-Toronto-Helsinki Parsed Corpus of Old
English prose (YCOE); ca. 1.5 million words
●
York-Toronto-Helsinki Parsed Corpus of Old
English poetry (YCOEP); 71,490 words
●
Dictionary of Old English Corpus in Electronic
Form (DOEC); ca. 3.8 million words
→ all available through the University of Oxford Text Archive
(http://www.ota.ahds.ac.uk/);
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
12 / 25
Old English digital resources: dictionary
Dictionary of Old English (DOE) corpus stats:
Number of HTML documents 3,037
Token count 3,786,753
Type count 343,135
Token count / type count ca. 11
Total number of sentences 234113
Average sentence length 5.5
Minimum sentence length 1
Maximum sentence length 263
Compare to Brown
corpus:
ca. 1 Mio tokens and ca.
50.000 types (T/T = 20)
Spelling variations. e.g.
„wundarlic‟, „wundorlic‟,
„wunderlic‟
12568 DOE entries for the letters from A to G
(http://tapor.library.utoronto.ca/doe/)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
13 / 25
WSD methodologies 1
Criteria for selecting the target terms:
➔
minimum count 200, minimum length 3 characters
➔
non-Latin (i.e. no „dictum‟, „confundantur‟, „magister‟...)
➔
common nouns
➔
no proper nouns (e.g. no „Egypta‟, „Micel‟, „Iulianus‟...)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
14 / 25
WSD methodologies 2
Target terms: Target term Token count in
DOE corpus
Basic translation
Anweald 242 Power, realm, order of
angels
Fultum 574 Help, aid, remedy
Fæder 416 Father, lord (relig.)
For 955 Movement, journey...
Eadigan 263 To bless, to make happy
Boc 567 Book, volume, legal doc
Ban 314 Bone, ivory
Are 308 Honour, mercy, property
Andlang 1743 Continuous, upright
Dryhten 261 Lord (worldly & relig.), chief
100 concordance
matches each
(random selection)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
15 / 25
WSD methodologies 3
Selected word senses of "bōc":
(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)
A. book
A.1. in general, without particular reference to form or content
Lk (WSCp) 4.17: he þa boc unfeold
B. major division of a larger work
JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter
ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.
D. legal document
Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng
edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ &
Sanctæ Paule into ealdan mynstræ.
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
16 / 25
WSD methodologies 4
From corpus to feature vectors – bag-of-words model with fixed size
token window
from Ch 540 (Birch 862):
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
17 / 25
Implementation
●
Libraries used:
– Mallet (NLP and ML library)
– Jsoup (HTML processing)
●
Own implementation:
– Parsing of corpus and dictionary data
– Feature extraction and instance creation
– Pipes for baseline classifiers (Mallet additions)
– Metrics, summarization and output of results
...and much more...
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
18 / 25
Experiments and results 1
●
Baseline 1: most frequent class.
– Accuracy: 0.67
●
Baseline 2: random class.
– Accuracy: 0.44
Human annotators' upper and lower bounds: 0.75 – 0.97
(Gale et al., 1992)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
19 / 25
Experiments and results 2
One-vs-all classification
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - Naive Bayes
Accuracy
Avg Precision
Avg Recall
Avg F1
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - Naive Bayes
Accuracy
Lin Reg trend
Avg Precision
Avg Recall
Avg F1
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
20 / 25
Experiments and results 3
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - MaxEnt
Accuracy
Avg Precision
Avg Recall
Avg F1
One-vs-all classification
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - MaxEnt
Accuracy
Lin Reg trend
Avg Precision
Avg Recall
Avg F1
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
21 / 25
WSD methodologies 3
Selected word senses of "bōc":
(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)
A. book
A.1. in general, without particular reference to form or content
Lk (WSCp) 4.17: he þa boc unfeold
B. major division of a larger work
JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter
ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.
D. legal document
Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng
edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ &
Sanctæ Paule into ealdan mynstræ.
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
22 / 25
Experiments and results 4
Algorithm Feature
vector
Accuracy Precision Recall F1
Avg Std Dev Avg Std Dev Avg Std Dev Avg
NB, multi-class BoW 0.7635 0.11 0.7205 0.18 0.7865 0.16 0.7521
ME, multi-class BoW 0.7520 0.17 0.8610 0.10 0.6915 0.17 0.7670
NB, one-vs-all BoW 0.8400 0.09 0.8458 0.10 0.8368 0.11 0.8295
ME, one-vs-all BoW 0.7950 0.12 0.7875 0.13 0.8080 0.12 0.7662
NB, multi-class Coll. 0.7245 0.12 0.8135 0.08 0.6510 0.12 0.5895
ME, multi-class Coll. 0.7910 0.13 0.8845 0.08 0.6875 0.16 0.6510
NB, one-vs-all Coll. 0.8200 0.09 0.8305 0.12 0.8085 0.10 0.7970
ME, one-vs-all Coll. 0.7290 0.09 0.7395 0.10 0.7145 0.14 0.6890
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
23 / 25
Summary
●
Historical languages: interesting, rewarding and difficult to work with
●
WSD does give satisfactory results even without stemming etc.
●
Best WSD performance: NB (F1), one vs. all, window size: ??
●
Annotated data set (available on website)
●
Baseline classifiers as contributions to MALLET
●
Possible extensions:
– More advanced vector representations
– Bootstrapping
– Train classifiers based on other corpora
– Distributional thesaurus (DT)?
●
Acknowledgements:
Winfried Rudolf, Göttingen & Juan Carmona Ramirez, Jena
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
24 / 25
Thanks a lot for your attention!
Any questions?
Paul S. Langeslag, Göttingen
New book: Seasons in the Literatures of the Medieval North
Alexander Fraser, München
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
25 / 25
References
● Mark Stevenson. Word sense disambiguation : the case for combinations of knowledge sources. CSLI
studies in computational linguistics. CSLI Publ., Stanford, Calif., 2003.
● D. Yarowsky. Word sense disambiguation. In Alexander Clark, editor, The handbook of computational
linguistics and natural language processing, Blackwell handbooks in linguistics. Wiley-Blackwell, Oxford
[u.a.], 1. publ. Edition, 2010.
● D. Crystal. The Cambridge Encyclopedia of Language. The Cambridge Encyclopedia of Language.
Cambridge University Press, 2010.
● Clara Cabezas, Philip Resnik, and Jessica Stevens. Supervised sense tagging using support vector machi
nes. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguati-
on Systems, SENSEVAL ’01, pages 59–62, Stroudsburg, PA, USA, 2001. Association for Computational
Linguistics.
● Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu,
2002.
● Marcel Bollmann. Pos tagging for historical texts with sparse training data. In Proceedings of the 7th
Linguistic Annotation Workshop and Interoperability in Discourse, pages 11–18, Sofia, Bulgaria, August
2013. Association for Computational Linguistics.
● Taesun Moon and Jason Baldridge. Part-of-speech tagging for middle English through alignment and
projection of parallel diachronic texts. In Proceedings of the 2007 Joint Conference on Empirical Me- thods
in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages
390–399, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
● Roland Meyer. New wine in old wineskins? - tagging old russian via annotation projection from modern
translations. Russian Linguistics, 35(2):267–281, 2011.
● Marco Pennacchiotti and Fabio Massimo Zanzotto. Natural Language Processing across time: an empi
rical investigation on Italian, volume 5221, pages 371–382. Springer, 2008.

More Related Content

Similar to Word Sense Disambiguation in Old English

Alenka Šauperl: Abstracts for scientific papers
Alenka Šauperl: Abstracts for scientific papers Alenka Šauperl: Abstracts for scientific papers
Alenka Šauperl: Abstracts for scientific papers
ÚISK FF UK
 
2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3
CLARIAH
 
Iscram 2015
Iscram 2015Iscram 2015
Iscram 2015
ISCRAM 2015
 
Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool
Fusepool SME project
 
Linking Library Data using Fusepool
Linking Library Data using FusepoolLinking Library Data using Fusepool
Linking Library Data using Fusepool
datentaste
 
Fullstendig_Soknad_Opprykk
Fullstendig_Soknad_OpprykkFullstendig_Soknad_Opprykk
Fullstendig_Soknad_Opprykk
University of Agder
 
Finding and managing process engineering information
Finding and managing process engineering informationFinding and managing process engineering information
Finding and managing process engineering information
Thomas Hapke
 
Hinterland, Influence, Environs
Hinterland, Influence, EnvironsHinterland, Influence, Environs
Hinterland, Influence, Environs
ArchaeoLandscapes Europe
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
Thomas Hapke
 
Dealing with research data
Dealing with research dataDealing with research data
Dealing with research data
Elena Simukovic
 
SCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groupsSCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groups
British Science Association
 
2014 10 TDWG - Environments-EOL
2014 10 TDWG - Environments-EOL 2014 10 TDWG - Environments-EOL
2014 10 TDWG - Environments-EOL
Evangelos Pafilis
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
Dirk Roorda
 
Scholar voices 1 - international scholars perspective of UK libraries
Scholar voices 1 - international scholars perspective of UK librariesScholar voices 1 - international scholars perspective of UK libraries
Scholar voices 1 - international scholars perspective of UK libraries
nmjb
 
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Scottish Language Dictionaries
 
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Global Human Ecodynamics Alliance
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
Thomas Hapke
 
532_Paper
532_Paper532_Paper
532_Paper
Arash Saidi
 
1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions
RIILP
 
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
Lisa Brewer
 

Similar to Word Sense Disambiguation in Old English (20)

Alenka Šauperl: Abstracts for scientific papers
Alenka Šauperl: Abstracts for scientific papers Alenka Šauperl: Abstracts for scientific papers
Alenka Šauperl: Abstracts for scientific papers
 
2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3
 
Iscram 2015
Iscram 2015Iscram 2015
Iscram 2015
 
Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool
 
Linking Library Data using Fusepool
Linking Library Data using FusepoolLinking Library Data using Fusepool
Linking Library Data using Fusepool
 
Fullstendig_Soknad_Opprykk
Fullstendig_Soknad_OpprykkFullstendig_Soknad_Opprykk
Fullstendig_Soknad_Opprykk
 
Finding and managing process engineering information
Finding and managing process engineering informationFinding and managing process engineering information
Finding and managing process engineering information
 
Hinterland, Influence, Environs
Hinterland, Influence, EnvironsHinterland, Influence, Environs
Hinterland, Influence, Environs
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
 
Dealing with research data
Dealing with research dataDealing with research data
Dealing with research data
 
SCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groupsSCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groups
 
2014 10 TDWG - Environments-EOL
2014 10 TDWG - Environments-EOL 2014 10 TDWG - Environments-EOL
2014 10 TDWG - Environments-EOL
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
 
Scholar voices 1 - international scholars perspective of UK libraries
Scholar voices 1 - international scholars perspective of UK librariesScholar voices 1 - international scholars perspective of UK libraries
Scholar voices 1 - international scholars perspective of UK libraries
 
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
 
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
 
532_Paper
532_Paper532_Paper
532_Paper
 
1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions
 
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
 

Recently uploaded

23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
Shashank Shekhar Pandey
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 

Recently uploaded (20)

23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1HOW DO ORGANISMS REPRODUCE?reproduction part 1
HOW DO ORGANISMS REPRODUCE?reproduction part 1
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 

Word Sense Disambiguation in Old English

  • 1. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 1 / 25 "God Wat þæt Ic Eom God" Word Sense Disambiguation in Old English Bamberg, Staatsbibliothek, Msc.Nat.1 (9th century) Martin Wunderlich and Alexander Fraser (LMU M nchen) Paul Sander Langeslag (University of G ttingen)
  • 2. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 2 / 25 Can we apply WSD techniques to a historical language like Old English and what are the specific challenges?
  • 3. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 3 / 25 Overview ● Background on the Old English language ● NLP and historical languages – problems and opportunities ● Old English digital resources ● WSD methodologies applied here ● Experiments and results ● Summary and discussion
  • 4. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 4 / 25 Background on the OE language 1 ● Spoken ca. 450 – 1100 AD ● A Germanic language: „God Wat þæt Ic Eom God‟ → „Gott weiß, dass ich gut bin‟ („God knows I'm good‟ - David Bowie) ● 5 cases, 3 genders, 3 numbers (singual, dual, plural) An example: – „Seo cwen geseah þone guman.‟ * – „Se guma geseah þa cwen.‟ ** (from Crystal, 2010) * „The woman saw the man.‟ ** „The man saw the woman‟
  • 5. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 5 / 25 Background on the OE language 2 ● Initially a runic alphabet known as „futhorc‟ (after the first letters -ᚠᚢᚦᚩᚱᚳ) ● ...keeping Thorn ᚦ and Wynn ƿ and adding Latin ● 24 letter alphabet: a æ b c d ð e f ᵹ/g h i l m n o p r s/ſ t þ u ƿ/w x y ● Introduced around 600 AD
  • 6. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 6 / 25 Background on the OE language 3 Migrations and settlements: https://www.uni-due.de/SHE/Germanic_Migration_to_Britain.gif (site maintained by Prof. Raymond Hickey, Chair of Linguistics)
  • 7. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 7 / 25 NLP & historical languages: problems ● Stopword lists ● POS taggers ● Word and sentence tokenizers ● Standard tools and libraries ● Shared tasks with prepared training data ● Existing research
  • 8. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 8 / 25 NLP & historical languages: problems ● Stopword lists ● POS taggers ● Word and sentence tokenizers ● Standard tools and libraries ● Shared tasks with prepared training data ● Existing research … well, a bit ...
  • 9. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 9 / 25 NLP & historical languages: related work ● Annotation projection in Germanic languages with parallel bible texts (Sukhareva and Chiarcos, 2014) ● Application of existing NLP tools to ancient Italian (Pennacchiotti and Zanzotto, 2008) ● Tagging Old East Slavonic texts (Meyer, 2011) ● POS tagging Early Modern German texts (Bollmann, 2013) ● Projection of tags from contemporary EN to ME (Moon and Baldridge, 2007)
  • 10. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 10 / 25 NLP & historical languages: opportunities 1.Digital corpora & dictionaries/lexicons do exist (incl. OE Wikipedia: https://ang.wikipedia.org/wiki/H%C4%93afodtramet) 2.Static corpus 3.Few existing NLP applications → lots to explore
  • 11. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 11 / 25 Old English digital resources: corpora ● York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE); ca. 1.5 million words ● York-Toronto-Helsinki Parsed Corpus of Old English poetry (YCOEP); 71,490 words ● Dictionary of Old English Corpus in Electronic Form (DOEC); ca. 3.8 million words → all available through the University of Oxford Text Archive (http://www.ota.ahds.ac.uk/);
  • 12. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 12 / 25 Old English digital resources: dictionary Dictionary of Old English (DOE) corpus stats: Number of HTML documents 3,037 Token count 3,786,753 Type count 343,135 Token count / type count ca. 11 Total number of sentences 234113 Average sentence length 5.5 Minimum sentence length 1 Maximum sentence length 263 Compare to Brown corpus: ca. 1 Mio tokens and ca. 50.000 types (T/T = 20) Spelling variations. e.g. „wundarlic‟, „wundorlic‟, „wunderlic‟ 12568 DOE entries for the letters from A to G (http://tapor.library.utoronto.ca/doe/)
  • 13. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 13 / 25 WSD methodologies 1 Criteria for selecting the target terms: ➔ minimum count 200, minimum length 3 characters ➔ non-Latin (i.e. no „dictum‟, „confundantur‟, „magister‟...) ➔ common nouns ➔ no proper nouns (e.g. no „Egypta‟, „Micel‟, „Iulianus‟...)
  • 14. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 14 / 25 WSD methodologies 2 Target terms: Target term Token count in DOE corpus Basic translation Anweald 242 Power, realm, order of angels Fultum 574 Help, aid, remedy Fæder 416 Father, lord (relig.) For 955 Movement, journey... Eadigan 263 To bless, to make happy Boc 567 Book, volume, legal doc Ban 314 Bone, ivory Are 308 Honour, mercy, property Andlang 1743 Continuous, upright Dryhten 261 Lord (worldly & relig.), chief 100 concordance matches each (random selection)
  • 15. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 15 / 25 WSD methodologies 3 Selected word senses of "bōc": (http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007) A. book A.1. in general, without particular reference to form or content Lk (WSCp) 4.17: he þa boc unfeold B. major division of a larger work JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter ðon in Pathma ealond þæt boc ðæra sighðana eac awrat. D. legal document Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ & Sanctæ Paule into ealdan mynstræ.
  • 16. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 16 / 25 WSD methodologies 4 From corpus to feature vectors – bag-of-words model with fixed size token window from Ch 540 (Birch 862):
  • 17. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 17 / 25 Implementation ● Libraries used: – Mallet (NLP and ML library) – Jsoup (HTML processing) ● Own implementation: – Parsing of corpus and dictionary data – Feature extraction and instance creation – Pipes for baseline classifiers (Mallet additions) – Metrics, summarization and output of results ...and much more...
  • 18. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 18 / 25 Experiments and results 1 ● Baseline 1: most frequent class. – Accuracy: 0.67 ● Baseline 2: random class. – Accuracy: 0.44 Human annotators' upper and lower bounds: 0.75 – 0.97 (Gale et al., 1992)
  • 19. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 19 / 25 Experiments and results 2 One-vs-all classification 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - Naive Bayes Accuracy Avg Precision Avg Recall Avg F1 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - Naive Bayes Accuracy Lin Reg trend Avg Precision Avg Recall Avg F1
  • 20. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 20 / 25 Experiments and results 3 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - MaxEnt Accuracy Avg Precision Avg Recall Avg F1 One-vs-all classification 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - MaxEnt Accuracy Lin Reg trend Avg Precision Avg Recall Avg F1
  • 21. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 21 / 25 WSD methodologies 3 Selected word senses of "bōc": (http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007) A. book A.1. in general, without particular reference to form or content Lk (WSCp) 4.17: he þa boc unfeold B. major division of a larger work JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter ðon in Pathma ealond þæt boc ðæra sighðana eac awrat. D. legal document Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ & Sanctæ Paule into ealdan mynstræ.
  • 22. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 22 / 25 Experiments and results 4 Algorithm Feature vector Accuracy Precision Recall F1 Avg Std Dev Avg Std Dev Avg Std Dev Avg NB, multi-class BoW 0.7635 0.11 0.7205 0.18 0.7865 0.16 0.7521 ME, multi-class BoW 0.7520 0.17 0.8610 0.10 0.6915 0.17 0.7670 NB, one-vs-all BoW 0.8400 0.09 0.8458 0.10 0.8368 0.11 0.8295 ME, one-vs-all BoW 0.7950 0.12 0.7875 0.13 0.8080 0.12 0.7662 NB, multi-class Coll. 0.7245 0.12 0.8135 0.08 0.6510 0.12 0.5895 ME, multi-class Coll. 0.7910 0.13 0.8845 0.08 0.6875 0.16 0.6510 NB, one-vs-all Coll. 0.8200 0.09 0.8305 0.12 0.8085 0.10 0.7970 ME, one-vs-all Coll. 0.7290 0.09 0.7395 0.10 0.7145 0.14 0.6890
  • 23. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 23 / 25 Summary ● Historical languages: interesting, rewarding and difficult to work with ● WSD does give satisfactory results even without stemming etc. ● Best WSD performance: NB (F1), one vs. all, window size: ?? ● Annotated data set (available on website) ● Baseline classifiers as contributions to MALLET ● Possible extensions: – More advanced vector representations – Bootstrapping – Train classifiers based on other corpora – Distributional thesaurus (DT)? ● Acknowledgements: Winfried Rudolf, Göttingen & Juan Carmona Ramirez, Jena
  • 24. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 24 / 25 Thanks a lot for your attention! Any questions? Paul S. Langeslag, Göttingen New book: Seasons in the Literatures of the Medieval North Alexander Fraser, München
  • 25. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 25 / 25 References ● Mark Stevenson. Word sense disambiguation : the case for combinations of knowledge sources. CSLI studies in computational linguistics. CSLI Publ., Stanford, Calif., 2003. ● D. Yarowsky. Word sense disambiguation. In Alexander Clark, editor, The handbook of computational linguistics and natural language processing, Blackwell handbooks in linguistics. Wiley-Blackwell, Oxford [u.a.], 1. publ. Edition, 2010. ● D. Crystal. The Cambridge Encyclopedia of Language. The Cambridge Encyclopedia of Language. Cambridge University Press, 2010. ● Clara Cabezas, Philip Resnik, and Jessica Stevens. Supervised sense tagging using support vector machi nes. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguati- on Systems, SENSEVAL ’01, pages 59–62, Stroudsburg, PA, USA, 2001. Association for Computational Linguistics. ● Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002. ● Marcel Bollmann. Pos tagging for historical texts with sparse training data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability in Discourse, pages 11–18, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. ● Taesun Moon and Jason Baldridge. Part-of-speech tagging for middle English through alignment and projection of parallel diachronic texts. In Proceedings of the 2007 Joint Conference on Empirical Me- thods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 390–399, Prague, Czech Republic, June 2007. Association for Computational Linguistics. ● Roland Meyer. New wine in old wineskins? - tagging old russian via annotation projection from modern translations. Russian Linguistics, 35(2):267–281, 2011. ● Marco Pennacchiotti and Fabio Massimo Zanzotto. Natural Language Processing across time: an empi rical investigation on Italian, volume 5221, pages 371–382. Springer, 2008.