SlideShare a Scribd company logo
1 of 25
Download to read offline
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
1 / 25
"God Wat þæt Ic Eom God"
Word Sense Disambiguation in Old English
Bamberg, Staatsbibliothek, Msc.Nat.1 (9th century)
Martin Wunderlich and Alexander Fraser (LMU M nchen)
Paul Sander Langeslag (University of G ttingen)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
2 / 25
Can we apply WSD
techniques to a
historical language
like Old English
and
what are the
specific challenges?
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
3 / 25
Overview
●
Background on the Old English language
●
NLP and historical languages – problems and
opportunities
●
Old English digital resources
●
WSD methodologies applied here
●
Experiments and results
●
Summary and discussion
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
4 / 25
Background on the OE language 1
●
Spoken ca. 450 – 1100 AD
●
A Germanic language:
„God Wat þæt Ic Eom God‟
→ „Gott weiß, dass ich gut bin‟
(„God knows I'm good‟ - David Bowie)
●
5 cases, 3 genders, 3 numbers (singual, dual, plural)
An example:
– „Seo cwen geseah þone guman.‟ *
– „Se guma geseah þa cwen.‟ **
(from Crystal, 2010)
* „The woman saw the man.‟ ** „The man saw the woman‟
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
5 / 25
Background on the OE language 2
●
Initially a runic alphabet known as „futhorc‟
(after the first letters -ᚠᚢᚦᚩᚱᚳ)
●
...keeping Thorn ᚦ and Wynn ƿ and adding Latin
●
24 letter alphabet:
a æ b c d ð e f ᵹ/g h i l m n o p r s/ſ t þ u ƿ/w x y
●
Introduced around 600 AD
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
6 / 25
Background on the OE language 3
Migrations and settlements:
https://www.uni-due.de/SHE/Germanic_Migration_to_Britain.gif
(site maintained by Prof. Raymond Hickey, Chair of Linguistics)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
7 / 25
NLP & historical languages: problems
●
Stopword lists
●
POS taggers
●
Word and sentence tokenizers
●
Standard tools and libraries
●
Shared tasks with prepared training data
●
Existing research
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
8 / 25
NLP & historical languages: problems
●
Stopword lists
●
POS taggers
●
Word and sentence tokenizers
●
Standard tools and libraries
●
Shared tasks with prepared training data
●
Existing research … well, a bit ...
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
9 / 25
NLP & historical languages: related work
●
Annotation projection in Germanic languages with parallel bible texts
(Sukhareva and Chiarcos, 2014)
●
Application of existing NLP tools to ancient Italian
(Pennacchiotti and Zanzotto, 2008)
●
Tagging Old East Slavonic texts
(Meyer, 2011)
●
POS tagging Early Modern German texts
(Bollmann, 2013)
●
Projection of tags from contemporary EN to ME
(Moon and Baldridge, 2007)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
10 / 25
NLP & historical languages: opportunities
1.Digital corpora & dictionaries/lexicons do exist
(incl. OE Wikipedia: https://ang.wikipedia.org/wiki/H%C4%93afodtramet)
2.Static corpus
3.Few existing NLP applications → lots to explore
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
11 / 25
Old English digital resources: corpora
●
York-Toronto-Helsinki Parsed Corpus of Old
English prose (YCOE); ca. 1.5 million words
●
York-Toronto-Helsinki Parsed Corpus of Old
English poetry (YCOEP); 71,490 words
●
Dictionary of Old English Corpus in Electronic
Form (DOEC); ca. 3.8 million words
→ all available through the University of Oxford Text Archive
(http://www.ota.ahds.ac.uk/);
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
12 / 25
Old English digital resources: dictionary
Dictionary of Old English (DOE) corpus stats:
Number of HTML documents 3,037
Token count 3,786,753
Type count 343,135
Token count / type count ca. 11
Total number of sentences 234113
Average sentence length 5.5
Minimum sentence length 1
Maximum sentence length 263
Compare to Brown
corpus:
ca. 1 Mio tokens and ca.
50.000 types (T/T = 20)
Spelling variations. e.g.
„wundarlic‟, „wundorlic‟,
„wunderlic‟
12568 DOE entries for the letters from A to G
(http://tapor.library.utoronto.ca/doe/)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
13 / 25
WSD methodologies 1
Criteria for selecting the target terms:
➔
minimum count 200, minimum length 3 characters
➔
non-Latin (i.e. no „dictum‟, „confundantur‟, „magister‟...)
➔
common nouns
➔
no proper nouns (e.g. no „Egypta‟, „Micel‟, „Iulianus‟...)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
14 / 25
WSD methodologies 2
Target terms: Target term Token count in
DOE corpus
Basic translation
Anweald 242 Power, realm, order of
angels
Fultum 574 Help, aid, remedy
Fæder 416 Father, lord (relig.)
For 955 Movement, journey...
Eadigan 263 To bless, to make happy
Boc 567 Book, volume, legal doc
Ban 314 Bone, ivory
Are 308 Honour, mercy, property
Andlang 1743 Continuous, upright
Dryhten 261 Lord (worldly & relig.), chief
100 concordance
matches each
(random selection)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
15 / 25
WSD methodologies 3
Selected word senses of "bōc":
(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)
A. book
A.1. in general, without particular reference to form or content
Lk (WSCp) 4.17: he þa boc unfeold
B. major division of a larger work
JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter
ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.
D. legal document
Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng
edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ &
Sanctæ Paule into ealdan mynstræ.
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
16 / 25
WSD methodologies 4
From corpus to feature vectors – bag-of-words model with fixed size
token window
from Ch 540 (Birch 862):
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
17 / 25
Implementation
●
Libraries used:
– Mallet (NLP and ML library)
– Jsoup (HTML processing)
●
Own implementation:
– Parsing of corpus and dictionary data
– Feature extraction and instance creation
– Pipes for baseline classifiers (Mallet additions)
– Metrics, summarization and output of results
...and much more...
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
18 / 25
Experiments and results 1
●
Baseline 1: most frequent class.
– Accuracy: 0.67
●
Baseline 2: random class.
– Accuracy: 0.44
Human annotators' upper and lower bounds: 0.75 – 0.97
(Gale et al., 1992)
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
19 / 25
Experiments and results 2
One-vs-all classification
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - Naive Bayes
Accuracy
Avg Precision
Avg Recall
Avg F1
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - Naive Bayes
Accuracy
Lin Reg trend
Avg Precision
Avg Recall
Avg F1
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
20 / 25
Experiments and results 3
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - MaxEnt
Accuracy
Avg Precision
Avg Recall
Avg F1
One-vs-all classification
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - MaxEnt
Accuracy
Lin Reg trend
Avg Precision
Avg Recall
Avg F1
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
21 / 25
WSD methodologies 3
Selected word senses of "bōc":
(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)
A. book
A.1. in general, without particular reference to form or content
Lk (WSCp) 4.17: he þa boc unfeold
B. major division of a larger work
JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter
ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.
D. legal document
Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng
edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ &
Sanctæ Paule into ealdan mynstræ.
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
22 / 25
Experiments and results 4
Algorithm Feature
vector
Accuracy Precision Recall F1
Avg Std Dev Avg Std Dev Avg Std Dev Avg
NB, multi-class BoW 0.7635 0.11 0.7205 0.18 0.7865 0.16 0.7521
ME, multi-class BoW 0.7520 0.17 0.8610 0.10 0.6915 0.17 0.7670
NB, one-vs-all BoW 0.8400 0.09 0.8458 0.10 0.8368 0.11 0.8295
ME, one-vs-all BoW 0.7950 0.12 0.7875 0.13 0.8080 0.12 0.7662
NB, multi-class Coll. 0.7245 0.12 0.8135 0.08 0.6510 0.12 0.5895
ME, multi-class Coll. 0.7910 0.13 0.8845 0.08 0.6875 0.16 0.6510
NB, one-vs-all Coll. 0.8200 0.09 0.8305 0.12 0.8085 0.10 0.7970
ME, one-vs-all Coll. 0.7290 0.09 0.7395 0.10 0.7145 0.14 0.6890
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
23 / 25
Summary
●
Historical languages: interesting, rewarding and difficult to work with
●
WSD does give satisfactory results even without stemming etc.
●
Best WSD performance: NB (F1), one vs. all, window size: ??
●
Annotated data set (available on website)
●
Baseline classifiers as contributions to MALLET
●
Possible extensions:
– More advanced vector representations
– Bootstrapping
– Train classifiers based on other corpora
– Distributional thesaurus (DT)?
●
Acknowledgements:
Winfried Rudolf, Göttingen & Juan Carmona Ramirez, Jena
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
24 / 25
Thanks a lot for your attention!
Any questions?
Paul S. Langeslag, Göttingen
New book: Seasons in the Literatures of the Medieval North
Alexander Fraser, München
GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
25 / 25
References
● Mark Stevenson. Word sense disambiguation : the case for combinations of knowledge sources. CSLI
studies in computational linguistics. CSLI Publ., Stanford, Calif., 2003.
● D. Yarowsky. Word sense disambiguation. In Alexander Clark, editor, The handbook of computational
linguistics and natural language processing, Blackwell handbooks in linguistics. Wiley-Blackwell, Oxford
[u.a.], 1. publ. Edition, 2010.
● D. Crystal. The Cambridge Encyclopedia of Language. The Cambridge Encyclopedia of Language.
Cambridge University Press, 2010.
● Clara Cabezas, Philip Resnik, and Jessica Stevens. Supervised sense tagging using support vector machi
nes. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguati-
on Systems, SENSEVAL ’01, pages 59–62, Stroudsburg, PA, USA, 2001. Association for Computational
Linguistics.
● Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu,
2002.
● Marcel Bollmann. Pos tagging for historical texts with sparse training data. In Proceedings of the 7th
Linguistic Annotation Workshop and Interoperability in Discourse, pages 11–18, Sofia, Bulgaria, August
2013. Association for Computational Linguistics.
● Taesun Moon and Jason Baldridge. Part-of-speech tagging for middle English through alignment and
projection of parallel diachronic texts. In Proceedings of the 2007 Joint Conference on Empirical Me- thods
in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages
390–399, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
● Roland Meyer. New wine in old wineskins? - tagging old russian via annotation projection from modern
translations. Russian Linguistics, 35(2):267–281, 2011.
● Marco Pennacchiotti and Fabio Massimo Zanzotto. Natural Language Processing across time: an empi
rical investigation on Italian, volume 5221, pages 371–382. Springer, 2008.

More Related Content

Similar to Word Sense Disambiguation in Old English

Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool
Fusepool SME project
 
Linking Library Data using Fusepool
Linking Library Data using FusepoolLinking Library Data using Fusepool
Linking Library Data using Fusepool
datentaste
 
Finding and managing process engineering information
Finding and managing process engineering informationFinding and managing process engineering information
Finding and managing process engineering information
Thomas Hapke
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
Thomas Hapke
 
SCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groupsSCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groups
British Science Association
 
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Global Human Ecodynamics Alliance
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
Thomas Hapke
 
1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions
RIILP
 

Similar to Word Sense Disambiguation in Old English (20)

Alenka Šauperl: Abstracts for scientific papers
Alenka Šauperl: Abstracts for scientific papers Alenka Šauperl: Abstracts for scientific papers
Alenka Šauperl: Abstracts for scientific papers
 
2016 05-20-clariah-wp3
2016 05-20-clariah-wp32016 05-20-clariah-wp3
2016 05-20-clariah-wp3
 
Iscram 2015
Iscram 2015Iscram 2015
Iscram 2015
 
Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool Johannes Hercher Developer Linking Data presentation Fusepool
Johannes Hercher Developer Linking Data presentation Fusepool
 
Linking Library Data using Fusepool
Linking Library Data using FusepoolLinking Library Data using Fusepool
Linking Library Data using Fusepool
 
Fullstendig_Soknad_Opprykk
Fullstendig_Soknad_OpprykkFullstendig_Soknad_Opprykk
Fullstendig_Soknad_Opprykk
 
Finding and managing process engineering information
Finding and managing process engineering informationFinding and managing process engineering information
Finding and managing process engineering information
 
Hinterland, Influence, Environs
Hinterland, Influence, EnvironsHinterland, Influence, Environs
Hinterland, Influence, Environs
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
 
Dealing with research data
Dealing with research dataDealing with research data
Dealing with research data
 
SCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groupsSCC2011 - Diversifying your audience - working with deaf groups
SCC2011 - Diversifying your audience - working with deaf groups
 
2014 10 TDWG - Environments-EOL
2014 10 TDWG - Environments-EOL 2014 10 TDWG - Environments-EOL
2014 10 TDWG - Environments-EOL
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
 
Scholar voices 1 - international scholars perspective of UK libraries
Scholar voices 1 - international scholars perspective of UK librariesScholar voices 1 - international scholars perspective of UK libraries
Scholar voices 1 - international scholars perspective of UK libraries
 
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
Carolin Müller-Spitzer & Sascha Wolfer - A quantitative view on dictionary us...
 
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
Steven Hartman (NIES) Ecocriticism, Environmental History and the advent of I...
 
Finding and managing engineering information
Finding and managing engineering informationFinding and managing engineering information
Finding and managing engineering information
 
532_Paper
532_Paper532_Paper
532_Paper
 
1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions1. EXPERT Winter School Partner Introductions
1. EXPERT Winter School Partner Introductions
 
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
A Short Review Of The Application Of 3D Documentation Methods On Selected UW ...
 

Recently uploaded

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 

Recently uploaded (20)

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 

Word Sense Disambiguation in Old English

  • 1. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 1 / 25 "God Wat þæt Ic Eom God" Word Sense Disambiguation in Old English Bamberg, Staatsbibliothek, Msc.Nat.1 (9th century) Martin Wunderlich and Alexander Fraser (LMU M nchen) Paul Sander Langeslag (University of G ttingen)
  • 2. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 2 / 25 Can we apply WSD techniques to a historical language like Old English and what are the specific challenges?
  • 3. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 3 / 25 Overview ● Background on the Old English language ● NLP and historical languages – problems and opportunities ● Old English digital resources ● WSD methodologies applied here ● Experiments and results ● Summary and discussion
  • 4. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 4 / 25 Background on the OE language 1 ● Spoken ca. 450 – 1100 AD ● A Germanic language: „God Wat þæt Ic Eom God‟ → „Gott weiß, dass ich gut bin‟ („God knows I'm good‟ - David Bowie) ● 5 cases, 3 genders, 3 numbers (singual, dual, plural) An example: – „Seo cwen geseah þone guman.‟ * – „Se guma geseah þa cwen.‟ ** (from Crystal, 2010) * „The woman saw the man.‟ ** „The man saw the woman‟
  • 5. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 5 / 25 Background on the OE language 2 ● Initially a runic alphabet known as „futhorc‟ (after the first letters -ᚠᚢᚦᚩᚱᚳ) ● ...keeping Thorn ᚦ and Wynn ƿ and adding Latin ● 24 letter alphabet: a æ b c d ð e f ᵹ/g h i l m n o p r s/ſ t þ u ƿ/w x y ● Introduced around 600 AD
  • 6. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 6 / 25 Background on the OE language 3 Migrations and settlements: https://www.uni-due.de/SHE/Germanic_Migration_to_Britain.gif (site maintained by Prof. Raymond Hickey, Chair of Linguistics)
  • 7. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 7 / 25 NLP & historical languages: problems ● Stopword lists ● POS taggers ● Word and sentence tokenizers ● Standard tools and libraries ● Shared tasks with prepared training data ● Existing research
  • 8. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 8 / 25 NLP & historical languages: problems ● Stopword lists ● POS taggers ● Word and sentence tokenizers ● Standard tools and libraries ● Shared tasks with prepared training data ● Existing research … well, a bit ...
  • 9. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 9 / 25 NLP & historical languages: related work ● Annotation projection in Germanic languages with parallel bible texts (Sukhareva and Chiarcos, 2014) ● Application of existing NLP tools to ancient Italian (Pennacchiotti and Zanzotto, 2008) ● Tagging Old East Slavonic texts (Meyer, 2011) ● POS tagging Early Modern German texts (Bollmann, 2013) ● Projection of tags from contemporary EN to ME (Moon and Baldridge, 2007)
  • 10. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 10 / 25 NLP & historical languages: opportunities 1.Digital corpora & dictionaries/lexicons do exist (incl. OE Wikipedia: https://ang.wikipedia.org/wiki/H%C4%93afodtramet) 2.Static corpus 3.Few existing NLP applications → lots to explore
  • 11. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 11 / 25 Old English digital resources: corpora ● York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE); ca. 1.5 million words ● York-Toronto-Helsinki Parsed Corpus of Old English poetry (YCOEP); 71,490 words ● Dictionary of Old English Corpus in Electronic Form (DOEC); ca. 3.8 million words → all available through the University of Oxford Text Archive (http://www.ota.ahds.ac.uk/);
  • 12. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 12 / 25 Old English digital resources: dictionary Dictionary of Old English (DOE) corpus stats: Number of HTML documents 3,037 Token count 3,786,753 Type count 343,135 Token count / type count ca. 11 Total number of sentences 234113 Average sentence length 5.5 Minimum sentence length 1 Maximum sentence length 263 Compare to Brown corpus: ca. 1 Mio tokens and ca. 50.000 types (T/T = 20) Spelling variations. e.g. „wundarlic‟, „wundorlic‟, „wunderlic‟ 12568 DOE entries for the letters from A to G (http://tapor.library.utoronto.ca/doe/)
  • 13. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 13 / 25 WSD methodologies 1 Criteria for selecting the target terms: ➔ minimum count 200, minimum length 3 characters ➔ non-Latin (i.e. no „dictum‟, „confundantur‟, „magister‟...) ➔ common nouns ➔ no proper nouns (e.g. no „Egypta‟, „Micel‟, „Iulianus‟...)
  • 14. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 14 / 25 WSD methodologies 2 Target terms: Target term Token count in DOE corpus Basic translation Anweald 242 Power, realm, order of angels Fultum 574 Help, aid, remedy Fæder 416 Father, lord (relig.) For 955 Movement, journey... Eadigan 263 To bless, to make happy Boc 567 Book, volume, legal doc Ban 314 Bone, ivory Are 308 Honour, mercy, property Andlang 1743 Continuous, upright Dryhten 261 Lord (worldly & relig.), chief 100 concordance matches each (random selection)
  • 15. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 15 / 25 WSD methodologies 3 Selected word senses of "bōc": (http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007) A. book A.1. in general, without particular reference to form or content Lk (WSCp) 4.17: he þa boc unfeold B. major division of a larger work JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter ðon in Pathma ealond þæt boc ðæra sighðana eac awrat. D. legal document Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ & Sanctæ Paule into ealdan mynstræ.
  • 16. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 16 / 25 WSD methodologies 4 From corpus to feature vectors – bag-of-words model with fixed size token window from Ch 540 (Birch 862):
  • 17. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 17 / 25 Implementation ● Libraries used: – Mallet (NLP and ML library) – Jsoup (HTML processing) ● Own implementation: – Parsing of corpus and dictionary data – Feature extraction and instance creation – Pipes for baseline classifiers (Mallet additions) – Metrics, summarization and output of results ...and much more...
  • 18. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 18 / 25 Experiments and results 1 ● Baseline 1: most frequent class. – Accuracy: 0.67 ● Baseline 2: random class. – Accuracy: 0.44 Human annotators' upper and lower bounds: 0.75 – 0.97 (Gale et al., 1992)
  • 19. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 19 / 25 Experiments and results 2 One-vs-all classification 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - Naive Bayes Accuracy Avg Precision Avg Recall Avg F1 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - Naive Bayes Accuracy Lin Reg trend Avg Precision Avg Recall Avg F1
  • 20. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 20 / 25 Experiments and results 3 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - MaxEnt Accuracy Avg Precision Avg Recall Avg F1 One-vs-all classification 0 2 4 6 8 10 12 14 16 18 20 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 A vs. notA - MaxEnt Accuracy Lin Reg trend Avg Precision Avg Recall Avg F1
  • 21. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 21 / 25 WSD methodologies 3 Selected word senses of "bōc": (http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007) A. book A.1. in general, without particular reference to form or content Lk (WSCp) 4.17: he þa boc unfeold B. major division of a larger work JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter ðon in Pathma ealond þæt boc ðæra sighðana eac awrat. D. legal document Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ & Sanctæ Paule into ealdan mynstræ.
  • 22. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 22 / 25 Experiments and results 4 Algorithm Feature vector Accuracy Precision Recall F1 Avg Std Dev Avg Std Dev Avg Std Dev Avg NB, multi-class BoW 0.7635 0.11 0.7205 0.18 0.7865 0.16 0.7521 ME, multi-class BoW 0.7520 0.17 0.8610 0.10 0.6915 0.17 0.7670 NB, one-vs-all BoW 0.8400 0.09 0.8458 0.10 0.8368 0.11 0.8295 ME, one-vs-all BoW 0.7950 0.12 0.7875 0.13 0.8080 0.12 0.7662 NB, multi-class Coll. 0.7245 0.12 0.8135 0.08 0.6510 0.12 0.5895 ME, multi-class Coll. 0.7910 0.13 0.8845 0.08 0.6875 0.16 0.6510 NB, one-vs-all Coll. 0.8200 0.09 0.8305 0.12 0.8085 0.10 0.7970 ME, one-vs-all Coll. 0.7290 0.09 0.7395 0.10 0.7145 0.14 0.6890
  • 23. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 23 / 25 Summary ● Historical languages: interesting, rewarding and difficult to work with ● WSD does give satisfactory results even without stemming etc. ● Best WSD performance: NB (F1), one vs. all, window size: ?? ● Annotated data set (available on website) ● Baseline classifiers as contributions to MALLET ● Possible extensions: – More advanced vector representations – Bootstrapping – Train classifiers based on other corpora – Distributional thesaurus (DT)? ● Acknowledgements: Winfried Rudolf, Göttingen & Juan Carmona Ramirez, Jena
  • 24. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 24 / 25 Thanks a lot for your attention! Any questions? Paul S. Langeslag, Göttingen New book: Seasons in the Literatures of the Medieval North Alexander Fraser, München
  • 25. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag Additional material: http://www.cis.uni-muenchen.de/~martinw/ 25 / 25 References ● Mark Stevenson. Word sense disambiguation : the case for combinations of knowledge sources. CSLI studies in computational linguistics. CSLI Publ., Stanford, Calif., 2003. ● D. Yarowsky. Word sense disambiguation. In Alexander Clark, editor, The handbook of computational linguistics and natural language processing, Blackwell handbooks in linguistics. Wiley-Blackwell, Oxford [u.a.], 1. publ. Edition, 2010. ● D. Crystal. The Cambridge Encyclopedia of Language. The Cambridge Encyclopedia of Language. Cambridge University Press, 2010. ● Clara Cabezas, Philip Resnik, and Jessica Stevens. Supervised sense tagging using support vector machi nes. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguati- on Systems, SENSEVAL ’01, pages 59–62, Stroudsburg, PA, USA, 2001. Association for Computational Linguistics. ● Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002. ● Marcel Bollmann. Pos tagging for historical texts with sparse training data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability in Discourse, pages 11–18, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. ● Taesun Moon and Jason Baldridge. Part-of-speech tagging for middle English through alignment and projection of parallel diachronic texts. In Proceedings of the 2007 Joint Conference on Empirical Me- thods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 390–399, Prague, Czech Republic, June 2007. Association for Computational Linguistics. ● Roland Meyer. New wine in old wineskins? - tagging old russian via annotation projection from modern translations. Russian Linguistics, 35(2):267–281, 2011. ● Marco Pennacchiotti and Fabio Massimo Zanzotto. Natural Language Processing across time: an empi rical investigation on Italian, volume 5221, pages 371–382. Springer, 2008.