Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Word Sense Disambiguation in Old English
1. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
1 / 25
"God Wat þæt Ic Eom God"
Word Sense Disambiguation in Old English
Bamberg, Staatsbibliothek, Msc.Nat.1 (9th century)
Martin Wunderlich and Alexander Fraser (LMU M nchen)
Paul Sander Langeslag (University of G ttingen)
2. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
2 / 25
Can we apply WSD
techniques to a
historical language
like Old English
and
what are the
specific challenges?
3. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
3 / 25
Overview
●
Background on the Old English language
●
NLP and historical languages – problems and
opportunities
●
Old English digital resources
●
WSD methodologies applied here
●
Experiments and results
●
Summary and discussion
4. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
4 / 25
Background on the OE language 1
●
Spoken ca. 450 – 1100 AD
●
A Germanic language:
„God Wat þæt Ic Eom God‟
→ „Gott weiß, dass ich gut bin‟
(„God knows I'm good‟ - David Bowie)
●
5 cases, 3 genders, 3 numbers (singual, dual, plural)
An example:
– „Seo cwen geseah þone guman.‟ *
– „Se guma geseah þa cwen.‟ **
(from Crystal, 2010)
* „The woman saw the man.‟ ** „The man saw the woman‟
5. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
5 / 25
Background on the OE language 2
●
Initially a runic alphabet known as „futhorc‟
(after the first letters -ᚠᚢᚦᚩᚱᚳ)
●
...keeping Thorn ᚦ and Wynn ƿ and adding Latin
●
24 letter alphabet:
a æ b c d ð e f ᵹ/g h i l m n o p r s/ſ t þ u ƿ/w x y
●
Introduced around 600 AD
6. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
6 / 25
Background on the OE language 3
Migrations and settlements:
https://www.uni-due.de/SHE/Germanic_Migration_to_Britain.gif
(site maintained by Prof. Raymond Hickey, Chair of Linguistics)
7. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
7 / 25
NLP & historical languages: problems
●
Stopword lists
●
POS taggers
●
Word and sentence tokenizers
●
Standard tools and libraries
●
Shared tasks with prepared training data
●
Existing research
8. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
8 / 25
NLP & historical languages: problems
●
Stopword lists
●
POS taggers
●
Word and sentence tokenizers
●
Standard tools and libraries
●
Shared tasks with prepared training data
●
Existing research … well, a bit ...
9. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
9 / 25
NLP & historical languages: related work
●
Annotation projection in Germanic languages with parallel bible texts
(Sukhareva and Chiarcos, 2014)
●
Application of existing NLP tools to ancient Italian
(Pennacchiotti and Zanzotto, 2008)
●
Tagging Old East Slavonic texts
(Meyer, 2011)
●
POS tagging Early Modern German texts
(Bollmann, 2013)
●
Projection of tags from contemporary EN to ME
(Moon and Baldridge, 2007)
10. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
10 / 25
NLP & historical languages: opportunities
1.Digital corpora & dictionaries/lexicons do exist
(incl. OE Wikipedia: https://ang.wikipedia.org/wiki/H%C4%93afodtramet)
2.Static corpus
3.Few existing NLP applications → lots to explore
11. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
11 / 25
Old English digital resources: corpora
●
York-Toronto-Helsinki Parsed Corpus of Old
English prose (YCOE); ca. 1.5 million words
●
York-Toronto-Helsinki Parsed Corpus of Old
English poetry (YCOEP); 71,490 words
●
Dictionary of Old English Corpus in Electronic
Form (DOEC); ca. 3.8 million words
→ all available through the University of Oxford Text Archive
(http://www.ota.ahds.ac.uk/);
12. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
12 / 25
Old English digital resources: dictionary
Dictionary of Old English (DOE) corpus stats:
Number of HTML documents 3,037
Token count 3,786,753
Type count 343,135
Token count / type count ca. 11
Total number of sentences 234113
Average sentence length 5.5
Minimum sentence length 1
Maximum sentence length 263
Compare to Brown
corpus:
ca. 1 Mio tokens and ca.
50.000 types (T/T = 20)
Spelling variations. e.g.
„wundarlic‟, „wundorlic‟,
„wunderlic‟
12568 DOE entries for the letters from A to G
(http://tapor.library.utoronto.ca/doe/)
13. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
13 / 25
WSD methodologies 1
Criteria for selecting the target terms:
➔
minimum count 200, minimum length 3 characters
➔
non-Latin (i.e. no „dictum‟, „confundantur‟, „magister‟...)
➔
common nouns
➔
no proper nouns (e.g. no „Egypta‟, „Micel‟, „Iulianus‟...)
14. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
14 / 25
WSD methodologies 2
Target terms: Target term Token count in
DOE corpus
Basic translation
Anweald 242 Power, realm, order of
angels
Fultum 574 Help, aid, remedy
Fæder 416 Father, lord (relig.)
For 955 Movement, journey...
Eadigan 263 To bless, to make happy
Boc 567 Book, volume, legal doc
Ban 314 Bone, ivory
Are 308 Honour, mercy, property
Andlang 1743 Continuous, upright
Dryhten 261 Lord (worldly & relig.), chief
100 concordance
matches each
(random selection)
15. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
15 / 25
WSD methodologies 3
Selected word senses of "bōc":
(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)
A. book
A.1. in general, without particular reference to form or content
Lk (WSCp) 4.17: he þa boc unfeold
B. major division of a larger work
JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter
ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.
D. legal document
Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng
edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ &
Sanctæ Paule into ealdan mynstræ.
16. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
16 / 25
WSD methodologies 4
From corpus to feature vectors – bag-of-words model with fixed size
token window
from Ch 540 (Birch 862):
17. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
17 / 25
Implementation
●
Libraries used:
– Mallet (NLP and ML library)
– Jsoup (HTML processing)
●
Own implementation:
– Parsing of corpus and dictionary data
– Feature extraction and instance creation
– Pipes for baseline classifiers (Mallet additions)
– Metrics, summarization and output of results
...and much more...
18. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
18 / 25
Experiments and results 1
●
Baseline 1: most frequent class.
– Accuracy: 0.67
●
Baseline 2: random class.
– Accuracy: 0.44
Human annotators' upper and lower bounds: 0.75 – 0.97
(Gale et al., 1992)
19. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
19 / 25
Experiments and results 2
One-vs-all classification
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - Naive Bayes
Accuracy
Avg Precision
Avg Recall
Avg F1
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - Naive Bayes
Accuracy
Lin Reg trend
Avg Precision
Avg Recall
Avg F1
20. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
20 / 25
Experiments and results 3
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - MaxEnt
Accuracy
Avg Precision
Avg Recall
Avg F1
One-vs-all classification
0 2 4 6 8 10 12 14 16 18 20
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
A vs. notA - MaxEnt
Accuracy
Lin Reg trend
Avg Precision
Avg Recall
Avg F1
21. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
21 / 25
WSD methodologies 3
Selected word senses of "bōc":
(http://tapor.library.utoronto.ca/doe/dict/indices/headwordsd.html#E03007)
A. book
A.1. in general, without particular reference to form or content
Lk (WSCp) 4.17: he þa boc unfeold
B. major division of a larger work
JnArgGl (Li) 3: ðis uutedlice godspell aurat in ðær meigð æfter
ðon in Pathma ealond þæt boc ðæra sighðana eac awrat.
D. legal document
Birch 862: Þis is ðæs landes boc æt Duntune ðe Eadred cyng
edniwon gæbocodæ sanctæ trinitate & Sanctæ Pætræ &
Sanctæ Paule into ealdan mynstræ.
23. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
23 / 25
Summary
●
Historical languages: interesting, rewarding and difficult to work with
●
WSD does give satisfactory results even without stemming etc.
●
Best WSD performance: NB (F1), one vs. all, window size: ??
●
Annotated data set (available on website)
●
Baseline classifiers as contributions to MALLET
●
Possible extensions:
– More advanced vector representations
– Bootstrapping
– Train classifiers based on other corpora
– Distributional thesaurus (DT)?
●
Acknowledgements:
Winfried Rudolf, Göttingen & Juan Carmona Ramirez, Jena
24. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
24 / 25
Thanks a lot for your attention!
Any questions?
Paul S. Langeslag, Göttingen
New book: Seasons in the Literatures of the Medieval North
Alexander Fraser, München
25. GSCL, Essen – 2015-09-30 WSD in Old English – Martin Wunderlich, Alexander Fraser, Paul Sander Langeslag
Additional material: http://www.cis.uni-muenchen.de/~martinw/
25 / 25
References
● Mark Stevenson. Word sense disambiguation : the case for combinations of knowledge sources. CSLI
studies in computational linguistics. CSLI Publ., Stanford, Calif., 2003.
● D. Yarowsky. Word sense disambiguation. In Alexander Clark, editor, The handbook of computational
linguistics and natural language processing, Blackwell handbooks in linguistics. Wiley-Blackwell, Oxford
[u.a.], 1. publ. Edition, 2010.
● D. Crystal. The Cambridge Encyclopedia of Language. The Cambridge Encyclopedia of Language.
Cambridge University Press, 2010.
● Clara Cabezas, Philip Resnik, and Jessica Stevens. Supervised sense tagging using support vector machi
nes. In The Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguati-
on Systems, SENSEVAL ’01, pages 59–62, Stroudsburg, PA, USA, 2001. Association for Computational
Linguistics.
● Andrew Kachites McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu,
2002.
● Marcel Bollmann. Pos tagging for historical texts with sparse training data. In Proceedings of the 7th
Linguistic Annotation Workshop and Interoperability in Discourse, pages 11–18, Sofia, Bulgaria, August
2013. Association for Computational Linguistics.
● Taesun Moon and Jason Baldridge. Part-of-speech tagging for middle English through alignment and
projection of parallel diachronic texts. In Proceedings of the 2007 Joint Conference on Empirical Me- thods
in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages
390–399, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
● Roland Meyer. New wine in old wineskins? - tagging old russian via annotation projection from modern
translations. Russian Linguistics, 35(2):267–281, 2011.
● Marco Pennacchiotti and Fabio Massimo Zanzotto. Natural Language Processing across time: an empi
rical investigation on Italian, volume 5221, pages 371–382. Springer, 2008.