SlideShare a Scribd company logo
Deep Learning-Based Morphological Taggers and
Lemmatizers for Annotating Historical Texts
Helmut Schmid
Centrum für Informations- und Sprachverarbeitung
Ludwig-Maximilian-Universität München
schmid@cis.lmu.de
POS Tagging and Lemmatization
word part-of-speech tag lemma
do AVD dô
begagenda VVFIN.Ind.Past.Sg.3 be-gègenen
imo PPER.Masc.Dat.Sg.3 ër
min DPOSA.Masc.Nom.Sg.* mîn
trohtin NA.Masc.Nom.Sg.st truhtîn
mit APPR mit
ſinero DPOSA.Fem.Dat.Sg.st sîn
arngrihte NA.Fem.Dat.Sg.st êre-grëhte
. $_
Example sentence from the Middle High German Reference Corpus
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 2 / 30
High Spelling Variation in Historical Texts
Variants of the Middle High German word tuon (to do)
Dialectal variation
tuon, dun, doyn
Spelling variation
tuon, thuon, tuen
Script variation
tuon, tvon, tûon, tu
o
n, tv
o
n, to
v
n
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 3 / 30
Overview
Word representations
POS tagger architecture
Lemmatizer architecture
Evaluations
Conclusions
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 4 / 30
Tagging Methods
Statistical Taggers
the most widely used class of annotation systems
They learn from manually annotated data.
Unseen words cause problems:
If thuon was never seen, what should its POS tag be?
Guessing the POS tag based on the final letters does not always help:
The most frequent word with the same 3-letter-suffix is the preposition uon (von).
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 5 / 30
Word Representations
Statistical systems represent words as atomic units.
We need a better word representation
which reflects word similarity
and provides good representations for unseen words
Solution
Representation of words with an n-dimensional vector (embedding)
(-0.7, 1.5, 2.8, -5.5, 0.2, ...)
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 6 / 30
Word Embeddings
Brussels
London
Paris
Belgium
UK
France
orange
blue
pink
house
building
small
tiny
walk
run
one
two
three
five
Embeddings are points in an n-dimensional space.
Embeddings of similar words are close to each other.
Higher-dimensional spaces can model different types of similarity.
syntactic, semantic etc.
Such embeddings can be learned with neural networks.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 7 / 30
Part-of-Speech Tagging
The quick brown ... dog
DT JJ JJ NN
bi−RNN
embeddings
The embedding sequence is processed with a bi-RNN
⇒ representation for each word in context
The output tag is predicted from this contextual representation.
The POS tagger is trained end-to-end on a manually annotated text corpus.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 8 / 30
Dealing With Unseen Words
The word embeddings of unseen words cannot be learned from the
annotated data.
Solution 1: Pre-train the embeddings on a large unannotated corpus
(e.g. by predicting the current word based on the left and right context)
Solution 2: Compute a word representation from the letter sequence.
The two methods can also be combined.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 9 / 30
Letter-Based Word Representations
The letters are represented with letter
embeddings
The embedding sequence is processed with
a bi-RNN
The two final representations are
concatenated.
⇒ letter-based word representation i kuq ...
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 10 / 30
Full Network
t h e od giu kcq ...
DT JJ NN
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 11 / 30
Advantages
The letter-based RNN
learns regular spelling variations
such as u ↔ v, uo ↔ u
o
etc.
generalizes from words to their unseen spelling variants
tuon → thv
o
n
provides high quality word representations for unseen words
uses fewer parameters than an embedding table
is slower, but the word representations can be precomputed
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 12 / 30
Technical Details of the POS Tagger
Word representations
character-based representation computed with a biRNN
Forward-RNN processes a 10-letter suffix of the word
Backward-RNN processes a 10-letter prefix of the word
Prefixes/Suffixes are padded with padding symbols if the word is
shorter.
⇒ faster and simpler (parallel) processing
optionally: pre-trained word embeddings
concatenated with the character-based representation
most useful when the training corpus is small
RNN over words
uses two biRNN layers with residual connections
i.e. the input is added to the output
All RNNs are implemented with LSTMs.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 13 / 30
Lemmatizer
b a b i e s bab<s>
b a b
Σ
encoder decoder
attention
?
Encoder-decoder architecture with attention
The encoder computes a representation for each letter in context.
The decoder generates the output letters one by one.
The attention module computes an input summary conditioned on the
current decoder state.
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 14 / 30
POS Tagging Experiments on Tiger
Goal: Check if our tagger is state-of-the-art.
Tiger is the standard corpus for German POS tagging
Resulting accuracy for POS and morphological tagging
test
Heigold 93.23
our tagger 93.42
Heigold+emb 93.85
our tagger+emb 93.88
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 15 / 30
POS Tagging Experiments on the ReM corpus
Corpus Description
Reference Corpus for Middle High German
period 1050–1350
diplomatically transcribed
semi-automatically annotated with
normalized spelling
part of speech
lemma
morphological features
2.9 times more transcribed word types than normalized word types
Contracted word forms such as inhandon (in handon, in hand) are
annotated with two tags
7830 different fine-grained tags including these combined tags
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 16 / 30
POS Tagging Experiments on the ReM corpus
Resulting POS and morphological tagging accuracies
test
TreeTagger POS 90.40
our tagger POS 95.88
our tagger POS+morph 89.45
POS+morph accuracy on unseen words: 79%
POS+morph accuracy on seen words with unseen tag: 57%
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 17 / 30
POS Tagging Experiments on Other Corpora
Corpus size tags dev test
Middle Dutch Gysseling corpus 1.5M 3864 91.10 91.01
Syntactic Reference Corpus of Medieval French 1.2M 68 96.45 96.28
York-Toronto-Helsinki Parsed C. of Old English 1.6M 298 97.37 97.13
Ancient Greek Dependency Treebank 2.1 548K 1041 91.35 91.29
Icelandic Parsed Historical Corpus 383K 319 93.88 89.87
Corpus Taurinese (Old Italian) 246K 239 98.46 98.43
GerManC historical German newspaper corpus 770K 62 97.25 98.23
GerManC historical German newspaper corpus 770K 2303 85.65 87.72
The accuracy is high if the tagset is small.
Exception: Icelandic
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 18 / 30
Lemmatization Experiments on ReM
Encoder-decoder system DL4MT by Kyunghyun Cho
originally developed for machine translation
used for morphological reinflection by Kann & Schütze
input: wordform + POS (split into a sequence of letters)
f r o g e t e # V V F I N . * . P a s t . S g . 3
output: lemma (sequence of letters)
v r â g e n
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 19 / 30
Lemmatization Experiments on ReM
Test accuracy on unseen word-tag combinations
test
overall 91.87
unseen words 86.43
seen words with unseen tags 93.33
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 20 / 30
Conclusions
The annotation of historical corpora is difficult due to dialectal and
spelling variations.
Neural networks are well suited to deal with this variation.
The letter-based word representations
are robust to spelling variations and
reflect the similarity of words.
NN-based POS taggers outperform traditional POS taggers used in
previous work on historical corpora by a large margin.
NN-based lemmatizers also produce accurate results on historical
corpora.
The new tagger and lemmatizer (called RNNTagger) is freely
available for non-commercial purposes via my homepage
http://www.cis.lmu.de/~schmid
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 21 / 30
Thank you for your attention!
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 22 / 30
POS Tagger: Network Sizes
Letter LSTM layers: 1
Word LSTM layers: 2
Letter embeddings size: 100
Letter LSTM size: 400
Word LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: SGD
Initial learning rate: 1.0 (no average across batch)
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
minimal character frequency: 2
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 23 / 30
Lemmatizer: Network Sizes
Encoder layers: 2
Decoder layers: 1
Letter embeddings size: 100
Encoder LSTM size: 400
Decoder LSTM size: 400
Dropout Rate: 0.5 (no dropout on embeddings)
Training algorithm: Adam
Initial learning rate: 0.001
Learning Rate decay: 5% after each epoch
Gradient clipping threshold: 1.0
Early Stopping
tied decoder input and output embeddings
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 24 / 30
Lemmatization Errors
POS tag word lemma predicted
NA.Masc.Akk.Sg.st karger karkære kargære
VVFIN.Ind.Pres.Pl.3 gan’lâzzont geantlâzen gantlâzen
NA.Masc,Fem.Akk.Pl.* wiſeblumen wisebluome wîsbluome
NA.Fem.Nom.Sg.* mæſſenîe massenîe mèssenîe
ADJD.Pos unkunde unkund unkünde
VVFIN.Ind.Pres.Sg.3 geith gân jëhen
NE.Nom.Sg Franzze Franze Franke
ADJA.Pos.Neut.Akk.Pl.st gemaliv gemâl mâl
NA.Masc.Akk.Sg.st buneiz punèiz bunèiz
PAVAP_VVPP inchomen in_komen în_komen
NA.Neut.Dat.Sg.st geſtiche gèstach gestifte
NA.Neut.Nom.Sg.st uber_zinder überzimber überzinter
NA.Masc.Dat.Pl.st rieterin rihtære rîtære
AVD ennot ennoète ennôte
NA.Neut.Nom.Sg.st vzkuchen ûzkûchen ûzkiuchen
AVD_VVIMP.Sg.2 anegenke ane_dènken ane_gènken
NE.Dat.Sg c¯oſtenopele Constantinopel Konstantînôpol
VVINF murmuren murmerièren murmeln
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 25 / 30
Most Frequent POS Tag Confusions
freq correct erroneous example words
202 NE FM Auguſti, Rome, Stephani
111 ADJA NA ceſue, zeſwe, vordern
105 NA PTKNEG niht, nicht, nit
102 NA ADJA weiſen, ríndeſpuch, oel
100 DRELS DDS der, ds
, dv, die
82 DDS DDART der, daz, den, ds
76 NA AVD vil, iht, mer, lu
o
te
71 NA ADJD gut, weich, recht, gu
o
t
61 ADJA AVD mer, leſtin, mínr, mere
60 DRELS DDART der, ds
, di
56 AVD NA vil, her, mer
56 DRELS KOUS daz, Dat, Du, dc
55 PPER VAFIN is, ſi, ſei
55 PTKNEG NA niht, nit, níht
53 ADJD AVD her, ſu
o
ze, ſtete
52 PTKVZ AVD vor, wids
, wider, ane
51 VVFIN NA lant, zûg, zs
priſt
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 26 / 30
Most Frequent Morphological Tag Confusions
freq correct erroneous
220 VVFIN.Ind.Past.Sg.3 VVFIN.*.Past.Sg.3
210 PPER.Masc.Nom.Pl.3 PPER.*.Nom.Pl.3
78 VAFIN.Ind.Past.Pl.3 VAFIN.*.Past.Pl.3
60 DDART.Masc.Nom.Pl.* DDART.*.Nom.Pl.*
58 NE.Gen.Sg NE.Nom.Sg
56 DDART.Fem.Akk.Sg.* DDART.Fem.Nom.Sg.*
53 VVFIN.Ind.Past.Pl.3 VVFIN.*.Past.Pl.3
51 PPER.Neut.Nom.Pl.3 PPER.*.Nom.Pl.3
51 NA.Fem.Akk.Sg.st NA.Fem.Nom.Sg.st
50 PPER.Masc,Neut.Nom.Pl.3 PPER.*.Nom.Pl.3
48 NA.Neut.Akk.Sg.st NA.Neut.Nom.Sg.st
47 PPER.Fem.Nom.Pl.3 PPER.*.Nom.Pl.3
47 NA.Neut.Akk.Pl.st NA.Neut.Akk.Sg.st
47 DRELS.Masc.Nom.Pl DRELS.*.Nom.Pl
46 NA.Fem.Dat.Sg.st NA.Fem.Gen.Sg.st
42 NA.Neut.Nom.Sg.st NA.Neut.Akk.Sg.st
42 DDS.Masc.Nom.Pl.* DDS.*.Nom.Pl.*
40 PPER.Neut.Akk.Sg.3 PPER.Neut.Gen.Sg.3
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 27 / 30
Application: Extraction of Negation Clitics
Middle High German negation particles are often realized as clitics:
clitic prefix:
enſprecheſt (nicht sprichst, not speak+2sg)
nemochte (nicht mochte, not wanted)
niſt (nicht ist, not is)
clitic suffix:
erne (er nicht, he not)
wirn (wir nicht, we not)
double prefix:
ſinenthewalten (sich nicht entwalten, himself not restrain)
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 28 / 30
Application: Extraction of Negation Clitics
Goal: Automatic extraction of words with negation clitics
Steps:
Training of the tagger on the ReM training corpus
Annotation of the test corpus
Extraction of all words tagged with PTKNEG-..., ...-PTKNEG, or
...-PTKNEG-...
Resulting precision: 96.4%
Resulting recall: 97.8%
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 29 / 30
Recurrent Neural Network
simple NN RNNsimple NN
ht = tanh(W (ht−1 ◦ x))
ht is the RNN state at time t
(= vector with the activations of all RNN neurons)
x is the input vector
The previous RNN state ht and the input x are concatenated
The resulting vector is multiplied with the matrix W
(= linear projection to a new space)
Finally, the activation function (here tanh) is applied
(squashes the values of the vector to the range [-1,1])
Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 30 / 30

More Related Content

What's hot

A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
IJNSA Journal
 
Meta Languages
Meta LanguagesMeta Languages
Meta Languages
Liam Dunphy
 
Language: The Way Words are Use (Technical Writing CS212))
Language: The Way Words are Use (Technical Writing CS212))Language: The Way Words are Use (Technical Writing CS212))
Language: The Way Words are Use (Technical Writing CS212))
Taibah University, College of Computer Science & Engineering
 
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesCoping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Marc Pantel
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
csandit
 
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant NetworksFine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
tiphaine_pheneau
 
Pointers c imp
Pointers c impPointers c imp
Pointers c imp
Neha Sharma
 
Pointers In C
Pointers In CPointers In C
Pointers In C
Sriram Raj
 
IRJET-Market Basket Analysis based on Frequent Itemset Mining
IRJET-Market Basket Analysis based on Frequent Itemset MiningIRJET-Market Basket Analysis based on Frequent Itemset Mining
IRJET-Market Basket Analysis based on Frequent Itemset Mining
IRJET Journal
 
AbstractKR on Pargram 2006
AbstractKR on Pargram 2006AbstractKR on Pargram 2006
AbstractKR on Pargram 2006
Valeria de Paiva
 

What's hot (10)

A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
A MULTI-LAYER HYBRID TEXT STEGANOGRAPHY FOR SECRET COMMUNICATION USING WORD T...
 
Meta Languages
Meta LanguagesMeta Languages
Meta Languages
 
Language: The Way Words are Use (Technical Writing CS212))
Language: The Way Words are Use (Technical Writing CS212))Language: The Way Words are Use (Technical Writing CS212))
Language: The Way Words are Use (Technical Writing CS212))
 
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling LanguagesCoping with Semantic Variation Points in Domain-Specific Modeling Languages
Coping with Semantic Variation Points in Domain-Specific Modeling Languages
 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
 
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant NetworksFine-Grained Intercontact Characterization in Disruption-Tolerant Networks
Fine-Grained Intercontact Characterization in Disruption-Tolerant Networks
 
Pointers c imp
Pointers c impPointers c imp
Pointers c imp
 
Pointers In C
Pointers In CPointers In C
Pointers In C
 
IRJET-Market Basket Analysis based on Frequent Itemset Mining
IRJET-Market Basket Analysis based on Frequent Itemset MiningIRJET-Market Basket Analysis based on Frequent Itemset Mining
IRJET-Market Basket Analysis based on Frequent Itemset Mining
 
AbstractKR on Pargram 2006
AbstractKR on Pargram 2006AbstractKR on Pargram 2006
AbstractKR on Pargram 2006
 

Similar to Session6 01.helmut schmid

Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
Editor IJARCET
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Lviv Startup Club
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
Prabha P
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
fridolin.wild
 
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESEFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
kevig
 
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and PhonemesEffect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
kevig
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
kkkseld
 
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIESA REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
IJCSES Journal
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
Vipul Munot
 
Toward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingToward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech tagging
IAESIJAI
 
A Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and SemanticsA Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and Semantics
Formal Systems Laboratory at University of Illinois
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
IDES Editor
 
Lfg and gpsg
Lfg and gpsgLfg and gpsg
Lfg and gpsg
SubramanianMuthusamy3
 
LFG and GPSG.pptx
LFG and GPSG.pptxLFG and GPSG.pptx
LFG and GPSG.pptx
Subramanian Mani
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
Red Over
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
Florian Leitner
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
ssuser4293bd
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
iosrjce
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
butest
 

Similar to Session6 01.helmut schmid (20)

Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329Ijarcet vol-2-issue-2-323-329
Ijarcet vol-2-issue-2-323-329
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
 
Mjfg now
Mjfg nowMjfg now
Mjfg now
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESEFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
 
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and PhonemesEffect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
 
Speech To Sign Language Interpreter System
Speech To Sign Language Interpreter SystemSpeech To Sign Language Interpreter System
Speech To Sign Language Interpreter System
 
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIESA REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
 
dialogue act modeling for automatic tagging and recognition
 dialogue act modeling for automatic tagging and recognition dialogue act modeling for automatic tagging and recognition
dialogue act modeling for automatic tagging and recognition
 
Toward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech taggingToward accurate Amazigh part-of-speech tagging
Toward accurate Amazigh part-of-speech tagging
 
A Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and SemanticsA Rewriting Approach to Concurrent Programming Language Design and Semantics
A Rewriting Approach to Concurrent Programming Language Design and Semantics
 
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic AmbiguityEffective Approach for Disambiguating Chinese Polyphonic Ambiguity
Effective Approach for Disambiguating Chinese Polyphonic Ambiguity
 
Lfg and gpsg
Lfg and gpsgLfg and gpsg
Lfg and gpsg
 
LFG and GPSG.pptx
LFG and GPSG.pptxLFG and GPSG.pptx
LFG and GPSG.pptx
 
Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)Pauls klein 2011-lm_paper(3)
Pauls klein 2011-lm_paper(3)
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical InformaticsMEBI 591C/598 – Data and Text Mining in Biomedical Informatics
MEBI 591C/598 – Data and Text Mining in Biomedical Informatics
 

More from IMPACT Centre of Competence

Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
IMPACT Centre of Competence
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
IMPACT Centre of Competence
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
IMPACT Centre of Competence
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
IMPACT Centre of Competence
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
IMPACT Centre of Competence
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
IMPACT Centre of Competence
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
IMPACT Centre of Competence
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
IMPACT Centre of Competence
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
IMPACT Centre of Competence
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
IMPACT Centre of Competence
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
IMPACT Centre of Competence
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
IMPACT Centre of Competence
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
IMPACT Centre of Competence
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
IMPACT Centre of Competence
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
IMPACT Centre of Competence
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
IMPACT Centre of Competence
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
IMPACT Centre of Competence
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
IMPACT Centre of Competence
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
IMPACT Centre of Competence
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
IMPACT Centre of Competence
 

More from IMPACT Centre of Competence (20)

Session1 03.hsian-an wang
Session1 03.hsian-an wangSession1 03.hsian-an wang
Session1 03.hsian-an wang
 
Session7 03.katrien depuydt
Session7 03.katrien depuydtSession7 03.katrien depuydt
Session7 03.katrien depuydt
 
Session7 02.peter kiraly
Session7 02.peter kiralySession7 02.peter kiraly
Session7 02.peter kiraly
 
Session6 04.giuseppe celano
Session6 04.giuseppe celanoSession6 04.giuseppe celano
Session6 04.giuseppe celano
 
Session6 03.sandra young
Session6 03.sandra youngSession6 03.sandra young
Session6 03.sandra young
 
Session6 02.jeremi ochab
Session6 02.jeremi ochabSession6 02.jeremi ochab
Session6 02.jeremi ochab
 
Session5 04.evangelos varthis
Session5 04.evangelos varthisSession5 04.evangelos varthis
Session5 04.evangelos varthis
 
Session5 03.george rehm
Session5 03.george rehmSession5 03.george rehm
Session5 03.george rehm
 
Session5 02.tom derrick
Session5 02.tom derrickSession5 02.tom derrick
Session5 02.tom derrick
 
Session5 01.rutger vankoert
Session5 01.rutger vankoertSession5 01.rutger vankoert
Session5 01.rutger vankoert
 
Session4 04.senka drobac
Session4 04.senka drobacSession4 04.senka drobac
Session4 04.senka drobac
 
Session3 04.arnau baro
Session3 04.arnau baroSession3 04.arnau baro
Session3 04.arnau baro
 
Session3 03.christian clausner
Session3 03.christian clausnerSession3 03.christian clausner
Session3 03.christian clausner
 
Session3 02.kimmo ketunnen
Session3 02.kimmo ketunnenSession3 02.kimmo ketunnen
Session3 02.kimmo ketunnen
 
Session3 01.clemens neudecker
Session3 01.clemens neudeckerSession3 01.clemens neudecker
Session3 01.clemens neudecker
 
Session2 04.ashkan ashkpour
Session2 04.ashkan ashkpourSession2 04.ashkan ashkpour
Session2 04.ashkan ashkpour
 
Session2 03.juri opitz
Session2 03.juri opitzSession2 03.juri opitz
Session2 03.juri opitz
 
Session2 02.christian reul
Session2 02.christian reulSession2 02.christian reul
Session2 02.christian reul
 
Session2 01.emad mohamed
Session2 01.emad mohamedSession2 01.emad mohamed
Session2 01.emad mohamed
 
Session1 04.florian fink
Session1 04.florian finkSession1 04.florian fink
Session1 04.florian fink
 

Recently uploaded

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
Fwdays
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
Neo4j
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
operationspcvita
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 

Recently uploaded (20)

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota"Choosing proper type of scaling", Olena Syrota
"Choosing proper type of scaling", Olena Syrota
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge GraphGraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
GraphRAG for LifeSciences Hands-On with the Clinical Knowledge Graph
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
The Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptxThe Microsoft 365 Migration Tutorial For Beginner.pptx
The Microsoft 365 Migration Tutorial For Beginner.pptx
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 

Session6 01.helmut schmid

  • 1. Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts Helmut Schmid Centrum für Informations- und Sprachverarbeitung Ludwig-Maximilian-Universität München schmid@cis.lmu.de
  • 2. POS Tagging and Lemmatization word part-of-speech tag lemma do AVD dô begagenda VVFIN.Ind.Past.Sg.3 be-gègenen imo PPER.Masc.Dat.Sg.3 ër min DPOSA.Masc.Nom.Sg.* mîn trohtin NA.Masc.Nom.Sg.st truhtîn mit APPR mit ſinero DPOSA.Fem.Dat.Sg.st sîn arngrihte NA.Fem.Dat.Sg.st êre-grëhte . $_ Example sentence from the Middle High German Reference Corpus Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 2 / 30
  • 3. High Spelling Variation in Historical Texts Variants of the Middle High German word tuon (to do) Dialectal variation tuon, dun, doyn Spelling variation tuon, thuon, tuen Script variation tuon, tvon, tûon, tu o n, tv o n, to v n Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 3 / 30
  • 4. Overview Word representations POS tagger architecture Lemmatizer architecture Evaluations Conclusions Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 4 / 30
  • 5. Tagging Methods Statistical Taggers the most widely used class of annotation systems They learn from manually annotated data. Unseen words cause problems: If thuon was never seen, what should its POS tag be? Guessing the POS tag based on the final letters does not always help: The most frequent word with the same 3-letter-suffix is the preposition uon (von). Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 5 / 30
  • 6. Word Representations Statistical systems represent words as atomic units. We need a better word representation which reflects word similarity and provides good representations for unseen words Solution Representation of words with an n-dimensional vector (embedding) (-0.7, 1.5, 2.8, -5.5, 0.2, ...) Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 6 / 30
  • 7. Word Embeddings Brussels London Paris Belgium UK France orange blue pink house building small tiny walk run one two three five Embeddings are points in an n-dimensional space. Embeddings of similar words are close to each other. Higher-dimensional spaces can model different types of similarity. syntactic, semantic etc. Such embeddings can be learned with neural networks. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 7 / 30
  • 8. Part-of-Speech Tagging The quick brown ... dog DT JJ JJ NN bi−RNN embeddings The embedding sequence is processed with a bi-RNN ⇒ representation for each word in context The output tag is predicted from this contextual representation. The POS tagger is trained end-to-end on a manually annotated text corpus. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 8 / 30
  • 9. Dealing With Unseen Words The word embeddings of unseen words cannot be learned from the annotated data. Solution 1: Pre-train the embeddings on a large unannotated corpus (e.g. by predicting the current word based on the left and right context) Solution 2: Compute a word representation from the letter sequence. The two methods can also be combined. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 9 / 30
  • 10. Letter-Based Word Representations The letters are represented with letter embeddings The embedding sequence is processed with a bi-RNN The two final representations are concatenated. ⇒ letter-based word representation i kuq ... Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 10 / 30
  • 11. Full Network t h e od giu kcq ... DT JJ NN Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 11 / 30
  • 12. Advantages The letter-based RNN learns regular spelling variations such as u ↔ v, uo ↔ u o etc. generalizes from words to their unseen spelling variants tuon → thv o n provides high quality word representations for unseen words uses fewer parameters than an embedding table is slower, but the word representations can be precomputed Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 12 / 30
  • 13. Technical Details of the POS Tagger Word representations character-based representation computed with a biRNN Forward-RNN processes a 10-letter suffix of the word Backward-RNN processes a 10-letter prefix of the word Prefixes/Suffixes are padded with padding symbols if the word is shorter. ⇒ faster and simpler (parallel) processing optionally: pre-trained word embeddings concatenated with the character-based representation most useful when the training corpus is small RNN over words uses two biRNN layers with residual connections i.e. the input is added to the output All RNNs are implemented with LSTMs. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 13 / 30
  • 14. Lemmatizer b a b i e s bab<s> b a b Σ encoder decoder attention ? Encoder-decoder architecture with attention The encoder computes a representation for each letter in context. The decoder generates the output letters one by one. The attention module computes an input summary conditioned on the current decoder state. Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 14 / 30
  • 15. POS Tagging Experiments on Tiger Goal: Check if our tagger is state-of-the-art. Tiger is the standard corpus for German POS tagging Resulting accuracy for POS and morphological tagging test Heigold 93.23 our tagger 93.42 Heigold+emb 93.85 our tagger+emb 93.88 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 15 / 30
  • 16. POS Tagging Experiments on the ReM corpus Corpus Description Reference Corpus for Middle High German period 1050–1350 diplomatically transcribed semi-automatically annotated with normalized spelling part of speech lemma morphological features 2.9 times more transcribed word types than normalized word types Contracted word forms such as inhandon (in handon, in hand) are annotated with two tags 7830 different fine-grained tags including these combined tags Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 16 / 30
  • 17. POS Tagging Experiments on the ReM corpus Resulting POS and morphological tagging accuracies test TreeTagger POS 90.40 our tagger POS 95.88 our tagger POS+morph 89.45 POS+morph accuracy on unseen words: 79% POS+morph accuracy on seen words with unseen tag: 57% Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 17 / 30
  • 18. POS Tagging Experiments on Other Corpora Corpus size tags dev test Middle Dutch Gysseling corpus 1.5M 3864 91.10 91.01 Syntactic Reference Corpus of Medieval French 1.2M 68 96.45 96.28 York-Toronto-Helsinki Parsed C. of Old English 1.6M 298 97.37 97.13 Ancient Greek Dependency Treebank 2.1 548K 1041 91.35 91.29 Icelandic Parsed Historical Corpus 383K 319 93.88 89.87 Corpus Taurinese (Old Italian) 246K 239 98.46 98.43 GerManC historical German newspaper corpus 770K 62 97.25 98.23 GerManC historical German newspaper corpus 770K 2303 85.65 87.72 The accuracy is high if the tagset is small. Exception: Icelandic Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 18 / 30
  • 19. Lemmatization Experiments on ReM Encoder-decoder system DL4MT by Kyunghyun Cho originally developed for machine translation used for morphological reinflection by Kann & Schütze input: wordform + POS (split into a sequence of letters) f r o g e t e # V V F I N . * . P a s t . S g . 3 output: lemma (sequence of letters) v r â g e n Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 19 / 30
  • 20. Lemmatization Experiments on ReM Test accuracy on unseen word-tag combinations test overall 91.87 unseen words 86.43 seen words with unseen tags 93.33 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 20 / 30
  • 21. Conclusions The annotation of historical corpora is difficult due to dialectal and spelling variations. Neural networks are well suited to deal with this variation. The letter-based word representations are robust to spelling variations and reflect the similarity of words. NN-based POS taggers outperform traditional POS taggers used in previous work on historical corpora by a large margin. NN-based lemmatizers also produce accurate results on historical corpora. The new tagger and lemmatizer (called RNNTagger) is freely available for non-commercial purposes via my homepage http://www.cis.lmu.de/~schmid Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 21 / 30
  • 22. Thank you for your attention! Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 22 / 30
  • 23. POS Tagger: Network Sizes Letter LSTM layers: 1 Word LSTM layers: 2 Letter embeddings size: 100 Letter LSTM size: 400 Word LSTM size: 400 Dropout Rate: 0.5 (no dropout on embeddings) Training algorithm: SGD Initial learning rate: 1.0 (no average across batch) Learning Rate decay: 5% after each epoch Gradient clipping threshold: 1.0 Early Stopping minimal character frequency: 2 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 23 / 30
  • 24. Lemmatizer: Network Sizes Encoder layers: 2 Decoder layers: 1 Letter embeddings size: 100 Encoder LSTM size: 400 Decoder LSTM size: 400 Dropout Rate: 0.5 (no dropout on embeddings) Training algorithm: Adam Initial learning rate: 0.001 Learning Rate decay: 5% after each epoch Gradient clipping threshold: 1.0 Early Stopping tied decoder input and output embeddings Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 24 / 30
  • 25. Lemmatization Errors POS tag word lemma predicted NA.Masc.Akk.Sg.st karger karkære kargære VVFIN.Ind.Pres.Pl.3 gan’lâzzont geantlâzen gantlâzen NA.Masc,Fem.Akk.Pl.* wiſeblumen wisebluome wîsbluome NA.Fem.Nom.Sg.* mæſſenîe massenîe mèssenîe ADJD.Pos unkunde unkund unkünde VVFIN.Ind.Pres.Sg.3 geith gân jëhen NE.Nom.Sg Franzze Franze Franke ADJA.Pos.Neut.Akk.Pl.st gemaliv gemâl mâl NA.Masc.Akk.Sg.st buneiz punèiz bunèiz PAVAP_VVPP inchomen in_komen în_komen NA.Neut.Dat.Sg.st geſtiche gèstach gestifte NA.Neut.Nom.Sg.st uber_zinder überzimber überzinter NA.Masc.Dat.Pl.st rieterin rihtære rîtære AVD ennot ennoète ennôte NA.Neut.Nom.Sg.st vzkuchen ûzkûchen ûzkiuchen AVD_VVIMP.Sg.2 anegenke ane_dènken ane_gènken NE.Dat.Sg c¯oſtenopele Constantinopel Konstantînôpol VVINF murmuren murmerièren murmeln Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 25 / 30
  • 26. Most Frequent POS Tag Confusions freq correct erroneous example words 202 NE FM Auguſti, Rome, Stephani 111 ADJA NA ceſue, zeſwe, vordern 105 NA PTKNEG niht, nicht, nit 102 NA ADJA weiſen, ríndeſpuch, oel 100 DRELS DDS der, ds , dv, die 82 DDS DDART der, daz, den, ds 76 NA AVD vil, iht, mer, lu o te 71 NA ADJD gut, weich, recht, gu o t 61 ADJA AVD mer, leſtin, mínr, mere 60 DRELS DDART der, ds , di 56 AVD NA vil, her, mer 56 DRELS KOUS daz, Dat, Du, dc 55 PPER VAFIN is, ſi, ſei 55 PTKNEG NA niht, nit, níht 53 ADJD AVD her, ſu o ze, ſtete 52 PTKVZ AVD vor, wids , wider, ane 51 VVFIN NA lant, zûg, zs priſt Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 26 / 30
  • 27. Most Frequent Morphological Tag Confusions freq correct erroneous 220 VVFIN.Ind.Past.Sg.3 VVFIN.*.Past.Sg.3 210 PPER.Masc.Nom.Pl.3 PPER.*.Nom.Pl.3 78 VAFIN.Ind.Past.Pl.3 VAFIN.*.Past.Pl.3 60 DDART.Masc.Nom.Pl.* DDART.*.Nom.Pl.* 58 NE.Gen.Sg NE.Nom.Sg 56 DDART.Fem.Akk.Sg.* DDART.Fem.Nom.Sg.* 53 VVFIN.Ind.Past.Pl.3 VVFIN.*.Past.Pl.3 51 PPER.Neut.Nom.Pl.3 PPER.*.Nom.Pl.3 51 NA.Fem.Akk.Sg.st NA.Fem.Nom.Sg.st 50 PPER.Masc,Neut.Nom.Pl.3 PPER.*.Nom.Pl.3 48 NA.Neut.Akk.Sg.st NA.Neut.Nom.Sg.st 47 PPER.Fem.Nom.Pl.3 PPER.*.Nom.Pl.3 47 NA.Neut.Akk.Pl.st NA.Neut.Akk.Sg.st 47 DRELS.Masc.Nom.Pl DRELS.*.Nom.Pl 46 NA.Fem.Dat.Sg.st NA.Fem.Gen.Sg.st 42 NA.Neut.Nom.Sg.st NA.Neut.Akk.Sg.st 42 DDS.Masc.Nom.Pl.* DDS.*.Nom.Pl.* 40 PPER.Neut.Akk.Sg.3 PPER.Neut.Gen.Sg.3 Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 27 / 30
  • 28. Application: Extraction of Negation Clitics Middle High German negation particles are often realized as clitics: clitic prefix: enſprecheſt (nicht sprichst, not speak+2sg) nemochte (nicht mochte, not wanted) niſt (nicht ist, not is) clitic suffix: erne (er nicht, he not) wirn (wir nicht, we not) double prefix: ſinenthewalten (sich nicht entwalten, himself not restrain) Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 28 / 30
  • 29. Application: Extraction of Negation Clitics Goal: Automatic extraction of words with negation clitics Steps: Training of the tagger on the ReM training corpus Annotation of the test corpus Extraction of all words tagged with PTKNEG-..., ...-PTKNEG, or ...-PTKNEG-... Resulting precision: 96.4% Resulting recall: 97.8% Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 29 / 30
  • 30. Recurrent Neural Network simple NN RNNsimple NN ht = tanh(W (ht−1 ◦ x)) ht is the RNN state at time t (= vector with the activations of all RNN neurons) x is the input vector The previous RNN state ht and the input x are concatenated The resulting vector is multiplied with the matrix W (= linear projection to a new space) Finally, the activation function (here tanh) is applied (squashes the values of the vector to the range [-1,1]) Helmut Schmid (CIS) Deep Learning-Based Morphological Taggers and Lemmatizers for Historical Texts 30 / 30