1. Investigating the Possibilities of
Using SMT for Text Annotation
László J. Laki1,2
laki.laszlo@itk.ppke.hu
1 Pázmány Péter Catholic University, Faculty of Information Technology
2 MTA-PPKE Language Technology Research Group
This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
2. OUTLINE
• SMT as POS tagger
• Baseline system
• Decreasing the size of target vocabulary
• Handling OOV words
• Evaluation
• Conclusion
3. STATISTICAL MACHINE TRANSLATION
• Frameworks • Corpus
– MOSES (Koehn et. al., 2007) – Szeged Korpusz 2
(Csendes et. al., 2003)
– JOSHUA (Li et. al., 2009)
– 1.2 million words
– SRILM (Stolcke, 2002) – MSD coding system
4. THE BASELINE SYSTEM
Plain text a konszolidációra való törekvés találkozott a budapest#bank igényeivel is -
tudjuk meg garadnai#róbert adattárház-menedzsertől .
Reference a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]
annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
adattárház-menedzser_[Nc-sb] ._[Punct]
System’s a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]
annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
adattárház-menedzsertől ._[Punct]
• Correct annotation: 24557 System BLEU score Accuracy
• Incorrect annotation: 646 MOSES 98.49% 91.29%
• No annotation: 1697 JOSHUA 97.31% 91.07%
5. DECREASING THE SIZE OF TARGET VOCABULARY
• With only POS disambiguation
– Annotate to POS tags without lemmatization
• (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n])
– Complexity: 152694 ->1128 tokens;
– Accuracy: 91.46% (+0.17%)
• With simplifying POS tags
– Annotate to main POS tags
• (e.g. [Vmis3s---n] -> V)
– Complexity: 1128 -> 14 tokens;
– Accuracy: 92.20% (+0.91%)
• Conclusion
– None of the OOV words were tagged (1698 pieces)
– Quality slightly increased at the cost of the significant
information loss
6. HANDLING OOV WORDS
• OOV words are included in just a few Token #
word classes ezt 120
• Analyze the context of the OOV words a 100
• Create a dictionary based on the kívül 6
frequency of the words calculated diplomáciai 4
from training set
magyarországi 4
• The words not included in this
képességet 2
dictionary are changed to string
„unk” erőfeszítéseken 2
• Tested on different thresholds adhatnák 1
Plain text ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül
a lobbyerőt és képességet a diplomáciai erőfeszítéseken
mindenekelőtt a magyarországi multinacionálisokadhatnák . .
multinacionálisok adhatnák
Modified ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk
unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi
text unk .
unk .
7. Threshold Accuracy
Original text Lemmatized Multiple
text treshold
HANDLING OOV WORDS
X<1(Baseline) 91.46%
93.13% 92.57% 93.28%
90.40% 92.25% 90.65%
88.41% 91.81% 88.62%
87.07% 91.48% 87.40%
85.97% 91.10% 86.15%
• In the original text
– Best accuracy: 93.13%
• In case of lemmas
– Best accuracy: 92.57%
• Multiple thresholds
– Best accuracy: 93.28%
8. INTRODUCING POSTFIXES
• Goal: Separate nouns, verbs, adjectives,
etc.
• Different POS types have characteristic
postfixes
• Use last characters of the OOV words.
– Last 2,3,4 characters
– e.g. noun: házból -> unk_ból
verb: megállítottuk -> unk_tuk
10. EVALUATION
System Token Sentence
• Baseline: Only POS tagging accuracy accuracy
– Choose the best Baseline (BL) 89.66% 25.27%
SMT-_Baselin2 91.46% 34.53%
• PurePos:
SMT-_OOV-_postfix 95.96% 56.47%
– Maxent and HMM PurePos 96.03% 55.87%
based PurePos-MorphTable 97.29% 66.40%
– Include OpenNLP Maxent (ONM) 95.28% 26.00%
morphological OpenNLP Perceptron (ONP) 94.98% 26.67%
disambiguation System Token Sentence
POS tagging + lemmatization accuracy accuracy
• OpenNLP SMT-_Baselin1 91.29% 33.73%
– Maxent based PurePos 83.92% 10.00%
PurePos-MorphTable 84.89% 11.60%
– Perceptron based
11. CONCLUSION
• SMT system was examined for part-of-
speech disambiguation and lemmatization
in Hungarian
• Absolutely automated system
• Best accuracy about 96%
• Decreasing the size of target vocabulary
• Handle OOV words
12. THANK YOU FOR YOUR ATTENTION
laki.laszlo@itk.ppke.hu