Using SMT for Hungarian text annotation

Investigating the Possibilities of
Using SMT for Text Annotation
László J. Laki1,2
laki.laszlo@itk.ppke.hu

1 Pázmány Péter Catholic University, Faculty of Information Technology

2 MTA-PPKE Language Technology Research Group

This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002

OUTLINE
• SMT as POS tagger
• Baseline system
• Decreasing the size of target vocabulary
• Handling OOV words
• Evaluation
• Conclusion

STATISTICAL MACHINE TRANSLATION

• Frameworks • Corpus
– MOSES (Koehn et. al., 2007) – Szeged Korpusz 2
(Csendes et. al., 2003)
– JOSHUA (Li et. al., 2009)
– 1.2 million words
– SRILM (Stolcke, 2002) – MSD coding system

THE BASELINE SYSTEM
Plain text a konszolidációra való törekvés találkozott a budapest#bank igényeivel is -
tudjuk meg garadnai#róbert adattárház-menedzsertől .
Reference a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]
annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
adattárház-menedzser_[Nc-sb] ._[Punct]
System’s a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]
annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
adattárház-menedzsertől ._[Punct]

• Correct annotation: 24557 System BLEU score Accuracy
• Incorrect annotation: 646 MOSES 98.49% 91.29%
• No annotation: 1697 JOSHUA 97.31% 91.07%

DECREASING THE SIZE OF TARGET VOCABULARY
• With only POS disambiguation
– Annotate to POS tags without lemmatization
• (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n])
– Complexity: 152694 ->1128 tokens;
– Accuracy: 91.46% (+0.17%)
• With simplifying POS tags
– Annotate to main POS tags
• (e.g. [Vmis3s---n] -> V)
– Complexity: 1128 -> 14 tokens;
– Accuracy: 92.20% (+0.91%)
• Conclusion
– None of the OOV words were tagged (1698 pieces)
– Quality slightly increased at the cost of the significant
information loss

HANDLING OOV WORDS
• OOV words are included in just a few Token #
word classes ezt 120
• Analyze the context of the OOV words a 100
• Create a dictionary based on the kívül 6
frequency of the words calculated diplomáciai 4
from training set
magyarországi 4
• The words not included in this
képességet 2
dictionary are changed to string
„unk” erőfeszítéseken 2
• Tested on different thresholds adhatnák 1

Plain text ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül
a lobbyerőt és képességet a diplomáciai erőfeszítéseken
mindenekelőtt a magyarországi multinacionálisokadhatnák . .
multinacionálisok adhatnák
Modified ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk
unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi
text unk .
unk .

Threshold Accuracy
Original text Lemmatized Multiple
text treshold
HANDLING OOV WORDS
X<1(Baseline) 91.46%
93.13% 92.57% 93.28%
90.40% 92.25% 90.65%
88.41% 91.81% 88.62%
87.07% 91.48% 87.40%
85.97% 91.10% 86.15%

• In the original text
– Best accuracy: 93.13%
• In case of lemmas
• Multiple thresholds

INTRODUCING POSTFIXES
• Goal: Separate nouns, verbs, adjectives,
etc.
• Different POS types have characteristic
postfixes
• Use last characters of the OOV words.
– Last 2,3,4 characters
– e.g. noun: házból -> unk_ból
verb: megállítottuk -> unk_tuk

INTRODUCING POSTFIXES
Threshold Accuracy
Number of leftcharacters
2 3 4
X<1 91.46% 91.46% 91.46%
(Baseline)
95.17% 95.83% 95.96%
94.17% 95.32% 95.90%
93.48% 94.97% 95.73%
92.94% 94.70% 95.60%
92.61% 94.55% 95.55%

EVALUATION
System Token Sentence
• Baseline: Only POS tagging accuracy accuracy
– Choose the best Baseline (BL) 89.66% 25.27%
SMT-_Baselin2 91.46% 34.53%
• PurePos:
SMT-_OOV-_postfix 95.96% 56.47%
– Maxent and HMM PurePos 96.03% 55.87%
based PurePos-MorphTable 97.29% 66.40%
– Include OpenNLP Maxent (ONM) 95.28% 26.00%

morphological OpenNLP Perceptron (ONP) 94.98% 26.67%

disambiguation System Token Sentence
POS tagging + lemmatization accuracy accuracy
• OpenNLP SMT-_Baselin1 91.29% 33.73%
– Maxent based PurePos 83.92% 10.00%
PurePos-MorphTable 84.89% 11.60%
– Perceptron based

CONCLUSION
• SMT system was examined for part-of-
speech disambiguation and lemmatization
in Hungarian
• Absolutely automated system
• Best accuracy about 96%
• Decreasing the size of target vocabulary
• Handle OOV words

THANK YOU FOR YOUR ATTENTION

laki.laszlo@itk.ppke.hu

Using SMT for Hungarian text annotation

Recommended

Recommended

More Related Content

Similar to Using SMT for Hungarian text annotation

Similar to Using SMT for Hungarian text annotation (12)

Recently uploaded

Recently uploaded (20)

Using SMT for Hungarian text annotation