Your SlideShare is downloading. ×
Investigating the Possibilities of Using SMT for Text Annotation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Investigating the Possibilities of Using SMT for Text Annotation

266
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
266
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Investigating the Possibilities of Using SMT for Text Annotation László J. Laki1,2 laki.laszlo@itk.ppke.hu 1 Pázmány Péter Catholic University, Faculty of Information Technology 2 MTA-PPKE Language Technology Research GroupThis work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
  • 2. OUTLINE• SMT as POS tagger• Baseline system• Decreasing the size of target vocabulary• Handling OOV words• Evaluation• Conclusion
  • 3. STATISTICAL MACHINE TRANSLATION• Frameworks • Corpus – MOSES (Koehn et. al., 2007) – Szeged Korpusz 2 (Csendes et. al., 2003) – JOSHUA (Li et. al., 2009) – 1.2 million words – SRILM (Stolcke, 2002) – MSD coding system
  • 4. THE BASELINE SYSTEMPlain text a konszolidációra való törekvés találkozott a budapest#bank igényeivel is - tudjuk meg garadnai#róbert adattárház-menedzsertől .Reference a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzser_[Nc-sb] ._[Punct]System’s a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzsertől ._[Punct]• Correct annotation: 24557 System BLEU score Accuracy• Incorrect annotation: 646 MOSES 98.49% 91.29%• No annotation: 1697 JOSHUA 97.31% 91.07%
  • 5. DECREASING THE SIZE OF TARGET VOCABULARY• With only POS disambiguation – Annotate to POS tags without lemmatization • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n]) – Complexity: 152694 ->1128 tokens; – Accuracy: 91.46% (+0.17%)• With simplifying POS tags – Annotate to main POS tags • (e.g. [Vmis3s---n] -> V) – Complexity: 1128 -> 14 tokens; – Accuracy: 92.20% (+0.91%)• Conclusion – None of the OOV words were tagged (1698 pieces) – Quality slightly increased at the cost of the significant information loss
  • 6. HANDLING OOV WORDS• OOV words are included in just a few Token # word classes ezt 120• Analyze the context of the OOV words a 100• Create a dictionary based on the kívül 6 frequency of the words calculated diplomáciai 4 from training set magyarországi 4• The words not included in this képességet 2 dictionary are changed to string „unk” erőfeszítéseken 2• Tested on different thresholds adhatnák 1Plain text ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül a lobbyerőt és képességet a diplomáciai erőfeszítéseken mindenekelőtt a magyarországi multinacionálisokadhatnák . . multinacionálisok adhatnákModified ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországitext unk . unk .
  • 7. Threshold Accuracy Original text Lemmatized Multiple text treshold HANDLING OOV WORDS X<1(Baseline) 91.46% 93.13% 92.57% 93.28% 90.40% 92.25% 90.65% 88.41% 91.81% 88.62% 87.07% 91.48% 87.40% 85.97% 91.10% 86.15%• In the original text – Best accuracy: 93.13%• In case of lemmas – Best accuracy: 92.57%• Multiple thresholds – Best accuracy: 93.28%
  • 8. INTRODUCING POSTFIXES• Goal: Separate nouns, verbs, adjectives, etc.• Different POS types have characteristic postfixes• Use last characters of the OOV words. – Last 2,3,4 characters – e.g. noun: házból -> unk_ból verb: megállítottuk -> unk_tuk
  • 9. INTRODUCING POSTFIXESThreshold Accuracy Number of leftcharacters 2 3 4 X<1 91.46% 91.46% 91.46%(Baseline) 95.17% 95.83% 95.96% 94.17% 95.32% 95.90% 93.48% 94.97% 95.73% 92.94% 94.70% 95.60% 92.61% 94.55% 95.55%
  • 10. EVALUATION System Token Sentence• Baseline: Only POS tagging accuracy accuracy – Choose the best Baseline (BL) 89.66% 25.27% SMT-_Baselin2 91.46% 34.53%• PurePos: SMT-_OOV-_postfix 95.96% 56.47% – Maxent and HMM PurePos 96.03% 55.87% based PurePos-MorphTable 97.29% 66.40% – Include OpenNLP Maxent (ONM) 95.28% 26.00% morphological OpenNLP Perceptron (ONP) 94.98% 26.67% disambiguation System Token Sentence POS tagging + lemmatization accuracy accuracy• OpenNLP SMT-_Baselin1 91.29% 33.73% – Maxent based PurePos 83.92% 10.00% PurePos-MorphTable 84.89% 11.60% – Perceptron based
  • 11. CONCLUSION• SMT system was examined for part-of- speech disambiguation and lemmatization in Hungarian• Absolutely automated system• Best accuracy about 96%• Decreasing the size of target vocabulary• Handle OOV words
  • 12. THANK YOU FOR YOUR ATTENTION laki.laszlo@itk.ppke.hu