• Like
  • Save
Investigating the Possibilities of Using SMT for Text Annotation
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Investigating the Possibilities of Using SMT for Text Annotation

  • 256 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
256
On SlideShare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Investigating the Possibilities of Using SMT for Text Annotation László J. Laki1,2 laki.laszlo@itk.ppke.hu 1 Pázmány Péter Catholic University, Faculty of Information Technology 2 MTA-PPKE Language Technology Research GroupThis work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
  • 2. OUTLINE• SMT as POS tagger• Baseline system• Decreasing the size of target vocabulary• Handling OOV words• Evaluation• Conclusion
  • 3. STATISTICAL MACHINE TRANSLATION• Frameworks • Corpus – MOSES (Koehn et. al., 2007) – Szeged Korpusz 2 (Csendes et. al., 2003) – JOSHUA (Li et. al., 2009) – 1.2 million words – SRILM (Stolcke, 2002) – MSD coding system
  • 4. THE BASELINE SYSTEMPlain text a konszolidációra való törekvés találkozott a budapest#bank igényeivel is - tudjuk meg garadnai#róbert adattárház-menedzsertől .Reference a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzser_[Nc-sb] ._[Punct]System’s a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzsertől ._[Punct]• Correct annotation: 24557 System BLEU score Accuracy• Incorrect annotation: 646 MOSES 98.49% 91.29%• No annotation: 1697 JOSHUA 97.31% 91.07%
  • 5. DECREASING THE SIZE OF TARGET VOCABULARY• With only POS disambiguation – Annotate to POS tags without lemmatization • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n]) – Complexity: 152694 ->1128 tokens; – Accuracy: 91.46% (+0.17%)• With simplifying POS tags – Annotate to main POS tags • (e.g. [Vmis3s---n] -> V) – Complexity: 1128 -> 14 tokens; – Accuracy: 92.20% (+0.91%)• Conclusion – None of the OOV words were tagged (1698 pieces) – Quality slightly increased at the cost of the significant information loss
  • 6. HANDLING OOV WORDS• OOV words are included in just a few Token # word classes ezt 120• Analyze the context of the OOV words a 100• Create a dictionary based on the kívül 6 frequency of the words calculated diplomáciai 4 from training set magyarországi 4• The words not included in this képességet 2 dictionary are changed to string „unk” erőfeszítéseken 2• Tested on different thresholds adhatnák 1Plain text ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül a lobbyerőt és képességet a diplomáciai erőfeszítéseken mindenekelőtt a magyarországi multinacionálisokadhatnák . . multinacionálisok adhatnákModified ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországitext unk . unk .
  • 7. Threshold Accuracy Original text Lemmatized Multiple text treshold HANDLING OOV WORDS X<1(Baseline) 91.46% 93.13% 92.57% 93.28% 90.40% 92.25% 90.65% 88.41% 91.81% 88.62% 87.07% 91.48% 87.40% 85.97% 91.10% 86.15%• In the original text – Best accuracy: 93.13%• In case of lemmas – Best accuracy: 92.57%• Multiple thresholds – Best accuracy: 93.28%
  • 8. INTRODUCING POSTFIXES• Goal: Separate nouns, verbs, adjectives, etc.• Different POS types have characteristic postfixes• Use last characters of the OOV words. – Last 2,3,4 characters – e.g. noun: házból -> unk_ból verb: megállítottuk -> unk_tuk
  • 9. INTRODUCING POSTFIXESThreshold Accuracy Number of leftcharacters 2 3 4 X<1 91.46% 91.46% 91.46%(Baseline) 95.17% 95.83% 95.96% 94.17% 95.32% 95.90% 93.48% 94.97% 95.73% 92.94% 94.70% 95.60% 92.61% 94.55% 95.55%
  • 10. EVALUATION System Token Sentence• Baseline: Only POS tagging accuracy accuracy – Choose the best Baseline (BL) 89.66% 25.27% SMT-_Baselin2 91.46% 34.53%• PurePos: SMT-_OOV-_postfix 95.96% 56.47% – Maxent and HMM PurePos 96.03% 55.87% based PurePos-MorphTable 97.29% 66.40% – Include OpenNLP Maxent (ONM) 95.28% 26.00% morphological OpenNLP Perceptron (ONP) 94.98% 26.67% disambiguation System Token Sentence POS tagging + lemmatization accuracy accuracy• OpenNLP SMT-_Baselin1 91.29% 33.73% – Maxent based PurePos 83.92% 10.00% PurePos-MorphTable 84.89% 11.60% – Perceptron based
  • 11. CONCLUSION• SMT system was examined for part-of- speech disambiguation and lemmatization in Hungarian• Absolutely automated system• Best accuracy about 96%• Decreasing the size of target vocabulary• Handle OOV words
  • 12. THANK YOU FOR YOUR ATTENTION laki.laszlo@itk.ppke.hu