SlideShare a Scribd company logo
1 of 12
Investigating the Possibilities of
       Using SMT for Text Annotation
                                 László J. Laki1,2
                                laki.laszlo@itk.ppke.hu

  1 Pázmány    Péter Catholic University, Faculty of Information Technology

                2 MTA-PPKE    Language Technology Research Group


This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
OUTLINE
•   SMT as POS tagger
•   Baseline system
•   Decreasing the size of target vocabulary
•   Handling OOV words
•   Evaluation
•   Conclusion
STATISTICAL MACHINE TRANSLATION




• Frameworks                      • Corpus
  – MOSES (Koehn et. al., 2007)     – Szeged Korpusz 2
                                      (Csendes et. al., 2003)
  – JOSHUA (Li et. al., 2009)
                                    – 1.2 million words
  – SRILM (Stolcke, 2002)           – MSD coding system
THE BASELINE SYSTEM
Plain text   a konszolidációra való törekvés találkozott a budapest#bank igényeivel is -
             tudjuk meg garadnai#róbert adattárház-menedzsertől .
Reference    a_[Tf]       konszolidáció_[Nc-ss]   való_[Afp-sn]    törekvés_[Nc-sn]
annotation   találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
             is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
             adattárház-menedzser_[Nc-sb] ._[Punct]
System’s     a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn]
annotation   találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3]
             is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn]
              adattárház-menedzsertől ._[Punct]


• Correct annotation:         24557          System       BLEU score       Accuracy
• Incorrect annotation:         646           MOSES              98.49%         91.29%
• No annotation:               1697          JOSHUA              97.31%         91.07%
DECREASING THE SIZE OF TARGET VOCABULARY
• With only POS disambiguation
   – Annotate to POS tags without lemmatization
      • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n])
   – Complexity: 152694 ->1128 tokens;
   – Accuracy: 91.46% (+0.17%)
• With simplifying POS tags
   – Annotate to main POS tags
      • (e.g. [Vmis3s---n] -> V)
   – Complexity: 1128 -> 14 tokens;
   – Accuracy: 92.20% (+0.91%)
• Conclusion
   – None of the OOV words were tagged (1698 pieces)
   – Quality slightly increased at the cost of the significant
     information loss
HANDLING OOV WORDS
• OOV words are included in just a few                             Token            #
  word classes                                             ezt                120
• Analyze the context of the OOV words                     a                  100
• Create a dictionary based on the                         kívül              6
  frequency of the words calculated                        diplomáciai        4
  from training set
                                                           magyarországi      4
• The words not included in this
                                                           képességet         2
  dictionary are changed to string
  „unk”                                                    erőfeszítéseken 2
• Tested on different thresholds                           adhatnák           1


Plain text   ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül
                 a lobbyerőt és képességet a diplomáciai erőfeszítéseken
             mindenekelőtt a magyarországi multinacionálisokadhatnák . .
                                           multinacionálisok adhatnák
Modified     ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk
                   unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi
text         unk .
                 unk .
Threshold                   Accuracy
                         Original text   Lemmatized   Multiple
                                            text      treshold
         HANDLING OOV WORDS
           X<1(Baseline) 91.46%
                              93.13%         92.57%      93.28%
                              90.40%         92.25%      90.65%
                              88.41%         91.81%      88.62%
                              87.07%         91.48%      87.40%
                              85.97%         91.10%      86.15%

• In the original text
  – Best accuracy: 93.13%
• In case of lemmas
  – Best accuracy: 92.57%
• Multiple thresholds
  – Best accuracy: 93.28%
INTRODUCING POSTFIXES
• Goal: Separate nouns, verbs, adjectives,
  etc.
• Different POS types have characteristic
  postfixes
• Use last characters of the OOV words.
  – Last 2,3,4 characters
  – e.g. noun: házból -> unk_ból
         verb: megállítottuk -> unk_tuk
INTRODUCING POSTFIXES
Threshold           Accuracy
              Number of leftcharacters
              2          3          4
   X<1       91.46%    91.46%     91.46%
(Baseline)
             95.17%    95.83% 95.96%
             94.17%    95.32%     95.90%
             93.48%    94.97%     95.73%
             92.94%    94.70%     95.60%
             92.61%    94.55%     95.55%
EVALUATION
                               System                 Token   Sentence
• Baseline:                Only POS tagging          accuracy accuracy
  – Choose the best    Baseline (BL)                   89.66%   25.27%
                       SMT-_Baselin2                   91.46%   34.53%
• PurePos:
                       SMT-_OOV-_postfix               95.96%   56.47%
  – Maxent and HMM     PurePos                         96.03%   55.87%
    based              PurePos-MorphTable              97.29%   66.40%
  – Include            OpenNLP Maxent (ONM)            95.28%   26.00%

    morphological      OpenNLP Perceptron (ONP)        94.98%   26.67%

    disambiguation               System               Token   Sentence
                       POS tagging + lemmatization   accuracy accuracy
• OpenNLP              SMT-_Baselin1                   91.29%   33.73%
  – Maxent based       PurePos                         83.92%   10.00%
                       PurePos-MorphTable              84.89%   11.60%
  – Perceptron based
CONCLUSION
• SMT system was examined for part-of-
  speech disambiguation and lemmatization
  in Hungarian
• Absolutely automated system
• Best accuracy about 96%
• Decreasing the size of target vocabulary
• Handle OOV words
THANK YOU FOR YOUR ATTENTION

     laki.laszlo@itk.ppke.hu

More Related Content

Similar to Using SMT for Hungarian text annotation

RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRubén Izquierdo Beviá
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRubén Izquierdo Beviá
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Saurabh Kaushik
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012MediaEval2012
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...geraintduck
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingGuy De Pauw
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwarevsrtwin
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...ayaha osaki
 

Similar to Using SMT for Hungarian text annotation (12)

LSA algorithm
LSA algorithmLSA algorithm
LSA algorithm
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
RANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpusRANLP 2013: DutchSemcor in quest of the ideal corpus
RANLP 2013: DutchSemcor in quest of the ideal corpus
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinf...
 
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela BarreiroTowards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
 
Methods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech TaggingMethods for Amharic Part-of-Speech Tagging
Methods for Amharic Part-of-Speech Tagging
 
Supporting the authoring process with linguistic software
Supporting the authoring process with linguistic softwareSupporting the authoring process with linguistic software
Supporting the authoring process with linguistic software
 
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence LabelingMarek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
 
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
2017:12:06 acl読み会"Learning attention for historical text normalization by lea...
 

Recently uploaded

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Using SMT for Hungarian text annotation

  • 1. Investigating the Possibilities of Using SMT for Text Annotation László J. Laki1,2 laki.laszlo@itk.ppke.hu 1 Pázmány Péter Catholic University, Faculty of Information Technology 2 MTA-PPKE Language Technology Research Group This work was partially supported by TÁMOP-4.2.1.B – 11/2/KMR-2011–0002
  • 2. OUTLINE • SMT as POS tagger • Baseline system • Decreasing the size of target vocabulary • Handling OOV words • Evaluation • Conclusion
  • 3. STATISTICAL MACHINE TRANSLATION • Frameworks • Corpus – MOSES (Koehn et. al., 2007) – Szeged Korpusz 2 (Csendes et. al., 2003) – JOSHUA (Li et. al., 2009) – 1.2 million words – SRILM (Stolcke, 2002) – MSD coding system
  • 4. THE BASELINE SYSTEM Plain text a konszolidációra való törekvés találkozott a budapest#bank igényeivel is - tudjuk meg garadnai#róbert adattárház-menedzsertől . Reference a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn] annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzser_[Nc-sb] ._[Punct] System’s a_[Tf] konszolidáció_[Nc-ss] való_[Afp-sn] törekvés_[Nc-sn] annotation találkozik_[Vmis3s---n] a_[Tf] Budapest#Bank_[Np-sn] igény_[Nc-pi---s3] is_[Ccsp] -_[Punct] tud_[Vmip1p---y] meg_[Rp] Garadnai#Róbert_[Np-sn] adattárház-menedzsertől ._[Punct] • Correct annotation: 24557 System BLEU score Accuracy • Incorrect annotation: 646 MOSES 98.49% 91.29% • No annotation: 1697 JOSHUA 97.31% 91.07%
  • 5. DECREASING THE SIZE OF TARGET VOCABULARY • With only POS disambiguation – Annotate to POS tags without lemmatization • (e.g. találkozik_[Vmis3s---n] -> [Vmis3s---n]) – Complexity: 152694 ->1128 tokens; – Accuracy: 91.46% (+0.17%) • With simplifying POS tags – Annotate to main POS tags • (e.g. [Vmis3s---n] -> V) – Complexity: 1128 -> 14 tokens; – Accuracy: 92.20% (+0.91%) • Conclusion – None of the OOV words were tagged (1698 pieces) – Quality slightly increased at the cost of the significant information loss
  • 6. HANDLING OOV WORDS • OOV words are included in just a few Token # word classes ezt 120 • Analyze the context of the OOV words a 100 • Create a dictionary based on the kívül 6 frequency of the words calculated diplomáciai 4 from training set magyarországi 4 • The words not included in this képességet 2 dictionary are changed to string „unk” erőfeszítéseken 2 • Tested on different thresholds adhatnák 1 Plain text ezt a lobbyerőt és képességet a diplomáciai erőfeszítéseken kívül a lobbyerőt és képességet a diplomáciai erőfeszítéseken mindenekelőtt a magyarországi multinacionálisokadhatnák . . multinacionálisok adhatnák Modified ezt a unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi unk unk és unk a diplomáciai unk kívül mindenekelőtt a magyarországi text unk . unk .
  • 7. Threshold Accuracy Original text Lemmatized Multiple text treshold HANDLING OOV WORDS X<1(Baseline) 91.46% 93.13% 92.57% 93.28% 90.40% 92.25% 90.65% 88.41% 91.81% 88.62% 87.07% 91.48% 87.40% 85.97% 91.10% 86.15% • In the original text – Best accuracy: 93.13% • In case of lemmas – Best accuracy: 92.57% • Multiple thresholds – Best accuracy: 93.28%
  • 8. INTRODUCING POSTFIXES • Goal: Separate nouns, verbs, adjectives, etc. • Different POS types have characteristic postfixes • Use last characters of the OOV words. – Last 2,3,4 characters – e.g. noun: házból -> unk_ból verb: megállítottuk -> unk_tuk
  • 9. INTRODUCING POSTFIXES Threshold Accuracy Number of leftcharacters 2 3 4 X<1 91.46% 91.46% 91.46% (Baseline) 95.17% 95.83% 95.96% 94.17% 95.32% 95.90% 93.48% 94.97% 95.73% 92.94% 94.70% 95.60% 92.61% 94.55% 95.55%
  • 10. EVALUATION System Token Sentence • Baseline: Only POS tagging accuracy accuracy – Choose the best Baseline (BL) 89.66% 25.27% SMT-_Baselin2 91.46% 34.53% • PurePos: SMT-_OOV-_postfix 95.96% 56.47% – Maxent and HMM PurePos 96.03% 55.87% based PurePos-MorphTable 97.29% 66.40% – Include OpenNLP Maxent (ONM) 95.28% 26.00% morphological OpenNLP Perceptron (ONP) 94.98% 26.67% disambiguation System Token Sentence POS tagging + lemmatization accuracy accuracy • OpenNLP SMT-_Baselin1 91.29% 33.73% – Maxent based PurePos 83.92% 10.00% PurePos-MorphTable 84.89% 11.60% – Perceptron based
  • 11. CONCLUSION • SMT system was examined for part-of- speech disambiguation and lemmatization in Hungarian • Absolutely automated system • Best accuracy about 96% • Decreasing the size of target vocabulary • Handle OOV words
  • 12. THANK YOU FOR YOUR ATTENTION laki.laszlo@itk.ppke.hu