Purepos -- an open source morphological disambiguator

•

1 like•759 views

nlpg

Technology

PurePos – an open source
morphological disambiguator
György Orosz, Attila Novák

{oroszgy, novak.attila}@itk.ppke.hu

Pázmány Péter Catholic University, Faculty of Information Technology
MTA-PPKE Language Technology Research Group

This work was partially supported by TÁMOP: 4.2.2/B – 10/1–2010–0014

Outline

PurePos
– Full morphological disambiguation (tag + lemma)
– Integrated morphological analyzer

1) Need of a tagger with an integrated MA
2) Implementation, Contribution
3) Evaluation

Problems with agglutinating languages
• Small word coverage of the corpus
• Even 1000+ possible forms of a word
• Possibly huge tagset
– absent tags
– absent tag sequences
• Standalone lemmatization is not a good
solution

Less-resourced languages
• Morphologically complex
• Lack of annotated corpora

Building an annotated corpus:
1) Manually disambiguate/correct
2) Train the tagger
3) Tag some text

Web service scenario
• Need of a high precision tagging tool
• Noisy and unseen data
• Incremental training

What do we need?
• Full morphological disambiguation
– Including lemmatization
• Integrated morphological analyzer
• Incremental training
• Unicode support
• Fast to train
• Open source
• Easy to use

Where to start?
• From scratch?
• Modifying an existing tool?
– TriTagger
– IceMorphy
– Apertium tagger
– HunPos
– OpenNLP
– ...

HunPos
Pros: Cons:
– Trigram tagger (TnT) – Only POS tagging
– Beam search (no lemmatization)
– Clever tricks – Implemented in
– Contains a suffix OCaml
guesser – No support for
– Employing a Unicode
morphological table – No real MA
– Fast to train and
decode

Using the analyzer

• Reducing the
search space
• Generating lemma
candidates

Lemmatization

Morphological guesser
1) Generating
E.g.: candidates
Facebookjukba
2) Filter by POS tag
3) Select the most
probable one

Incremental training
Training Tagging
1) Train the tagger 1) Load the model
2) Save the model 2) Compile the model
3) Load the model 3) Use the model for
4) Add training data tagging
to the model
5) Save the model

Evaluation

Accuracy
OpenNLP (perceptron) 97,16%
OpenNLP (maxent) 96.45% POS tagging
PurePos (without MA) 98.14% accuracy
PurePos (with MA) 98.99%

Accuracy
Full disambiguation Guesser 89.79%
accuracy of PurePos Guesser + MT 90.35%
Guesser + MA 98.35%

Evaluation

Full disambiguation accuracy

Evaluation

Performance as a web service

Lemmatization Tagging Combined
Baseline 90.58% 98.14% 89.79%
MT-10k 90.58% 98.14% 89.79%
MT-30k 90.58% 98.17% 89.81%
MT-100k 90.64% 98.30% 89.90%
MT-100k* 90.72% 98.39% 89.97%
PurePos 99.07% 98.99% 98.35%

PurePos
• Reimplementation of HunPos
• Deeply integrated MA
• Full disambiguation
• State-of-the-art accuracy
• Full Unicode support
• Incremental training
• Open source
• Easily extensible

Thank you!

http://nlpg.itk.ppke.hu/software/purepos

Similar to Purepos -- an open source morphological disambiguator

Building NLP solutions using Pythonbotsplash.com

An Introduction to Natural Language ProcessingTyrone Systems

Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar

Introduction To Applied Machine Learningananth

Learning to Translate with Joey NMTJulia Kreutzer

Rui Meng - 2017 - Deep Keyphrase GenerationAssociation for Computational Linguistics

The Joy of SciPykammeyer

MTM 2015Matīss ‎‎‎‎‎‎‎

PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)Databricks

Thinking about nlpPan Xiaotong

The Power of Auto ML and How Does it WorkIvo Andreev

Course report-islam-taharimul (1)TANVIRAHMED611926

Investigating the Possibilities of Using SMT for Text Annotationnlpg

Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Vienna Data Science Group

AutoML lectures (ACDL 2019)Joaquin Vanschoren

Chat adapted pos tagger for romanian languageUniversity Politehnica Bucharest

Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, MicrosoftGuhan Suriyanarayanan

Error handling in visual fox pro 9Mike Feltman

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge

NLP,expert,robotics.pptxAmanBadesra1

Similar to Purepos -- an open source morphological disambiguator (20)

Building NLP solutions using Python

An Introduction to Natural Language Processing

Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...

Introduction To Applied Machine Learning

Learning to Translate with Joey NMT

Rui Meng - 2017 - Deep Keyphrase Generation

The Joy of SciPy

MTM 2015

PyTorch - an ecosystem for deep learning with Soumith Chintala (Facebook AI)

Thinking about nlp

The Power of Auto ML and How Does it Work

Course report-islam-taharimul (1)

Investigating the Possibilities of Using SMT for Text Annotation

Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...

AutoML lectures (ACDL 2019)

Chat adapted pos tagger for romanian language

Using Deep Learning at Scale - Guhan Suriyanarayanan and Adi Oltean, Microsoft

Error handling in visual fox pro 9

Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016

NLP,expert,robotics.pptx

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

How to convert PDF to text with Nanonetsnaman860154

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

My Hashitalk Indonesia April 2024 Presentation

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Scaling API-first – The story of a global engineering organization

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Boost PC performance: How more available memory can improve productivity

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Injustice - Developers Among Us (SciFiDevCon 2024)

08448380779 Call Girls In Civil Lines Women Seeking Men

Handwritten Text Recognition for manuscripts and early printed texts

Breaking the Kubernetes Kill Chain: Host Path Mount

Google AI Hackathon: LLM based Evaluator for RAG

Maximizing Board Effectiveness 2024 Webinar.pptx

SQL Database Design For Developers at php[tek] 2024

How to convert PDF to text with Nanonets

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Salesforce Community Group Quito, Salesforce 101

Purepos -- an open source morphological disambiguator

1. PurePos – an open source morphological disambiguator György Orosz, Attila Novák {oroszgy, novak.attila}@itk.ppke.hu Pázmány Péter Catholic University, Faculty of Information Technology MTA-PPKE Language Technology Research Group This work was partially supported by TÁMOP: 4.2.2/B – 10/1–2010–0014

2. Outline PurePos – Full morphological disambiguation (tag + lemma) – Integrated morphological analyzer 1) Need of a tagger with an integrated MA 2) Implementation, Contribution 3) Evaluation

3. Problems with agglutinating languages • Small word coverage of the corpus • Even 1000+ possible forms of a word • Possibly huge tagset – absent tags – absent tag sequences • Standalone lemmatization is not a good solution

4. Less-resourced languages • Morphologically complex • Lack of annotated corpora Building an annotated corpus: 1) Manually disambiguate/correct 2) Train the tagger 3) Tag some text

5. Web service scenario • Need of a high precision tagging tool • Noisy and unseen data • Incremental training

6. What do we need? • Full morphological disambiguation – Including lemmatization • Integrated morphological analyzer • Incremental training • Unicode support • Fast to train • Open source • Easy to use

7. Where to start? • From scratch? • Modifying an existing tool? – TriTagger – IceMorphy – Apertium tagger – HunPos – OpenNLP – ...

8. HunPos Pros: Cons: – Trigram tagger (TnT) – Only POS tagging – Beam search (no lemmatization) – Clever tricks – Implemented in – Contains a suffix OCaml guesser – No support for – Employing a Unicode morphological table – No real MA – Fast to train and decode

9. Using the analyzer • Reducing the search space • Generating lemma candidates

10. Lemmatization Morphological guesser 1) Generating E.g.: candidates Facebookjukba 2) Filter by POS tag 3) Select the most probable one

11. Incremental training Training Tagging 1) Train the tagger 1) Load the model 2) Save the model 2) Compile the model 3) Load the model 3) Use the model for 4) Add training data tagging to the model 5) Save the model

12. Evaluation Accuracy OpenNLP (perceptron) 97,16% OpenNLP (maxent) 96.45% POS tagging PurePos (without MA) 98.14% accuracy PurePos (with MA) 98.99% Accuracy Full disambiguation Guesser 89.79% accuracy of PurePos Guesser + MT 90.35% Guesser + MA 98.35%

13. Evaluation POS tagging accuracy

14. Evaluation Full disambiguation accuracy

15. Evaluation Performance as a web service Lemmatization Tagging Combined Baseline 90.58% 98.14% 89.79% MT-10k 90.58% 98.14% 89.79% MT-30k 90.58% 98.17% 89.81% MT-100k 90.64% 98.30% 89.90% MT-100k* 90.72% 98.39% 89.97% PurePos 99.07% 98.99% 98.35%

16. PurePos • Reimplementation of HunPos • Deeply integrated MA • Full disambiguation • State-of-the-art accuracy • Full Unicode support • Incremental training • Open source • Easily extensible

17. Thank you! http://nlpg.itk.ppke.hu/software/purepos

Purepos -- an open source morphological disambiguator

Recommended

Recommended

More Related Content

Similar to Purepos -- an open source morphological disambiguator

Similar to Purepos -- an open source morphological disambiguator (20)

Recently uploaded

Recently uploaded (20)

Purepos -- an open source morphological disambiguator