Purepos -- an open source morphological disambiguator

776 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
776
On SlideShare
0
From Embeds
0
Number of Embeds
128
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Purepos -- an open source morphological disambiguator

  1. 1. PurePos – an open source morphological disambiguator György Orosz, Attila Novák {oroszgy, novak.attila}@itk.ppke.hu Pázmány Péter Catholic University, Faculty of Information Technology MTA-PPKE Language Technology Research GroupThis work was partially supported by TÁMOP: 4.2.2/B – 10/1–2010–0014
  2. 2. Outline PurePos – Full morphological disambiguation (tag + lemma) – Integrated morphological analyzer1) Need of a tagger with an integrated MA2) Implementation, Contribution3) Evaluation
  3. 3. Problems with agglutinating languages• Small word coverage of the corpus• Even 1000+ possible forms of a word• Possibly huge tagset – absent tags – absent tag sequences• Standalone lemmatization is not a good solution
  4. 4. Less-resourced languages• Morphologically complex• Lack of annotated corporaBuilding an annotated corpus: 1) Manually disambiguate/correct 2) Train the tagger 3) Tag some text
  5. 5. Web service scenario• Need of a high precision tagging tool• Noisy and unseen data• Incremental training
  6. 6. What do we need?• Full morphological disambiguation – Including lemmatization• Integrated morphological analyzer• Incremental training• Unicode support• Fast to train• Open source• Easy to use
  7. 7. Where to start?• From scratch?• Modifying an existing tool? – TriTagger – IceMorphy – Apertium tagger – HunPos – OpenNLP – ...
  8. 8. HunPosPros: Cons: – Trigram tagger (TnT) – Only POS tagging – Beam search (no lemmatization) – Clever tricks – Implemented in – Contains a suffix OCaml guesser – No support for – Employing a Unicode morphological table – No real MA – Fast to train and decode
  9. 9. Using the analyzer • Reducing the search space • Generating lemma candidates
  10. 10. LemmatizationMorphological guesser 1) Generating E.g.: candidates Facebookjukba 2) Filter by POS tag 3) Select the most probable one
  11. 11. Incremental trainingTraining Tagging 1) Train the tagger 1) Load the model 2) Save the model 2) Compile the model 3) Load the model 3) Use the model for 4) Add training data tagging to the model 5) Save the model
  12. 12. Evaluation AccuracyOpenNLP (perceptron) 97,16%OpenNLP (maxent) 96.45% POS taggingPurePos (without MA) 98.14% accuracyPurePos (with MA) 98.99% Accuracy Full disambiguation Guesser 89.79% accuracy of PurePos Guesser + MT 90.35% Guesser + MA 98.35%
  13. 13. EvaluationPOS tagging accuracy
  14. 14. EvaluationFull disambiguation accuracy
  15. 15. EvaluationPerformance as a web service Lemmatization Tagging CombinedBaseline 90.58% 98.14% 89.79%MT-10k 90.58% 98.14% 89.79%MT-30k 90.58% 98.17% 89.81%MT-100k 90.64% 98.30% 89.90%MT-100k* 90.72% 98.39% 89.97%PurePos 99.07% 98.99% 98.35%
  16. 16. PurePos• Reimplementation of HunPos• Deeply integrated MA• Full disambiguation• State-of-the-art accuracy• Full Unicode support• Incremental training• Open source• Easily extensible
  17. 17. Thank you!http://nlpg.itk.ppke.hu/software/purepos

×