Purepos -- an open source morphological disambiguator

  • 412 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
412
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. PurePos – an open source morphological disambiguator György Orosz, Attila Novák {oroszgy, novak.attila}@itk.ppke.hu Pázmány Péter Catholic University, Faculty of Information Technology MTA-PPKE Language Technology Research GroupThis work was partially supported by TÁMOP: 4.2.2/B – 10/1–2010–0014
  • 2. Outline PurePos – Full morphological disambiguation (tag + lemma) – Integrated morphological analyzer1) Need of a tagger with an integrated MA2) Implementation, Contribution3) Evaluation
  • 3. Problems with agglutinating languages• Small word coverage of the corpus• Even 1000+ possible forms of a word• Possibly huge tagset – absent tags – absent tag sequences• Standalone lemmatization is not a good solution
  • 4. Less-resourced languages• Morphologically complex• Lack of annotated corporaBuilding an annotated corpus: 1) Manually disambiguate/correct 2) Train the tagger 3) Tag some text
  • 5. Web service scenario• Need of a high precision tagging tool• Noisy and unseen data• Incremental training
  • 6. What do we need?• Full morphological disambiguation – Including lemmatization• Integrated morphological analyzer• Incremental training• Unicode support• Fast to train• Open source• Easy to use
  • 7. Where to start?• From scratch?• Modifying an existing tool? – TriTagger – IceMorphy – Apertium tagger – HunPos – OpenNLP – ...
  • 8. HunPosPros: Cons: – Trigram tagger (TnT) – Only POS tagging – Beam search (no lemmatization) – Clever tricks – Implemented in – Contains a suffix OCaml guesser – No support for – Employing a Unicode morphological table – No real MA – Fast to train and decode
  • 9. Using the analyzer • Reducing the search space • Generating lemma candidates
  • 10. LemmatizationMorphological guesser 1) Generating E.g.: candidates Facebookjukba 2) Filter by POS tag 3) Select the most probable one
  • 11. Incremental trainingTraining Tagging 1) Train the tagger 1) Load the model 2) Save the model 2) Compile the model 3) Load the model 3) Use the model for 4) Add training data tagging to the model 5) Save the model
  • 12. Evaluation AccuracyOpenNLP (perceptron) 97,16%OpenNLP (maxent) 96.45% POS taggingPurePos (without MA) 98.14% accuracyPurePos (with MA) 98.99% Accuracy Full disambiguation Guesser 89.79% accuracy of PurePos Guesser + MT 90.35% Guesser + MA 98.35%
  • 13. EvaluationPOS tagging accuracy
  • 14. EvaluationFull disambiguation accuracy
  • 15. EvaluationPerformance as a web service Lemmatization Tagging CombinedBaseline 90.58% 98.14% 89.79%MT-10k 90.58% 98.14% 89.79%MT-30k 90.58% 98.17% 89.81%MT-100k 90.64% 98.30% 89.90%MT-100k* 90.72% 98.39% 89.97%PurePos 99.07% 98.99% 98.35%
  • 16. PurePos• Reimplementation of HunPos• Deeply integrated MA• Full disambiguation• State-of-the-art accuracy• Full Unicode support• Incremental training• Open source• Easily extensible
  • 17. Thank you!http://nlpg.itk.ppke.hu/software/purepos