Purepos -- an open source morphological disambiguator
1. PurePos – an open source
morphological disambiguator
György Orosz, Attila Novák
{oroszgy, novak.attila}@itk.ppke.hu
Pázmány Péter Catholic University, Faculty of Information Technology
MTA-PPKE Language Technology Research Group
This work was partially supported by TÁMOP: 4.2.2/B – 10/1–2010–0014
2. Outline
PurePos
– Full morphological disambiguation (tag + lemma)
– Integrated morphological analyzer
1) Need of a tagger with an integrated MA
2) Implementation, Contribution
3) Evaluation
3. Problems with agglutinating languages
• Small word coverage of the corpus
• Even 1000+ possible forms of a word
• Possibly huge tagset
– absent tags
– absent tag sequences
• Standalone lemmatization is not a good
solution
4. Less-resourced languages
• Morphologically complex
• Lack of annotated corpora
Building an annotated corpus:
1) Manually disambiguate/correct
2) Train the tagger
3) Tag some text
5. Web service scenario
• Need of a high precision tagging tool
• Noisy and unseen data
• Incremental training
6. What do we need?
• Full morphological disambiguation
– Including lemmatization
• Integrated morphological analyzer
• Incremental training
• Unicode support
• Fast to train
• Open source
• Easy to use
7. Where to start?
• From scratch?
• Modifying an existing tool?
– TriTagger
– IceMorphy
– Apertium tagger
– HunPos
– OpenNLP
– ...
8. HunPos
Pros: Cons:
– Trigram tagger (TnT) – Only POS tagging
– Beam search (no lemmatization)
– Clever tricks – Implemented in
– Contains a suffix OCaml
guesser – No support for
– Employing a Unicode
morphological table – No real MA
– Fast to train and
decode
9. Using the analyzer
• Reducing the
search space
• Generating lemma
candidates
11. Incremental training
Training Tagging
1) Train the tagger 1) Load the model
2) Save the model 2) Compile the model
3) Load the model 3) Use the model for
4) Add training data tagging
to the model
5) Save the model
15. Evaluation
Performance as a web service
Lemmatization Tagging Combined
Baseline 90.58% 98.14% 89.79%
MT-10k 90.58% 98.14% 89.79%
MT-30k 90.58% 98.17% 89.81%
MT-100k 90.64% 98.30% 89.90%
MT-100k* 90.72% 98.39% 89.97%
PurePos 99.07% 98.99% 98.35%
16. PurePos
• Reimplementation of HunPos
• Deeply integrated MA
• Full disambiguation
• State-of-the-art accuracy
• Full Unicode support
• Incremental training
• Open source
• Easily extensible