4. pymorphy2: features
• Morphological analysis of Russian words;
• morphological generation: lemmatization, inflection,
number agreement;
• P(tag | word) estimates;
• out-of-vocabulary words handling;
• experimental support for Ukrainian language.
5. pymorphy2: implementation
• Python library and a command line tool
• Permissive open-source license: MIT for code,
Creative Commons BY-SA for data
• 600+ unit tests; 90%+ test coverage
• Memory usage: 30MB = 15MB pymorphy2 + 15MB
Python interpreter
• Speed: 20-100K words per second with an optional
C++ extension
6. Analysis of Vocabulary
Words
• OpenCorpora dictionary for Russian (5M word
forms, 400K lemmas);
• a dictionary based on LanguageTool data (2.5M
word forms) by Andrey Rysin, Dmitry Chaplinsky,
Mariana Romanyshyn, Vladimir Sevastyanov &
others.
7. Analysis of Vocabulary
Words
Source dictionaries provide lexemes:
ёж NOUN,anim,masc sing,nomn
ежа NOUN,anim,masc sing,gent
ежу NOUN,anim,masc sing,datv
...
ежами NOUN,anim,masc plur,ablt
ежах NOUN,anim,masc plur,loct
8. Tasks
• Analyze: get a word from dictionary, return its tag
• Lemmatize: find a word in dictionary, get 1st word
from its lexeme
• Inflect: find a word in dictionary, get a compatible
word from its lexeme
9. Efficiency considerations
• OpenCorpora XML dictionary is 400MB on disk
• XML search lookup is O(N)
• When loaded to an in-memory hash table (Python
dict) dictionary takes several GB of RAM
10. Solution
• Extract paradigms from lexemes; encode words as
DAFSA.
• Also tried: succinct tries, two double-array tries
• 5M Russian word forms in DAFSA == 3MB RAM
12. Lexeme
prefix stem suffix tag
хомяков ый ADJF,Qual masc,sing,nomn
хомяков ого ADJF,Qual masc,sing,gent
...
хомяков ы ADJS,Qual plur
хомяков ее COMP,Qual
хомяков ей COMP,Qual V-ej
по хомяков ее COMP,Qual Cmp2
по хомяков ей COMP,Qual Cmp2,V-ej
13. Paradigm
prefix suffix tag
ый ADJF,Qual masc,sing,nomn
ого ADJF,Qual masc,sing,gent
...
ы ADJS,Qual plur
ее COMP,Qual
ей COMP,Qual V-ej
по ее COMP,Qual Cmp2
по ей COMP,Qual Cmp2,V-ej
17. Common prefixes removal: language-
specific lists of common immutable
prefixes (e.g. "не", "псевдо")
• недопсевдоавиашоу == недо + псевдоавиашоу
• псевдоавиашоу == псевдо + авиашоу
• авиашоу == авиа + шоу
• шоу - a known word
18. Words Ending with Other Dictionary Words
Example: котопсина
• a word being analyzed has another word from a
dictionary as a suffix;
• the length of this "suffix" word is no less than 3;
• the length of the word without the "suffix" is no
greater than 5;
• "suffix" word is of an open class (noun, verb,
adjective, participle, gerund)
19. Endings Matching
Example: бурбуляторовый
• words with common endings often have the same
grammatical form
• pymorphy2 builds an index of all 1-5 char word
endings and their analyses
• (frequency, paradigm_id, form_index) triple is
stored for each ending
20. Words with a Hyphen
• adverbs with a hyphen: по-хорошему
• particles separated by a hyphen: смотри-ка
• compound words: интернет-магазин, человек-
паук
21. P(tag | word) estimation
• Based on partially disambiguated OpenCorpora
data;
• MLE with Laplace smoothing
22. Evaluation: bad ideas
• evaluate pymorphy2 on OpenCorpora data
• evaluate Mystem on ruscorpora.ru (НКРЯ) data
23. Evaluation Setup
• pymorphy2 and Mystem 3.0;
• 100 randomly selected sentences from OpenCorpora
("microcorpus");
• 100 randomly selected sentences from ruscorpora.ru;
• tagsets are different; evaluation requires complicated
tag matching and manual checking of all errors;
• available online (http://goo.gl/BNXQXf)
24. Evaluation: errors
(full grammatical tags, recall, errors in
hyphenated words are not considered errors)
0
7,5
15
22,5
30
pymorphy2 Mystem 3.0
8
9
15
10
microcorpus ruscorpora
26. Evaluation: results
• Both pymorphy2 and mystem made less than 1%
errors (without disambiguation); most errors are in
special cases.
• Hard to draw a conclusion; interpretation of
evaluation results is important.
• 6 errors in ruscorpora.ru gold results are found by
parsing it with pymorphy2, 1 error in microcorpus
gold results is found by parsing it with mystem.
27. Future work
• Improve people names, abbreviations, hyphenated words
parsing;
• improve non-contextual P(tag|word) estimates;
• improve Ukrainian language support;
• add Belarusian language support;
• there is a room for speed improvements;
• nicer command-line utility;
• ideas?