Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages

Morphological Analyzer
and Generator for Russian
and Ukrainian Languages
Mikhail Korobov
AIST 2015

Morphological Analysis:
word -> possible grammatical tags
• стали: VERB,perf,intr plur,past,indc
(ГЛ,сов,неперех мн,прош,изъяв);
• стали: NOUN,inan,femn sing,[nomn;gent;datv;loct]
(СУЩ,неод,жр [ед,рд;ед,дт;eд,пр;мн,им;мн,вн])
• бутявка: NOUN,inan,femn sing,nomn
(СУЩ,неод,жр ед,им)

Moprhological Generation
• lemmatization: стали -> стать, ежом -> ёж
• inﬂection: стали -> (sing,3per,fut) -> станет
• inﬂection: ёж -> (datv) -> ежу

pymorphy2: features
• Morphological analysis of Russian words;
• morphological generation: lemmatization, inﬂection,
number agreement;
• P(tag | word) estimates;
• out-of-vocabulary words handling;
• experimental support for Ukrainian language.

pymorphy2: implementation
• Python library and a command line tool
• Permissive open-source license: MIT for code,
Creative Commons BY-SA for data
• 600+ unit tests; 90%+ test coverage
• Memory usage: 30MB = 15MB pymorphy2 + 15MB
Python interpreter
• Speed: 20-100K words per second with an optional
C++ extension

Analysis of Vocabulary
Words
• OpenCorpora dictionary for Russian (5M word
forms, 400K lemmas);
• a dictionary based on LanguageTool data (2.5M
word forms) by Andrey Rysin, Dmitry Chaplinsky,
Mariana Romanyshyn, Vladimir Sevastyanov &
others.

Analysis of Vocabulary
Words
Source dictionaries provide lexemes:
ёж NOUN,anim,masc sing,nomn
ежа NOUN,anim,masc sing,gent
ежу NOUN,anim,masc sing,datv
...
ежами NOUN,anim,masc plur,ablt
ежах NOUN,anim,masc plur,loct

Tasks
• Analyze: get a word from dictionary, return its tag
• Lemmatize: find a word in dictionary, get 1st word
from its lexeme
• Inflect: find a word in dictionary, get a compatible
word from its lexeme

Efﬁciency considerations
• OpenCorpora XML dictionary is 400MB on disk
• XML search lookup is O(N)
• When loaded to an in-memory hash table (Python
dict) dictionary takes several GB of RAM

Solution
• Extract paradigms from lexemes; encode words as
DAFSA.
• Also tried: succinct tries, two double-array tries
• 5M Russian word forms in DAFSA == 3MB RAM

Lexeme
word tag
хомяковый ADJF,Qual masc,sing,nomn
хомякового ADJF,Qual masc,sing,gent
...
хомяковы ADJS,Qual plur
хомяковее COMP,Qual
хомяковей COMP,Qual V-ej
похомяковее COMP,Qual Cmp2
похомяковей COMP,Qual Cmp2,V-ej

Lexeme
prefix stem suffix tag
хомяков ый ADJF,Qual masc,sing,nomn
хомяков ого ADJF,Qual masc,sing,gent
...
хомяков ы ADJS,Qual plur
хомяков ее COMP,Qual
хомяков ей COMP,Qual V-ej
по хомяков ее COMP,Qual Cmp2
по хомяков ей COMP,Qual Cmp2,V-ej

Paradigm
prefix suffix tag
ый ADJF,Qual masc,sing,nomn
ого ADJF,Qual masc,sing,gent
...
ы ADJS,Qual plur
ее COMP,Qual
ей COMP,Qual V-ej
по ее COMP,Qual Cmp2
по ей COMP,Qual Cmp2,V-ej

Paradigm, encoded
prefix_id suffix_id tag_id
0 66 78
0 67 79
...
0 37 94
0 82 95
0 121 96
1 82 97
1 121 98

DAFSA
10
14
0
2
3
1
16
4 6
32
И
sep
7
22
sep
8 9
sep
И
13
103
12103
102
2
2
0
17
104
2
(word, paradigm_id, form_index) triples:
(двор, 103, 0); (ёж, 104, 0);
(дворник, 101, 2); (дворник, 102, 2);
(ёжик, 101, 2); (ёжик, 102, 2)

Common prefixes removal: language-
specific lists of common immutable
prefixes (e.g. "не", "псевдо")
• недопсевдоавиашоу == недо + псевдоавиашоу
• псевдоавиашоу == псевдо + авиашоу
• авиашоу == авиа + шоу
• шоу - a known word

Words Ending with Other Dictionary Words
Example: котопсина
• a word being analyzed has another word from a
dictionary as a suffix;
• the length of this "suffix" word is no less than 3;
• the length of the word without the "suffix" is no
greater than 5;
• "suffix" word is of an open class (noun, verb,
adjective, participle, gerund)

Endings Matching
Example: бурбуляторовый
• words with common endings often have the same
grammatical form
• pymorphy2 builds an index of all 1-5 char word
endings and their analyses
• (frequency, paradigm_id, form_index) triple is
stored for each ending

Words with a Hyphen
• adverbs with a hyphen: по-хорошему
• particles separated by a hyphen: смотри-ка
• compound words: интернет-магазин, человек-
паук

P(tag | word) estimation
• Based on partially disambiguated OpenCorpora
data;
• MLE with Laplace smoothing

Evaluation: bad ideas
• evaluate pymorphy2 on OpenCorpora data
• evaluate Mystem on ruscorpora.ru (НКРЯ) data

Evaluation Setup
• pymorphy2 and Mystem 3.0;
• 100 randomly selected sentences from OpenCorpora
("microcorpus");
• 100 randomly selected sentences from ruscorpora.ru;
• tagsets are different; evaluation requires complicated
tag matching and manual checking of all errors;
• available online (http://goo.gl/BNXQXf)

Evaluation: errors
(full grammatical tags, recall, errors in
hyphenated words are not considered errors)
0
7,5
15
22,5
30
pymorphy2 Mystem 3.0
8
9
15
10
microcorpus ruscorpora

Evaluation: errors
0
3,5
7
10,5
14
Abbreviations People Names Regular Words Other Hyphenated Words*
11
2
6
1
14
02
44
9
pymorphy2 Mystem 3.0

Evaluation: results
• Both pymorphy2 and mystem made less than 1%
errors (without disambiguation); most errors are in
special cases.
• Hard to draw a conclusion; interpretation of
evaluation results is important.
• 6 errors in ruscorpora.ru gold results are found by
parsing it with pymorphy2, 1 error in microcorpus
gold results is found by parsing it with mystem.

Future work
• Improve people names, abbreviations, hyphenated words
parsing;
• improve non-contextual P(tag|word) estimates;
• improve Ukrainian language support;
• add Belarusian language support;
• there is a room for speed improvements;
• nicer command-line utility;
• ideas?

You can help
https://github.com/kmike/pymorphy2

Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages

Recommended

Recommended

More Related Content

More from AIST

More from AIST (20)

Recently uploaded

Recently uploaded (20)

Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages