SlideShare a Scribd company logo
1 of 28
Download to read offline
Morphological Analyzer
and Generator for Russian
and Ukrainian Languages
Mikhail Korobov
AIST 2015
Morphological Analysis:
word -> possible grammatical tags
• стали: VERB,perf,intr plur,past,indc
(ГЛ,сов,неперех мн,прош,изъяв);
• стали: NOUN,inan,femn sing,[nomn;gent;datv;loct]
(СУЩ,неод,жр [ед,рд;ед,дт;eд,пр;мн,им;мн,вн])
• бутявка: NOUN,inan,femn sing,nomn
(СУЩ,неод,жр ед,им)
Moprhological Generation
• lemmatization: стали -> стать, ежом -> ёж
• inflection: стали -> (sing,3per,fut) -> станет
• inflection: ёж -> (datv) -> ежу
pymorphy2: features
• Morphological analysis of Russian words;
• morphological generation: lemmatization, inflection,
number agreement;
• P(tag | word) estimates;
• out-of-vocabulary words handling;
• experimental support for Ukrainian language.
pymorphy2: implementation
• Python library and a command line tool
• Permissive open-source license: MIT for code,
Creative Commons BY-SA for data
• 600+ unit tests; 90%+ test coverage
• Memory usage: 30MB = 15MB pymorphy2 + 15MB
Python interpreter
• Speed: 20-100K words per second with an optional
C++ extension
Analysis of Vocabulary
Words
• OpenCorpora dictionary for Russian (5M word
forms, 400K lemmas);
• a dictionary based on LanguageTool data (2.5M
word forms) by Andrey Rysin, Dmitry Chaplinsky,
Mariana Romanyshyn, Vladimir Sevastyanov &
others.
Analysis of Vocabulary
Words
Source dictionaries provide lexemes:
ёж NOUN,anim,masc sing,nomn
ежа NOUN,anim,masc sing,gent
ежу NOUN,anim,masc sing,datv
...
ежами NOUN,anim,masc plur,ablt
ежах NOUN,anim,masc plur,loct
Tasks
• Analyze: get a word from dictionary, return its tag
• Lemmatize: find a word in dictionary, get 1st word
from its lexeme
• Inflect: find a word in dictionary, get a compatible
word from its lexeme
Efficiency considerations
• OpenCorpora XML dictionary is 400MB on disk
• XML search lookup is O(N)
• When loaded to an in-memory hash table (Python
dict) dictionary takes several GB of RAM
Solution
• Extract paradigms from lexemes; encode words as
DAFSA.
• Also tried: succinct tries, two double-array tries
• 5M Russian word forms in DAFSA == 3MB RAM
Lexeme
word tag
хомяковый ADJF,Qual masc,sing,nomn
хомякового ADJF,Qual masc,sing,gent
...
хомяковы ADJS,Qual plur
хомяковее COMP,Qual
хомяковей COMP,Qual V-ej
похомяковее COMP,Qual Cmp2
похомяковей COMP,Qual Cmp2,V-ej
Lexeme
prefix stem suffix tag
хомяков ый ADJF,Qual masc,sing,nomn
хомяков ого ADJF,Qual masc,sing,gent
...
хомяков ы ADJS,Qual plur
хомяков ее COMP,Qual
хомяков ей COMP,Qual V-ej
по хомяков ее COMP,Qual Cmp2
по хомяков ей COMP,Qual Cmp2,V-ej
Paradigm
prefix suffix tag
ый ADJF,Qual masc,sing,nomn
ого ADJF,Qual masc,sing,gent
...
ы ADJS,Qual plur
ее COMP,Qual
ей COMP,Qual V-ej
по ее COMP,Qual Cmp2
по ей COMP,Qual Cmp2,V-ej
Paradigm, encoded
prefix_id suffix_id tag_id
0 66 78
0 67 79
...
0 37 94
0 82 95
0 121 96
1 82 97
1 121 98
DAFSA
10
14
0
2
3
1
16
4 6
32
И
sep
7
22
sep
8 9
sep
И
13
103
12103
102
2
2
0
17
104
2
(word, paradigm_id, form_index) triples:
(двор, 103, 0); (ёж, 104, 0);
(дворник, 101, 2); (дворник, 102, 2);
(ёжик, 101, 2); (ёжик, 102, 2)
Out of Vocabulary
Words
Common prefixes removal: language-
specific lists of common immutable
prefixes (e.g. "не", "псевдо")
• недопсевдоавиашоу == недо + псевдоавиашоу
• псевдоавиашоу == псевдо + авиашоу
• авиашоу == авиа + шоу
• шоу - a known word
Words Ending with Other Dictionary Words
Example: котопсина
• a word being analyzed has another word from a
dictionary as a suffix;
• the length of this "suffix" word is no less than 3;
• the length of the word without the "suffix" is no
greater than 5;
• "suffix" word is of an open class (noun, verb,
adjective, participle, gerund)
Endings Matching
Example: бурбуляторовый
• words with common endings often have the same
grammatical form
• pymorphy2 builds an index of all 1-5 char word
endings and their analyses
• (frequency, paradigm_id, form_index) triple is
stored for each ending
Words with a Hyphen
• adverbs with a hyphen: по-хорошему
• particles separated by a hyphen: смотри-ка
• compound words: интернет-магазин, человек-
паук
P(tag | word) estimation
• Based on partially disambiguated OpenCorpora
data;
• MLE with Laplace smoothing
Evaluation: bad ideas
• evaluate pymorphy2 on OpenCorpora data
• evaluate Mystem on ruscorpora.ru (НКРЯ) data
Evaluation Setup
• pymorphy2 and Mystem 3.0;
• 100 randomly selected sentences from OpenCorpora
("microcorpus");
• 100 randomly selected sentences from ruscorpora.ru;
• tagsets are different; evaluation requires complicated
tag matching and manual checking of all errors;
• available online (http://goo.gl/BNXQXf)
Evaluation: errors
(full grammatical tags, recall, errors in
hyphenated words are not considered errors)
0
7,5
15
22,5
30
pymorphy2 Mystem 3.0
8
9
15
10
microcorpus ruscorpora
Evaluation: errors
0
3,5
7
10,5
14
Abbreviations People Names Regular Words Other Hyphenated Words*
11
2
6
1
14
02
44
9
pymorphy2 Mystem 3.0
Evaluation: results
• Both pymorphy2 and mystem made less than 1%
errors (without disambiguation); most errors are in
special cases.
• Hard to draw a conclusion; interpretation of
evaluation results is important.
• 6 errors in ruscorpora.ru gold results are found by
parsing it with pymorphy2, 1 error in microcorpus
gold results is found by parsing it with mystem.
Future work
• Improve people names, abbreviations, hyphenated words
parsing;
• improve non-contextual P(tag|word) estimates;
• improve Ukrainian language support;
• add Belarusian language support;
• there is a room for speed improvements;
• nicer command-line utility;
• ideas?
You can help
https://github.com/kmike/pymorphy2

More Related Content

More from AIST

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray ImagesAIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныAIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискAIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAAIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeAIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesAIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsAIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceAIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumAIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingAIST
 

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 

Recently uploaded

WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )Pooja Nehwal
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...henrik385807
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...NETWAYS
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxFamilyWorshipCenterD
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptssuser319dad
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...NETWAYS
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...NETWAYS
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfhenrik385807
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxmavinoikein
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)Basil Achie
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfhenrik385807
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSebastiano Panichella
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Pooja Nehwal
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...NETWAYS
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Krijn Poppe
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringSebastiano Panichella
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AITatiana Gurgel
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSebastiano Panichella
 

Recently uploaded (20)

WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
WhatsApp 📞 9892124323 ✅Call Girls In Juhu ( Mumbai )
 
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
 
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
Open Source Camp Kubernetes 2024 | Running WebAssembly on Kubernetes by Alex ...
 
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptxGenesis part 2 Isaiah Scudder 04-24-2024.pptx
Genesis part 2 Isaiah Scudder 04-24-2024.pptx
 
Philippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.pptPhilippine History cavite Mutiny Report.ppt
Philippine History cavite Mutiny Report.ppt
 
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
OSCamp Kubernetes 2024 | Zero-Touch OS-Infrastruktur für Container und Kubern...
 
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
OSCamp Kubernetes 2024 | A Tester's Guide to CI_CD as an Automated Quality Co...
 
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdfCTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
CTAC 2024 Valencia - Henrik Hanke - Reduce to the max - slideshare.pdf
 
Work Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptxWork Remotely with Confluence ACE 2.pptx
Work Remotely with Confluence ACE 2.pptx
 
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls KolkataRussian Call Girls in Kolkata Vaishnavi 🤌  8250192130 🚀 Vip Call Girls Kolkata
Russian Call Girls in Kolkata Vaishnavi 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
NATIONAL ANTHEMS OF AFRICA (National Anthems of Africa)
 
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdfOpen Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
Open Source Strategy in Logistics 2015_Henrik Hankedvz-d-nl-log-conference.pdf
 
SBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation TrackSBFT Tool Competition 2024 -- Python Test Case Generation Track
SBFT Tool Competition 2024 -- Python Test Case Generation Track
 
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
Navi Mumbai Call Girls Service Pooja 9892124323 Real Russian Girls Looking Mo...
 
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
OSCamp Kubernetes 2024 | SRE Challenges in Monolith to Microservices Shift at...
 
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
Presentation for the Strategic Dialogue on the Future of Agriculture, Brussel...
 
The 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software EngineeringThe 3rd Intl. Workshop on NL-based Software Engineering
The 3rd Intl. Workshop on NL-based Software Engineering
 
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Rohini Delhi 💯Call Us 🔝8264348440🔝
 
Microsoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AIMicrosoft Copilot AI for Everyone - created by AI
Microsoft Copilot AI for Everyone - created by AI
 
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with AerialistSimulation-based Testing of Unmanned Aerial Vehicles with Aerialist
Simulation-based Testing of Unmanned Aerial Vehicles with Aerialist
 

Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages

  • 1. Morphological Analyzer and Generator for Russian and Ukrainian Languages Mikhail Korobov AIST 2015
  • 2. Morphological Analysis: word -> possible grammatical tags • стали: VERB,perf,intr plur,past,indc (ГЛ,сов,неперех мн,прош,изъяв); • стали: NOUN,inan,femn sing,[nomn;gent;datv;loct] (СУЩ,неод,жр [ед,рд;ед,дт;eд,пр;мн,им;мн,вн]) • бутявка: NOUN,inan,femn sing,nomn (СУЩ,неод,жр ед,им)
  • 3. Moprhological Generation • lemmatization: стали -> стать, ежом -> ёж • inflection: стали -> (sing,3per,fut) -> станет • inflection: ёж -> (datv) -> ежу
  • 4. pymorphy2: features • Morphological analysis of Russian words; • morphological generation: lemmatization, inflection, number agreement; • P(tag | word) estimates; • out-of-vocabulary words handling; • experimental support for Ukrainian language.
  • 5. pymorphy2: implementation • Python library and a command line tool • Permissive open-source license: MIT for code, Creative Commons BY-SA for data • 600+ unit tests; 90%+ test coverage • Memory usage: 30MB = 15MB pymorphy2 + 15MB Python interpreter • Speed: 20-100K words per second with an optional C++ extension
  • 6. Analysis of Vocabulary Words • OpenCorpora dictionary for Russian (5M word forms, 400K lemmas); • a dictionary based on LanguageTool data (2.5M word forms) by Andrey Rysin, Dmitry Chaplinsky, Mariana Romanyshyn, Vladimir Sevastyanov & others.
  • 7. Analysis of Vocabulary Words Source dictionaries provide lexemes: ёж NOUN,anim,masc sing,nomn ежа NOUN,anim,masc sing,gent ежу NOUN,anim,masc sing,datv ... ежами NOUN,anim,masc plur,ablt ежах NOUN,anim,masc plur,loct
  • 8. Tasks • Analyze: get a word from dictionary, return its tag • Lemmatize: find a word in dictionary, get 1st word from its lexeme • Inflect: find a word in dictionary, get a compatible word from its lexeme
  • 9. Efficiency considerations • OpenCorpora XML dictionary is 400MB on disk • XML search lookup is O(N) • When loaded to an in-memory hash table (Python dict) dictionary takes several GB of RAM
  • 10. Solution • Extract paradigms from lexemes; encode words as DAFSA. • Also tried: succinct tries, two double-array tries • 5M Russian word forms in DAFSA == 3MB RAM
  • 11. Lexeme word tag хомяковый ADJF,Qual masc,sing,nomn хомякового ADJF,Qual masc,sing,gent ... хомяковы ADJS,Qual plur хомяковее COMP,Qual хомяковей COMP,Qual V-ej похомяковее COMP,Qual Cmp2 похомяковей COMP,Qual Cmp2,V-ej
  • 12. Lexeme prefix stem suffix tag хомяков ый ADJF,Qual masc,sing,nomn хомяков ого ADJF,Qual masc,sing,gent ... хомяков ы ADJS,Qual plur хомяков ее COMP,Qual хомяков ей COMP,Qual V-ej по хомяков ее COMP,Qual Cmp2 по хомяков ей COMP,Qual Cmp2,V-ej
  • 13. Paradigm prefix suffix tag ый ADJF,Qual masc,sing,nomn ого ADJF,Qual masc,sing,gent ... ы ADJS,Qual plur ее COMP,Qual ей COMP,Qual V-ej по ее COMP,Qual Cmp2 по ей COMP,Qual Cmp2,V-ej
  • 14. Paradigm, encoded prefix_id suffix_id tag_id 0 66 78 0 67 79 ... 0 37 94 0 82 95 0 121 96 1 82 97 1 121 98
  • 15. DAFSA 10 14 0 2 3 1 16 4 6 32 И sep 7 22 sep 8 9 sep И 13 103 12103 102 2 2 0 17 104 2 (word, paradigm_id, form_index) triples: (двор, 103, 0); (ёж, 104, 0); (дворник, 101, 2); (дворник, 102, 2); (ёжик, 101, 2); (ёжик, 102, 2)
  • 17. Common prefixes removal: language- specific lists of common immutable prefixes (e.g. "не", "псевдо") • недопсевдоавиашоу == недо + псевдоавиашоу • псевдоавиашоу == псевдо + авиашоу • авиашоу == авиа + шоу • шоу - a known word
  • 18. Words Ending with Other Dictionary Words Example: котопсина • a word being analyzed has another word from a dictionary as a suffix; • the length of this "suffix" word is no less than 3; • the length of the word without the "suffix" is no greater than 5; • "suffix" word is of an open class (noun, verb, adjective, participle, gerund)
  • 19. Endings Matching Example: бурбуляторовый • words with common endings often have the same grammatical form • pymorphy2 builds an index of all 1-5 char word endings and their analyses • (frequency, paradigm_id, form_index) triple is stored for each ending
  • 20. Words with a Hyphen • adverbs with a hyphen: по-хорошему • particles separated by a hyphen: смотри-ка • compound words: интернет-магазин, человек- паук
  • 21. P(tag | word) estimation • Based on partially disambiguated OpenCorpora data; • MLE with Laplace smoothing
  • 22. Evaluation: bad ideas • evaluate pymorphy2 on OpenCorpora data • evaluate Mystem on ruscorpora.ru (НКРЯ) data
  • 23. Evaluation Setup • pymorphy2 and Mystem 3.0; • 100 randomly selected sentences from OpenCorpora ("microcorpus"); • 100 randomly selected sentences from ruscorpora.ru; • tagsets are different; evaluation requires complicated tag matching and manual checking of all errors; • available online (http://goo.gl/BNXQXf)
  • 24. Evaluation: errors (full grammatical tags, recall, errors in hyphenated words are not considered errors) 0 7,5 15 22,5 30 pymorphy2 Mystem 3.0 8 9 15 10 microcorpus ruscorpora
  • 25. Evaluation: errors 0 3,5 7 10,5 14 Abbreviations People Names Regular Words Other Hyphenated Words* 11 2 6 1 14 02 44 9 pymorphy2 Mystem 3.0
  • 26. Evaluation: results • Both pymorphy2 and mystem made less than 1% errors (without disambiguation); most errors are in special cases. • Hard to draw a conclusion; interpretation of evaluation results is important. • 6 errors in ruscorpora.ru gold results are found by parsing it with pymorphy2, 1 error in microcorpus gold results is found by parsing it with mystem.
  • 27. Future work • Improve people names, abbreviations, hyphenated words parsing; • improve non-contextual P(tag|word) estimates; • improve Ukrainian language support; • add Belarusian language support; • there is a room for speed improvements; • nicer command-line utility; • ideas?