SlideShare a Scribd company logo
Part-of-Speech Tagging: The Power of the Linear
SVM-based Filtration Method for Russian
Language
Anton Kazennikov
IQMEN LLC
01 June 2017Dialogue 20172
MorphoRuEval Shared Task Goals
● Extended POS Tagging
● POS
● Morphological features
● Lemmatization
● Different text sources
● News
● Social media texts
● Fiction texts
01 June 2017Dialogue 20173
Shared Task Tagset
# Category Features
1 POS NOUN, PROPN, ADJ, PRON, NUM, VERB, ADV, DET, CONJ*, ADP,
PART*, H*, INT*, PUNCT*
2 Case Num, Gen, Dat, Acc, Loc, Ins
3 Number Sing, Plur
4 Gender Masc, Fem, Neut
5 Animacy Anim*, Inan*
6 Tense Past, Notpast
7 Person 1, 2, 3
8 VerbForm Inf, Fin, Conv
9 Mood Ind, Imp
10 Variant Short/Brev
11 Degree Pos, Cmp
12 NumForm Digit
* - non-evaluated features
01 June 2017Dialogue 20174
Training Corpora
Corpus Tokens Unique
Lemmas
Unique
Feature sets
Unique words
GICR 1M 43k 303 115k
SynTagRus 0.9M 43k 250 104k
RNC 1.2M 53k 557 127k
OpenCorpora 0.4M 42.5k 337 79k
01 June 2017Dialogue 20175
Proposed approach
● Knowledge-based + Machine Learning
● Unified solution for both tasks
● Based on normalizing substitution
<WF ending, NF Form ending, Feature set>
● Two stages:
1) Generate candidate parses for each word
2) Select optimal parse in left-to-right manner
01 June 2017Dialogue 20176
Candidate Generation
● Close to classical morphological analysis: based on
AOT dictionary
● Guesser: FSA on reversed words (split stem + ending)
Algorithm:
1) Collect parses from AOT Dictionary
2) Collect parses from corpus-based dictionary (GICR)
3) Collect parses from hand-generated dictionary (~50 entries)
4) If no parses found, then guess them
01 June 2017Dialogue 20177
AOT Dictionary
● Based on A. A. Zaliznyak’s
morphological dictionary
● 175k entries
● 4.6M wordforms
Entry
● Entry-wide features
● Stem
● Paradigm
Paradigm – a sorted list
of endings
● First ending – normal form ending
Ending
● Ending string
● Ending features
01 June 2017Dialogue 20178
AOT Dictionary Conversion
Preprocessing
● Feature-based mapping
● Paradigm
transformations
● Verb splitting (participles → Adj)
● Adj/Cmp → Adv
● Immutable nouns processing
● Induces some tolerable
noise
Conversion procedure
● Build partially converted dictionary
● Filter through GICR+SynTagRus parses
● Keep only unambiguous substs
● Converts 1.6M of 4.6M
wordforms
● Various corpora/AOT
mismatches:
● Some nouns have Anim/Inan ambiguity -
“Земля” vs. “земля”
● Gender-No Gender: мр-жр - “пьяница”
● Proper names vs. locations
01 June 2017Dialogue 20179
Parse Filtering
sent[i] – i-th word
parses[i] – set of parses for i-th word
for i=0 to len(sent):
context = extract_context(sent, parses, i)
scores = score(parses[i], context)
max_parse = argmax(parses[i], scores)
sent[i].feats = max_parse.feats
sent[i].lemma = max_parse.lemma
01 June 2017Dialogue 201710
Parse Filtering: Scorer construction
● Score – sum of scores of each features
● Multi-class SVM Classifier on context features
● Trained on each group of features
● Scores POS and morphological features separately
● Hash kernel for model size reduction
dot(class,w,x) = sum(w_class[i] x[i])
→
dot(class,w,x) = sum(w[hash(i, class)] x[i])
01 June 2017Dialogue 201711
Parse Filtering: Context Features
Window of ±3 words around current word features:
● Prefixes, suffixes, wordform itself
● Stems and endings
● POS tag and features combinations
● Ambiguity classes for unparsed context
● Graphical features: capitalization, punctuation etc.
● Bigram and trigram feature schemes
● ~ 150 features for each word
Ambiguity class for unparsed context - a sorted set of candidate
features: “gen/acc” for case ambiguity
01 June 2017Dialogue 201714
Shared Task Results on Closed Track
News
POS Sent
POS
Lemma Sent
Lemma
93,99 64,8 93,01 56,42
93,83 63,13 92,96 54,19
93,71 61,45 89,61 40,22
93,35 55,03 89,23 37,71
Vkontakte
POS Sent
POS
Lemma Sent
Lemma
92,42 65,85 91,69 61,09
92,39 64,08 90,97 60,21
92,29 63,56 88,65 52,64
91,49 61,44 90,97 48,94
Average
POS Sent
POS
Lemma Sent
Lemma
93,39 65,29 92,22 58,21
93,08 62,71 91,81 56,49
92,64 61,01 89,32 45,18
92,57 58,40 88,47 44,78
Fiction
POS Sent
POS
Lemma Sent
Lemma
94,16 65,23 92,01 57,11
92,87 60,91 91,46 55,08
92,16 60,15 90,28 45,18
92,40 56,60 88,84 35,28
01 June 2017Dialogue 201715
Additional Experiments
# Team POS Sent POS Lemma Sent Lemma
1 AOT-2017 93,64 63,93 92,76 59,39
2 C 93,39 65,29
3 AOT-2012 93,08 62,71 92,22 58,21
4 w/o guesser 93,00 60,68 92,21 56,59
5 H 92,64 58,4 80,71 25,01
6 A 92,57 61,01 91,81 56,49
7 w/o AOT 91,13 53,48 80,72 24,54
8 w/o AOT, guesser 88,43 45,60 86,74 40,46
01 June 2017Dialogue 201716
Conclusions
● Solid, high-performance approach
● Robustness across different text sources
● Highly sensitive to dictionary quality
● Efforts on corpus and dictionary unification could
further improve the performance of the presented
approach
Thanks for your attention!
Questions?
E-mail: kazennikov@iqmen.ru
Code and data available at:
https://github.com/kzn/morphoRuEval

More Related Content

Recently uploaded

11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
Ritik83251
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Tissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptxTissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptx
muralinath2
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
Sciences of Europe
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills MN
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
Sérgio Sacani
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 

Recently uploaded (20)

11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Tissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptxTissue fluids_etiology_volume regulation_pressure.pptx
Tissue fluids_etiology_volume regulation_pressure.pptx
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...
 
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDSJAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
JAMES WEBB STUDY THE MASSIVE BLACK HOLE SEEDS
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
Alireza Esmikhani
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

MorphoRuEval-2017. Part-of-Speech Tagging: The Power of the Linear SVM-based Filtration Method for Russian Language

  • 1. Part-of-Speech Tagging: The Power of the Linear SVM-based Filtration Method for Russian Language Anton Kazennikov IQMEN LLC
  • 2. 01 June 2017Dialogue 20172 MorphoRuEval Shared Task Goals ● Extended POS Tagging ● POS ● Morphological features ● Lemmatization ● Different text sources ● News ● Social media texts ● Fiction texts
  • 3. 01 June 2017Dialogue 20173 Shared Task Tagset # Category Features 1 POS NOUN, PROPN, ADJ, PRON, NUM, VERB, ADV, DET, CONJ*, ADP, PART*, H*, INT*, PUNCT* 2 Case Num, Gen, Dat, Acc, Loc, Ins 3 Number Sing, Plur 4 Gender Masc, Fem, Neut 5 Animacy Anim*, Inan* 6 Tense Past, Notpast 7 Person 1, 2, 3 8 VerbForm Inf, Fin, Conv 9 Mood Ind, Imp 10 Variant Short/Brev 11 Degree Pos, Cmp 12 NumForm Digit * - non-evaluated features
  • 4. 01 June 2017Dialogue 20174 Training Corpora Corpus Tokens Unique Lemmas Unique Feature sets Unique words GICR 1M 43k 303 115k SynTagRus 0.9M 43k 250 104k RNC 1.2M 53k 557 127k OpenCorpora 0.4M 42.5k 337 79k
  • 5. 01 June 2017Dialogue 20175 Proposed approach ● Knowledge-based + Machine Learning ● Unified solution for both tasks ● Based on normalizing substitution <WF ending, NF Form ending, Feature set> ● Two stages: 1) Generate candidate parses for each word 2) Select optimal parse in left-to-right manner
  • 6. 01 June 2017Dialogue 20176 Candidate Generation ● Close to classical morphological analysis: based on AOT dictionary ● Guesser: FSA on reversed words (split stem + ending) Algorithm: 1) Collect parses from AOT Dictionary 2) Collect parses from corpus-based dictionary (GICR) 3) Collect parses from hand-generated dictionary (~50 entries) 4) If no parses found, then guess them
  • 7. 01 June 2017Dialogue 20177 AOT Dictionary ● Based on A. A. Zaliznyak’s morphological dictionary ● 175k entries ● 4.6M wordforms Entry ● Entry-wide features ● Stem ● Paradigm Paradigm – a sorted list of endings ● First ending – normal form ending Ending ● Ending string ● Ending features
  • 8. 01 June 2017Dialogue 20178 AOT Dictionary Conversion Preprocessing ● Feature-based mapping ● Paradigm transformations ● Verb splitting (participles → Adj) ● Adj/Cmp → Adv ● Immutable nouns processing ● Induces some tolerable noise Conversion procedure ● Build partially converted dictionary ● Filter through GICR+SynTagRus parses ● Keep only unambiguous substs ● Converts 1.6M of 4.6M wordforms ● Various corpora/AOT mismatches: ● Some nouns have Anim/Inan ambiguity - “Земля” vs. “земля” ● Gender-No Gender: мр-жр - “пьяница” ● Proper names vs. locations
  • 9. 01 June 2017Dialogue 20179 Parse Filtering sent[i] – i-th word parses[i] – set of parses for i-th word for i=0 to len(sent): context = extract_context(sent, parses, i) scores = score(parses[i], context) max_parse = argmax(parses[i], scores) sent[i].feats = max_parse.feats sent[i].lemma = max_parse.lemma
  • 10. 01 June 2017Dialogue 201710 Parse Filtering: Scorer construction ● Score – sum of scores of each features ● Multi-class SVM Classifier on context features ● Trained on each group of features ● Scores POS and morphological features separately ● Hash kernel for model size reduction dot(class,w,x) = sum(w_class[i] x[i]) → dot(class,w,x) = sum(w[hash(i, class)] x[i])
  • 11. 01 June 2017Dialogue 201711 Parse Filtering: Context Features Window of ±3 words around current word features: ● Prefixes, suffixes, wordform itself ● Stems and endings ● POS tag and features combinations ● Ambiguity classes for unparsed context ● Graphical features: capitalization, punctuation etc. ● Bigram and trigram feature schemes ● ~ 150 features for each word Ambiguity class for unparsed context - a sorted set of candidate features: “gen/acc” for case ambiguity
  • 12. 01 June 2017Dialogue 201714 Shared Task Results on Closed Track News POS Sent POS Lemma Sent Lemma 93,99 64,8 93,01 56,42 93,83 63,13 92,96 54,19 93,71 61,45 89,61 40,22 93,35 55,03 89,23 37,71 Vkontakte POS Sent POS Lemma Sent Lemma 92,42 65,85 91,69 61,09 92,39 64,08 90,97 60,21 92,29 63,56 88,65 52,64 91,49 61,44 90,97 48,94 Average POS Sent POS Lemma Sent Lemma 93,39 65,29 92,22 58,21 93,08 62,71 91,81 56,49 92,64 61,01 89,32 45,18 92,57 58,40 88,47 44,78 Fiction POS Sent POS Lemma Sent Lemma 94,16 65,23 92,01 57,11 92,87 60,91 91,46 55,08 92,16 60,15 90,28 45,18 92,40 56,60 88,84 35,28
  • 13. 01 June 2017Dialogue 201715 Additional Experiments # Team POS Sent POS Lemma Sent Lemma 1 AOT-2017 93,64 63,93 92,76 59,39 2 C 93,39 65,29 3 AOT-2012 93,08 62,71 92,22 58,21 4 w/o guesser 93,00 60,68 92,21 56,59 5 H 92,64 58,4 80,71 25,01 6 A 92,57 61,01 91,81 56,49 7 w/o AOT 91,13 53,48 80,72 24,54 8 w/o AOT, guesser 88,43 45,60 86,74 40,46
  • 14. 01 June 2017Dialogue 201716 Conclusions ● Solid, high-performance approach ● Robustness across different text sources ● Highly sensitive to dictionary quality ● Efforts on corpus and dictionary unification could further improve the performance of the presented approach
  • 15. Thanks for your attention! Questions? E-mail: kazennikov@iqmen.ru Code and data available at: https://github.com/kzn/morphoRuEval