SlideShare a Scribd company logo
1 of 15
Download to read offline
Part-of-Speech Tagging: The Power of the Linear
SVM-based Filtration Method for Russian
Language
Anton Kazennikov
IQMEN LLC
01 June 2017Dialogue 20172
MorphoRuEval Shared Task Goals
● Extended POS Tagging
● POS
● Morphological features
● Lemmatization
● Different text sources
● News
● Social media texts
● Fiction texts
01 June 2017Dialogue 20173
Shared Task Tagset
# Category Features
1 POS NOUN, PROPN, ADJ, PRON, NUM, VERB, ADV, DET, CONJ*, ADP,
PART*, H*, INT*, PUNCT*
2 Case Num, Gen, Dat, Acc, Loc, Ins
3 Number Sing, Plur
4 Gender Masc, Fem, Neut
5 Animacy Anim*, Inan*
6 Tense Past, Notpast
7 Person 1, 2, 3
8 VerbForm Inf, Fin, Conv
9 Mood Ind, Imp
10 Variant Short/Brev
11 Degree Pos, Cmp
12 NumForm Digit
* - non-evaluated features
01 June 2017Dialogue 20174
Training Corpora
Corpus Tokens Unique
Lemmas
Unique
Feature sets
Unique words
GICR 1M 43k 303 115k
SynTagRus 0.9M 43k 250 104k
RNC 1.2M 53k 557 127k
OpenCorpora 0.4M 42.5k 337 79k
01 June 2017Dialogue 20175
Proposed approach
● Knowledge-based + Machine Learning
● Unified solution for both tasks
● Based on normalizing substitution
<WF ending, NF Form ending, Feature set>
● Two stages:
1) Generate candidate parses for each word
2) Select optimal parse in left-to-right manner
01 June 2017Dialogue 20176
Candidate Generation
● Close to classical morphological analysis: based on
AOT dictionary
● Guesser: FSA on reversed words (split stem + ending)
Algorithm:
1) Collect parses from AOT Dictionary
2) Collect parses from corpus-based dictionary (GICR)
3) Collect parses from hand-generated dictionary (~50 entries)
4) If no parses found, then guess them
01 June 2017Dialogue 20177
AOT Dictionary
● Based on A. A. Zaliznyak’s
morphological dictionary
● 175k entries
● 4.6M wordforms
Entry
● Entry-wide features
● Stem
● Paradigm
Paradigm – a sorted list
of endings
● First ending – normal form ending
Ending
● Ending string
● Ending features
01 June 2017Dialogue 20178
AOT Dictionary Conversion
Preprocessing
● Feature-based mapping
● Paradigm
transformations
● Verb splitting (participles → Adj)
● Adj/Cmp → Adv
● Immutable nouns processing
● Induces some tolerable
noise
Conversion procedure
● Build partially converted dictionary
● Filter through GICR+SynTagRus parses
● Keep only unambiguous substs
● Converts 1.6M of 4.6M
wordforms
● Various corpora/AOT
mismatches:
● Some nouns have Anim/Inan ambiguity -
“Земля” vs. “земля”
● Gender-No Gender: мр-жр - “пьяница”
● Proper names vs. locations
01 June 2017Dialogue 20179
Parse Filtering
sent[i] – i-th word
parses[i] – set of parses for i-th word
for i=0 to len(sent):
context = extract_context(sent, parses, i)
scores = score(parses[i], context)
max_parse = argmax(parses[i], scores)
sent[i].feats = max_parse.feats
sent[i].lemma = max_parse.lemma
01 June 2017Dialogue 201710
Parse Filtering: Scorer construction
● Score – sum of scores of each features
● Multi-class SVM Classifier on context features
● Trained on each group of features
● Scores POS and morphological features separately
● Hash kernel for model size reduction
dot(class,w,x) = sum(w_class[i] x[i])
→
dot(class,w,x) = sum(w[hash(i, class)] x[i])
01 June 2017Dialogue 201711
Parse Filtering: Context Features
Window of ±3 words around current word features:
● Prefixes, suffixes, wordform itself
● Stems and endings
● POS tag and features combinations
● Ambiguity classes for unparsed context
● Graphical features: capitalization, punctuation etc.
● Bigram and trigram feature schemes
● ~ 150 features for each word
Ambiguity class for unparsed context - a sorted set of candidate
features: “gen/acc” for case ambiguity
01 June 2017Dialogue 201714
Shared Task Results on Closed Track
News
POS Sent
POS
Lemma Sent
Lemma
93,99 64,8 93,01 56,42
93,83 63,13 92,96 54,19
93,71 61,45 89,61 40,22
93,35 55,03 89,23 37,71
Vkontakte
POS Sent
POS
Lemma Sent
Lemma
92,42 65,85 91,69 61,09
92,39 64,08 90,97 60,21
92,29 63,56 88,65 52,64
91,49 61,44 90,97 48,94
Average
POS Sent
POS
Lemma Sent
Lemma
93,39 65,29 92,22 58,21
93,08 62,71 91,81 56,49
92,64 61,01 89,32 45,18
92,57 58,40 88,47 44,78
Fiction
POS Sent
POS
Lemma Sent
Lemma
94,16 65,23 92,01 57,11
92,87 60,91 91,46 55,08
92,16 60,15 90,28 45,18
92,40 56,60 88,84 35,28
01 June 2017Dialogue 201715
Additional Experiments
# Team POS Sent POS Lemma Sent Lemma
1 AOT-2017 93,64 63,93 92,76 59,39
2 C 93,39 65,29
3 AOT-2012 93,08 62,71 92,22 58,21
4 w/o guesser 93,00 60,68 92,21 56,59
5 H 92,64 58,4 80,71 25,01
6 A 92,57 61,01 91,81 56,49
7 w/o AOT 91,13 53,48 80,72 24,54
8 w/o AOT, guesser 88,43 45,60 86,74 40,46
01 June 2017Dialogue 201716
Conclusions
● Solid, high-performance approach
● Robustness across different text sources
● Highly sensitive to dictionary quality
● Efforts on corpus and dictionary unification could
further improve the performance of the presented
approach
Thanks for your attention!
Questions?
E-mail: kazennikov@iqmen.ru
Code and data available at:
https://github.com/kzn/morphoRuEval

More Related Content

Recently uploaded

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravitySubhadipsau21168
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 

Recently uploaded (20)

Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
The Black hole shadow in Modified Gravity
The Black hole shadow in Modified GravityThe Black hole shadow in Modified Gravity
The Black hole shadow in Modified Gravity
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 

Featured

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 

Featured (20)

Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 

MorphoRuEval-2017. Part-of-Speech Tagging: The Power of the Linear SVM-based Filtration Method for Russian Language

  • 1. Part-of-Speech Tagging: The Power of the Linear SVM-based Filtration Method for Russian Language Anton Kazennikov IQMEN LLC
  • 2. 01 June 2017Dialogue 20172 MorphoRuEval Shared Task Goals ● Extended POS Tagging ● POS ● Morphological features ● Lemmatization ● Different text sources ● News ● Social media texts ● Fiction texts
  • 3. 01 June 2017Dialogue 20173 Shared Task Tagset # Category Features 1 POS NOUN, PROPN, ADJ, PRON, NUM, VERB, ADV, DET, CONJ*, ADP, PART*, H*, INT*, PUNCT* 2 Case Num, Gen, Dat, Acc, Loc, Ins 3 Number Sing, Plur 4 Gender Masc, Fem, Neut 5 Animacy Anim*, Inan* 6 Tense Past, Notpast 7 Person 1, 2, 3 8 VerbForm Inf, Fin, Conv 9 Mood Ind, Imp 10 Variant Short/Brev 11 Degree Pos, Cmp 12 NumForm Digit * - non-evaluated features
  • 4. 01 June 2017Dialogue 20174 Training Corpora Corpus Tokens Unique Lemmas Unique Feature sets Unique words GICR 1M 43k 303 115k SynTagRus 0.9M 43k 250 104k RNC 1.2M 53k 557 127k OpenCorpora 0.4M 42.5k 337 79k
  • 5. 01 June 2017Dialogue 20175 Proposed approach ● Knowledge-based + Machine Learning ● Unified solution for both tasks ● Based on normalizing substitution <WF ending, NF Form ending, Feature set> ● Two stages: 1) Generate candidate parses for each word 2) Select optimal parse in left-to-right manner
  • 6. 01 June 2017Dialogue 20176 Candidate Generation ● Close to classical morphological analysis: based on AOT dictionary ● Guesser: FSA on reversed words (split stem + ending) Algorithm: 1) Collect parses from AOT Dictionary 2) Collect parses from corpus-based dictionary (GICR) 3) Collect parses from hand-generated dictionary (~50 entries) 4) If no parses found, then guess them
  • 7. 01 June 2017Dialogue 20177 AOT Dictionary ● Based on A. A. Zaliznyak’s morphological dictionary ● 175k entries ● 4.6M wordforms Entry ● Entry-wide features ● Stem ● Paradigm Paradigm – a sorted list of endings ● First ending – normal form ending Ending ● Ending string ● Ending features
  • 8. 01 June 2017Dialogue 20178 AOT Dictionary Conversion Preprocessing ● Feature-based mapping ● Paradigm transformations ● Verb splitting (participles → Adj) ● Adj/Cmp → Adv ● Immutable nouns processing ● Induces some tolerable noise Conversion procedure ● Build partially converted dictionary ● Filter through GICR+SynTagRus parses ● Keep only unambiguous substs ● Converts 1.6M of 4.6M wordforms ● Various corpora/AOT mismatches: ● Some nouns have Anim/Inan ambiguity - “Земля” vs. “земля” ● Gender-No Gender: мр-жр - “пьяница” ● Proper names vs. locations
  • 9. 01 June 2017Dialogue 20179 Parse Filtering sent[i] – i-th word parses[i] – set of parses for i-th word for i=0 to len(sent): context = extract_context(sent, parses, i) scores = score(parses[i], context) max_parse = argmax(parses[i], scores) sent[i].feats = max_parse.feats sent[i].lemma = max_parse.lemma
  • 10. 01 June 2017Dialogue 201710 Parse Filtering: Scorer construction ● Score – sum of scores of each features ● Multi-class SVM Classifier on context features ● Trained on each group of features ● Scores POS and morphological features separately ● Hash kernel for model size reduction dot(class,w,x) = sum(w_class[i] x[i]) → dot(class,w,x) = sum(w[hash(i, class)] x[i])
  • 11. 01 June 2017Dialogue 201711 Parse Filtering: Context Features Window of ±3 words around current word features: ● Prefixes, suffixes, wordform itself ● Stems and endings ● POS tag and features combinations ● Ambiguity classes for unparsed context ● Graphical features: capitalization, punctuation etc. ● Bigram and trigram feature schemes ● ~ 150 features for each word Ambiguity class for unparsed context - a sorted set of candidate features: “gen/acc” for case ambiguity
  • 12. 01 June 2017Dialogue 201714 Shared Task Results on Closed Track News POS Sent POS Lemma Sent Lemma 93,99 64,8 93,01 56,42 93,83 63,13 92,96 54,19 93,71 61,45 89,61 40,22 93,35 55,03 89,23 37,71 Vkontakte POS Sent POS Lemma Sent Lemma 92,42 65,85 91,69 61,09 92,39 64,08 90,97 60,21 92,29 63,56 88,65 52,64 91,49 61,44 90,97 48,94 Average POS Sent POS Lemma Sent Lemma 93,39 65,29 92,22 58,21 93,08 62,71 91,81 56,49 92,64 61,01 89,32 45,18 92,57 58,40 88,47 44,78 Fiction POS Sent POS Lemma Sent Lemma 94,16 65,23 92,01 57,11 92,87 60,91 91,46 55,08 92,16 60,15 90,28 45,18 92,40 56,60 88,84 35,28
  • 13. 01 June 2017Dialogue 201715 Additional Experiments # Team POS Sent POS Lemma Sent Lemma 1 AOT-2017 93,64 63,93 92,76 59,39 2 C 93,39 65,29 3 AOT-2012 93,08 62,71 92,22 58,21 4 w/o guesser 93,00 60,68 92,21 56,59 5 H 92,64 58,4 80,71 25,01 6 A 92,57 61,01 91,81 56,49 7 w/o AOT 91,13 53,48 80,72 24,54 8 w/o AOT, guesser 88,43 45,60 86,74 40,46
  • 14. 01 June 2017Dialogue 201716 Conclusions ● Solid, high-performance approach ● Robustness across different text sources ● Highly sensitive to dictionary quality ● Efforts on corpus and dictionary unification could further improve the performance of the presented approach
  • 15. Thanks for your attention! Questions? E-mail: kazennikov@iqmen.ru Code and data available at: https://github.com/kzn/morphoRuEval