SlideShare a Scribd company logo
Combining Knowledge
and CRF-based Approach to
Named Entity Recognition in
Russian
Mozharova V. A.,
Loukachevitch N. V.
Lomonosov Moscow State University
Named entity recognition task
A named entity is a word or a word collocation that means a
specific object or an event and distinguishes it from other similar
objects.
1. Президент [Владимир Путин] PER 17 декабря провел
традиционную пресс-конференцию перед Новым Годом.
2. Студенты и Татьяны получат эксклюзивный пропуск на
Главный каток страны.
Methods:
1. Machine learning
2. Rule-based approach
3. Combination
2
Related works
• English
• A lot of works
• Evaluations (MUC, CoNLL, ACE …)
• Russian
• Machine learning (CRF)
• (Antonova, Soloviev, 2013; Podobryaev, 2013; Gareev, 2013)
• Most works on own collections
• Rule-based
• (Trofimov, 2014)
• Open collection “Person-1000”
3
Outline
Our approach: CRF-based Named Entity recognition
Features:
• Token-based
• Lexicon-based
• Context-based
Labeling representation
• IO-scheme
• BIO-scheme
Experiments on open collections
• Persons-1000
• Persons-1111-F (Eastern names)
4
CRF-based machine learning
CRF is a tool for labeling sequential data.
• CRF++ (open source implementation)
Preprocessing
• Morphological analyzer (POS-tagging, lemmatization,
gender and grammatical case tagging)
5
Scheme of text processing
6
Text Feature
Extraction:
-token-based
-lexicon-based
-context-based
Name
Extraction
CRF
Token features
Most traditional features
1. Token initial form (lemma)
2. Number of symbols in a token
3. Letter case: BigBig, BigSmall, SmallSmall, Fence
4. Token type
• part of speech
• type of punctuation
5. The presence of a vowel (a binary feature)
6. If a token contains a known letter n-gram from a pre-
defined set:
• Кузнецов, Матвиенко, Джугашвили
• Госдепартамент, Газпром
7
Features based on lexicons
We used vocabularies that store lists of useful
expressions (words or phrase)
Sources:
• Phonebook
• Wikipedia
• Thesaurus (РуТез)
Single feature for each lexicon
Example:
«Набережные[geo2] Челны[geo2]»
8
lexicons
.
9
Vocabulary Size, objects Clarification Examples
Famous persons 31482 Famous people Владимир Путин
First names 2773 First names Василий, Анна, Том
Surnames 66108 Surnames Кузнецов, Грибоедов
Verbs of informing 1729 Verbs that usually occur with
persons
высказать,
признаться
Companies 33380 Organization names Сбербанк
Company types 6774 Organization types организация,
авиафирма
Geography 8969 Geographical objects Балтийское море
Equipment 44094 Devices, equipment, tools устройство, телефон
Context features and example
10
Token Lemma Register Token
Type
Second
Name
Geo Label
В В Small Auxiliary False False NO
России РОССИЯ BigSmall Noun False Geo1 GEOPOLIT
Алиев АЛИЕВ BigSmall Noun Sname1 False PER
третий ТРЕТИЙ Small Numeral False False NO
раз РАЗ Small Auxiliary False False NO
Expert labeling. Brat annotatiоn tool
11
Labeling representation
IO-scheme (Inside-Outside)
• I - belongs to named entity
• O - does not belong to named
entity
|C| + 1 classes
12
Token IO-Labels BIO-labels
Владимир I-PER B-PER
Путин I-PER I-PER
посетил OUTSIDE OUTSIDE
Англию I-GEOPOLIT B-GEOPOLIT
BIO-scheme (Begin-Inside-
Outside)
• B - named entity beginning
• I - named entity continuation
• O - not named entity
2*|C| + 1 classes
IO-labeling: aggregation of tokens into
named entities
13
I-PER
Person
I-PER
Person
I-PER
Person
I-PER
Петр
Person Person
I-PER
Петр
I-PER
Person
I-PER
IO-labeling: aggregation of tokens into
named entities
14
I-ORG
Organization
I-ORG
I-PER
X1 …
Person
I-PER
X1
…
OUTSIDE
X1
Person
…
Person
Target metric
intersectionCount is the number of named entities labeled by both:
the classier and the expert;
classifierCount is the number of named entities labeled by only the
classier;
expertCount is the number of named entities labeled by only the
expert.
15
Text collections
• "Persons-1000" (1000 news documents)
• Russian names: Александр Игнатенко, Алексей Волков
• " Persons-1111F" (1111 news documents)
• Eastern names: Абдалла Халаф, Иттё Ито
We additionally labeled:
• Organizations (ORG)
• Media organizations having a specific function of
information providing (MEDIA)
• Locations (LOC)
• States and capitals in the role of a state (GEOPOLIT)
16
Experiments on Collection
“Persons-1000”
NE
Type
F-score, %
IO IO +
rules
BIO
PER 94.95 95.09 96.08
ORG 80.03 80.23 83.84
LOC 92.60 92.60 94.57
Average 89.54 89.67 91.71
17
NE
Type
F-score, %
IO IO +
rules
BIO
PER 94.95 95.01 95.63
ORG 75.90 76.16 80.06
MEDIA 87.95 87.95 87.99
LOC 84.53 84.53 86.91
GEOPOLIT 94.65 94.65 94.50
Average 88.21 88.37 89.93
Cross-validation 3:1
Experiments on collection with Eastern
names (Persons-1111F)
Person name extraction
“Persons-1000”: cross-validation 3:1
“Persons-1111F” : training on “Persons-1000”
18
Collection F-score, %
Rule-based
(Trofimov, 2014)
Our system
Pesons-1000 96.62 96.08
Persons-1111F 64.43 81.68
Conclusion
• We presented the system for Russian Named Entity
Recognition task using knowledge-based approach
together with CRF classifier
• We tested our system on two open text collections
“Persons-1000” and “Persons-1111” and compare
our results with rule-based system
• We compared two labeling schemes for Russian
texts: IO-scheme and BIO-scheme
19

More Related Content

Viewers also liked

Cancer in dogs
Cancer in dogsCancer in dogs
Cancer in dogs
MatThomson
 
Semantic web-and-public-data - en
Semantic web-and-public-data - enSemantic web-and-public-data - en
Semantic web-and-public-data - en
Tenforce
 
God is Loving
God is LovingGod is Loving
God is Loving
William Harris
 
Trabalho 1
Trabalho 1Trabalho 1
Trabalho 1
EB2 Mira
 
Are we with-it? - Lucia Schoombee
Are we with-it? - Lucia SchoombeeAre we with-it? - Lucia Schoombee
Are we with-it? - Lucia SchoombeeHELIGLIASA
 
15 Things to Give Up to be Happy
15 Things to Give Up to be Happy 15 Things to Give Up to be Happy
15 Things to Give Up to be Happy
OH TEIK BIN
 

Viewers also liked (6)

Cancer in dogs
Cancer in dogsCancer in dogs
Cancer in dogs
 
Semantic web-and-public-data - en
Semantic web-and-public-data - enSemantic web-and-public-data - en
Semantic web-and-public-data - en
 
God is Loving
God is LovingGod is Loving
God is Loving
 
Trabalho 1
Trabalho 1Trabalho 1
Trabalho 1
 
Are we with-it? - Lucia Schoombee
Are we with-it? - Lucia SchoombeeAre we with-it? - Lucia Schoombee
Are we with-it? - Lucia Schoombee
 
15 Things to Give Up to be Happy
15 Things to Give Up to be Happy 15 Things to Give Up to be Happy
15 Things to Give Up to be Happy
 

More from AIST

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
AIST
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
AIST
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
AIST
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
AIST
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
AIST
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
AIST
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
AIST
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
AIST
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
AIST
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
AIST
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
AIST
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
AIST
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
AIST
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
AIST
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
AIST
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
AIST
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
AIST
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
AIST
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
AIST
 

More from AIST (20)

Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  ImagesAlexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray  Images
Alexey Mikhaylichenko - Automatic Detection of Bone Contours in X-Ray Images
 
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоныАлена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
Алена Ильина и Иван Бибилов, GoTo - GoTo школы, конкурсы и хакатоны
 
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
Станислав Кралин, Сайтсофт - Связанные открытые данные федеральных органов ис...
 
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поискПавел Браславский,Velpas - Velpas: мобильный визуальный поиск
Павел Браславский,Velpas - Velpas: мобильный визуальный поиск
 
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
Евгений Цымбалов, Webgames - Методы машинного обучения для задач игровой анал...
 
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
Александр Москвичев, EveResearch - Алгоритмы анализа данных в маркетинговых и...
 
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
Петр Ермаков, HeadHunter - Модерация резюме: от людей к роботам. Машинное обу...
 
Иосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBAИосиф Иткин, Exactpro - TBA
Иосиф Иткин, Exactpro - TBA
 
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge ExchangeNikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
Nikolay Karpov - Evolvable Semantic Platform for Facilitating Knowledge Exchange
 
George Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product CategoriesGeorge Moiseev - Classification of E-commerce Websites by Product Categories
George Moiseev - Classification of E-commerce Websites by Product Categories
 
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationElena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
Elena Bruches - The Hybrid Approach to Part-of-Speech Disambiguation
 
Marina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chantsMarina Danshina - The methodology of automated decryption of znamenny chants
Marina Danshina - The methodology of automated decryption of znamenny chants
 
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First GlanceEdward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
Edward Klyshinsky - The Corpus of Syntactic Co-occurences: the First Glance
 
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
Galina Lavrentyeva - Anti-spoofing Methods for Automatic Speaker Verification...
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
Kaytoue Mehdi - Finding duplicate labels in behavioral data: an application f...
 
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamediumValeri Labunets - The bichromatic excitable Schrodinger metamedium
Valeri Labunets - The bichromatic excitable Schrodinger metamedium
 
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
Valeri Labunets - Fast multiparametric wavelet transforms and packets for ima...
 
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
Alexander Karkishchenko - Threefold Symmetry Detection in Hexagonal Images Ba...
 
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation DenoisingArtyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
Artyom Makovetskii - An Efficient Algorithm for Total Variation Denoising
 

Recently uploaded

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 

Recently uploaded (20)

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 

Valeriia Mozharova and Natalia Loukachevitch - Combining Knowledge and CRF-based Approach to Named Entity Recognition in Russian

  • 1. Combining Knowledge and CRF-based Approach to Named Entity Recognition in Russian Mozharova V. A., Loukachevitch N. V. Lomonosov Moscow State University
  • 2. Named entity recognition task A named entity is a word or a word collocation that means a specific object or an event and distinguishes it from other similar objects. 1. Президент [Владимир Путин] PER 17 декабря провел традиционную пресс-конференцию перед Новым Годом. 2. Студенты и Татьяны получат эксклюзивный пропуск на Главный каток страны. Methods: 1. Machine learning 2. Rule-based approach 3. Combination 2
  • 3. Related works • English • A lot of works • Evaluations (MUC, CoNLL, ACE …) • Russian • Machine learning (CRF) • (Antonova, Soloviev, 2013; Podobryaev, 2013; Gareev, 2013) • Most works on own collections • Rule-based • (Trofimov, 2014) • Open collection “Person-1000” 3
  • 4. Outline Our approach: CRF-based Named Entity recognition Features: • Token-based • Lexicon-based • Context-based Labeling representation • IO-scheme • BIO-scheme Experiments on open collections • Persons-1000 • Persons-1111-F (Eastern names) 4
  • 5. CRF-based machine learning CRF is a tool for labeling sequential data. • CRF++ (open source implementation) Preprocessing • Morphological analyzer (POS-tagging, lemmatization, gender and grammatical case tagging) 5
  • 6. Scheme of text processing 6 Text Feature Extraction: -token-based -lexicon-based -context-based Name Extraction CRF
  • 7. Token features Most traditional features 1. Token initial form (lemma) 2. Number of symbols in a token 3. Letter case: BigBig, BigSmall, SmallSmall, Fence 4. Token type • part of speech • type of punctuation 5. The presence of a vowel (a binary feature) 6. If a token contains a known letter n-gram from a pre- defined set: • Кузнецов, Матвиенко, Джугашвили • Госдепартамент, Газпром 7
  • 8. Features based on lexicons We used vocabularies that store lists of useful expressions (words or phrase) Sources: • Phonebook • Wikipedia • Thesaurus (РуТез) Single feature for each lexicon Example: «Набережные[geo2] Челны[geo2]» 8
  • 9. lexicons . 9 Vocabulary Size, objects Clarification Examples Famous persons 31482 Famous people Владимир Путин First names 2773 First names Василий, Анна, Том Surnames 66108 Surnames Кузнецов, Грибоедов Verbs of informing 1729 Verbs that usually occur with persons высказать, признаться Companies 33380 Organization names Сбербанк Company types 6774 Organization types организация, авиафирма Geography 8969 Geographical objects Балтийское море Equipment 44094 Devices, equipment, tools устройство, телефон
  • 10. Context features and example 10 Token Lemma Register Token Type Second Name Geo Label В В Small Auxiliary False False NO России РОССИЯ BigSmall Noun False Geo1 GEOPOLIT Алиев АЛИЕВ BigSmall Noun Sname1 False PER третий ТРЕТИЙ Small Numeral False False NO раз РАЗ Small Auxiliary False False NO
  • 11. Expert labeling. Brat annotatiоn tool 11
  • 12. Labeling representation IO-scheme (Inside-Outside) • I - belongs to named entity • O - does not belong to named entity |C| + 1 classes 12 Token IO-Labels BIO-labels Владимир I-PER B-PER Путин I-PER I-PER посетил OUTSIDE OUTSIDE Англию I-GEOPOLIT B-GEOPOLIT BIO-scheme (Begin-Inside- Outside) • B - named entity beginning • I - named entity continuation • O - not named entity 2*|C| + 1 classes
  • 13. IO-labeling: aggregation of tokens into named entities 13 I-PER Person I-PER Person I-PER Person I-PER Петр Person Person I-PER Петр I-PER Person I-PER
  • 14. IO-labeling: aggregation of tokens into named entities 14 I-ORG Organization I-ORG I-PER X1 … Person I-PER X1 … OUTSIDE X1 Person … Person
  • 15. Target metric intersectionCount is the number of named entities labeled by both: the classier and the expert; classifierCount is the number of named entities labeled by only the classier; expertCount is the number of named entities labeled by only the expert. 15
  • 16. Text collections • "Persons-1000" (1000 news documents) • Russian names: Александр Игнатенко, Алексей Волков • " Persons-1111F" (1111 news documents) • Eastern names: Абдалла Халаф, Иттё Ито We additionally labeled: • Organizations (ORG) • Media organizations having a specific function of information providing (MEDIA) • Locations (LOC) • States and capitals in the role of a state (GEOPOLIT) 16
  • 17. Experiments on Collection “Persons-1000” NE Type F-score, % IO IO + rules BIO PER 94.95 95.09 96.08 ORG 80.03 80.23 83.84 LOC 92.60 92.60 94.57 Average 89.54 89.67 91.71 17 NE Type F-score, % IO IO + rules BIO PER 94.95 95.01 95.63 ORG 75.90 76.16 80.06 MEDIA 87.95 87.95 87.99 LOC 84.53 84.53 86.91 GEOPOLIT 94.65 94.65 94.50 Average 88.21 88.37 89.93 Cross-validation 3:1
  • 18. Experiments on collection with Eastern names (Persons-1111F) Person name extraction “Persons-1000”: cross-validation 3:1 “Persons-1111F” : training on “Persons-1000” 18 Collection F-score, % Rule-based (Trofimov, 2014) Our system Pesons-1000 96.62 96.08 Persons-1111F 64.43 81.68
  • 19. Conclusion • We presented the system for Russian Named Entity Recognition task using knowledge-based approach together with CRF classifier • We tested our system on two open text collections “Persons-1000” and “Persons-1111” and compare our results with rule-based system • We compared two labeling schemes for Russian texts: IO-scheme and BIO-scheme 19