SlideShare a Scribd company logo
1 of 26
Download to read offline
Polyglot-NER: Massive Multilingual
Named Entity Recognition
SDM
May 2, 2015
Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena
Stony Brook University
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Named Entity Recognition (NER) Problem
■Input:
Plain text, T
■Output:
The spans of T that constitute proper names,
and the classification of the entity’s type.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Examples
Input: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Output: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Location
Location Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual NER
❑NLTK
■ English
❑Stanford
■ English, Spanish,
Chinese, Arabic
❑OpenNLP
■ English, German, Dutch,
Spanish
❑Polyglot-NER
■ 40 Major Languages!
(English, Spanish, French, German,
Russian, Polish, Portuguese, Italian,
Dutch, Arabic, Hebrew, Hindi, Korean,
Japanese, Vietnamese, …)
While many pipelines exist, most languages are unsupported
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Does Multilingual Matter?
Yes!
Only 55% of the top 10 million websites are in English! [1]
There are 51 languages on Wikipedia with 100,000+
articles. [2]
[1] http://w3techs.com/technologies/history_overview/content_language/ms/y
[2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual is Hard
Feature Scarcity
NLP tasks typically rely on
language-specific feature
engineering
❑ Orthographic features
❑ Part of Speech Tags
❑ Parallel Corpora
❑ WordNet
Annotation Scarcity
Need NER examples -
labeled data is expensive.
Our solution: neural word
embeddings.
Our solution:
Wikipedia/Freebase for training
examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-problem: Word Representation
Input: Unstructured text
Output: Low dimensional word embeddings
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distributed Word Representations
Big Idea: Give similar words similar representations
pine
oak
rose
daisy
reading
writing
read
write
|V|
|V|: size of vocabulary
pine
oak
rose
daisy
reading
writing
read
write
d
d << |V|
Similar words share similar
representations.
Latent
Dimensions
Explicit
Dimensions
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Polyglot Embeddings
● Wikipedia article text
● 137 Languages
● Available:
○ http://bit.ly/embeddings
[Al-Rfou, Perozzi, Skiena, 13] C
Imagination
C
is
C
greater
C
than
C
detail
Score
Hidden
Layer
H
Projection
Layer
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-Problem: Annotation Mining
Input: Wikipedia, Freebase
Output: Labeled NER training examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Related Work
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Annotations from Wikipedia
Inter-wiki links are a great
potential source of mentions.
WikipediaFreebase
Freebase tells us which articles
are entity articles.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Example
Wiki Text:
Vancouver is a coastal seaport city on the mainland of
British Columbia. The city's mayor is Gregor Robertson.
“Vancouver”
“British Columbia”
“Gregor Robertson”
Strings
/m/080h2
/m/015jr
/m/0grlms
Freebase MID
City
Region
Person
Freebase
Category
Location
Location
Person
NER Label
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Bad News
Many false negatives in our dataset!
■ Wikipedia editors annotate only the first mention of
an entity but not later ones.
■ Most of the named entity mentions are not linked!
Example:
Vancouver is a coastal seaport city on the
mainland of British Columbia. Vancouver’s
mayor is Gregor Robertson.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Good News
Positive labels are very
high quality!
Need to emphasize this in
our training.
?
?
?
?
?
?
?
‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The trick: Oversampling
p
We can change the label
distribution by
oversampling from the
positive labels.
p is the percentage of positive
labels in the training dataset.
Initially no
oversampling
p = 0.5, much
better
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Cross-Domain Performance
Oversampling
Oversampling +
Exact Matching
Cross-Domain Testing on CoNLL
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Demo
@ http://bit.ly/polyglot-ner
Legend: Location Organization Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
But How to Evaluate?
■We have labeled data for a few languages
■Would like to evaluate everything
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distant Evaluation
John proviene de la ciudad de
Nueva York.
John is coming from New York City.
Machine
Translation
Calculate the error of omitting entities and the error of adding entities.
Person: 1
Location: 1
Organization: 0
Person: 0
Location: 1
Organization: 1
1
1
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Experimental Design
Distant Evaluation for Polyglot-NER:
1. Annotate English Wikipedia sentences using Stanford NER.
2. Randomly pick 1500 sentences that have at least one entity detected.
3. Translate these sentences using Google translate to 40 languages.
4. Run Polyglot-NER on the translated datasets.
5. Compare the number of entity chunks our annotators found to the
ones detected by Stanford per sentence.
6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Effect of Data Size
■ Size of training data
matters!
■ Tokenization is quite
important when the
word embeddings
coverage is limited.
# Words (Log Scale)
ErrorMissing
More
Data Will
Help
Anomalies
Good
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Performance by Category
ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error
Person Location
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Limitations
■Named entities don’t always translate well:
❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …”
■Need a working translation system for the language
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Take-aways
■NER in 40 languages!
■Word embeddings & oversampling offers equal
or better performance to feature engineering for
NER annotation mining.
■Translation based evaluation?
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Thanks!
NER Demo: http://bit.ly/polyglot-ner
NER Code: http://polyglot-nlp.com
bperozzi@cs.stonybrook.edu
www.perozzi.net
Bryan Perozzi

More Related Content

What's hot

HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法Masahito Zembutsu
 
0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきこと0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきことmao999
 
Return Oriented Programming
Return Oriented ProgrammingReturn Oriented Programming
Return Oriented Programmingsounakano
 
論文紹介: Cuckoo filter: practically better than bloom
論文紹介: Cuckoo filter: practically better than bloom論文紹介: Cuckoo filter: practically better than bloom
論文紹介: Cuckoo filter: practically better than bloomSho Nakazono
 
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)NTT DATA Technology & Innovation
 
MAP 実装してみた
MAP 実装してみたMAP 実装してみた
MAP 実装してみたMasakazu Asama
 
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embeddingtaeseon ryu
 
Mroonga開発者が来たぞ!
Mroonga開発者が来たぞ!Mroonga開発者が来たぞ!
Mroonga開発者が来たぞ!Kouhei Sutou
 
Linux の hugepage の開発動向
Linux の hugepage の開発動向Linux の hugepage の開発動向
Linux の hugepage の開発動向Naoya Horiguchi
 
eBPFは何が嬉しいのか
eBPFは何が嬉しいのかeBPFは何が嬉しいのか
eBPFは何が嬉しいのかYutaro Hayakawa
 
pg_bigmと類似度検索
pg_bigmと類似度検索pg_bigmと類似度検索
pg_bigmと類似度検索Masahiko Sawada
 
Scapyで作る・解析するパケット
Scapyで作る・解析するパケットScapyで作る・解析するパケット
Scapyで作る・解析するパケットTakaaki Hoyo
 
PHP の GC の話
PHP の GC の話PHP の GC の話
PHP の GC の話y-uti
 
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled AttentionDeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled Attentiontaeseon ryu
 
Wiresharkの解析プラグインを作る ssmjp 201409
Wiresharkの解析プラグインを作る ssmjp 201409Wiresharkの解析プラグインを作る ssmjp 201409
Wiresharkの解析プラグインを作る ssmjp 201409稔 小林
 
Mroongaを使ったときの MySQLの制限との戦い
Mroongaを使ったときの MySQLの制限との戦いMroongaを使ったときの MySQLの制限との戦い
Mroongaを使ったときの MySQLの制限との戦いNaoya Murakami
 
冬のLock free祭り safe
冬のLock free祭り safe冬のLock free祭り safe
冬のLock free祭り safeKumazaki Hiroki
 
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기CONNECT FOUNDATION
 

What's hot (20)

HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法HashiCorpのNomadを使ったコンテナのスケジューリング手法
HashiCorpのNomadを使ったコンテナのスケジューリング手法
 
0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきこと0章 Linuxカーネルを読む前に最低限知っておくべきこと
0章 Linuxカーネルを読む前に最低限知っておくべきこと
 
Return Oriented Programming
Return Oriented ProgrammingReturn Oriented Programming
Return Oriented Programming
 
論文紹介: Cuckoo filter: practically better than bloom
論文紹介: Cuckoo filter: practically better than bloom論文紹介: Cuckoo filter: practically better than bloom
論文紹介: Cuckoo filter: practically better than bloom
 
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
pg_bigmで全文検索するときに気を付けたい5つのポイント(第23回PostgreSQLアンカンファレンス@オンライン 発表資料)
 
MAP 実装してみた
MAP 実装してみたMAP 実装してみた
MAP 実装してみた
 
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position EmbeddingRoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer: Enhanced Transformer with Rotary Position Embedding
 
Mroonga開発者が来たぞ!
Mroonga開発者が来たぞ!Mroonga開発者が来たぞ!
Mroonga開発者が来たぞ!
 
Linux の hugepage の開発動向
Linux の hugepage の開発動向Linux の hugepage の開発動向
Linux の hugepage の開発動向
 
Stack pivot
Stack pivotStack pivot
Stack pivot
 
eBPFは何が嬉しいのか
eBPFは何が嬉しいのかeBPFは何が嬉しいのか
eBPFは何が嬉しいのか
 
pg_bigmと類似度検索
pg_bigmと類似度検索pg_bigmと類似度検索
pg_bigmと類似度検索
 
Symbol GC
Symbol GCSymbol GC
Symbol GC
 
Scapyで作る・解析するパケット
Scapyで作る・解析するパケットScapyで作る・解析するパケット
Scapyで作る・解析するパケット
 
PHP の GC の話
PHP の GC の話PHP の GC の話
PHP の GC の話
 
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled AttentionDeBERTA : Decoding-Enhanced BERT with Disentangled Attention
DeBERTA : Decoding-Enhanced BERT with Disentangled Attention
 
Wiresharkの解析プラグインを作る ssmjp 201409
Wiresharkの解析プラグインを作る ssmjp 201409Wiresharkの解析プラグインを作る ssmjp 201409
Wiresharkの解析プラグインを作る ssmjp 201409
 
Mroongaを使ったときの MySQLの制限との戦い
Mroongaを使ったときの MySQLの制限との戦いMroongaを使ったときの MySQLの制限との戦い
Mroongaを使ったときの MySQLの制限との戦い
 
冬のLock free祭り safe
冬のLock free祭り safe冬のLock free祭り safe
冬のLock free祭り safe
 
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
[부스트캠프 Tech Talk] 김봉진_WandB로 Auto ML 뿌수기
 

Viewers also liked

Currículo Nacional de la Educación Básica
Currículo Nacional de la Educación BásicaCurrículo Nacional de la Educación Básica
Currículo Nacional de la Educación BásicaDiego Ponce de Leon
 
Portafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica DocentePortafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica DocenteNorma Vega
 
JULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJulio Pari
 
El emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional certEl emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional certMaestros Online
 
1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demandaGeohistoria23
 
Onderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitiefOnderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitiefrloggen
 
Como hacer un plan de negocios
Como hacer un plan de negociosComo hacer un plan de negocios
Como hacer un plan de negociosXPINNERPablo
 
Schrijven voor het web
Schrijven voor het webSchrijven voor het web
Schrijven voor het webSimone Levie
 
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA... ..
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicasLarryJimenez
 
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda..... ..
 
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3... ..
 
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA... ..
 
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA... ..
 
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2... ..
 
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1... ..
 

Viewers also liked (20)

Currículo Nacional de la Educación Básica
Currículo Nacional de la Educación BásicaCurrículo Nacional de la Educación Básica
Currículo Nacional de la Educación Básica
 
Portafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica DocentePortafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica Docente
 
JULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de Negocios
 
El emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional certEl emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional cert
 
PMP Sonora Saludable 2010 2015
PMP Sonora Saludable 2010   2015  PMP Sonora Saludable 2010   2015
PMP Sonora Saludable 2010 2015
 
Tears In The Rain
Tears In The RainTears In The Rain
Tears In The Rain
 
1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda
 
Onderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitiefOnderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitief
 
Como hacer un plan de negocios
Como hacer un plan de negociosComo hacer un plan de negocios
Como hacer un plan de negocios
 
Schrijven voor het web
Schrijven voor het webSchrijven voor het web
Schrijven voor het web
 
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicas
 
Cápsula 1. estudios de mercado
Cápsula 1. estudios de mercadoCápsula 1. estudios de mercado
Cápsula 1. estudios de mercado
 
Rodriguez alvarez
Rodriguez alvarezRodriguez alvarez
Rodriguez alvarez
 
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
 
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
 
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
 
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
 
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
 
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
 

Recently uploaded

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 

Recently uploaded (20)

Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 

Massive Multilingual Named Entity Recognition Using Polyglot-NER

  • 1. Polyglot-NER: Massive Multilingual Named Entity Recognition SDM May 2, 2015 Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena Stony Brook University
  • 2. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Named Entity Recognition (NER) Problem ■Input: Plain text, T ■Output: The spans of T that constitute proper names, and the classification of the entity’s type.
  • 3. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Examples Input: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Output: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Location Location Person
  • 4. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual NER ❑NLTK ■ English ❑Stanford ■ English, Spanish, Chinese, Arabic ❑OpenNLP ■ English, German, Dutch, Spanish ❑Polyglot-NER ■ 40 Major Languages! (English, Spanish, French, German, Russian, Polish, Portuguese, Italian, Dutch, Arabic, Hebrew, Hindi, Korean, Japanese, Vietnamese, …) While many pipelines exist, most languages are unsupported
  • 5. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Does Multilingual Matter? Yes! Only 55% of the top 10 million websites are in English! [1] There are 51 languages on Wikipedia with 100,000+ articles. [2] [1] http://w3techs.com/technologies/history_overview/content_language/ms/y [2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
  • 6. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual is Hard Feature Scarcity NLP tasks typically rely on language-specific feature engineering ❑ Orthographic features ❑ Part of Speech Tags ❑ Parallel Corpora ❑ WordNet Annotation Scarcity Need NER examples - labeled data is expensive. Our solution: neural word embeddings. Our solution: Wikipedia/Freebase for training examples
  • 7. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-problem: Word Representation Input: Unstructured text Output: Low dimensional word embeddings
  • 8. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distributed Word Representations Big Idea: Give similar words similar representations pine oak rose daisy reading writing read write |V| |V|: size of vocabulary pine oak rose daisy reading writing read write d d << |V| Similar words share similar representations. Latent Dimensions Explicit Dimensions
  • 9. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Polyglot Embeddings ● Wikipedia article text ● 137 Languages ● Available: ○ http://bit.ly/embeddings [Al-Rfou, Perozzi, Skiena, 13] C Imagination C is C greater C than C detail Score Hidden Layer H Projection Layer
  • 10. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-Problem: Annotation Mining Input: Wikipedia, Freebase Output: Labeled NER training examples
  • 11. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Related Work
  • 12. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Annotations from Wikipedia Inter-wiki links are a great potential source of mentions. WikipediaFreebase Freebase tells us which articles are entity articles.
  • 13. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Example Wiki Text: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. “Vancouver” “British Columbia” “Gregor Robertson” Strings /m/080h2 /m/015jr /m/0grlms Freebase MID City Region Person Freebase Category Location Location Person NER Label
  • 14. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Bad News Many false negatives in our dataset! ■ Wikipedia editors annotate only the first mention of an entity but not later ones. ■ Most of the named entity mentions are not linked! Example: Vancouver is a coastal seaport city on the mainland of British Columbia. Vancouver’s mayor is Gregor Robertson.
  • 15. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Good News Positive labels are very high quality! Need to emphasize this in our training. ? ? ? ? ? ? ? ‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
  • 16. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The trick: Oversampling p We can change the label distribution by oversampling from the positive labels. p is the percentage of positive labels in the training dataset. Initially no oversampling p = 0.5, much better
  • 17. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Cross-Domain Performance Oversampling Oversampling + Exact Matching Cross-Domain Testing on CoNLL
  • 18. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Demo @ http://bit.ly/polyglot-ner Legend: Location Organization Person
  • 19. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition But How to Evaluate? ■We have labeled data for a few languages ■Would like to evaluate everything
  • 20. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distant Evaluation John proviene de la ciudad de Nueva York. John is coming from New York City. Machine Translation Calculate the error of omitting entities and the error of adding entities. Person: 1 Location: 1 Organization: 0 Person: 0 Location: 1 Organization: 1 1 1
  • 21. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Experimental Design Distant Evaluation for Polyglot-NER: 1. Annotate English Wikipedia sentences using Stanford NER. 2. Randomly pick 1500 sentences that have at least one entity detected. 3. Translate these sentences using Google translate to 40 languages. 4. Run Polyglot-NER on the translated datasets. 5. Compare the number of entity chunks our annotators found to the ones detected by Stanford per sentence. 6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
  • 22. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Effect of Data Size ■ Size of training data matters! ■ Tokenization is quite important when the word embeddings coverage is limited. # Words (Log Scale) ErrorMissing More Data Will Help Anomalies Good
  • 23. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Performance by Category ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error Person Location
  • 24. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Limitations ■Named entities don’t always translate well: ❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …” ■Need a working translation system for the language
  • 25. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Take-aways ■NER in 40 languages! ■Word embeddings & oversampling offers equal or better performance to feature engineering for NER annotation mining. ■Translation based evaluation?
  • 26. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Thanks! NER Demo: http://bit.ly/polyglot-ner NER Code: http://polyglot-nlp.com bperozzi@cs.stonybrook.edu www.perozzi.net Bryan Perozzi