SlideShare a Scribd company logo
Polyglot-NER: Massive Multilingual
Named Entity Recognition
SDM
May 2, 2015
Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena
Stony Brook University
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Named Entity Recognition (NER) Problem
■Input:
Plain text, T
■Output:
The spans of T that constitute proper names,
and the classification of the entity’s type.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Examples
Input: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Output: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Location
Location Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual NER
❑NLTK
■ English
❑Stanford
■ English, Spanish,
Chinese, Arabic
❑OpenNLP
■ English, German, Dutch,
Spanish
❑Polyglot-NER
■ 40 Major Languages!
(English, Spanish, French, German,
Russian, Polish, Portuguese, Italian,
Dutch, Arabic, Hebrew, Hindi, Korean,
Japanese, Vietnamese, …)
While many pipelines exist, most languages are unsupported
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Does Multilingual Matter?
Yes!
Only 55% of the top 10 million websites are in English! [1]
There are 51 languages on Wikipedia with 100,000+
articles. [2]
[1] http://w3techs.com/technologies/history_overview/content_language/ms/y
[2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual is Hard
Feature Scarcity
NLP tasks typically rely on
language-specific feature
engineering
❑ Orthographic features
❑ Part of Speech Tags
❑ Parallel Corpora
❑ WordNet
Annotation Scarcity
Need NER examples -
labeled data is expensive.
Our solution: neural word
embeddings.
Our solution:
Wikipedia/Freebase for training
examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-problem: Word Representation
Input: Unstructured text
Output: Low dimensional word embeddings
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distributed Word Representations
Big Idea: Give similar words similar representations
pine
oak
rose
daisy
reading
writing
read
write
|V|
|V|: size of vocabulary
pine
oak
rose
daisy
reading
writing
read
write
d
d << |V|
Similar words share similar
representations.
Latent
Dimensions
Explicit
Dimensions
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Polyglot Embeddings
● Wikipedia article text
● 137 Languages
● Available:
○ http://bit.ly/embeddings
[Al-Rfou, Perozzi, Skiena, 13] C
Imagination
C
is
C
greater
C
than
C
detail
Score
Hidden
Layer
H
Projection
Layer
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-Problem: Annotation Mining
Input: Wikipedia, Freebase
Output: Labeled NER training examples
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Related Work
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Annotations from Wikipedia
Inter-wiki links are a great
potential source of mentions.
WikipediaFreebase
Freebase tells us which articles
are entity articles.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Example
Wiki Text:
Vancouver is a coastal seaport city on the mainland of
British Columbia. The city's mayor is Gregor Robertson.
“Vancouver”
“British Columbia”
“Gregor Robertson”
Strings
/m/080h2
/m/015jr
/m/0grlms
Freebase MID
City
Region
Person
Freebase
Category
Location
Location
Person
NER Label
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Bad News
Many false negatives in our dataset!
■ Wikipedia editors annotate only the first mention of
an entity but not later ones.
■ Most of the named entity mentions are not linked!
Example:
Vancouver is a coastal seaport city on the
mainland of British Columbia. Vancouver’s
mayor is Gregor Robertson.
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Good News
Positive labels are very
high quality!
Need to emphasize this in
our training.
?
?
?
?
?
?
?
‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The trick: Oversampling
p
We can change the label
distribution by
oversampling from the
positive labels.
p is the percentage of positive
labels in the training dataset.
Initially no
oversampling
p = 0.5, much
better
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Cross-Domain Performance
Oversampling
Oversampling +
Exact Matching
Cross-Domain Testing on CoNLL
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Demo
@ http://bit.ly/polyglot-ner
Legend: Location Organization Person
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
But How to Evaluate?
■We have labeled data for a few languages
■Would like to evaluate everything
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distant Evaluation
John proviene de la ciudad de
Nueva York.
John is coming from New York City.
Machine
Translation
Calculate the error of omitting entities and the error of adding entities.
Person: 1
Location: 1
Organization: 0
Person: 0
Location: 1
Organization: 1
1
1
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Experimental Design
Distant Evaluation for Polyglot-NER:
1. Annotate English Wikipedia sentences using Stanford NER.
2. Randomly pick 1500 sentences that have at least one entity detected.
3. Translate these sentences using Google translate to 40 languages.
4. Run Polyglot-NER on the translated datasets.
5. Compare the number of entity chunks our annotators found to the
ones detected by Stanford per sentence.
6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Effect of Data Size
■ Size of training data
matters!
■ Tokenization is quite
important when the
word embeddings
coverage is limited.
# Words (Log Scale)
ErrorMissing
More
Data Will
Help
Anomalies
Good
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Performance by Category
ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error
Person Location
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Limitations
■Named entities don’t always translate well:
❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …”
■Need a working translation system for the language
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Take-aways
■NER in 40 languages!
■Word embeddings & oversampling offers equal
or better performance to feature engineering for
NER annotation mining.
■Translation based evaluation?
Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Thanks!
NER Demo: http://bit.ly/polyglot-ner
NER Code: http://polyglot-nlp.com
bperozzi@cs.stonybrook.edu
www.perozzi.net
Bryan Perozzi

More Related Content

What's hot

How to build a virtual machine
How to build a virtual machineHow to build a virtual machine
How to build a virtual machine
Terence Parr
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automation
BHAWESH RAJPAL
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
PromptCloud
 
Svelte the future of frontend development
Svelte   the future of frontend developmentSvelte   the future of frontend development
Svelte the future of frontend development
twilson63
 
Ruby Rails 老司機帶飛
Ruby Rails 老司機帶飛Ruby Rails 老司機帶飛
Ruby Rails 老司機帶飛
Wen-Tien Chang
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
Galder Zamarreño
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
NAVER D2
 
Migrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMigrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDB
MongoDB
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptx
nehabsairam
 
Elements Of Web Design
Elements Of Web DesignElements Of Web Design
Elements Of Web Design
Dan Dixon
 
Mongoose - Melhores práticas usando MongoDB e Node.js
Mongoose - Melhores práticas usando MongoDB e Node.jsMongoose - Melhores práticas usando MongoDB e Node.js
Mongoose - Melhores práticas usando MongoDB e Node.js
Suissa
 
Web Cache Deception Attack
Web Cache Deception AttackWeb Cache Deception Attack
Web Cache Deception Attack
Omer Gil
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB Database
Tariqul islam
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
Kevin Weil
 
Web development
Web developmentWeb development
Web development
Kanan Rahimov
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
Yu-Chang Ho
 
Web бағдарламалау
Web бағдарламалауWeb бағдарламалау
Web бағдарламалау
Rauan Ibraikhan
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Ignaz Wanders
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
neela madheswari
 

What's hot (20)

How to build a virtual machine
How to build a virtual machineHow to build a virtual machine
How to build a virtual machine
 
Web scraping &amp; browser automation
Web scraping &amp; browser automationWeb scraping &amp; browser automation
Web scraping &amp; browser automation
 
Web Scraping and Data Extraction Service
Web Scraping and Data Extraction ServiceWeb Scraping and Data Extraction Service
Web Scraping and Data Extraction Service
 
Svelte the future of frontend development
Svelte   the future of frontend developmentSvelte   the future of frontend development
Svelte the future of frontend development
 
Ruby Rails 老司機帶飛
Ruby Rails 老司機帶飛Ruby Rails 老司機帶飛
Ruby Rails 老司機帶飛
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
Migrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDBMigrating from RDBMS to MongoDB
Migrating from RDBMS to MongoDB
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptx
 
Elements Of Web Design
Elements Of Web DesignElements Of Web Design
Elements Of Web Design
 
Mongoose - Melhores práticas usando MongoDB e Node.js
Mongoose - Melhores práticas usando MongoDB e Node.jsMongoose - Melhores práticas usando MongoDB e Node.js
Mongoose - Melhores práticas usando MongoDB e Node.js
 
Web Cache Deception Attack
Web Cache Deception AttackWeb Cache Deception Attack
Web Cache Deception Attack
 
NOSQL and MongoDB Database
NOSQL and MongoDB DatabaseNOSQL and MongoDB Database
NOSQL and MongoDB Database
 
Protocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at TwitterProtocol Buffers and Hadoop at Twitter
Protocol Buffers and Hadoop at Twitter
 
Web development
Web developmentWeb development
Web development
 
What is Web-scraping?
What is Web-scraping?What is Web-scraping?
What is Web-scraping?
 
Infinispan for Dummies
Infinispan for DummiesInfinispan for Dummies
Infinispan for Dummies
 
Web бағдарламалау
Web бағдарламалауWeb бағдарламалау
Web бағдарламалау
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 

Viewers also liked

Currículo Nacional de la Educación Básica
Currículo Nacional de la Educación BásicaCurrículo Nacional de la Educación Básica
Currículo Nacional de la Educación Básica
Diego Ponce de Leon
 
Portafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica DocentePortafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica Docente
Norma Vega
 
JULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJulio Pari
 
El emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional certEl emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional cert
Maestros Online
 
PMP Sonora Saludable 2010 2015
PMP Sonora Saludable 2010   2015  PMP Sonora Saludable 2010   2015
1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demandaGeohistoria23
 
Onderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitiefOnderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitief
rloggen
 
Como hacer un plan de negocios
Como hacer un plan de negociosComo hacer un plan de negocios
Como hacer un plan de negocios
XPINNERPablo
 
Schrijven voor het web
Schrijven voor het webSchrijven voor het web
Schrijven voor het web
Simone Levie
 
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
.. ..
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicas
LarryJimenez
 
Rodriguez alvarez
Rodriguez alvarezRodriguez alvarez
Rodriguez alvarez
Roxana Saldaña
 
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
.. ..
 
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
.. ..
 
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
.. ..
 
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
.. ..
 
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
.. ..
 
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
.. ..
 

Viewers also liked (20)

Currículo Nacional de la Educación Básica
Currículo Nacional de la Educación BásicaCurrículo Nacional de la Educación Básica
Currículo Nacional de la Educación Básica
 
Portafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica DocentePortafolio de Evidencias de mi Práctica Docente
Portafolio de Evidencias de mi Práctica Docente
 
JULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de NegociosJULIOPARI - Elaborando un Plan de Negocios
JULIOPARI - Elaborando un Plan de Negocios
 
El emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional certEl emprendedor y el empresario profesional cert
El emprendedor y el empresario profesional cert
 
PMP Sonora Saludable 2010 2015
PMP Sonora Saludable 2010   2015  PMP Sonora Saludable 2010   2015
PMP Sonora Saludable 2010 2015
 
Tears In The Rain
Tears In The RainTears In The Rain
Tears In The Rain
 
1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda1ºBACH Economía Tema 5 Oferta y demanda
1ºBACH Economía Tema 5 Oferta y demanda
 
Onderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitiefOnderzoeksrapport acrs v3.0_definitief
Onderzoeksrapport acrs v3.0_definitief
 
Como hacer un plan de negocios
Como hacer un plan de negociosComo hacer un plan de negocios
Como hacer un plan de negocios
 
Schrijven voor het web
Schrijven voor het webSchrijven voor het web
Schrijven voor het web
 
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
Evidence: Describing my kitchen. ENGLISH DOT WORKS 2. SENA.
 
Estrategias competitivas básicas
Estrategias competitivas básicasEstrategias competitivas básicas
Estrategias competitivas básicas
 
Cápsula 1. estudios de mercado
Cápsula 1. estudios de mercadoCápsula 1. estudios de mercado
Cápsula 1. estudios de mercado
 
Rodriguez alvarez
Rodriguez alvarezRodriguez alvarez
Rodriguez alvarez
 
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
2. describing cities and places. ENGLISH DOT WORKS 2. SENA. semana 4 acitivda...
 
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
3.Evidence: Getting to Bogota.ENGLISH DOT WORKS 2. SENA.semana 4 actividad 3.
 
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
Evidence: Going to the restaurant . ENGLISH DOT WORKS 2. SENA.
 
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
Evidence: I can’t believe it.ENGLISH DOT WORKS 2. semana 3 actividad 1.SENA.
 
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
Evidence: Memorable moments.ENGLISH DOT WORKS 2. SENA. semana 2 actividad 2.
 
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
Evidence: Planning my trip. ENGLISH DOT WORKS 2. SENA. semana 4 actividad 1.
 

Recently uploaded

Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
aishnasrivastava
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
anitaento25
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
muralinath2
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
Scintica Instrumentation
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SELF-EXPLANATORY
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
yusufzako14
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
silvermistyshot
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
subedisuryaofficial
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
AlaminAfendy1
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 

Recently uploaded (20)

Structural Classification Of Protein (SCOP)
Structural Classification Of Protein  (SCOP)Structural Classification Of Protein  (SCOP)
Structural Classification Of Protein (SCOP)
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
insect taxonomy importance systematics and classification
insect taxonomy importance systematics and classificationinsect taxonomy importance systematics and classification
insect taxonomy importance systematics and classification
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
ESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptxESR_factors_affect-clinic significance-Pathysiology.pptx
ESR_factors_affect-clinic significance-Pathysiology.pptx
 
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
in vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptxin vitro propagation of plants lecture note.pptx
in vitro propagation of plants lecture note.pptx
 
Lateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensiveLateral Ventricles.pdf very easy good diagrams comprehensive
Lateral Ventricles.pdf very easy good diagrams comprehensive
 
Citrus Greening Disease and its Management
Citrus Greening Disease and its ManagementCitrus Greening Disease and its Management
Citrus Greening Disease and its Management
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
In silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptxIn silico drugs analogue design: novobiocin analogues.pptx
In silico drugs analogue design: novobiocin analogues.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 

POLYGLOT-NER: Massive Multilingual Named Entity Recognition

  • 1. Polyglot-NER: Massive Multilingual Named Entity Recognition SDM May 2, 2015 Rami Al-Rfou, Vivek Kulkarni, Bryan Perozzi, Steve Skiena Stony Brook University
  • 2. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Named Entity Recognition (NER) Problem ■Input: Plain text, T ■Output: The spans of T that constitute proper names, and the classification of the entity’s type.
  • 3. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Examples Input: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Output: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. Location Location Person
  • 4. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual NER ❑NLTK ■ English ❑Stanford ■ English, Spanish, Chinese, Arabic ❑OpenNLP ■ English, German, Dutch, Spanish ❑Polyglot-NER ■ 40 Major Languages! (English, Spanish, French, German, Russian, Polish, Portuguese, Italian, Dutch, Arabic, Hebrew, Hindi, Korean, Japanese, Vietnamese, …) While many pipelines exist, most languages are unsupported
  • 5. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Does Multilingual Matter? Yes! Only 55% of the top 10 million websites are in English! [1] There are 51 languages on Wikipedia with 100,000+ articles. [2] [1] http://w3techs.com/technologies/history_overview/content_language/ms/y [2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
  • 6. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Multilingual is Hard Feature Scarcity NLP tasks typically rely on language-specific feature engineering ❑ Orthographic features ❑ Part of Speech Tags ❑ Parallel Corpora ❑ WordNet Annotation Scarcity Need NER examples - labeled data is expensive. Our solution: neural word embeddings. Our solution: Wikipedia/Freebase for training examples
  • 7. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-problem: Word Representation Input: Unstructured text Output: Low dimensional word embeddings
  • 8. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distributed Word Representations Big Idea: Give similar words similar representations pine oak rose daisy reading writing read write |V| |V|: size of vocabulary pine oak rose daisy reading writing read write d d << |V| Similar words share similar representations. Latent Dimensions Explicit Dimensions
  • 9. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Polyglot Embeddings ● Wikipedia article text ● 137 Languages ● Available: ○ http://bit.ly/embeddings [Al-Rfou, Perozzi, Skiena, 13] C Imagination C is C greater C than C detail Score Hidden Layer H Projection Layer
  • 10. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Sub-Problem: Annotation Mining Input: Wikipedia, Freebase Output: Labeled NER training examples
  • 11. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Related Work
  • 12. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Annotations from Wikipedia Inter-wiki links are a great potential source of mentions. WikipediaFreebase Freebase tells us which articles are entity articles.
  • 13. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Example Wiki Text: Vancouver is a coastal seaport city on the mainland of British Columbia. The city's mayor is Gregor Robertson. “Vancouver” “British Columbia” “Gregor Robertson” Strings /m/080h2 /m/015jr /m/0grlms Freebase MID City Region Person Freebase Category Location Location Person NER Label
  • 14. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Bad News Many false negatives in our dataset! ■ Wikipedia editors annotate only the first mention of an entity but not later ones. ■ Most of the named entity mentions are not linked! Example: Vancouver is a coastal seaport city on the mainland of British Columbia. Vancouver’s mayor is Gregor Robertson.
  • 15. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The Good News Positive labels are very high quality! Need to emphasize this in our training. ? ? ? ? ? ? ? ‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
  • 16. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition The trick: Oversampling p We can change the label distribution by oversampling from the positive labels. p is the percentage of positive labels in the training dataset. Initially no oversampling p = 0.5, much better
  • 17. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Cross-Domain Performance Oversampling Oversampling + Exact Matching Cross-Domain Testing on CoNLL
  • 18. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition NER Demo @ http://bit.ly/polyglot-ner Legend: Location Organization Person
  • 19. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition But How to Evaluate? ■We have labeled data for a few languages ■Would like to evaluate everything
  • 20. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Distant Evaluation John proviene de la ciudad de Nueva York. John is coming from New York City. Machine Translation Calculate the error of omitting entities and the error of adding entities. Person: 1 Location: 1 Organization: 0 Person: 0 Location: 1 Organization: 1 1 1
  • 21. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Experimental Design Distant Evaluation for Polyglot-NER: 1. Annotate English Wikipedia sentences using Stanford NER. 2. Randomly pick 1500 sentences that have at least one entity detected. 3. Translate these sentences using Google translate to 40 languages. 4. Run Polyglot-NER on the translated datasets. 5. Compare the number of entity chunks our annotators found to the ones detected by Stanford per sentence. 6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
  • 22. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Effect of Data Size ■ Size of training data matters! ■ Tokenization is quite important when the word embeddings coverage is limited. # Words (Log Scale) ErrorMissing More Data Will Help Anomalies Good
  • 23. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Performance by Category ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error Person Location
  • 24. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Limitations ■Named entities don’t always translate well: ❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …” ■Need a working translation system for the language
  • 25. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Take-aways ■NER in 40 languages! ■Word embeddings & oversampling offers equal or better performance to feature engineering for NER annotation mining. ■Translation based evaluation?
  • 26. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition Thanks! NER Demo: http://bit.ly/polyglot-ner NER Code: http://polyglot-nlp.com bperozzi@cs.stonybrook.edu www.perozzi.net Bryan Perozzi