The increasing diversity of languages used on the web introduces a new level of complexity to Information Retrieval (IR) systems. We can no longer assume that textual content is written in one language or even the same language family. In this paper, we demonstrate how to build massive multilingual annotators with minimal human expertise and intervention. We describe a system that builds Named Entity Recognition (NER) annotators for 40 major languages using Wikipedia and Freebase. Our approach does not require NER human annotated datasets or language specific resources like treebanks, parallel corpora, and orthographic rules. The novelty of approach lies therein - using only language agnostic techniques, while achieving competitive performance.
Our method learns distributed word representations (word embeddings) which encode semantic and syntactic features of words in each language. Then, we automatically generate datasets from Wikipedia link structure and Freebase attributes. Finally, we apply two preprocessing stages (oversampling and exact surface form matching) which do not require any linguistic expertise.
Our evaluation is two fold: First, we demonstrate the system performance on human annotated datasets. Second, for languages where no gold-standard benchmarks are available, we propose a new method, distant evaluation, based on statistical machine translation.
2. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Named Entity Recognition (NER) Problem
■Input:
Plain text, T
■Output:
The spans of T that constitute proper names,
and the classification of the entity’s type.
3. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Examples
Input: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Output: Vancouver is a coastal seaport city on the mainland
of British Columbia. The city's mayor is Gregor Robertson.
Location
Location Person
4. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual NER
❑NLTK
■ English
❑Stanford
■ English, Spanish,
Chinese, Arabic
❑OpenNLP
■ English, German, Dutch,
Spanish
❑Polyglot-NER
■ 40 Major Languages!
(English, Spanish, French, German,
Russian, Polish, Portuguese, Italian,
Dutch, Arabic, Hebrew, Hindi, Korean,
Japanese, Vietnamese, …)
While many pipelines exist, most languages are unsupported
5. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Does Multilingual Matter?
Yes!
Only 55% of the top 10 million websites are in English! [1]
There are 51 languages on Wikipedia with 100,000+
articles. [2]
[1] http://w3techs.com/technologies/history_overview/content_language/ms/y
[2] http://meta.wikimedia.org/wiki/List_of_Wikipedias
6. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Multilingual is Hard
Feature Scarcity
NLP tasks typically rely on
language-specific feature
engineering
❑ Orthographic features
❑ Part of Speech Tags
❑ Parallel Corpora
❑ WordNet
Annotation Scarcity
Need NER examples -
labeled data is expensive.
Our solution: neural word
embeddings.
Our solution:
Wikipedia/Freebase for training
examples
7. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-problem: Word Representation
Input: Unstructured text
Output: Low dimensional word embeddings
8. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distributed Word Representations
Big Idea: Give similar words similar representations
pine
oak
rose
daisy
reading
writing
read
write
|V|
|V|: size of vocabulary
pine
oak
rose
daisy
reading
writing
read
write
d
d << |V|
Similar words share similar
representations.
Latent
Dimensions
Explicit
Dimensions
9. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Polyglot Embeddings
● Wikipedia article text
● 137 Languages
● Available:
○ http://bit.ly/embeddings
[Al-Rfou, Perozzi, Skiena, 13] C
Imagination
C
is
C
greater
C
than
C
detail
Score
Hidden
Layer
H
Projection
Layer
10. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Sub-Problem: Annotation Mining
Input: Wikipedia, Freebase
Output: Labeled NER training examples
12. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Annotations from Wikipedia
Inter-wiki links are a great
potential source of mentions.
WikipediaFreebase
Freebase tells us which articles
are entity articles.
13. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Example
Wiki Text:
Vancouver is a coastal seaport city on the mainland of
British Columbia. The city's mayor is Gregor Robertson.
“Vancouver”
“British Columbia”
“Gregor Robertson”
Strings
/m/080h2
/m/015jr
/m/0grlms
Freebase MID
City
Region
Person
Freebase
Category
Location
Location
Person
NER Label
14. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Bad News
Many false negatives in our dataset!
■ Wikipedia editors annotate only the first mention of
an entity but not later ones.
■ Most of the named entity mentions are not linked!
Example:
Vancouver is a coastal seaport city on the
mainland of British Columbia. Vancouver’s
mayor is Gregor Robertson.
15. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The Good News
Positive labels are very
high quality!
Need to emphasize this in
our training.
?
?
?
?
?
?
?
‘Learning Classifiers from only positive and unlabeled examples’ [Elkin & Noto, 08]
16. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
The trick: Oversampling
p
We can change the label
distribution by
oversampling from the
positive labels.
p is the percentage of positive
labels in the training dataset.
Initially no
oversampling
p = 0.5, much
better
17. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Cross-Domain Performance
Oversampling
Oversampling +
Exact Matching
Cross-Domain Testing on CoNLL
18. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
NER Demo
@ http://bit.ly/polyglot-ner
Legend: Location Organization Person
19. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
But How to Evaluate?
■We have labeled data for a few languages
■Would like to evaluate everything
20. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Distant Evaluation
John proviene de la ciudad de
Nueva York.
John is coming from New York City.
Machine
Translation
Calculate the error of omitting entities and the error of adding entities.
Person: 1
Location: 1
Organization: 0
Person: 0
Location: 1
Organization: 1
1
1
21. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Experimental Design
Distant Evaluation for Polyglot-NER:
1. Annotate English Wikipedia sentences using Stanford NER.
2. Randomly pick 1500 sentences that have at least one entity detected.
3. Translate these sentences using Google translate to 40 languages.
4. Run Polyglot-NER on the translated datasets.
5. Compare the number of entity chunks our annotators found to the
ones detected by Stanford per sentence.
6. Calculate the error of omitting (ℰ 𝓜) and adding entities (ℰ 𝒜)
22. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Effect of Data Size
■ Size of training data
matters!
■ Tokenization is quite
important when the
word embeddings
coverage is limited.
# Words (Log Scale)
ErrorMissing
More
Data Will
Help
Anomalies
Good
23. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Performance by Category
ℰ 𝒜: Adding Error ℰ 𝓜: Missing Error
Person Location
24. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Limitations
■Named entities don’t always translate well:
❑Ex: “Γείτονας Shanna Rudd δήλωσε στο CNN …”
■Need a working translation system for the language
25. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Take-aways
■NER in 40 languages!
■Word embeddings & oversampling offers equal
or better performance to feature engineering for
NER annotation mining.
■Translation based evaluation?
26. Bryan Perozzi Polyglot-NER: Massive Multilingual Named Entity Recognition
Thanks!
NER Demo: http://bit.ly/polyglot-ner
NER Code: http://polyglot-nlp.com
bperozzi@cs.stonybrook.edu
www.perozzi.net
Bryan Perozzi