Babak Rasolzadeh: The importance of entities

Meltwater Budapest, April 2016
The importance of entities
Babak Rasolzadeh, Director of Data Science Research

1. Company background
2. Data Science @ Meltwater
3. Challenges with NLP at Large scale
4. Entities, entities, entities
a. Social NER
b. ELS
c. Knowledge Graph

3
What is Meltwater?
● A business intelligence company → Providing insights from data outside
the firewall (news, blogs, social media, etc.)
● Born in Oslo, in 2001.
● Founder and CEO: Jorn Lyssegen
● www.meltwater.com
● 30K+ clients all over the World.
● 1000+ employees
● 60+ offices around the world, mostly sale.
● Tech offices: USA, Germany, Sweden, Hungary, India.

4
Why?
own brand
competitors
leads
partners product reviews
own industry

5
What?
Uses Meltwater to find out about new
instances of vandalism and break-ins.
Often, the victim is in need of services
Uses Meltwater to help determine
how public perception of certain
ingredient chemicals will influence
adoption & sales
Uses Meltwater to be alerted of
when certain patent will expire in
target markets
Uses Meltwater to monitor the
performance and popularity of news
anchors and programs
Uses Meltwater social listening
to estimate and prevent
infrastructure attacks

Unstructured
Document
Stream
Pipeline
Enrichments
Search
/Storage
Enriched Documents
High Performance
Indexes
Processing
Services
API Layer
APPSBackup
Storage
Raw Documents
15 supported languages in pipeline
(EN, DE, SV, NO, FI, ZH, JP, FR, ES, DA, NL, PT, AR, IT,
HI)
Typical enrichments
○ Sentiment analysis
○ Thematic analysis
○ Categorization
○ Keyphrase extraction
○ Named Entity Recognition
○ Named Entity Disambiguation
NLP & Data Science at Meltwater

8
What other than NLP?
● Recommendation Engines
DOC3
DOC3
DOC3
DOC3
DOC3
DOC8
Realtime
recommender
engine
● Correlation and predictive pattern recognition
● Word2vec techniques
concept 3
concept 1
concept 2
“British American Tobacco" or "British
American Tobbaco" or (BAT near tobacco) or
"英美煙草" or (("Lucky Strike" or "Dunhill"
or "Pall Mall") near/15 cigarette*)

9
Machine Learning Terminology

10
Challenges with Data Science (NLP) at scale
• High DPS (~2000) and a lot to do! (tokenization, lemmatization,
stemming, POS tagging, categorization, sentiment, NER, ...) with racing
conditions!
Pipeline
Enrichments
SV
EN
DE
POS NER• Training data labelling is costly! x15
• Contextual information expensive (computationally).
• Noise, missing data, variation (e.g. slang), data types, ...

Knowledge Base Strategy
Entities, entities, entities
don - July 2015

12
Knowledge Base StrategyWhat are Named Entities (NE)?
● Non-linguistic definition
○ Referable entities
○ Usually Proper Names
○ Single or multi-word
→ I know this man. He might be Charles.
→ He lives in Stockholm. He is Swedish.

13
Knowledge Base StrategyWhat is Named Entity Recognition (NER)?
1. Extracting NEs from a text.
2. Categorizing NEs from a set of predefined categories.
John lives in Stockholm. He works at Ericsson.
Categories of {PER, LOC, ORG, MISC, PROD}

14
Knowledge Base StrategyWhat NER is not?
● NER is not event recognition.
● NER recognises entities in text, and classifies them in some way, but it
does not create templates, nor does it perform co-reference or entity
linking.
● NER is not just matching text strings with pre-defined lists of names. It only
recognises entities which are being used as entities in a given context.
(i.e. not easy!)

15
● Key part of Information Extraction system
● Robust handling of proper names essential for many applications
● Pre-processing for different classification levels
● Information filtering
● Information linking
● Entity level sentiment
● Knowledge graph
Why NER?

16
Knowledge Base StrategyWhy NER?

17
Knowledge Base StrategyWhy NER?
Pepsi spooks Coke with
this Halloween themed ad.
Entity specific sentiment analysis a.k.a ELS

So what about Social…?

19
Supervised Learning
❏ Hidden Markov Model (HMM) Freitag and Mccallum, 1999; Leek, 1997.
❏ Conditional Markov Model (CMM) Borthwick, 1999; McCallum et al., 2000.
✓ Conditional Random Field (CRF) Lafferty, 2001; Ratinov and Roth, 2009.
How to do NER? (state-of-the-art)

20
● Ground truth data collection for NER is very expensive
● Solutions:
○ Automatic NER annotation using Wikipedia data
○ Applying Latent Dirichlet Analysis (LDA) based NER detection
using Gazetteer data.
Training data

22
Extensive lists of names for a specific category
● PER
○ First names (male-female) and surnames, their frequency
● LOC
○ Cities, Countries
○ Population
● ORG
○ Name of companies from Yellow pages.
Gazetteers help
Disadvantages
○ Difficult to create and maintain (or expensive if commercial)
○ Usefulness varies depending on category
○ Ambiguity
○ Words occur in more lists of different types (PER, LOC, FAC,...)

23
Let’s say we want to estimate the likelihood of the bi-gram "to Shanghai",
without having seen this in a training set.
The system can obtain a good estimate if it can cluster "Shanghai" with
other city names (like “London”, “Beijing”), then make its estimate based
on the likelihood of phrases such as "to London", "to Beijing" and "to
Denver"
Brown clustering - motivation

24
● Proposed by Brown et al. (1992) (a.k.a “IBM clustering”)
● Hierarchical class-based labeling method.
● Bottom-up
● Unsupervised learning
○ Doesn't need labeled data but rather large set of raw text.
● Greedy technique to maximize bi-gram MI.
● Merge words by contextual similarity.
Brown clustering (1)
( )

25
Brown clustering (2)
● Large amount of data
○ Similar words appear in similar contexts.
○ More precisely: similar words have similar distribution of words to their
immediate left and right.
● Example: “the” and “a” both are determinant.
○ Frequency of immediate words on their left and right:

27
Hmm...easy?
● What are the challenges in real applications?
● What about moving to other languages?
● What about moving to social domain?

28
Disambiguation
What is the entity
category of
“Washington”?

29
Different languages
● Tokenization
○ Chinese & Japanese: Words not separated
● Part of speech
○ Nouns
■ English: only number inflection
■ German: number, gender and case inflection
○ Verbs
■ English: regular verb 4, irregular verb up to 8 distinct forms
■ Finnish: more than 10,000 forms
● NER: Shape feature
○ English: Only proper nouns capitalized
○ German: All nouns capitalized

31
Different languages
Studying of linguistic
properties of a language is
important!

33
Challenges in Social NER
● The performance of “off-the-shelf” NER methods degrades severely when
applied on Twitter data
● Tweets
○ are short: 140 character limit.
○ cover wide range of topics.
○ are written grammatically in broken language.
○ are written fast and posted from anywhere: a lot of mis-spelling.
→ a solution which considers social characteristics of text

34
Challenges in Social NER
Examples of noisy data
● Jaguar's gonna like this episode of #MadMen even less than last week's, I bet.
● Dane Bowers is in Asda I cant believe.it luckiest girl in the world omf i cant believe
it omg
● A feel good story RT @DailyBreezeNews: Santa Claus arrives by helicopter at LAX
to greet local school

35
Solution (1)
Adapting existing features to social properties
(POS tagger of editorial NER performs really poor
when it comes to social documents.)

36
Solution (2)
Weight (importance) of each CRF feature

37
Results
● Training Data
○ ~76K tweets labeled by human
annotator.
● Inter agreement of two
annotators.
● Test Data
○ ~9.1K tweets labeled by human
annotator.
● Improvement compared state-of-
the-art method
Ritter, A. et al. Named entity recognition in tweets: An
experimental study. EMNLP ’11, pages 1524–1534.

What about sentiment….?

Document Level Sentiment - how it works
Inter-annotator agreement ~80%*
* http://bit.ly/human-sentiment

Machine Learning Magic
Supervised learning
Naive bayes - BernoulliNB, GaussianNB, MultinomialNB
Support Vector Machines - LinearSVM, RbfSVM
Maximum Entropy Model - GIS, IIS, MEGAM, TADM
MLP - RecurrentNN

Machine Learning Magic

Document Level Sentiment - current status
~60-70% (depending on language)
Not too terrible, considering that human
performance is at best ~80%...
...but why is it so hard?

Document Level Sentiment - how it’s used

Document Level Sentiment - the problem

Negative
Neutral

“Those numbers underline a growing gap between McDonald's and today's fast-
food customers. It will only get wider with another year's worth of the same
uninspired fare that has made McDonald's customers easy pickings for Panera
Bread, Chick-fil-A, Chipotle Mexican Grill and others.
”
Negative
Positive
Does not make sense for our industry!

Entity Level Sentiment (ELS)

Entity Level Sentiment - motivation
● DLS imprecise and wrong for our customers
● Entities are of main importance for our customers
● We already have NER (Named Entity Recognition) technology
Idea:
Identify the sentiment towards each particular entity in a text!

Entity Level Sentiment - how it works
NER
BMW: Positive
Mercedes: Neutral
Toyota: Negative
…

Entity1: Positive
Entity2: Neutral
Entity3: Negative
…
E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative

Entity1: Positive
Entity2: Neutral
Entity3: Negative
…
NER

Entity Level Sentiment - use case

Entity Level Sentiment - current status
● ELS is considered a very tough problem in NLP/ML
● The accuracy of state-of-the-art ELS is currently very low
(~45%)

The holy grail : The Graph Knowledge Base
don - July 2015

56
Entities + Relationships
As the types of entities and their
relationships grows so does
the capacity to infer insights
that depend on connectivity
and eventually one can
answer questions that
would otherwise not be
possible with only separate
datasets!

57
KB Architecture
Unstructured
Document
Stream
Pipeline
Enrichments
Graph
Search
Enriched Documents
High Performance
Indexes
Processing
Services
API Layer
Knowledge
Base
(Graph)
I/O
External Data Providers
Updates/subscriptions
Lookups
APPSBackup
Storage
Raw Documents

Why is it hard?

60
Data Acquisition trade-offs
Highvolume
High quality
Cheap
Manual data
acquisition
Special crawlers,
Smart algorithms
Acquisitions,
partnerships
low
quality
expensive
low
volume

61
Composing the KB - Scalability

62
Scalability Requirements - next steps
Companies ~ 100 million worldwide
People ~ 500 million (including media influencers)
Products ~ 500 million
~1 billion entities all the connections
between them
→
billions of nodes, trillions of edges!

63
Composing the KB - New features

64
Improve entity search - company NED

65
Improve entity search - person NED
Robert Gates
22nd Secretary of Defense
William Henry Gates III
former CEO & cofounder of Microsoft
“Who is Mr. Gates?”

67
Map influencer network
influencer score ~ eg. PageRank

68
Suggested read
● Ratinov 2009 (challenges in NER): paper.
● ArkCMU (social): paper, code.
● Ritter et al (social): paper, code.
● Stanford NLP NER (editorial): paper, code.
● Brown clustering
○ brown clustering: video
○ Word Representations and N-grams: video
● Transforming Wikipedia into Named Entity Training Data: paper.

Babak Rasolzadeh: The importance of entities

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (11)

Similar to Babak Rasolzadeh: The importance of entities

Similar to Babak Rasolzadeh: The importance of entities (20)

More from Zoltan Varju

More from Zoltan Varju (20)

Recently uploaded

Recently uploaded (20)

Babak Rasolzadeh: The importance of entities