SlideShare a Scribd company logo
1 of 68
Download to read offline
Meltwater Budapest, April 2016
The importance of entities
Babak Rasolzadeh, Director of Data Science Research
1. Company background
2. Data Science @ Meltwater
3. Challenges with NLP at Large scale
4. Entities, entities, entities
a. Social NER
b. ELS
c. Knowledge Graph
3
What is Meltwater?
● A business intelligence company → Providing insights from data outside
the firewall (news, blogs, social media, etc.)
● Born in Oslo, in 2001.
● Founder and CEO: Jorn Lyssegen
● www.meltwater.com
● 30K+ clients all over the World.
● 1000+ employees
● 60+ offices around the world, mostly sale.
● Tech offices: USA, Germany, Sweden, Hungary, India.
4
Why?
own brand
competitors
leads
partners product reviews
own industry
5
What?
Uses Meltwater to find out about new
instances of vandalism and break-ins.
Often, the victim is in need of services
Uses Meltwater to help determine
how public perception of certain
ingredient chemicals will influence
adoption & sales
Uses Meltwater to be alerted of
when certain patent will expire in
target markets
Uses Meltwater to monitor the
performance and popularity of news
anchors and programs
Uses Meltwater social listening
to estimate and prevent
infrastructure attacks
6
How?
Unstructured
Document
Stream
Pipeline
Enrichments
Search
/Storage
Enriched Documents
High Performance
Indexes
Processing
Services
API Layer
APPSBackup
Storage
Raw Documents
15 supported languages in pipeline
(EN, DE, SV, NO, FI, ZH, JP, FR, ES, DA, NL, PT, AR, IT,
HI)
Typical enrichments
○ Sentiment analysis
○ Thematic analysis
○ Categorization
○ Keyphrase extraction
○ Named Entity Recognition
○ Named Entity Disambiguation
NLP & Data Science at Meltwater
8
What other than NLP?
● Recommendation Engines
DOC3
DOC3
DOC3
DOC3
DOC3
DOC8
Realtime
recommender
engine
● Correlation and predictive pattern recognition
● Word2vec techniques
concept 3
concept 1
concept 2
“British American Tobacco" or "British
American Tobbaco" or (BAT near tobacco) or
"英美煙草" or (("Lucky Strike" or "Dunhill"
or "Pall Mall") near/15 cigarette*)
9
Machine Learning Terminology
10
Challenges with Data Science (NLP) at scale
• High DPS (~2000) and a lot to do! (tokenization, lemmatization,
stemming, POS tagging, categorization, sentiment, NER, ...) with racing
conditions!
Pipeline
Enrichments
SV
EN
DE
POS NER• Training data labelling is costly! x15
• Contextual information expensive (computationally).
• Noise, missing data, variation (e.g. slang), data types, ...
Knowledge Base Strategy
Entities, entities, entities
don - July 2015
12
Knowledge Base StrategyWhat are Named Entities (NE)?
● Non-linguistic definition
○ Referable entities
○ Usually Proper Names
○ Single or multi-word
→ I know this man. He might be Charles.
→ He lives in Stockholm. He is Swedish.
13
Knowledge Base StrategyWhat is Named Entity Recognition (NER)?
1. Extracting NEs from a text.
2. Categorizing NEs from a set of predefined categories.
John lives in Stockholm. He works at Ericsson.
Categories of {PER, LOC, ORG, MISC, PROD}
14
Knowledge Base StrategyWhat NER is not?
● NER is not event recognition.
● NER recognises entities in text, and classifies them in some way, but it
does not create templates, nor does it perform co-reference or entity
linking.
● NER is not just matching text strings with pre-defined lists of names. It only
recognises entities which are being used as entities in a given context.
(i.e. not easy!)
15
● Key part of Information Extraction system
● Robust handling of proper names essential for many applications
● Pre-processing for different classification levels
● Information filtering
● Information linking
● Entity level sentiment
● Knowledge graph
Why NER?
16
Knowledge Base StrategyWhy NER?
17
Knowledge Base StrategyWhy NER?
Pepsi spooks Coke with
this Halloween themed ad.
Entity specific sentiment analysis a.k.a ELS
Knowledge Base Strategy
So what about Social…?
19
Supervised Learning
❏ Hidden Markov Model (HMM) Freitag and Mccallum, 1999; Leek, 1997.
❏ Conditional Markov Model (CMM) Borthwick, 1999; McCallum et al., 2000.
✓ Conditional Random Field (CRF) Lafferty, 2001; Ratinov and Roth, 2009.
How to do NER? (state-of-the-art)
20
● Ground truth data collection for NER is very expensive
● Solutions:
○ Automatic NER annotation using Wikipedia data
○ Applying Latent Dirichlet Analysis (LDA) based NER detection
using Gazetteer data.
Training data
21
NER pipeline
22
Extensive lists of names for a specific category
● PER
○ First names (male-female) and surnames, their frequency
● LOC
○ Cities, Countries
○ Population
● ORG
○ Name of companies from Yellow pages.
Gazetteers help
Disadvantages
○ Difficult to create and maintain (or expensive if commercial)
○ Usefulness varies depending on category
○ Ambiguity
○ Words occur in more lists of different types (PER, LOC, FAC,...)
23
Let’s say we want to estimate the likelihood of the bi-gram "to Shanghai",
without having seen this in a training set.
The system can obtain a good estimate if it can cluster "Shanghai" with
other city names (like “London”, “Beijing”), then make its estimate based
on the likelihood of phrases such as "to London", "to Beijing" and "to
Denver"
Brown clustering - motivation
24
● Proposed by Brown et al. (1992) (a.k.a “IBM clustering”)
● Hierarchical class-based labeling method.
● Bottom-up
● Unsupervised learning
○ Doesn't need labeled data but rather large set of raw text.
● Greedy technique to maximize bi-gram MI.
● Merge words by contextual similarity.
Brown clustering (1)
( )
25
Brown clustering (2)
● Large amount of data
○ Similar words appear in similar contexts.
○ More precisely: similar words have similar distribution of words to their
immediate left and right.
● Example: “the” and “a” both are determinant.
○ Frequency of immediate words on their left and right:
26
Brown clustering (3)
27
Hmm...easy?
● What are the challenges in real applications?
● What about moving to other languages?
● What about moving to social domain?
28
Disambiguation
What is the entity
category of
“Washington”?
29
Different languages
● Tokenization
○ Chinese & Japanese: Words not separated
● Part of speech
○ Nouns
■ English: only number inflection
■ German: number, gender and case inflection
○ Verbs
■ English: regular verb 4, irregular verb up to 8 distinct forms
■ Finnish: more than 10,000 forms
● NER: Shape feature
○ English: Only proper nouns capitalized
○ German: All nouns capitalized
30
Different languages
31
Different languages
Studying of linguistic
properties of a language is
important!
32
Editorial vs. Social
33
Challenges in Social NER
● The performance of “off-the-shelf” NER methods degrades severely when
applied on Twitter data
● Tweets
○ are short: 140 character limit.
○ cover wide range of topics.
○ are written grammatically in broken language.
○ are written fast and posted from anywhere: a lot of mis-spelling.
→ a solution which considers social characteristics of text
34
Challenges in Social NER
Examples of noisy data
● Jaguar's gonna like this episode of #MadMen even less than last week's, I bet.
● Dane Bowers is in Asda I cant believe.it luckiest girl in the world omf i cant believe
it omg
● A feel good story RT @DailyBreezeNews: Santa Claus arrives by helicopter at LAX
to greet local school
35
Solution (1)
Adapting existing features to social properties
(POS tagger of editorial NER performs really poor
when it comes to social documents.)
36
Solution (2)
Weight (importance) of each CRF feature
37
Results
● Training Data
○ ~76K tweets labeled by human
annotator.
● Inter agreement of two
annotators.
● Test Data
○ ~9.1K tweets labeled by human
annotator.
● Improvement compared state-of-
the-art method
Ritter, A. et al. Named entity recognition in tweets: An
experimental study. EMNLP ’11, pages 1524–1534.
Knowledge Base Strategy
What about sentiment….?
Document Level Sentiment - how it works
Inter-annotator agreement ~80%*
* http://bit.ly/human-sentiment
Document Level Sentiment - how it works
Machine Learning Magic
Supervised learning
Naive bayes - BernoulliNB, GaussianNB, MultinomialNB
Support Vector Machines - LinearSVM, RbfSVM
Maximum Entropy Model - GIS, IIS, MEGAM, TADM
MLP - RecurrentNN
Document Level Sentiment - how it works
Machine Learning Magic
Document Level Sentiment - current status
~60-70% (depending on language)
Not too terrible, considering that human
performance is at best ~80%...
...but why is it so hard?
Document Level Sentiment - how it’s used
Document Level Sentiment - how it’s used
Document Level Sentiment - the problem
Document Level Sentiment - the problem
Negative
Neutral
Document Level Sentiment - the problem
“Those numbers underline a growing gap between McDonald's and today's fast-
food customers. It will only get wider with another year's worth of the same
uninspired fare that has made McDonald's customers easy pickings for Panera
Bread, Chick-fil-A, Chipotle Mexican Grill and others.
”
Negative
Positive
Does not make sense for our industry!
Knowledge Base Strategy
Entity Level Sentiment (ELS)
Entity Level Sentiment - motivation
● DLS imprecise and wrong for our customers
● Entities are of main importance for our customers
● We already have NER (Named Entity Recognition) technology
Idea:
Identify the sentiment towards each particular entity in a text!
Entity Level Sentiment - how it works
NER
BMW: Positive
Mercedes: Neutral
Toyota: Negative
…
Entity Level Sentiment - how it works
Entity1: Positive
Entity2: Neutral
Entity3: Negative
…
E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative
E1:Positive
E2: Neutral
E3: Negative
Entity Level Sentiment - how it works
Entity1: Positive
Entity2: Neutral
Entity3: Negative
…
NER
Entity Level Sentiment - use case
Entity Level Sentiment - current status
● ELS is considered a very tough problem in NLP/ML
● The accuracy of state-of-the-art ELS is currently very low
(~45%)
Knowledge Base Strategy
The holy grail : The Graph Knowledge Base
don - July 2015
56
Entities + Relationships
As the types of entities and their
relationships grows so does
the capacity to infer insights
that depend on connectivity
and eventually one can
answer questions that
would otherwise not be
possible with only separate
datasets!
57
KB Architecture
Unstructured
Document
Stream
Pipeline
Enrichments
Graph
Search
Enriched Documents
High Performance
Indexes
Processing
Services
API Layer
Knowledge
Base
(Graph)
I/O
External Data Providers
Updates/subscriptions
Lookups
APPSBackup
Storage
Raw Documents
Knowledge Base Strategy
Why is it hard?
59
Composing the KB
60
Data Acquisition trade-offs
Highvolume
High quality
Cheap
Manual data
acquisition
Special crawlers,
Smart algorithms
Acquisitions,
partnerships
low
quality
expensive
low
volume
61
Composing the KB - Scalability
62
Scalability Requirements - next steps
Companies ~ 100 million worldwide
People ~ 500 million (including media influencers)
Products ~ 500 million
~1 billion entities all the connections
between them
→
billions of nodes, trillions of edges!
63
Composing the KB - New features
64
Improve entity search - company NED
65
Improve entity search - person NED
Robert Gates
22nd Secretary of Defense
William Henry Gates III
former CEO & cofounder of Microsoft
“Who is Mr. Gates?”
66
Emerging competition
67
Map influencer network
influencer score ~ eg. PageRank
68
Suggested read
● Ratinov 2009 (challenges in NER): paper.
● ArkCMU (social): paper, code.
● Ritter et al (social): paper, code.
● Stanford NLP NER (editorial): paper, code.
● Brown clustering
○ brown clustering: video
○ Word Representations and N-grams: video
● Transforming Wikipedia into Named Entity Training Data: paper.

More Related Content

Viewers also liked

Sorok között olvasni
Sorok között olvasniSorok között olvasni
Sorok között olvasniZoltan Varju
 
Milyenek a trollok
Milyenek a trollokMilyenek a trollok
Milyenek a trollokZoltan Varju
 
Munkanélküliség jelenbecslése
Munkanélküliség jelenbecsléseMunkanélküliség jelenbecslése
Munkanélküliség jelenbecsléseZoltan Varju
 
Digitális testbeszéd
Digitális testbeszédDigitális testbeszéd
Digitális testbeszédZoltan Varju
 
Érzelmek hálójában – hálózat- és tartalomelemzés
Érzelmek hálójában – hálózat- és tartalomelemzésÉrzelmek hálójában – hálózat- és tartalomelemzés
Érzelmek hálójában – hálózat- és tartalomelemzésZoltan Varju
 
Szabó - Varjú: Automatikus értékelés- és érzelemelemzés magyar nyelvű szöveg...
Szabó - Varjú: Automatikus  értékelés- és érzelemelemzés magyar nyelvű szöveg...Szabó - Varjú: Automatikus  értékelés- és érzelemelemzés magyar nyelvű szöveg...
Szabó - Varjú: Automatikus értékelés- és érzelemelemzés magyar nyelvű szöveg...Zoltan Varju
 
Balogh Kitti - Szűcs Krisztina: Képes beszéd
Balogh Kitti - Szűcs Krisztina: Képes beszédBalogh Kitti - Szűcs Krisztina: Képes beszéd
Balogh Kitti - Szűcs Krisztina: Képes beszédZoltan Varju
 
Coparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-cokeCoparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-cokePrabhpreet Singh
 
Kisvilágunk, a nyelv
Kisvilágunk, a nyelvKisvilágunk, a nyelv
Kisvilágunk, a nyelvZoltan Varju
 
Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...
Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...
Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...Zoltan Varju
 
Érzelmek és témák a szülészeti ellátásban
Érzelmek és témák a szülészeti ellátásbanÉrzelmek és témák a szülészeti ellátásban
Érzelmek és témák a szülészeti ellátásbankttblgh
 

Viewers also liked (11)

Sorok között olvasni
Sorok között olvasniSorok között olvasni
Sorok között olvasni
 
Milyenek a trollok
Milyenek a trollokMilyenek a trollok
Milyenek a trollok
 
Munkanélküliség jelenbecslése
Munkanélküliség jelenbecsléseMunkanélküliség jelenbecslése
Munkanélküliség jelenbecslése
 
Digitális testbeszéd
Digitális testbeszédDigitális testbeszéd
Digitális testbeszéd
 
Érzelmek hálójában – hálózat- és tartalomelemzés
Érzelmek hálójában – hálózat- és tartalomelemzésÉrzelmek hálójában – hálózat- és tartalomelemzés
Érzelmek hálójában – hálózat- és tartalomelemzés
 
Szabó - Varjú: Automatikus értékelés- és érzelemelemzés magyar nyelvű szöveg...
Szabó - Varjú: Automatikus  értékelés- és érzelemelemzés magyar nyelvű szöveg...Szabó - Varjú: Automatikus  értékelés- és érzelemelemzés magyar nyelvű szöveg...
Szabó - Varjú: Automatikus értékelés- és érzelemelemzés magyar nyelvű szöveg...
 
Balogh Kitti - Szűcs Krisztina: Képes beszéd
Balogh Kitti - Szűcs Krisztina: Képes beszédBalogh Kitti - Szűcs Krisztina: Képes beszéd
Balogh Kitti - Szűcs Krisztina: Képes beszéd
 
Coparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-cokeCoparative analysis-of-pepsi-and-coke
Coparative analysis-of-pepsi-and-coke
 
Kisvilágunk, a nyelv
Kisvilágunk, a nyelvKisvilágunk, a nyelv
Kisvilágunk, a nyelv
 
Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...
Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...
Balogh Kitti - Szűcs Krisztina - Varjú Zoltán: TechTea: Szövegvizualizációk a...
 
Érzelmek és témák a szülészeti ellátásban
Érzelmek és témák a szülészeti ellátásbanÉrzelmek és témák a szülészeti ellátásban
Érzelmek és témák a szülészeti ellátásban
 

Similar to Babak Rasolzadeh: The importance of entities

The Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David LowThe Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David LowDatabricks
 
State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)Liad Magen
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity RecognitionTomer Lieber
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...jcscholtes
 
Do We Need Better Presentations
Do We Need Better PresentationsDo We Need Better Presentations
Do We Need Better PresentationsJose Ramon Macias
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Andre Freitas
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Andre Freitas
 
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...KISK FF MU
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEDiana Maynard
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfJedha Bootcamp
 
Boosting Named Entity Extraction through Crowdsourcing
Boosting Named Entity Extraction through CrowdsourcingBoosting Named Entity Extraction through Crowdsourcing
Boosting Named Entity Extraction through Crowdsourcingoanainel
 
Rigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentRigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentSandy Man
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkMonaDiab7
 
Silk data - machine learning
Silk data - machine learning Silk data - machine learning
Silk data - machine learning SaltoDigitale
 
Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018Maxim Kublitski
 
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar
 
"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)Maryam Farooq
 

Similar to Babak Rasolzadeh: The importance of entities (20)

The Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David LowThe Rise Of Conversational AI with David Low
The Rise Of Conversational AI with David Low
 
State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)State of the art in Natural Language Processing (March 2019)
State of the art in Natural Language Processing (March 2019)
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Introduction to Named Entity Recognition
Introduction to Named Entity RecognitionIntroduction to Named Entity Recognition
Introduction to Named Entity Recognition
 
How can text-mining leverage developments in Deep Learning? Presentation at ...
How can text-mining leverage developments in Deep Learning?  Presentation at ...How can text-mining leverage developments in Deep Learning?  Presentation at ...
How can text-mining leverage developments in Deep Learning? Presentation at ...
 
Do We Need Better Presentations
Do We Need Better PresentationsDo We Need Better Presentations
Do We Need Better Presentations
 
Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)Question Answering over Linked Data (Reasoning Web Summer School)
Question Answering over Linked Data (Reasoning Web Summer School)
 
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
Question Answering over Linked Data: Challenges, Approaches & Trends (Tutoria...
 
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
Guenther Krumpak: The Book and The Internet - the Antithesis between Paper an...
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
Boosting Named Entity Extraction through Crowdsourcing
Boosting Named Entity Extraction through CrowdsourcingBoosting Named Entity Extraction through Crowdsourcing
Boosting Named Entity Extraction through Crowdsourcing
 
Rigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentRigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deployment
 
Towards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walkTowards Responsible NLP: Walking the walk
Towards Responsible NLP: Walking the walk
 
Silk data - machine learning
Silk data - machine learning Silk data - machine learning
Silk data - machine learning
 
Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018Machine learning technology for publishing industry / Buchmesse 2018
Machine learning technology for publishing industry / Buchmesse 2018
 
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves PeirsmanOpenbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
Openbar Leuven // Less is more. Working with less data in NLP by Yves Peirsman
 
"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)"Understanding Humans with Machines" (Arthur Tisi)
"Understanding Humans with Machines" (Arthur Tisi)
 

More from Zoltan Varju

NLP meetup 2016.10.05 - Bódogh Attila: xdroid
NLP meetup 2016.10.05 - Bódogh Attila: xdroidNLP meetup 2016.10.05 - Bódogh Attila: xdroid
NLP meetup 2016.10.05 - Bódogh Attila: xdroidZoltan Varju
 
NLP meetup 2016.10.05 - Szabó Martina Katalin: Precognox
NLP meetup 2016.10.05 - Szabó Martina Katalin: PrecognoxNLP meetup 2016.10.05 - Szabó Martina Katalin: Precognox
NLP meetup 2016.10.05 - Szabó Martina Katalin: PrecognoxZoltan Varju
 
NLP meetup 2016.10.05 - Szekeres Péter: Neticle
NLP meetup 2016.10.05 - Szekeres Péter: NeticleNLP meetup 2016.10.05 - Szekeres Péter: Neticle
NLP meetup 2016.10.05 - Szekeres Péter: NeticleZoltan Varju
 
Balogh Kitti: Szövegbányászat
Balogh Kitti: SzövegbányászatBalogh Kitti: Szövegbányászat
Balogh Kitti: SzövegbányászatZoltan Varju
 
Balogh Kitti: Politika a sorok között - Politikai témájú szövegelemzések
Balogh Kitti: Politika a sorok között - Politikai témájú szövegelemzésekBalogh Kitti: Politika a sorok között - Politikai témájú szövegelemzések
Balogh Kitti: Politika a sorok között - Politikai témájú szövegelemzésekZoltan Varju
 
Mókus (Koncsik Anita, Varjú Zoltán)
Mókus (Koncsik Anita, Varjú Zoltán)Mókus (Koncsik Anita, Varjú Zoltán)
Mókus (Koncsik Anita, Varjú Zoltán)Zoltan Varju
 
Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...
Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...
Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...Zoltan Varju
 
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...Zoltan Varju
 
Rasztik Zita: A стартовать jövevényszó fejlődési útja
Rasztik Zita: A стартовать jövevényszó fejlődési útjaRasztik Zita: A стартовать jövevényszó fejlődési útja
Rasztik Zita: A стартовать jövevényszó fejlődési útjaZoltan Varju
 
Kontextus és a hivatkozások ereje
Kontextus és a hivatkozások erejeKontextus és a hivatkozások ereje
Kontextus és a hivatkozások erejeZoltan Varju
 
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshezSimon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshezZoltan Varju
 
Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben Zoltan Varju
 
Miháltz Márton: Magyar wordnet
Miháltz Márton: Magyar wordnetMiháltz Márton: Magyar wordnet
Miháltz Márton: Magyar wordnetZoltan Varju
 
Ács Judit: Online soknyelvű szótárak
Ács Judit: Online soknyelvű szótárakÁcs Judit: Online soknyelvű szótárak
Ács Judit: Online soknyelvű szótárakZoltan Varju
 
Sass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezet
Sass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezetSass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezet
Sass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezetZoltan Varju
 
Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben Zoltan Varju
 
Vincze Veronika: A Szeged Korpusz és Treebank
Vincze Veronika: A Szeged Korpusz és Treebank Vincze Veronika: A Szeged Korpusz és Treebank
Vincze Veronika: A Szeged Korpusz és Treebank Zoltan Varju
 
Textus; szövegek hálójában
Textus; szövegek hálójábanTextus; szövegek hálójában
Textus; szövegek hálójábanZoltan Varju
 
Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésének elméleti és...
Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésénekelméleti és...Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésénekelméleti és...
Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésének elméleti és...Zoltan Varju
 

More from Zoltan Varju (20)

NLP meetup 2016.10.05 - Bódogh Attila: xdroid
NLP meetup 2016.10.05 - Bódogh Attila: xdroidNLP meetup 2016.10.05 - Bódogh Attila: xdroid
NLP meetup 2016.10.05 - Bódogh Attila: xdroid
 
NLP meetup 2016.10.05 - Szabó Martina Katalin: Precognox
NLP meetup 2016.10.05 - Szabó Martina Katalin: PrecognoxNLP meetup 2016.10.05 - Szabó Martina Katalin: Precognox
NLP meetup 2016.10.05 - Szabó Martina Katalin: Precognox
 
NLP meetup 2016.10.05 - Szekeres Péter: Neticle
NLP meetup 2016.10.05 - Szekeres Péter: NeticleNLP meetup 2016.10.05 - Szekeres Péter: Neticle
NLP meetup 2016.10.05 - Szekeres Péter: Neticle
 
Balogh Kitti: Szövegbányászat
Balogh Kitti: SzövegbányászatBalogh Kitti: Szövegbányászat
Balogh Kitti: Szövegbányászat
 
Balogh Kitti: Politika a sorok között - Politikai témájú szövegelemzések
Balogh Kitti: Politika a sorok között - Politikai témájú szövegelemzésekBalogh Kitti: Politika a sorok között - Politikai témájú szövegelemzések
Balogh Kitti: Politika a sorok között - Politikai témájú szövegelemzések
 
Mókus (Koncsik Anita, Varjú Zoltán)
Mókus (Koncsik Anita, Varjú Zoltán)Mókus (Koncsik Anita, Varjú Zoltán)
Mókus (Koncsik Anita, Varjú Zoltán)
 
Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...
Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...
Születésház - Adatozz okosan hackathon (Schmidt Erika, Balogh Kitti, Hudy Rób...
 
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
Danics Szabina Lívia: A magyar és az orosz melléknévi igenevek a megfelelteté...
 
Rasztik Zita: A стартовать jövevényszó fejlődési útja
Rasztik Zita: A стартовать jövevényszó fejlődési útjaRasztik Zita: A стартовать jövevényszó fejlődési útja
Rasztik Zita: A стартовать jövevényszó fejlődési útja
 
Kontextus és a hivatkozások ereje
Kontextus és a hivatkozások erejeKontextus és a hivatkozások ereje
Kontextus és a hivatkozások ereje
 
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshezSimon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
Simon Eszter: Silver standard korpuszok tulajdonnév-felismeréshez
 
Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben
 
Felhívás
FelhívásFelhívás
Felhívás
 
Miháltz Márton: Magyar wordnet
Miháltz Márton: Magyar wordnetMiháltz Márton: Magyar wordnet
Miháltz Márton: Magyar wordnet
 
Ács Judit: Online soknyelvű szótárak
Ács Judit: Online soknyelvű szótárakÁcs Judit: Online soknyelvű szótárak
Ács Judit: Online soknyelvű szótárak
 
Sass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezet
Sass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezetSass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezet
Sass Bálint: 28 millió szintaktikailag elemzett mondat és 500000 igei szerkezet
 
Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben Vincze Veronika: Korpuszok az információkinyerésben
Vincze Veronika: Korpuszok az információkinyerésben
 
Vincze Veronika: A Szeged Korpusz és Treebank
Vincze Veronika: A Szeged Korpusz és Treebank Vincze Veronika: A Szeged Korpusz és Treebank
Vincze Veronika: A Szeged Korpusz és Treebank
 
Textus; szövegek hálójában
Textus; szövegek hálójábanTextus; szövegek hálójában
Textus; szövegek hálójában
 
Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésének elméleti és...
Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésénekelméleti és...Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésénekelméleti és...
Szabó - Vincze - Morvay: Magyar nyelvű szövegek emócióelemzésének elméleti és...
 

Recently uploaded

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 

Recently uploaded (20)

Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 

Babak Rasolzadeh: The importance of entities

  • 1. Meltwater Budapest, April 2016 The importance of entities Babak Rasolzadeh, Director of Data Science Research
  • 2. 1. Company background 2. Data Science @ Meltwater 3. Challenges with NLP at Large scale 4. Entities, entities, entities a. Social NER b. ELS c. Knowledge Graph
  • 3. 3 What is Meltwater? ● A business intelligence company → Providing insights from data outside the firewall (news, blogs, social media, etc.) ● Born in Oslo, in 2001. ● Founder and CEO: Jorn Lyssegen ● www.meltwater.com ● 30K+ clients all over the World. ● 1000+ employees ● 60+ offices around the world, mostly sale. ● Tech offices: USA, Germany, Sweden, Hungary, India.
  • 5. 5 What? Uses Meltwater to find out about new instances of vandalism and break-ins. Often, the victim is in need of services Uses Meltwater to help determine how public perception of certain ingredient chemicals will influence adoption & sales Uses Meltwater to be alerted of when certain patent will expire in target markets Uses Meltwater to monitor the performance and popularity of news anchors and programs Uses Meltwater social listening to estimate and prevent infrastructure attacks
  • 7. Unstructured Document Stream Pipeline Enrichments Search /Storage Enriched Documents High Performance Indexes Processing Services API Layer APPSBackup Storage Raw Documents 15 supported languages in pipeline (EN, DE, SV, NO, FI, ZH, JP, FR, ES, DA, NL, PT, AR, IT, HI) Typical enrichments ○ Sentiment analysis ○ Thematic analysis ○ Categorization ○ Keyphrase extraction ○ Named Entity Recognition ○ Named Entity Disambiguation NLP & Data Science at Meltwater
  • 8. 8 What other than NLP? ● Recommendation Engines DOC3 DOC3 DOC3 DOC3 DOC3 DOC8 Realtime recommender engine ● Correlation and predictive pattern recognition ● Word2vec techniques concept 3 concept 1 concept 2 “British American Tobacco" or "British American Tobbaco" or (BAT near tobacco) or "英美煙草" or (("Lucky Strike" or "Dunhill" or "Pall Mall") near/15 cigarette*)
  • 10. 10 Challenges with Data Science (NLP) at scale • High DPS (~2000) and a lot to do! (tokenization, lemmatization, stemming, POS tagging, categorization, sentiment, NER, ...) with racing conditions! Pipeline Enrichments SV EN DE POS NER• Training data labelling is costly! x15 • Contextual information expensive (computationally). • Noise, missing data, variation (e.g. slang), data types, ...
  • 11. Knowledge Base Strategy Entities, entities, entities don - July 2015
  • 12. 12 Knowledge Base StrategyWhat are Named Entities (NE)? ● Non-linguistic definition ○ Referable entities ○ Usually Proper Names ○ Single or multi-word → I know this man. He might be Charles. → He lives in Stockholm. He is Swedish.
  • 13. 13 Knowledge Base StrategyWhat is Named Entity Recognition (NER)? 1. Extracting NEs from a text. 2. Categorizing NEs from a set of predefined categories. John lives in Stockholm. He works at Ericsson. Categories of {PER, LOC, ORG, MISC, PROD}
  • 14. 14 Knowledge Base StrategyWhat NER is not? ● NER is not event recognition. ● NER recognises entities in text, and classifies them in some way, but it does not create templates, nor does it perform co-reference or entity linking. ● NER is not just matching text strings with pre-defined lists of names. It only recognises entities which are being used as entities in a given context. (i.e. not easy!)
  • 15. 15 ● Key part of Information Extraction system ● Robust handling of proper names essential for many applications ● Pre-processing for different classification levels ● Information filtering ● Information linking ● Entity level sentiment ● Knowledge graph Why NER?
  • 17. 17 Knowledge Base StrategyWhy NER? Pepsi spooks Coke with this Halloween themed ad. Entity specific sentiment analysis a.k.a ELS
  • 18. Knowledge Base Strategy So what about Social…?
  • 19. 19 Supervised Learning ❏ Hidden Markov Model (HMM) Freitag and Mccallum, 1999; Leek, 1997. ❏ Conditional Markov Model (CMM) Borthwick, 1999; McCallum et al., 2000. ✓ Conditional Random Field (CRF) Lafferty, 2001; Ratinov and Roth, 2009. How to do NER? (state-of-the-art)
  • 20. 20 ● Ground truth data collection for NER is very expensive ● Solutions: ○ Automatic NER annotation using Wikipedia data ○ Applying Latent Dirichlet Analysis (LDA) based NER detection using Gazetteer data. Training data
  • 22. 22 Extensive lists of names for a specific category ● PER ○ First names (male-female) and surnames, their frequency ● LOC ○ Cities, Countries ○ Population ● ORG ○ Name of companies from Yellow pages. Gazetteers help Disadvantages ○ Difficult to create and maintain (or expensive if commercial) ○ Usefulness varies depending on category ○ Ambiguity ○ Words occur in more lists of different types (PER, LOC, FAC,...)
  • 23. 23 Let’s say we want to estimate the likelihood of the bi-gram "to Shanghai", without having seen this in a training set. The system can obtain a good estimate if it can cluster "Shanghai" with other city names (like “London”, “Beijing”), then make its estimate based on the likelihood of phrases such as "to London", "to Beijing" and "to Denver" Brown clustering - motivation
  • 24. 24 ● Proposed by Brown et al. (1992) (a.k.a “IBM clustering”) ● Hierarchical class-based labeling method. ● Bottom-up ● Unsupervised learning ○ Doesn't need labeled data but rather large set of raw text. ● Greedy technique to maximize bi-gram MI. ● Merge words by contextual similarity. Brown clustering (1) ( )
  • 25. 25 Brown clustering (2) ● Large amount of data ○ Similar words appear in similar contexts. ○ More precisely: similar words have similar distribution of words to their immediate left and right. ● Example: “the” and “a” both are determinant. ○ Frequency of immediate words on their left and right:
  • 27. 27 Hmm...easy? ● What are the challenges in real applications? ● What about moving to other languages? ● What about moving to social domain?
  • 28. 28 Disambiguation What is the entity category of “Washington”?
  • 29. 29 Different languages ● Tokenization ○ Chinese & Japanese: Words not separated ● Part of speech ○ Nouns ■ English: only number inflection ■ German: number, gender and case inflection ○ Verbs ■ English: regular verb 4, irregular verb up to 8 distinct forms ■ Finnish: more than 10,000 forms ● NER: Shape feature ○ English: Only proper nouns capitalized ○ German: All nouns capitalized
  • 31. 31 Different languages Studying of linguistic properties of a language is important!
  • 33. 33 Challenges in Social NER ● The performance of “off-the-shelf” NER methods degrades severely when applied on Twitter data ● Tweets ○ are short: 140 character limit. ○ cover wide range of topics. ○ are written grammatically in broken language. ○ are written fast and posted from anywhere: a lot of mis-spelling. → a solution which considers social characteristics of text
  • 34. 34 Challenges in Social NER Examples of noisy data ● Jaguar's gonna like this episode of #MadMen even less than last week's, I bet. ● Dane Bowers is in Asda I cant believe.it luckiest girl in the world omf i cant believe it omg ● A feel good story RT @DailyBreezeNews: Santa Claus arrives by helicopter at LAX to greet local school
  • 35. 35 Solution (1) Adapting existing features to social properties (POS tagger of editorial NER performs really poor when it comes to social documents.)
  • 36. 36 Solution (2) Weight (importance) of each CRF feature
  • 37. 37 Results ● Training Data ○ ~76K tweets labeled by human annotator. ● Inter agreement of two annotators. ● Test Data ○ ~9.1K tweets labeled by human annotator. ● Improvement compared state-of- the-art method Ritter, A. et al. Named entity recognition in tweets: An experimental study. EMNLP ’11, pages 1524–1534.
  • 38. Knowledge Base Strategy What about sentiment….?
  • 39. Document Level Sentiment - how it works Inter-annotator agreement ~80%* * http://bit.ly/human-sentiment
  • 40. Document Level Sentiment - how it works Machine Learning Magic Supervised learning Naive bayes - BernoulliNB, GaussianNB, MultinomialNB Support Vector Machines - LinearSVM, RbfSVM Maximum Entropy Model - GIS, IIS, MEGAM, TADM MLP - RecurrentNN
  • 41. Document Level Sentiment - how it works Machine Learning Magic
  • 42. Document Level Sentiment - current status ~60-70% (depending on language) Not too terrible, considering that human performance is at best ~80%... ...but why is it so hard?
  • 43. Document Level Sentiment - how it’s used
  • 44. Document Level Sentiment - how it’s used
  • 45. Document Level Sentiment - the problem
  • 46. Document Level Sentiment - the problem Negative Neutral
  • 47. Document Level Sentiment - the problem “Those numbers underline a growing gap between McDonald's and today's fast- food customers. It will only get wider with another year's worth of the same uninspired fare that has made McDonald's customers easy pickings for Panera Bread, Chick-fil-A, Chipotle Mexican Grill and others. ” Negative Positive Does not make sense for our industry!
  • 48. Knowledge Base Strategy Entity Level Sentiment (ELS)
  • 49. Entity Level Sentiment - motivation ● DLS imprecise and wrong for our customers ● Entities are of main importance for our customers ● We already have NER (Named Entity Recognition) technology Idea: Identify the sentiment towards each particular entity in a text!
  • 50. Entity Level Sentiment - how it works NER BMW: Positive Mercedes: Neutral Toyota: Negative …
  • 51. Entity Level Sentiment - how it works Entity1: Positive Entity2: Neutral Entity3: Negative … E1:Positive E2: Neutral E3: Negative E1:Positive E2: Neutral E3: Negative E1:Positive E2: Neutral E3: Negative
  • 52. Entity Level Sentiment - how it works Entity1: Positive Entity2: Neutral Entity3: Negative … NER
  • 54. Entity Level Sentiment - current status ● ELS is considered a very tough problem in NLP/ML ● The accuracy of state-of-the-art ELS is currently very low (~45%)
  • 55. Knowledge Base Strategy The holy grail : The Graph Knowledge Base don - July 2015
  • 56. 56 Entities + Relationships As the types of entities and their relationships grows so does the capacity to infer insights that depend on connectivity and eventually one can answer questions that would otherwise not be possible with only separate datasets!
  • 57. 57 KB Architecture Unstructured Document Stream Pipeline Enrichments Graph Search Enriched Documents High Performance Indexes Processing Services API Layer Knowledge Base (Graph) I/O External Data Providers Updates/subscriptions Lookups APPSBackup Storage Raw Documents
  • 60. 60 Data Acquisition trade-offs Highvolume High quality Cheap Manual data acquisition Special crawlers, Smart algorithms Acquisitions, partnerships low quality expensive low volume
  • 61. 61 Composing the KB - Scalability
  • 62. 62 Scalability Requirements - next steps Companies ~ 100 million worldwide People ~ 500 million (including media influencers) Products ~ 500 million ~1 billion entities all the connections between them → billions of nodes, trillions of edges!
  • 63. 63 Composing the KB - New features
  • 64. 64 Improve entity search - company NED
  • 65. 65 Improve entity search - person NED Robert Gates 22nd Secretary of Defense William Henry Gates III former CEO & cofounder of Microsoft “Who is Mr. Gates?”
  • 67. 67 Map influencer network influencer score ~ eg. PageRank
  • 68. 68 Suggested read ● Ratinov 2009 (challenges in NER): paper. ● ArkCMU (social): paper, code. ● Ritter et al (social): paper, code. ● Stanford NLP NER (editorial): paper, code. ● Brown clustering ○ brown clustering: video ○ Word Representations and N-grams: video ● Transforming Wikipedia into Named Entity Training Data: paper.