Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Generating Pseudo-ground Truth for
Predicting New Concepts in Social Streams
David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke

What is "anema"?
2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

For content interpretation and complex filtering
tasks we want to know who/what people talk about.

Entity Linking
TANK (VEHICLE)
Knowledge
Base (KB)
Document r
TANK
query q
?
?
TANK JOHNSON

Named Entity Recognition

Challenges
1. Entity "importance"
2. Noisy & short text (Twitter), updates in the KB

Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
!

Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
Q: How do you know an entity is important or has impact?
A: If it is in Wikipedia, it is/has

Challenge 1:
Entity Importance
Can we leverage today's entities to learn  
to predict tomorrow's entities?

Challenge 1:
Entity Importance
74

Challenge 1:
Entity Importance
74/140

Challenge 2:
Noisy data & changing KB
Unsupervised method for generating  
pseudo-ground truth (for training NER)

Assumption
A named-entity recognizer trained only on  
KB entities will learn to recognize KB entities

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2

d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Hahaha! Are we sure Jillert Anema isn't
Canadian? RT @rzbh: Dutch Coach's Anti-
America Rant http://on.cc.com/1htk9Wo

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
product
organization
organization

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB

Future
KB
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB

Future
KB
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
full KB
small KB

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…

Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
Evaluate

Evaluation
• Mention level (NER style)
• Entity level
!

Evaluation
Prediction
• Entity level
!

• Entity level
!
Evaluation
Prediction
Ground Truth

Evaluation
• Entity level
Prediction
Ground Truth

Tweets Entities Tweets EntitiesPrediction
Ground Truth

Experimental setup
Data:
Corpus: Twitter (TREC'11 MB: 4,832,838 tweets)
KB: Wikipedia (Jan 4th, 2012)
!
Components:
EL: Semanticizer
NERC: Custom

"links": [
{
"text": "ASAP",
"linkProbability": 0.17446043165467626,
"id": "30864663",
"senseProbability": 0.11690647482014388,
"title": "ASAP (variety show)",
"url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29",
"label": "ASAP",
"priorProbability": 0.631578947368421
},
{
"text": "ASAP Rocky",
"id": "33754098",
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
},
{
"text": "Kendrick Lamar",
"id": "29909823",
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"

NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.

NERC
1. Recognition
2. Classification
• SVMs
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"

NERC
1. Recognition
2. Classification
• SVMs
B I O B I
O O O O O O

NERC
1. Recognition
2. Classification
• SVMs
person person

m1
, e1
m2
, e2
Future
KB
NERC
Tweet
NERC
Model
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Predictions
m1
, c1
m2, c2
…
Entity
Linker
Sample
Corpus
Training data
m1
, c1
m2
, c2

From tweet to training sample
1. Convert EL output (Wikipedia concepts) to NERC labels;
• Label entity span (B-I-O) & class (PER/LOC/ORG)
!
2. Pick "good" samples
• entity linker's confidence score
• textual quality

"links": [
{
"text": "ASAP Rocky",
"id": "33754098",
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
},
{
"text": "Kendrick Lamar",
"id": "29909823",
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"

Entity Class
1. Map Wikipedia entity to DBpedia entity
2. Retrieve entity class (ontology);
• if Person: PER
• if Organisation, Company, or Non-ProfitOrganisation: ORG
• if Place, PopulatedPlace, City, Country: LOC
• …?

Sampling Methods
1. Entity linker confidence score
2. Textual quality

Sampling 1: Confidence Score
• Extract anchor text (a) to Wikipedia page (W)-mappings
• Confidence score combines two signals:
1. How common is it that a is used as a link
2. How commonly is a used as a link to W

Sampling 1: Confidence Score
• Higher threshold = fewer entities, less noise
• Lower threshold = fewer entities, more noise

Sampling 2: Textual quality

Highest scoring tweets
1. Watching the History channel, Hitler’”⁹s
Family. Hitler hid his true family
heritage, while others had to measure
up to Aryan purity.
2. When you sense yourself becoming
negative, stop and consider what it
would mean to apply that negative
energy in the opposite direction.
3. So. After school tomorrow, french
revision class. Tuesday, Drama
rehearsal and then at 8, cricket
training. Wednesday, Drama. Thursday
… (c)
Lowest scoring tweets
1. Toni Braxton ~ He Wasn't Man
Enough for Me _HASHTAG_
_HASHTAG_? _URL_ RT _Mention_
2. tell me what u think The GetMore
Girls, Part One _URL_
3. this girl better not go off on me rt

• Compare different sampling strategies;
• top tweets
• medium tweets
• medium+top tweets
• low+medium+top tweets (no sampling)

Results

m1
, e1
m2
, e2
NERC
NERC
Model
Unlabeled Tweet
?Entity
Linker
Future
KB
Tweet
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus Sample
Corpus
Predictions
m1
, c1
m2, c2
…
RQ1:
What is the impact of our sampling methods for
 

generating
pseudo-ground truth?
Training data
m1
, c1
m2
, c2

Findings: EL confidence score threshold
1. Higher threshold, higher accuracy
!
!
!
!
!
!
!

Solid: Precision 

Dotted: Recall
0"
5"
10"
15"
20"
25"
30"
35"
0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"

Findings: EL confidence score threshold
2. Higher threshold, more predictions

Findings: Textual Quality Sampling

Future
KB
Tweet
Unlabeled Tweet
?
NERC
Model
NERC
Training data
m1
, c1
m2
, c2
Sample
Corpus
m1
, e1
m2
, e2
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
RQ2:
What is the impact of the size of prior knowledge

on detecting unknown entities?
Entity
Linker
Today's
KB
Predictions
m1
, c1
m2, c2
…

Results: RQ2
Sampling 2: KB size (mentions)
!
!
!
!
!
!

blue: Our method 

red: Baseline 
0"
5"
10"
15"
20"
25"
30"
35"
40"
45"
50"
20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"

Conclusions
Recall increases as amount of prior knowledge grows:
1. Able to deal with missing labels, justifying approach
2. Rate of unknown entity detection increases as KB grows

Future Work
• Next step: Closing the loop
• Feed back to KB (entity normalization)
• From PER/LOC/ORG entities to other classes:
• Books, buildings, drugs, artists, …?
• Apply to other domains, languages
• From random sampling to time-based sampling

Fin
Questions?
!
!
!
!
!
!
!

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Recommended

Recommended

More Related Content

Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams (20)

More from David Graus

More from David Graus (15)

Recently uploaded

Recently uploaded (20)

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams