SlideShare a Scribd company logo
1 of 78
Download to read offline
Generating Pseudo-ground Truth for
Predicting New Concepts in Social Streams
David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
What is "anema"?
2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
For content interpretation and complex filtering
tasks we want to know who/what people talk about.
7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Entity Linking
8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TANK (VEHICLE)
Knowledge
Base (KB)
Document r
TANK
query q
?
?
TANK JOHNSON
9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenges
1. Entity "importance"
2. Noisy & short text (Twitter), updates in the KB
14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
!
15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
Q: How do you know an entity is important or has impact?
A: If it is in Wikipedia, it is/has
16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74/140
Challenge 2:
Noisy data & changing KB
Unsupervised method for generating 

pseudo-ground truth (for training NER)
20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Assumption
A named-entity recognizer trained only on 

KB entities will learn to recognize KB entities
21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Hahaha! Are we sure Jillert Anema isn't
Canadian? RT @rzbh: Dutch Coach's Anti-
America Rant http://on.cc.com/1htk9Wo
32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
product
organization
organization
37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
Future
KB
43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
full KB
small KB
44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
Evaluate
Evaluation
• Mention level (NER style)
• Entity level
!
47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
• Mention level (NER style)
• Entity level
!
• Mention level (NER style)
• Entity level
!
49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
• Mention level (NER style)
• Entity level
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
Experimental setup
Data:
Corpus: Twitter (TREC'11 MB: 4,832,838 tweets)
KB: Wikipedia (Jan 4th, 2012)
!
Components:
EL: Semanticizer
NERC: Custom
53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [
{
"text": "ASAP",
"linkProbability": 0.17446043165467626,
"id": "30864663",
"senseProbability": 0.11690647482014388,
"title": "ASAP (variety show)",
"url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29",
"label": "ASAP",
"priorProbability": 0.631578947368421
},
{
"text": "ASAP Rocky",
"linkProbability": 0.9333333333333333,
"id": "33754098",
"senseProbability": 0.9333333333333333,
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
"priorProbability": 1.0
},
{
"text": "Kendrick Lamar",
"linkProbability": 0.9533333333333334,
"id": "29909823",
"senseProbability": 0.9533333333333334,
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
"priorProbability": 1.0
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
B I O B I
O O O O O O
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
person person
59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
Future
KB
NERC
Tweet
NERC
Model
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Predictions
m1
, c1
m2, c2
…
Entity
Linker
Sample
Corpus
Training data
m1
, c1
m2
, c2
From tweet to training sample
1. Convert EL output (Wikipedia concepts) to NERC labels;
• Label entity span (B-I-O) & class (PER/LOC/ORG)
!
2. Pick "good" samples
• entity linker's confidence score
• textual quality
60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [
{
"text": "ASAP Rocky",
"linkProbability": 0.9333333333333333,
"id": "33754098",
"senseProbability": 0.9333333333333333,
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
"priorProbability": 1.0
},
{
"text": "Kendrick Lamar",
"linkProbability": 0.9533333333333334,
"id": "29909823",
"senseProbability": 0.9533333333333334,
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
"priorProbability": 1.0
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
Entity Class
1. Map Wikipedia entity to DBpedia entity
2. Retrieve entity class (ontology);
• if Person: PER
• if Organisation, Company, or Non-ProfitOrganisation: ORG
• if Place, PopulatedPlace, City, Country: LOC
• …?
62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling Methods
1. Entity linker confidence score
2. Textual quality
63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Extract anchor text (a) to Wikipedia page (W)-mappings
• Confidence score combines two signals:
1. How common is it that a is used as a link
2. How commonly is a used as a link to W
64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Higher threshold = fewer entities, less noise
• Lower threshold = fewer entities, more noise
65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Highest scoring tweets
1. Watching the History channel, Hitler’”⁹s
Family. Hitler hid his true family
heritage, while others had to measure
up to Aryan purity.
2. When you sense yourself becoming
negative, stop and consider what it
would mean to apply that negative
energy in the opposite direction.
3. So. After school tomorrow, french
revision class. Tuesday, Drama
rehearsal and then at 8, cricket
training. Wednesday, Drama. Thursday
… (c)
Lowest scoring tweets
1. Toni Braxton ~ He Wasn't Man
Enough for Me _HASHTAG_
_HASHTAG_? _URL_ RT _Mention_
2. tell me what u think The GetMore
Girls, Part One _URL_
3. this girl better not go off on me rt
Sampling 2: Textual quality
• Compare different sampling strategies;
• top tweets
• medium tweets
• medium+top tweets
• low+medium+top tweets (no sampling)
68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Results
69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
NERC
NERC
Model
Unlabeled Tweet
?Entity
Linker
Future
KB
Tweet
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus Sample
Corpus
Predictions
m1
, c1
m2, c2
…
RQ1: 	
 What is the impact of our sampling methods for 	
 

	
	
 	
 	
 	
 generating	
pseudo-ground truth?
Training data
m1
, c1
m2
, c2
Findings: EL confidence score threshold
1. Higher threshold, higher accuracy
!
!
!
!
!
!
!
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 Solid: Precision

	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 Dotted: Recall
71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
Findings: EL confidence score threshold
2. Higher threshold, more predictions
72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Findings: Textual Quality Sampling
73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Tweet
Unlabeled Tweet
?
NERC
Model
NERC
Training data
m1
, c1
m2
, c2
Sample
Corpus
m1
, e1
m2
, e2
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
RQ2: 	
What is the impact of the size of prior knowledge
	
	
 	
 	
 	
 on detecting unknown entities?
Entity
Linker
Today's
KB
Predictions
m1
, c1
m2, c2
…
Results: RQ2
Sampling 2: KB size (mentions)
!
!
!
!
!
!
	
	
 	
 	
 	
 	
 	
 	
 	
 	
	
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 blue: Our method

	
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 red: Baseline

75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
40"
45"
50"
20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
Conclusions
Recall increases as amount of prior knowledge grows:
1. Able to deal with missing labels, justifying approach
2. Rate of unknown entity detection increases as KB grows
76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future Work
• Next step: Closing the loop
• Feed back to KB (entity normalization)
• From PER/LOC/ORG entities to other classes:
• Books, buildings, drugs, artists, …?
• Apply to other domains, languages
• From random sampling to time-based sampling
77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Fin
Questions?
!
!
!
!
!
!
!
78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

More Related Content

Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentGovernment ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentJeffrey McGuire
 
Adaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & LivingAdaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & LivingThe Learning Hub
 
Everything that happens, happens somewhere
Everything that happens, happens somewhereEverything that happens, happens somewhere
Everything that happens, happens somewhereJene van der Heide
 
Taking your learning further SEND
Taking your learning further SENDTaking your learning further SEND
Taking your learning further SENDSpectronics
 
Technology Driven Diferentiation
Technology Driven DiferentiationTechnology Driven Diferentiation
Technology Driven DiferentiationVicki Davis
 
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...Jeffrey McGuire
 
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The NetherlandsPlatform Linked Data Netherlands (PLDN)
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphCUbRIK Project
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitiesCUbRIK Project
 
Now What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and AppsNow What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and AppsDesign for Context
 
Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019Ariadna Font Llitjos
 
2013.05.13 tom de nies - snow 2013
2013.05.13   tom de nies - snow 20132013.05.13   tom de nies - snow 2013
2013.05.13 tom de nies - snow 2013tdenies
 
Possibilities to Practices II
Possibilities to Practices IIPossibilities to Practices II
Possibilities to Practices IIDerrick Willard
 
Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015Martin Hamilton
 
MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)NORCAT
 
Lean at ITP info session 2015
Lean at ITP info session 2015Lean at ITP info session 2015
Lean at ITP info session 2015Jen van der Meer
 

Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams (20)

Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentGovernment ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
 
Adaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & LivingAdaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & Living
 
Partner-Driven Innovation
Partner-Driven InnovationPartner-Driven Innovation
Partner-Driven Innovation
 
Everything that happens, happens somewhere
Everything that happens, happens somewhereEverything that happens, happens somewhere
Everything that happens, happens somewhere
 
Taking your learning further SEND
Taking your learning further SENDTaking your learning further SEND
Taking your learning further SEND
 
Technology Driven Diferentiation
Technology Driven DiferentiationTechnology Driven Diferentiation
Technology Driven Diferentiation
 
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
 
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraph
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital Humanities
 
Now What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and AppsNow What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and Apps
 
Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019
 
2013.05.13 tom de nies - snow 2013
2013.05.13   tom de nies - snow 20132013.05.13   tom de nies - snow 2013
2013.05.13 tom de nies - snow 2013
 
Possibilities to Practices II
Possibilities to Practices IIPossibilities to Practices II
Possibilities to Practices II
 
Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015
 
Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma
 
MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)
 
Tim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data SchemeTim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data Scheme
 
Lean at ITP info session 2015
Lean at ITP info session 2015Lean at ITP info session 2015
Lean at ITP info session 2015
 
Digital Storytelling
Digital Storytelling Digital Storytelling
Digital Storytelling
 

More from David Graus

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsDavid Graus
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in RecommendationsDavid Graus
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.David Graus
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactDavid Graus
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsDavid Graus
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyDavid Graus
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesDavid Graus
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamDavid Graus
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDavid Graus
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoDavid Graus
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDavid Graus
 
Understanding Email Traffic
Understanding Email TrafficUnderstanding Email Traffic
Understanding Email TrafficDavid Graus
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-DiscoveryDavid Graus
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationDavid Graus
 

More from David Graus (15)

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientists
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in Recommendations
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for Impact
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender Systems
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData Amsterdam
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgeven
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.info
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
 
Understanding Email Traffic
Understanding Email TrafficUnderstanding Email Traffic
Understanding Email Traffic
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualization
 

Recently uploaded

DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
Understanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfUnderstanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfHabibouKarbo
 
Anatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science studentsAnatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science studentsgargipatil3210
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Chiheb Ben Hammouda
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyHemantThakare8
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasChayanika Das
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...jana861314
 
AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness1hk20is002
 
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...uzmashireenmbe01
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTAlexander F. Mayer
 
Remote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformationRemote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformationfahad Alotaibiu
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsCreative-Biolabs
 

Recently uploaded (20)

DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
Understanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdfUnderstanding Nutrition, 16th Edition pdf
Understanding Nutrition, 16th Edition pdf
 
Anatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science studentsAnatomy of Duodenum- detailed ppt presentation for science students
Anatomy of Duodenum- detailed ppt presentation for science students
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
Efficient Fourier Pricing of Multi-Asset Options: Quasi-Monte Carlo & Domain ...
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Food_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiologyFood_safety_Management_pptx.pptx in microbiology
Food_safety_Management_pptx.pptx in microbiology
 
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika DasBACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
BACTERIAL SECRETION SYSTEM by Dr. Chayanika Das
 
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
Speed Breeding in Vegetable Crops- innovative approach for present era of cro...
 
AICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awarenessAICTE activity on Water Conservation spreading awareness
AICTE activity on Water Conservation spreading awareness
 
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...Rabies ,a deadly viral disease transmitted by the most beautiful beings.  by ...
Rabies ,a deadly viral disease transmitted by the most beautiful beings. by ...
 
Interpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWSTInterpreting SDSS extragalactic data in the era of JWST
Interpreting SDSS extragalactic data in the era of JWST
 
Remote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformationRemote patient monitoring :Health care transformation
Remote patient monitoring :Health care transformation
 
Introduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative BiolabsIntroduction of Organ-On-A-Chip - Creative Biolabs
Introduction of Organ-On-A-Chip - Creative Biolabs
 

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

  • 1. Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
  • 2. What is "anema"? 2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 3. 3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 4. 4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 5. 5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 6. 6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 7. For content interpretation and complex filtering tasks we want to know who/what people talk about. 7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 8. Entity Linking 8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media TANK (VEHICLE) Knowledge Base (KB) Document r TANK query q ? ? TANK JOHNSON
  • 9. 9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 10. 10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 11. 11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 12. Named Entity Recognition 12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 13. Named Entity Recognition 13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 14. Challenges 1. Entity "importance" 2. Noisy & short text (Twitter), updates in the KB 14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 15. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! ! 15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 16. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! Q: How do you know an entity is important or has impact? A: If it is in Wikipedia, it is/has 16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 17. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 18. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74
  • 19. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74/140
  • 20. Challenge 2: Noisy data & changing KB Unsupervised method for generating 
 pseudo-ground truth (for training NER) 20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 21. Assumption A named-entity recognizer trained only on 
 KB entities will learn to recognize KB entities 21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 22. 22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus
  • 23. 23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet
  • 24. 24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker
  • 25. 25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 26. 26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 27. 27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 28. d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 29. 29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 30. 30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 31. 31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Hahaha! Are we sure Jillert Anema isn't Canadian? RT @rzbh: Dutch Coach's Anti- America Rant http://on.cc.com/1htk9Wo
  • 32. 32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 33. 33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus
  • 34. 34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2
  • 35. 35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 36. 36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. product organization organization
  • 37. 37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC
  • 38. 38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 39. 39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 40. 40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 …
  • 41. 41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • 42. 42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • 43. Future KB 43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB full KB small KB
  • 44. 44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB
  • 45. 45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 …
  • 46. 46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 … Evaluate
  • 47. Evaluation • Mention level (NER style) • Entity level ! 47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 48. 48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction • Mention level (NER style) • Entity level !
  • 49. • Mention level (NER style) • Entity level ! 49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 50. 50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation • Mention level (NER style) • Entity level This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 51. 51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • 52. 52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • 53. Experimental setup Data: Corpus: Twitter (TREC'11 MB: 4,832,838 tweets) KB: Wikipedia (Jan 4th, 2012) ! Components: EL: Semanticizer NERC: Custom 53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 54. 54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP", "linkProbability": 0.17446043165467626, "id": "30864663", "senseProbability": 0.11690647482014388, "title": "ASAP (variety show)", "url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29", "label": "ASAP", "priorProbability": 0.631578947368421 }, { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 55. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
  • 56. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 57. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" B I O B I O O O O O O
  • 58. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" person person
  • 59. 59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 Future KB NERC Tweet NERC Model Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Predictions m1 , c1 m2, c2 … Entity Linker Sample Corpus Training data m1 , c1 m2 , c2
  • 60. From tweet to training sample 1. Convert EL output (Wikipedia concepts) to NERC labels; • Label entity span (B-I-O) & class (PER/LOC/ORG) ! 2. Pick "good" samples • entity linker's confidence score • textual quality 60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 61. 61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 62. Entity Class 1. Map Wikipedia entity to DBpedia entity 2. Retrieve entity class (ontology); • if Person: PER • if Organisation, Company, or Non-ProfitOrganisation: ORG • if Place, PopulatedPlace, City, Country: LOC • …? 62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 63. Sampling Methods 1. Entity linker confidence score 2. Textual quality 63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 64. Sampling 1: Confidence Score • Extract anchor text (a) to Wikipedia page (W)-mappings • Confidence score combines two signals: 1. How common is it that a is used as a link 2. How commonly is a used as a link to W 64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 65. Sampling 1: Confidence Score • Higher threshold = fewer entities, less noise • Lower threshold = fewer entities, more noise 65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 66. Sampling 2: Textual quality 66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 67. Sampling 2: Textual quality 67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Highest scoring tweets 1. Watching the History channel, Hitler’”⁹s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity. 2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction. 3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c) Lowest scoring tweets 1. Toni Braxton ~ He Wasn't Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_ 2. tell me what u think The GetMore Girls, Part One _URL_ 3. this girl better not go off on me rt
  • 68. Sampling 2: Textual quality • Compare different sampling strategies; • top tweets • medium tweets • medium+top tweets • low+medium+top tweets (no sampling) 68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 69. Results 69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 70. 70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 NERC NERC Model Unlabeled Tweet ?Entity Linker Future KB Tweet Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Sample Corpus Predictions m1 , c1 m2, c2 … RQ1: What is the impact of our sampling methods for 
 generating pseudo-ground truth? Training data m1 , c1 m2 , c2
  • 71. Findings: EL confidence score threshold 1. Higher threshold, higher accuracy ! ! ! ! ! ! ! Solid: Precision
 Dotted: Recall 71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
  • 72. Findings: EL confidence score threshold 2. Higher threshold, more predictions 72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 73. Findings: Textual Quality Sampling 73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 74. 74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Tweet Unlabeled Tweet ? NERC Model NERC Training data m1 , c1 m2 , c2 Sample Corpus m1 , e1 m2 , e2 Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus RQ2: What is the impact of the size of prior knowledge on detecting unknown entities? Entity Linker Today's KB Predictions m1 , c1 m2, c2 …
  • 75. Results: RQ2 Sampling 2: KB size (mentions) ! ! ! ! ! ! blue: Our method
 red: Baseline
 75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
  • 76. Conclusions Recall increases as amount of prior knowledge grows: 1. Able to deal with missing labels, justifying approach 2. Rate of unknown entity detection increases as KB grows 76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 77. Future Work • Next step: Closing the loop • Feed back to KB (entity normalization) • From PER/LOC/ORG entities to other classes: • Books, buildings, drugs, artists, …? • Apply to other domains, languages • From random sampling to time-based sampling 77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 78. Fin Questions? ! ! ! ! ! ! ! 78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media