SlideShare a Scribd company logo
Generating Pseudo-ground Truth for
Predicting New Concepts in Social Streams
David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
What is "anema"?
2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
For content interpretation and complex filtering
tasks we want to know who/what people talk about.
7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Entity Linking
8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
TANK (VEHICLE)
Knowledge
Base (KB)
Document r
TANK
query q
?
?
TANK JOHNSON
9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Named Entity Recognition
13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenges
1. Entity "importance"
2. Noisy & short text (Twitter), updates in the KB
14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
!
15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Q: When should an entity exist in Wikipedia?
A: When it is important or has impact
!
Q: How do you know an entity is important or has impact?
A: If it is in Wikipedia, it is/has
16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74
Challenge 1:
Entity Importance
Can we leverage today's entities to learn 

to predict tomorrow's entities?
19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74/140
Challenge 2:
Noisy data & changing KB
Unsupervised method for generating 

pseudo-ground truth (for training NER)
20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Assumption
A named-entity recognizer trained only on 

KB entities will learn to recognize KB entities
21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Hahaha! Are we sure Jillert Anema isn't
Canadian? RT @rzbh: Dutch Coach's Anti-
America Rant http://on.cc.com/1htk9Wo
32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
product
organization
organization
37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
small KB
Future
KB
43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
full KB
small KB
44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
m1
, e1
m2
, e2
Unlabeled Tweet
?
Sample
Corpus
Training data
m1
, c1
m2
, c2
NERC
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet Entity
Linker
Unlabeled Tweet
?
NERC
Model
Predictions
m1
, c1
m2, c2
…
Today's
KB
Future
KB
Ground Truth
m1
, c1
m2
, c2
…
Evaluate
Evaluation
• Mention level (NER style)
• Entity level
!
47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
• Mention level (NER style)
• Entity level
!
• Mention level (NER style)
• Entity level
!
49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Evaluation
• Mention level (NER style)
• Entity level
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
This is like IBM buying Apple after
the Homebrew Computing Club
demo of the Apple I.
Prediction
Ground Truth
51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Tweets Entities Tweets EntitiesPrediction
Ground Truth
Experimental setup
Data:
Corpus: Twitter (TREC'11 MB: 4,832,838 tweets)
KB: Wikipedia (Jan 4th, 2012)
!
Components:
EL: Semanticizer
NERC: Custom
53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [
{
"text": "ASAP",
"linkProbability": 0.17446043165467626,
"id": "30864663",
"senseProbability": 0.11690647482014388,
"title": "ASAP (variety show)",
"url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29",
"label": "ASAP",
"priorProbability": 0.631578947368421
},
{
"text": "ASAP Rocky",
"linkProbability": 0.9333333333333333,
"id": "33754098",
"senseProbability": 0.9333333333333333,
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
"priorProbability": 1.0
},
{
"text": "Kendrick Lamar",
"linkProbability": 0.9533333333333334,
"id": "29909823",
"senseProbability": 0.9533333333333334,
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
"priorProbability": 1.0
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
B I O B I
O O O O O O
NERC
Two-stage approach [1]
1. Recognition
• Predict entity span
• For each token predict B, I, or O tag.
• Structured perceptron
2. Classification
• Given entity span, predict entity class (PER/LOC/ORG)
• SVMs
58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
[1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
"ASAP Rocky and Kendrick Lamar,
that's when I started listening again"
person person
59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
Future
KB
NERC
Tweet
NERC
Model
Unlabeled Tweet
?
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Predictions
m1
, c1
m2, c2
…
Entity
Linker
Sample
Corpus
Training data
m1
, c1
m2
, c2
From tweet to training sample
1. Convert EL output (Wikipedia concepts) to NERC labels;
• Label entity span (B-I-O) & class (PER/LOC/ORG)
!
2. Pick "good" samples
• entity linker's confidence score
• textual quality
60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
"links": [
{
"text": "ASAP Rocky",
"linkProbability": 0.9333333333333333,
"id": "33754098",
"senseProbability": 0.9333333333333333,
"title": "ASAP Rocky",
"url": "http://en.wikipedia.org/wiki/ASAP%20Rocky",
"label": "ASAP Rocky",
"priorProbability": 1.0
},
{
"text": "Kendrick Lamar",
"linkProbability": 0.9533333333333334,
"id": "29909823",
"senseProbability": 0.9533333333333334,
"title": "Kendrick Lamar",
"url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar",
"label": "Kendrick Lamar",
"priorProbability": 1.0
},
"ASAP Rocky and Kendrick Lamar, that's when I started listening again"
Entity Class
1. Map Wikipedia entity to DBpedia entity
2. Retrieve entity class (ontology);
• if Person: PER
• if Organisation, Company, or Non-ProfitOrganisation: ORG
• if Place, PopulatedPlace, City, Country: LOC
• …?
62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling Methods
1. Entity linker confidence score
2. Textual quality
63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Extract anchor text (a) to Wikipedia page (W)-mappings
• Confidence score combines two signals:
1. How common is it that a is used as a link
2. How commonly is a used as a link to W
64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 1: Confidence Score
• Higher threshold = fewer entities, less noise
• Lower threshold = fewer entities, more noise
65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Sampling 2: Textual quality
67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Highest scoring tweets
1. Watching the History channel, Hitler’”⁹s
Family. Hitler hid his true family
heritage, while others had to measure
up to Aryan purity.
2. When you sense yourself becoming
negative, stop and consider what it
would mean to apply that negative
energy in the opposite direction.
3. So. After school tomorrow, french
revision class. Tuesday, Drama
rehearsal and then at 8, cricket
training. Wednesday, Drama. Thursday
… (c)
Lowest scoring tweets
1. Toni Braxton ~ He Wasn't Man
Enough for Me _HASHTAG_
_HASHTAG_? _URL_ RT _Mention_
2. tell me what u think The GetMore
Girls, Part One _URL_
3. this girl better not go off on me rt
Sampling 2: Textual quality
• Compare different sampling strategies;
• top tweets
• medium tweets
• medium+top tweets
• low+medium+top tweets (no sampling)
68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Results
69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
m1
, e1
m2
, e2
NERC
NERC
Model
Unlabeled Tweet
?Entity
Linker
Future
KB
Tweet
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus Sample
Corpus
Predictions
m1
, c1
m2, c2
…
RQ1: 	
 What is the impact of our sampling methods for 	
 

	
	
 	
 	
 	
 generating	
pseudo-ground truth?
Training data
m1
, c1
m2
, c2
Findings: EL confidence score threshold
1. Higher threshold, higher accuracy
!
!
!
!
!
!
!
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 Solid: Precision

	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 Dotted: Recall
71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
Findings: EL confidence score threshold
2. Higher threshold, more predictions
72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Findings: Textual Quality Sampling
73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future
KB
Tweet
Unlabeled Tweet
?
NERC
Model
NERC
Training data
m1
, c1
m2
, c2
Sample
Corpus
m1
, e1
m2
, e2
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
Tweet
Corpus
RQ2: 	
What is the impact of the size of prior knowledge
	
	
 	
 	
 	
 on detecting unknown entities?
Entity
Linker
Today's
KB
Predictions
m1
, c1
m2, c2
…
Results: RQ2
Sampling 2: KB size (mentions)
!
!
!
!
!
!
	
	
 	
 	
 	
 	
 	
 	
 	
 	
	
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 blue: Our method

	
	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 	
 red: Baseline

75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
0"
5"
10"
15"
20"
25"
30"
35"
40"
45"
50"
20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
Conclusions
Recall increases as amount of prior knowledge grows:
1. Able to deal with missing labels, justifying approach
2. Rate of unknown entity detection increases as KB grows
76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Future Work
• Next step: Closing the loop
• Feed back to KB (entity normalization)
• From PER/LOC/ORG entities to other classes:
• Books, buildings, drugs, artists, …?
• Apply to other domains, languages
• From random sampling to time-based sampling
77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
Fin
Questions?
!
!
!
!
!
!
!
78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

More Related Content

Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentGovernment ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Jeffrey McGuire
 
Adaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & LivingAdaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & Living
The Learning Hub
 
Partner-Driven Innovation
Partner-Driven InnovationPartner-Driven Innovation
Partner-Driven Innovation
Daniel X. O'Neil
 
Everything that happens, happens somewhere
Everything that happens, happens somewhereEverything that happens, happens somewhere
Everything that happens, happens somewhere
Jene van der Heide
 
Taking your learning further SEND
Taking your learning further SENDTaking your learning further SEND
Taking your learning further SEND
Spectronics
 
Technology Driven Diferentiation
Technology Driven DiferentiationTechnology Driven Diferentiation
Technology Driven Diferentiation
Vicki Davis
 
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
Jeffrey McGuire
 
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
Platform Linked Data Netherlands (PLDN)
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraph
CUbRIK Project
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital Humanities
CUbRIK Project
 
Now What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and AppsNow What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and Apps
Design for Context
 
Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019
Ariadna Font Llitjos
 
2013.05.13 tom de nies - snow 2013
2013.05.13   tom de nies - snow 20132013.05.13   tom de nies - snow 2013
2013.05.13 tom de nies - snow 2013
tdenies
 
Possibilities to Practices II
Possibilities to Practices IIPossibilities to Practices II
Possibilities to Practices II
Derrick Willard
 
Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015
Martin Hamilton
 
Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma
Carlo V Caballero Uribe MD
 
MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)
NORCAT
 
Tim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data SchemeTim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data Scheme
The Open Data Institute of North Carolina
 
Lean at ITP info session 2015
Lean at ITP info session 2015Lean at ITP info session 2015
Lean at ITP info session 2015
Jen van der Meer
 
Digital Storytelling
Digital Storytelling Digital Storytelling
Digital Storytelling
Yasser Monkachi
 

Similar to Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams (20)

Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering GovernmentGovernment ICT 2.0 London 2014 – Open Source Drupal Empowering Government
Government ICT 2.0 London 2014 – Open Source Drupal Empowering Government
 
Adaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & LivingAdaptive and social learning in the New Worl of Work & Living
Adaptive and social learning in the New Worl of Work & Living
 
Partner-Driven Innovation
Partner-Driven InnovationPartner-Driven Innovation
Partner-Driven Innovation
 
Everything that happens, happens somewhere
Everything that happens, happens somewhereEverything that happens, happens somewhere
Everything that happens, happens somewhere
 
Taking your learning further SEND
Taking your learning further SENDTaking your learning further SEND
Taking your learning further SEND
 
Technology Driven Diferentiation
Technology Driven DiferentiationTechnology Driven Diferentiation
Technology Driven Diferentiation
 
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
DrupalGov Canberra 2014 Keynote: Code for a better world: Open Source Drupal ...
 
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands20191114 ECP Jaarcongres 2019   Solid Activties & Use Cases in The Netherlands
20191114 ECP Jaarcongres 2019 Solid Activties & Use Cases in The Netherlands
 
Building a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraphBuilding a social graph for the history of Europe: the CUbRIK histoGraph
Building a social graph for the history of Europe: the CUbRIK histoGraph
 
histoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital HumanitieshistoGraph: a case study in Digital Humanities
histoGraph: a case study in Digital Humanities
 
Now What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and AppsNow What? Creating Innovative LODLAM Sites and Apps
Now What? Creating Innovative LODLAM Sites and Apps
 
Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019Building Responsible AI - London Oct 2019
Building Responsible AI - London Oct 2019
 
2013.05.13 tom de nies - snow 2013
2013.05.13   tom de nies - snow 20132013.05.13   tom de nies - snow 2013
2013.05.13 tom de nies - snow 2013
 
Possibilities to Practices II
Possibilities to Practices IIPossibilities to Practices II
Possibilities to Practices II
 
Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015Here Comes The iPad Generation - Future of Higher Education 2015
Here Comes The iPad Generation - Future of Higher Education 2015
 
Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma Digital Health in Real LIfe #RotReuma
Digital Health in Real LIfe #RotReuma
 
MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)MaRS Market Intelligence (Spring 2015)
MaRS Market Intelligence (Spring 2015)
 
Tim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data SchemeTim Berners-Lee's 5-Star Open Data Scheme
Tim Berners-Lee's 5-Star Open Data Scheme
 
Lean at ITP info session 2015
Lean at ITP info session 2015Lean at ITP info session 2015
Lean at ITP info session 2015
 
Digital Storytelling
Digital Storytelling Digital Storytelling
Digital Storytelling
 

More from David Graus

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientists
David Graus
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in Recommendations
David Graus
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
David Graus
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for Impact
David Graus
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender Systems
David Graus
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
David Graus
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
David Graus
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData Amsterdam
David Graus
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgeven
David Graus
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.info
David Graus
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
David Graus
 
Understanding Email Traffic
Understanding Email TrafficUnderstanding Email Traffic
Understanding Email Traffic
David Graus
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
David Graus
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualization
David Graus
 

More from David Graus (15)

Pragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientistsPragmatic ethical and fair AI for data scientists
Pragmatic ethical and fair AI for data scientists
 
Bias in Recommendations
Bias in RecommendationsBias in Recommendations
Bias in Recommendations
 
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
RecSys in the Media Industry: Relevance, Recency, Popularity, and Diversity.
 
CAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for ImpactCAT/AI: Computer Assisted Translation 
Assessment for Impact
CAT/AI: Computer Assisted Translation 
Assessment for Impact
 
Opening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender SystemsOpening the Black Box of User Profiles in Content-based Recommender Systems
Opening the Black Box of User Profiles in Content-based Recommender Systems
 
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacyZoeken, vinden, en aanbevelen: personalisatie vs. privacy
Zoeken, vinden, en aanbevelen: personalisatie vs. privacy
 
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital TracesLayman's Talk: Entities of Interest --- Discovery in Digital Traces
Layman's Talk: Entities of Interest --- Discovery in Digital Traces
 
Financial News Mining @ PyData Amsterdam
Financial News Mining @ PyData AmsterdamFinancial News Mining @ PyData Amsterdam
Financial News Mining @ PyData Amsterdam
 
De Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgevenDe Macht van Data --- Hoe algoritmen ons leven vormgeven
De Macht van Data --- Hoe algoritmen ons leven vormgeven
 
Financial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.infoFinancial News Mining @ FD Mediagroep/Company.info
Financial News Mining @ FD Mediagroep/Company.info
 
Dynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity RankingDynamic Collective Entity Representations for Entity Ranking
Dynamic Collective Entity Representations for Entity Ranking
 
Understanding Email Traffic
Understanding Email TrafficUnderstanding Email Traffic
Understanding Email Traffic
 
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27thDavid Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
David Graus - Entity Linking (at SEA), Search Engines Amsterdam, Fri June 27th
 
Semantic Search in E-Discovery
Semantic Search in E-DiscoverySemantic Search in E-Discovery
Semantic Search in E-Discovery
 
Semantic annotation, clustering and visualization
Semantic annotation, clustering and visualizationSemantic annotation, clustering and visualization
Semantic annotation, clustering and visualization
 

Recently uploaded

AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Sérgio Sacani
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Sérgio Sacani
 
gastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptxgastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptx
Shekar Boddu
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Creative-Biolabs
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
Sérgio Sacani
 
Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8
abhinayakamasamudram
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
Injection: Risks and challenges - Injection of CO2 into geological rock forma...
Injection: Risks and challenges - Injection of CO2 into geological rock forma...Injection: Risks and challenges - Injection of CO2 into geological rock forma...
Injection: Risks and challenges - Injection of CO2 into geological rock forma...
Oeko-Institut
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
PirithiRaju
 
Post translation modification by Suyash Garg
Post translation modification by Suyash GargPost translation modification by Suyash Garg
Post translation modification by Suyash Garg
suyashempire
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Immunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunologyImmunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunology
VetriVel359477
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
lucianamillenium
 
fermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptxfermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptx
ananya23nair
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
RDhivya6
 

Recently uploaded (20)

AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
 
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...Discovery of An Apparent Red, High-Velocity Type Ia Supernova at  𝐳 = 2.9  wi...
Discovery of An Apparent Red, High-Velocity Type Ia Supernova at 𝐳 = 2.9 wi...
 
gastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptxgastroretentive drug delivery system-PPT.pptx
gastroretentive drug delivery system-PPT.pptx
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
Sustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart AgricultureSustainable Land Management - Climate Smart Agriculture
Sustainable Land Management - Climate Smart Agriculture
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
 
Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8Reaching the age of Adolescence- Class 8
Reaching the age of Adolescence- Class 8
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
Injection: Risks and challenges - Injection of CO2 into geological rock forma...
Injection: Risks and challenges - Injection of CO2 into geological rock forma...Injection: Risks and challenges - Injection of CO2 into geological rock forma...
Injection: Risks and challenges - Injection of CO2 into geological rock forma...
 
Pests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdfPests of Storage_Identification_Dr.UPR.pdf
Pests of Storage_Identification_Dr.UPR.pdf
 
Post translation modification by Suyash Garg
Post translation modification by Suyash GargPost translation modification by Suyash Garg
Post translation modification by Suyash Garg
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Immunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunologyImmunotherapy presentation from clinical immunology
Immunotherapy presentation from clinical immunology
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
 
fermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptxfermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptx
 
23PH301 - Optics - Unit 1 - Optical Lenses
23PH301 - Optics  -  Unit 1 - Optical Lenses23PH301 - Optics  -  Unit 1 - Optical Lenses
23PH301 - Optics - Unit 1 - Optical Lenses
 

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

  • 1. Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
  • 2. What is "anema"? 2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 3. 3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 4. 4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 5. 5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 6. 6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 7. For content interpretation and complex filtering tasks we want to know who/what people talk about. 7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 8. Entity Linking 8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media TANK (VEHICLE) Knowledge Base (KB) Document r TANK query q ? ? TANK JOHNSON
  • 9. 9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 10. 10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 11. 11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 12. Named Entity Recognition 12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 13. Named Entity Recognition 13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 14. Challenges 1. Entity "importance" 2. Noisy & short text (Twitter), updates in the KB 14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 15. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! ! 15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 16. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! Q: How do you know an entity is important or has impact? A: If it is in Wikipedia, it is/has 16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 17. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 18. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74
  • 19. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74/140
  • 20. Challenge 2: Noisy data & changing KB Unsupervised method for generating 
 pseudo-ground truth (for training NER) 20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 21. Assumption A named-entity recognizer trained only on 
 KB entities will learn to recognize KB entities 21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 22. 22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus
  • 23. 23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet
  • 24. 24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker
  • 25. 25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 26. 26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 27. 27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 28. d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 29. 29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 30. 30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 31. 31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Hahaha! Are we sure Jillert Anema isn't Canadian? RT @rzbh: Dutch Coach's Anti- America Rant http://on.cc.com/1htk9Wo
  • 32. 32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 33. 33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus
  • 34. 34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2
  • 35. 35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 36. 36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. product organization organization
  • 37. 37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC
  • 38. 38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 39. 39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 40. 40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 …
  • 41. 41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • 42. 42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • 43. Future KB 43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB full KB small KB
  • 44. 44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB
  • 45. 45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 …
  • 46. 46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 … Evaluate
  • 47. Evaluation • Mention level (NER style) • Entity level ! 47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 48. 48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction • Mention level (NER style) • Entity level !
  • 49. • Mention level (NER style) • Entity level ! 49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 50. 50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation • Mention level (NER style) • Entity level This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 51. 51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • 52. 52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • 53. Experimental setup Data: Corpus: Twitter (TREC'11 MB: 4,832,838 tweets) KB: Wikipedia (Jan 4th, 2012) ! Components: EL: Semanticizer NERC: Custom 53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 54. 54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP", "linkProbability": 0.17446043165467626, "id": "30864663", "senseProbability": 0.11690647482014388, "title": "ASAP (variety show)", "url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29", "label": "ASAP", "priorProbability": 0.631578947368421 }, { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 55. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
  • 56. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 57. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" B I O B I O O O O O O
  • 58. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" person person
  • 59. 59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 Future KB NERC Tweet NERC Model Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Predictions m1 , c1 m2, c2 … Entity Linker Sample Corpus Training data m1 , c1 m2 , c2
  • 60. From tweet to training sample 1. Convert EL output (Wikipedia concepts) to NERC labels; • Label entity span (B-I-O) & class (PER/LOC/ORG) ! 2. Pick "good" samples • entity linker's confidence score • textual quality 60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 61. 61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • 62. Entity Class 1. Map Wikipedia entity to DBpedia entity 2. Retrieve entity class (ontology); • if Person: PER • if Organisation, Company, or Non-ProfitOrganisation: ORG • if Place, PopulatedPlace, City, Country: LOC • …? 62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 63. Sampling Methods 1. Entity linker confidence score 2. Textual quality 63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 64. Sampling 1: Confidence Score • Extract anchor text (a) to Wikipedia page (W)-mappings • Confidence score combines two signals: 1. How common is it that a is used as a link 2. How commonly is a used as a link to W 64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 65. Sampling 1: Confidence Score • Higher threshold = fewer entities, less noise • Lower threshold = fewer entities, more noise 65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 66. Sampling 2: Textual quality 66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 67. Sampling 2: Textual quality 67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Highest scoring tweets 1. Watching the History channel, Hitler’”⁹s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity. 2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction. 3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c) Lowest scoring tweets 1. Toni Braxton ~ He Wasn't Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_ 2. tell me what u think The GetMore Girls, Part One _URL_ 3. this girl better not go off on me rt
  • 68. Sampling 2: Textual quality • Compare different sampling strategies; • top tweets • medium tweets • medium+top tweets • low+medium+top tweets (no sampling) 68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 69. Results 69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 70. 70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 NERC NERC Model Unlabeled Tweet ?Entity Linker Future KB Tweet Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Sample Corpus Predictions m1 , c1 m2, c2 … RQ1: What is the impact of our sampling methods for 
 generating pseudo-ground truth? Training data m1 , c1 m2 , c2
  • 71. Findings: EL confidence score threshold 1. Higher threshold, higher accuracy ! ! ! ! ! ! ! Solid: Precision
 Dotted: Recall 71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
  • 72. Findings: EL confidence score threshold 2. Higher threshold, more predictions 72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 73. Findings: Textual Quality Sampling 73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 74. 74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Tweet Unlabeled Tweet ? NERC Model NERC Training data m1 , c1 m2 , c2 Sample Corpus m1 , e1 m2 , e2 Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus RQ2: What is the impact of the size of prior knowledge on detecting unknown entities? Entity Linker Today's KB Predictions m1 , c1 m2, c2 …
  • 75. Results: RQ2 Sampling 2: KB size (mentions) ! ! ! ! ! ! blue: Our method
 red: Baseline
 75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
  • 76. Conclusions Recall increases as amount of prior knowledge grows: 1. Able to deal with missing labels, justifying approach 2. Rate of unknown entity detection increases as KB grows 76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 77. Future Work • Next step: Closing the loop • Feed back to KB (entity normalization) • From PER/LOC/ORG entities to other classes: • Books, buildings, drugs, artists, …? • Apply to other domains, languages • From random sampling to time-based sampling 77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 78. Fin Questions? ! ! ! ! ! ! ! 78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media