Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams
Upcoming SlideShare
Loading in...5
×
 

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

on

  • 544 views

The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, ...

The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.

Statistics

Views

Total Views
544
Views on SlideShare
151
Embed Views
393

Actions

Likes
0
Downloads
3
Comments
0

1 Embed 393

http://graus.nu 393

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams Presentation Transcript

  • Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
  • What is "anema"? 2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • For content interpretation and complex filtering tasks we want to know who/what people talk about. 7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Entity Linking 8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media TANK (VEHICLE) Knowledge Base (KB) Document r TANK query q ? ? TANK JOHNSON
  • 9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Named Entity Recognition 12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Named Entity Recognition 13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Challenges 1. Entity "importance" 2. Noisy & short text (Twitter), updates in the KB 14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! ! 15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! Q: How do you know an entity is important or has impact? A: If it is in Wikipedia, it is/has 16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74
  • Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74/140
  • Challenge 2: Noisy data & changing KB Unsupervised method for generating 
 pseudo-ground truth (for training NER) 20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Assumption A named-entity recognizer trained only on 
 KB entities will learn to recognize KB entities 21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus
  • 23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet
  • 24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker
  • 25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  • 30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Hahaha! Are we sure Jillert Anema isn't Canadian? RT @rzbh: Dutch Coach's Anti- America Rant http://on.cc.com/1htk9Wo
  • 32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  • 33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus
  • 34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2
  • 35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  • 36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. product organization organization
  • 37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC
  • 38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  • 40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 …
  • 41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • 42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  • Future KB 43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB full KB small KB
  • 44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB
  • 45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 …
  • 46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 … Evaluate
  • Evaluation • Mention level (NER style) • Entity level ! 47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction • Mention level (NER style) • Entity level !
  • • Mention level (NER style) • Entity level ! 49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation • Mention level (NER style) • Entity level This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  • 51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • 52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  • Experimental setup Data: Corpus: Twitter (TREC'11 MB: 4,832,838 tweets) KB: Wikipedia (Jan 4th, 2012) ! Components: EL: Semanticizer NERC: Custom 53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP", "linkProbability": 0.17446043165467626, "id": "30864663", "senseProbability": 0.11690647482014388, "title": "ASAP (variety show)", "url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29", "label": "ASAP", "priorProbability": 0.631578947368421 }, { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
  • NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" B I O B I O O O O O O
  • NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" person person
  • 59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 Future KB NERC Tweet NERC Model Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Predictions m1 , c1 m2, c2 … Entity Linker Sample Corpus Training data m1 , c1 m2 , c2
  • From tweet to training sample 1. Convert EL output (Wikipedia concepts) to NERC labels; • Label entity span (B-I-O) & class (PER/LOC/ORG) ! 2. Pick "good" samples • entity linker's confidence score • textual quality 60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  • Entity Class 1. Map Wikipedia entity to DBpedia entity 2. Retrieve entity class (ontology); • if Person: PER • if Organisation, Company, or Non-ProfitOrganisation: ORG • if Place, PopulatedPlace, City, Country: LOC • …? 62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Sampling Methods 1. Entity linker confidence score 2. Textual quality 63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Sampling 1: Confidence Score • Extract anchor text (a) to Wikipedia page (W)-mappings • Confidence score combines two signals: 1. How common is it that a is used as a link 2. How commonly is a used as a link to W 64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Sampling 1: Confidence Score • Higher threshold = fewer entities, less noise • Lower threshold = fewer entities, more noise 65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Sampling 2: Textual quality 66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Sampling 2: Textual quality 67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Highest scoring tweets 1. Watching the History channel, Hitler’”⁹s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity. 2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction. 3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c) Lowest scoring tweets 1. Toni Braxton ~ He Wasn't Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_ 2. tell me what u think The GetMore Girls, Part One _URL_ 3. this girl better not go off on me rt
  • Sampling 2: Textual quality • Compare different sampling strategies; • top tweets • medium tweets • medium+top tweets • low+medium+top tweets (no sampling) 68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Results 69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 NERC NERC Model Unlabeled Tweet ?Entity Linker Future KB Tweet Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Sample Corpus Predictions m1 , c1 m2, c2 … RQ1: What is the impact of our sampling methods for 
 generating pseudo-ground truth? Training data m1 , c1 m2 , c2
  • Findings: EL confidence score threshold 1. Higher threshold, higher accuracy ! ! ! ! ! ! ! Solid: Precision
 Dotted: Recall 71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
  • Findings: EL confidence score threshold 2. Higher threshold, more predictions 72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Findings: Textual Quality Sampling 73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • 74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Tweet Unlabeled Tweet ? NERC Model NERC Training data m1 , c1 m2 , c2 Sample Corpus m1 , e1 m2 , e2 Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus RQ2: What is the impact of the size of prior knowledge on detecting unknown entities? Entity Linker Today's KB Predictions m1 , c1 m2, c2 …
  • Results: RQ2 Sampling 2: KB size (mentions) ! ! ! ! ! ! blue: Our method
 red: Baseline
 75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
  • Conclusions Recall increases as amount of prior knowledge grows: 1. Able to deal with missing labels, justifying approach 2. Rate of unknown entity detection increases as KB grows 76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Future Work • Next step: Closing the loop • Feed back to KB (entity normalization) • From PER/LOC/ORG entities to other classes: • Books, buildings, drugs, artists, …? • Apply to other domains, languages • From random sampling to time-based sampling 77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  • Fin Questions? ! ! ! ! ! ! ! 78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media