Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

2,310 views

Published on

The manual curation of knowledge bases is a bottleneck in fast paced domains where new concepts constantly emerge. Identification of nascent concepts is important for improving early entity linking, content interpretation, and recommendation of new content in real-time applications. We present an unsupervised method for generating pseudo-ground truth for training a named entity recognizer to specifically identify entities that will become concepts in a knowledge base in the setting of social streams. We show that our method is able to deal with missing labels, justifying the use of pseudo-ground truth generation in this task. Finally, we show how our method significantly outperforms a lexical-matching baseline, by leveraging strategies for sampling pseudo-ground truth based on entity confidence scores and textual quality of input documents.

Published in: Science, Business, Technology
  • Be the first to comment

  • Be the first to like this

Generating Pseudo-ground Truth for Detecting New Concepts in Social Streams

  1. 1. Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams David Graus, Manos Tsagkias, Lars Buitinck, Maarten de Rijke
  2. 2. What is "anema"? 2d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  3. 3. 3d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  4. 4. 4d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  5. 5. 5d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  6. 6. 6d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  7. 7. For content interpretation and complex filtering tasks we want to know who/what people talk about. 7d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  8. 8. Entity Linking 8d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media TANK (VEHICLE) Knowledge Base (KB) Document r TANK query q ? ? TANK JOHNSON
  9. 9. 9d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  10. 10. 10d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  11. 11. 11d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  12. 12. Named Entity Recognition 12d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  13. 13. Named Entity Recognition 13d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  14. 14. Challenges 1. Entity "importance" 2. Noisy & short text (Twitter), updates in the KB 14d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  15. 15. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! ! 15d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  16. 16. Challenge 1: Entity Importance Q: When should an entity exist in Wikipedia? A: When it is important or has impact ! Q: How do you know an entity is important or has impact? A: If it is in Wikipedia, it is/has 16d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  17. 17. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 17d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  18. 18. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 18d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74
  19. 19. Challenge 1: Entity Importance Can we leverage today's entities to learn 
 to predict tomorrow's entities? 19d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 74/140
  20. 20. Challenge 2: Noisy data & changing KB Unsupervised method for generating 
 pseudo-ground truth (for training NER) 20d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  21. 21. Assumption A named-entity recognizer trained only on 
 KB entities will learn to recognize KB entities 21d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  22. 22. 22d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus
  23. 23. 23d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet
  24. 24. 24d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker
  25. 25. 25d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  26. 26. 26d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  27. 27. 27d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  28. 28. d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  29. 29. 29d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2
  30. 30. 30d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  31. 31. 31d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Hahaha! Are we sure Jillert Anema isn't Canadian? RT @rzbh: Dutch Coach's Anti- America Rant http://on.cc.com/1htk9Wo
  32. 32. 32d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ?
  33. 33. 33d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus
  34. 34. 34d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2
  35. 35. 35d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I.
  36. 36. 36d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. product organization organization
  37. 37. 37d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC
  38. 38. 38d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  39. 39. 39d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model
  40. 40. 40d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 …
  41. 41. 41d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  42. 42. 42d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB small KB
  43. 43. Future KB 43d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB full KB small KB
  44. 44. 44d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker m1 , e1 m2 , e2 Unlabeled Tweet ? Sample Corpus Training data m1 , c1 m2 , c2 NERC NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB
  45. 45. 45d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 …
  46. 46. 46d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Entity Linker Unlabeled Tweet ? NERC Model Predictions m1 , c1 m2, c2 … Today's KB Future KB Ground Truth m1 , c1 m2 , c2 … Evaluate
  47. 47. Evaluation • Mention level (NER style) • Entity level ! 47d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  48. 48. 48d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction • Mention level (NER style) • Entity level !
  49. 49. • Mention level (NER style) • Entity level ! 49d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  50. 50. 50d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Evaluation • Mention level (NER style) • Entity level This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. This is like IBM buying Apple after the Homebrew Computing Club demo of the Apple I. Prediction Ground Truth
  51. 51. 51d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  52. 52. 52d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Tweets Entities Tweets EntitiesPrediction Ground Truth
  53. 53. Experimental setup Data: Corpus: Twitter (TREC'11 MB: 4,832,838 tweets) KB: Wikipedia (Jan 4th, 2012) ! Components: EL: Semanticizer NERC: Custom 53d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  54. 54. 54d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP", "linkProbability": 0.17446043165467626, "id": "30864663", "senseProbability": 0.11690647482014388, "title": "ASAP (variety show)", "url": "http://en.wikipedia.org/wiki/ASAP%20%28variety%20show%29", "label": "ASAP", "priorProbability": 0.631578947368421 }, { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  55. 55. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 55d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012.
  56. 56. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 56d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  57. 57. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 57d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" B I O B I O O O O O O
  58. 58. NERC Two-stage approach [1] 1. Recognition • Predict entity span • For each token predict B, I, or O tag. • Structured perceptron 2. Classification • Given entity span, predict entity class (PER/LOC/ORG) • SVMs 58d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media [1] L. Buitinck and M. Marx. Two-stage named-entity recognition using averaged perceptrons. In NLDB'12, 2012. "ASAP Rocky and Kendrick Lamar, that's when I started listening again" person person
  59. 59. 59d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 Future KB NERC Tweet NERC Model Unlabeled Tweet ? Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Predictions m1 , c1 m2, c2 … Entity Linker Sample Corpus Training data m1 , c1 m2 , c2
  60. 60. From tweet to training sample 1. Convert EL output (Wikipedia concepts) to NERC labels; • Label entity span (B-I-O) & class (PER/LOC/ORG) ! 2. Pick "good" samples • entity linker's confidence score • textual quality 60d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  61. 61. 61d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media "links": [ { "text": "ASAP Rocky", "linkProbability": 0.9333333333333333, "id": "33754098", "senseProbability": 0.9333333333333333, "title": "ASAP Rocky", "url": "http://en.wikipedia.org/wiki/ASAP%20Rocky", "label": "ASAP Rocky", "priorProbability": 1.0 }, { "text": "Kendrick Lamar", "linkProbability": 0.9533333333333334, "id": "29909823", "senseProbability": 0.9533333333333334, "title": "Kendrick Lamar", "url": "http://en.wikipedia.org/wiki/Kendrick%20Lamar", "label": "Kendrick Lamar", "priorProbability": 1.0 }, "ASAP Rocky and Kendrick Lamar, that's when I started listening again"
  62. 62. Entity Class 1. Map Wikipedia entity to DBpedia entity 2. Retrieve entity class (ontology); • if Person: PER • if Organisation, Company, or Non-ProfitOrganisation: ORG • if Place, PopulatedPlace, City, Country: LOC • …? 62d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  63. 63. Sampling Methods 1. Entity linker confidence score 2. Textual quality 63d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  64. 64. Sampling 1: Confidence Score • Extract anchor text (a) to Wikipedia page (W)-mappings • Confidence score combines two signals: 1. How common is it that a is used as a link 2. How commonly is a used as a link to W 64d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  65. 65. Sampling 1: Confidence Score • Higher threshold = fewer entities, less noise • Lower threshold = fewer entities, more noise 65d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  66. 66. Sampling 2: Textual quality 66d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  67. 67. Sampling 2: Textual quality 67d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Highest scoring tweets 1. Watching the History channel, Hitler’”⁹s Family. Hitler hid his true family heritage, while others had to measure up to Aryan purity. 2. When you sense yourself becoming negative, stop and consider what it would mean to apply that negative energy in the opposite direction. 3. So. After school tomorrow, french revision class. Tuesday, Drama rehearsal and then at 8, cricket training. Wednesday, Drama. Thursday … (c) Lowest scoring tweets 1. Toni Braxton ~ He Wasn't Man Enough for Me _HASHTAG_ _HASHTAG_? _URL_ RT _Mention_ 2. tell me what u think The GetMore Girls, Part One _URL_ 3. this girl better not go off on me rt
  68. 68. Sampling 2: Textual quality • Compare different sampling strategies; • top tweets • medium tweets • medium+top tweets • low+medium+top tweets (no sampling) 68d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  69. 69. Results 69d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  70. 70. 70d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media m1 , e1 m2 , e2 NERC NERC Model Unlabeled Tweet ?Entity Linker Future KB Tweet Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Sample Corpus Predictions m1 , c1 m2, c2 … RQ1: What is the impact of our sampling methods for 
 generating pseudo-ground truth? Training data m1 , c1 m2 , c2
  71. 71. Findings: EL confidence score threshold 1. Higher threshold, higher accuracy ! ! ! ! ! ! ! Solid: Precision
 Dotted: Recall 71d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9"
  72. 72. Findings: EL confidence score threshold 2. Higher threshold, more predictions 72d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  73. 73. Findings: Textual Quality Sampling 73d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  74. 74. 74d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media Future KB Tweet Unlabeled Tweet ? NERC Model NERC Training data m1 , c1 m2 , c2 Sample Corpus m1 , e1 m2 , e2 Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus Tweet Corpus RQ2: What is the impact of the size of prior knowledge on detecting unknown entities? Entity Linker Today's KB Predictions m1 , c1 m2, c2 …
  75. 75. Results: RQ2 Sampling 2: KB size (mentions) ! ! ! ! ! ! blue: Our method
 red: Baseline
 75d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media 0" 5" 10" 15" 20" 25" 30" 35" 40" 45" 50" 20%" 30%" 40%" 50%" 60%" 70%" 80%" 90%"
  76. 76. Conclusions Recall increases as amount of prior knowledge grows: 1. Able to deal with missing labels, justifying approach 2. Rate of unknown entity detection increases as KB grows 76d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  77. 77. Future Work • Next step: Closing the loop • Feed back to KB (entity normalization) • From PER/LOC/ORG entities to other classes: • Books, buildings, drugs, artists, …? • Apply to other domains, languages • From random sampling to time-based sampling 77d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media
  78. 78. Fin Questions? ! ! ! ! ! ! ! 78d.p.graus@uva.nl | @dvdgrs | www.graus.co April 15 2014 | ECIR ‘14 | Mining Social Media

×