Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Concentric-based Approach to Represent News Topics in Tweets

434 views

Published on

This is the BSc thesis presentation of Enya Nieland.

Published in: Science
  • Be the first to comment

  • Be the first to like this

A Concentric-based Approach to Represent News Topics in Tweets

  1. 1. A Concentric-based Approach to Represent News Topics in Tweets Presenter: Enya Nieland Supervisor: Oana Inel
  2. 2. Problem Definition ● Tweets are short ○ max. 140 characters ● Redundant information ● A lot of the same information
  3. 3. Can we contextualize topics in tweets by determining a relevance score for the event-related information contained in the tweets?
  4. 4. Related Work Concentric Model ● Concentric Model for news videos ○ Core ■ Key entities ■ Summarizes main fact ■ Frequently mentioned entities ○ Crust ■ Describe particular details ■ Not necessarily frequent ■ Based on relations to Core Introducing the Concentric Model ● Relevancy Dimension ○ Rings in concentric model ■ each ring is a different level of relevancy ○ Relevancy depends on interpretation ● Finding Predicates to Entity Relations ○ Finding relations between entities ● Tracking Stories over Time ○ How does a news topic evolve over time José Luis Redondo, Giuseppe Rizzo, and Raphaël Troncy. “Capturing News Stories Once, Retelling a Thousand Ways”. José Luis Redondo, Giuseppe Rizzo, and Raphaël Troncy. “The Concentric Nature of News Semantic Snapshots”.
  5. 5. Dataset Dataset ● 817 Tweets about the event whaling ● 2014 and 2015 ● Dataset contains: ○ Tweet text ○ Relevant Mentions ○ Scores: ■ Tweet Event Relevance Score ■ Relevant Mentions Score ■ Sentiment Score ■ Novelty Score ● Scores defined through crowdsourcing Oana Inel, Tommaso Caselli, and Lora Aroyo. “Crowdsourcing Salient Information from Tweets and News”. High Tweet Event Relevance Score (1.00) Japan Sets Off for First Whaling Since UN Court Ruling - See more at: http://t.co/5BiHSWqjYu ( #japancc live at http://t.co/MVOUQb5AwD) Low Tweet Event Relevance Score (0.24) #health Why Norway Needs to Let Whaling Die - Despite best industry efforts, the whaling industry in Norway is fai... http://t.co/kC2c8odoS9
  6. 6. Approach 1. Take dataset from Inel et al. 2. Use the scores provided 3. Data analysis 4. Determine a Core and Crust 5. Combine Core and Crust to make Concentric Model 6. Evaluation of the results
  7. 7. Approach 1. Take dataset from Inel et al. 2. Use the scores provided 3. Data analysis 4. Determine a Core and Crust 5. Combine Core and Crust to make Concentric Model 6. Evaluation of the results
  8. 8. Approach 1. Take dataset from Inel et al. 2. Use the scores provided 3. Data analysis 4. Determine a Core and Crust 5. Combine Core and Crust to make Concentric Model 6. Evaluation of the results
  9. 9. Approach 1. Take dataset from Inel et al. 2. Use the scores provided 3. Data analysis 4. Determine a Core and Crust 5. Combine Core and Crust to make Concentric Model 6. Evaluation of the results
  10. 10. Approach 1. Take dataset from Inel et al. 2. Use the scores provided 3. Data analysis 4. Determine a Core and Crust 5. Combine Core and Crust to make Concentric Model 6. Evaluation of the results
  11. 11. Approach 1. Take dataset from Inel et al. 2. Use the scores provided 3. Data analysis 4. Determine a Core and Crust 5. Combine Core and Crust to make Concentric Model 6. Evaluation of the results
  12. 12. Baseline model Named Entity Expansion and Ranking 1. Generating list of entities → dataset Core Generation 2. Identify entities with higher level of representativeness → frequency of Relevant Mentions 3. Order entities (high → low) 4. Add top ranked entities to Core until one is found that is not semantically connected Crust Generation 5. Add entities with semantic relationship to core elements Replication of the approach of Redondo et al Fig 2. Scaled down representation of the Baseline model
  13. 13. First approach ● Calculate frequency of the relevant mentions ● Calculate average Relevant Mention Score ● Determine Core and Crust, based on thresholds ● Core ○ average Relevant Mention Score >= 0.70 ○ number of mentions > 10 ● Crust ○ average Relevant Mention Score >= 0.50 ○ number of mentions > 10 Fig 3. Representation of the First approach
  14. 14. Limitations First approach ● Same Relevant Mentions, but some contain symbols (#, :) ● Not all Relevant Mentions are lowercase Ways to improve ● Use stemming/lemmatization ○ Stemming works better ● Get rid of all symbols ○ Tweets and Relevant Mentions should only contain letters a-z ● Make better use of scores from the dataset Fig 3. Representation of the First approach
  15. 15. 1. Only use a-z & Implement stemming 2. Filter on Tweet Relevance Score ≥ 0.5 3. Filter on Relevant Mention Score ≥ 0.5 Core 4. Find all single word Relevant Mentions 5. Count the occurrences in Tweets + order 6. Count the occurrences in other Relevant Mentions 7. Start from top add to core until occurrences in other Relevant Mentions = 0 Crust 8. Find Relevant Mentions that contain Core entities 9. Count the Core words in the Relevant Mentions 10. Filter out Relevant Mentions with 1 and 2 words 11. Filter out words that only contain ‘whale’ or ‘japan’ Final Approach Fig 5. Scaled down representation of the Final approach
  16. 16. Evaluation Precision Recall F1-Score Baseline Model 0,56 0,17 0,26 Final Approach Relevance Score thresholds 0,3 0,64 0,71 0,67 0,4 0,83 0,51 0,64 0,5 0,85 0,43 0,57 0,6 0,97 0,55 0,70 0,7 0,72 0,33 0,46 0,8 0,50 0,64 0,56 Baseline Model True Relevance Total Positive Negative Examined Relevance Positive 94 75 169 Negative 463 285 648 Total 557 360 817 Final Approach - 0,60 True Relevance Total Positive Negative Examined Relevance Positive 35 1 36 Negative 29 55 84 Total 64 56 120
  17. 17. Evaluation Precision Recall F1-Score Baseline Model 0,56 0,17 0,26 Final Approach Relevance Score thresholds 0,3 0,64 0,71 0,67 0,4 0,83 0,51 0,64 0,5 0,85 0,43 0,57 0,6 0,97 0,55 0,70 0,7 0,72 0,33 0,46 0,8 0,50 0,64 0,56 Baseline Model True Relevance Total Positive Negative Examined Relevance Positive 94 75 169 Negative 463 285 648 Total 557 360 817 Final Approach - 0,60 True Relevance Total Positive Negative Examined Relevance Positive 35 1 36 Negative 29 55 84 Total 64 56 120
  18. 18. Conclusion ● Only following the approach from Redondo et al. does not work ● Relevance scores need to be taken into account ● First approach does not work ○ to many symbols and not all lowercase ○ stemming needed ● Final approach with Relevance Score threshold of 0.60 works best Research Question: Can we contextualize topics in tweets by determining a relevance score for the event-related information contained in the tweets?
  19. 19. Ideas for further research ● Does the model also work on other data? ● Are Tweets with links (to articles) more relevant? ● Does implementing novelty score for every day in dataset give a better Concentric model? ● Does the model also work on news topics that are only mentioned during one day (e.g. sports)?

×