A Concentric-based Approach to Represent News Topics in Tweets

A Concentric-based
Approach to Represent News
Topics in Tweets
Presenter: Enya Nieland
Supervisor: Oana Inel

Problem Definition
● Tweets are short
○ max. 140 characters
● Redundant information
● A lot of the same information

Can we contextualize topics in tweets by determining
a relevance score for the event-related information
contained in the tweets?

Related Work
Concentric Model
● Concentric Model for news videos
○ Core
■ Key entities
■ Summarizes main fact
■ Frequently mentioned entities
○ Crust
■ Describe particular details
■ Not necessarily frequent
■ Based on relations to Core
Introducing the Concentric Model
● Relevancy Dimension
○ Rings in concentric model
■ each ring is a different level of
relevancy
○ Relevancy depends on interpretation
● Finding Predicates to Entity Relations
○ Finding relations between entities
● Tracking Stories over Time
○ How does a news topic evolve over time
José Luis Redondo, Giuseppe Rizzo, and Raphaël Troncy.
“Capturing News Stories Once, Retelling a Thousand Ways”.
José Luis Redondo, Giuseppe Rizzo, and Raphaël Troncy. “The
Concentric Nature of News Semantic Snapshots”.

Dataset
Dataset
● 817 Tweets about the event whaling
● 2014 and 2015
● Dataset contains:
○ Tweet text
○ Relevant Mentions
○ Scores:
■ Tweet Event Relevance Score
■ Relevant Mentions Score
■ Sentiment Score
■ Novelty Score
● Scores defined through crowdsourcing
Oana Inel, Tommaso Caselli, and Lora Aroyo. “Crowdsourcing
Salient Information from Tweets and News”.
High Tweet Event Relevance Score (1.00)
Japan Sets Off for First Whaling Since UN Court
Ruling - See more at: http://t.co/5BiHSWqjYu (
#japancc live at http://t.co/MVOUQb5AwD)
Low Tweet Event Relevance Score (0.24)
#health Why Norway Needs to Let Whaling Die -
Despite best industry efforts, the whaling industry
in Norway is fai... http://t.co/kC2c8odoS9

Approach
1. Take dataset from Inel et al.
2. Use the scores provided
3. Data analysis
4. Determine a Core and Crust
5. Combine Core and Crust to
make Concentric Model
6. Evaluation of the results

Baseline model
Named Entity Expansion and Ranking
1. Generating list of entities → dataset
Core Generation
2. Identify entities with higher level of
representativeness → frequency of Relevant
Mentions
3. Order entities (high → low)
4. Add top ranked entities to Core until one is
found that is not semantically connected
Crust Generation
5. Add entities with semantic relationship to
core elements
Replication of the approach of Redondo et al
Fig 2. Scaled down representation of the Baseline model

First approach
● Calculate frequency of the relevant
mentions
● Calculate average Relevant Mention Score
● Determine Core and Crust, based on
thresholds
● Core
○ average Relevant Mention Score >= 0.70
○ number of mentions > 10
● Crust
○ average Relevant Mention Score >= 0.50
○ number of mentions > 10
Fig 3. Representation of the First approach

Limitations First approach
● Same Relevant Mentions, but some contain
symbols (#, :)
● Not all Relevant Mentions are lowercase
Ways to improve
● Use stemming/lemmatization
○ Stemming works better
● Get rid of all symbols
○ Tweets and Relevant Mentions should only
contain letters a-z
● Make better use of scores from the dataset
Fig 3. Representation of the First approach

1. Only use a-z & Implement stemming
2. Filter on Tweet Relevance Score ≥ 0.5
3. Filter on Relevant Mention Score ≥ 0.5
Core
4. Find all single word Relevant Mentions
5. Count the occurrences in Tweets + order
6. Count the occurrences in other Relevant Mentions
7. Start from top add to core until occurrences in
other Relevant Mentions = 0
Crust
8. Find Relevant Mentions that contain Core entities
9. Count the Core words in the Relevant Mentions
10. Filter out Relevant Mentions with 1 and 2 words
11. Filter out words that only contain ‘whale’ or ‘japan’
Final Approach
Fig 5. Scaled down representation of the Final approach

Evaluation
Precision Recall F1-Score
Baseline Model 0,56 0,17 0,26
Final Approach
Relevance Score
thresholds
0,3 0,64 0,71 0,67
0,4 0,83 0,51 0,64
0,5 0,85 0,43 0,57
0,6 0,97 0,55 0,70
0,7 0,72 0,33 0,46
0,8 0,50 0,64 0,56
Baseline Model
True Relevance
Total
Positive Negative
Examined
Relevance
Positive 94 75 169
Negative 463 285 648
Total 557 360 817
Final Approach - 0,60
True Relevance
Total
Positive Negative
Examined
Relevance
Positive 35 1 36
Negative 29 55 84
Total 64 56 120

Conclusion
● Only following the approach from Redondo et al. does not work
● Relevance scores need to be taken into account
● First approach does not work
○ to many symbols and not all lowercase
○ stemming needed
● Final approach with Relevance Score threshold of 0.60 works best
Research Question: Can we contextualize topics in tweets by
determining a relevance score for the event-related information
contained in the tweets?

Ideas for further research
● Does the model also work on other data?
● Are Tweets with links (to articles) more relevant?
● Does implementing novelty score for every day in dataset give a better
Concentric model?
● Does the model also work on news topics that are only mentioned during one
day (e.g. sports)?

A Concentric-based Approach to Represent News Topics in Tweets

Recommended

Recommended

More Related Content

Similar to A Concentric-based Approach to Represent News Topics in Tweets

Similar to A Concentric-based Approach to Represent News Topics in Tweets (19)

Recently uploaded

Recently uploaded (20)

A Concentric-based Approach to Represent News Topics in Tweets