SlideShare a Scribd company logo
1 of 11
Using Twitter
Early Detection of Trending Topics
D.C NLP
Meetup
June 10, 2015
Topics
• Motivation
• Underlying Theory
• Challenge
• Approach
• Initial Results
• Potential Implications
Timeline
• 9.31AM – Explosion occurred
• + 1 min – First Tweet
• +20 min – Local news reported
Reference: https://gigaom.com/2014/03/16/how-twitter-confirmed-the-explosion-in-harlem-before-the-news-did/
Harlem Gas Explosion in NYC ( March 2014)
Find the ‘tweet-dle’ within the ‘tweet-stack’
Motivation
Tweet
Tweet
Tweet ‘Interesting!’
‘Meh’
Me
Me
Action
No retweet/tweet
Retweet/tweet
Tweet
It’s Not Why We Share But How We Share
Theory
Step -wise
Gradual
Quick Rise
Tweet Rate Over Time Across Topics Implications
• Multiple ways topics can ‘trend’
• Approaches
– Parametric
• Too many variations.
– Non-parametric
• Support wide variations in
more automated fashion
Time Before Tagged As Trending Topic (min)
All Roads Lead to Rome!
Challenge
Step -wise
Gradual
Quick Rise
Tweet Rate Over Time Across Topics
Time Before Tagged As Trending Topic (min)
Clustering
(Used to classify new trends)
Time-Series Clustering
Approach
Data Collection Feature Engineering Modeling
K-Means Clustering
Tweet
Split trending vs
Non trending
Topics
Filter for topic
of the day
Tweets
(Streaming API)
Topics
(Trend API)
Notes:
• Streaming API: 1% of tweets
• English only
• 2 weeks sample ( Jan’15)
Tweet
Normalization/
Interpolation
Topic Identification
• Trending ( #, unigrams)
• Non-trending (#)
Trending Topics
• Exclude recurring or spurious
• Include topic within 24hrs
Distance metric
• Use dynamic time warping to align time series
Data Pipeline
Approach
1. Normalization
• Time series plot based on tweet rate
• Fixed length ( 120min)
• Tweet rate based on tweet 120 min
ago
2. Linear Interpolation
• Due to streaming API, 1% of tweets
• Gaps in the data
1
Topics
Tweets
On-going Event:
Wimbledon
9 Iowa State
Spurious:
Time for Pretty Little Liars
The Weekend - Earned It
Topic of the Day:
State of the Union
UnityMarch
(Less than 30mins) (More than 24 hours) (Within 24 hours and more than
30mins)
Excluded Included
2
Feature Engineering
Approach
K- Means Clustering with Dynamic Time Warping
• Similar to speech – identify same word but said by diff people
• Distance metric is Euclidean distance
Alignment using Dynamic Time Warping
Before…
Modeling
Approach
…After
Step-wise
Time(min) – Before Trending
Tweet Rate %
Step -wise Burst Gradual
Steady blimps blimps blimps
Time(min)
Tweet Rate %
‘Library’ Of Trends
Initial Results
• Labeling - Identification of Trending Topics
• Forecasting – Ranking of Topics by Volume
• Other social media streams ( Tumblr,
Instagram etc)
Potential Implications
Next Steps

More Related Content

Similar to DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

Wise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time WorkshopWise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time Workshopalywise
 
Kingston Fronts Flash Mob
Kingston Fronts  Flash MobKingston Fronts  Flash Mob
Kingston Fronts Flash MobBwalker15
 
A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...
A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...
A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...Brenda Moon
 
On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...
On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...
On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...Savvas Zannettou
 
Using Twitter as a data source: An overview of ethical challenges
Using Twitter as a data source: An overview of ethical challengesUsing Twitter as a data source: An overview of ethical challenges
Using Twitter as a data source: An overview of ethical challengesDr Wasim Ahmed
 
HootSuite 101 Workshop
HootSuite 101 WorkshopHootSuite 101 Workshop
HootSuite 101 WorkshopMisha Abasov
 

Similar to DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection (6)

Wise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time WorkshopWise #LAK15 It's About Time Workshop
Wise #LAK15 It's About Time Workshop
 
Kingston Fronts Flash Mob
Kingston Fronts  Flash MobKingston Fronts  Flash Mob
Kingston Fronts Flash Mob
 
A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...
A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...
A Topic Analysis Approach To Revealing Discussions On The Australian Twitters...
 
On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...
On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...
On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta ...
 
Using Twitter as a data source: An overview of ethical challenges
Using Twitter as a data source: An overview of ethical challengesUsing Twitter as a data source: An overview of ethical challenges
Using Twitter as a data source: An overview of ethical challenges
 
HootSuite 101 Workshop
HootSuite 101 WorkshopHootSuite 101 Workshop
HootSuite 101 Workshop
 

DC_NLP_June2015_Meetup_Twitter_Trending_Topic_Detection

  • 1. Using Twitter Early Detection of Trending Topics D.C NLP Meetup June 10, 2015
  • 2. Topics • Motivation • Underlying Theory • Challenge • Approach • Initial Results • Potential Implications
  • 3. Timeline • 9.31AM – Explosion occurred • + 1 min – First Tweet • +20 min – Local news reported Reference: https://gigaom.com/2014/03/16/how-twitter-confirmed-the-explosion-in-harlem-before-the-news-did/ Harlem Gas Explosion in NYC ( March 2014) Find the ‘tweet-dle’ within the ‘tweet-stack’ Motivation
  • 5. Step -wise Gradual Quick Rise Tweet Rate Over Time Across Topics Implications • Multiple ways topics can ‘trend’ • Approaches – Parametric • Too many variations. – Non-parametric • Support wide variations in more automated fashion Time Before Tagged As Trending Topic (min) All Roads Lead to Rome! Challenge
  • 6. Step -wise Gradual Quick Rise Tweet Rate Over Time Across Topics Time Before Tagged As Trending Topic (min) Clustering (Used to classify new trends) Time-Series Clustering Approach
  • 7. Data Collection Feature Engineering Modeling K-Means Clustering Tweet Split trending vs Non trending Topics Filter for topic of the day Tweets (Streaming API) Topics (Trend API) Notes: • Streaming API: 1% of tweets • English only • 2 weeks sample ( Jan’15) Tweet Normalization/ Interpolation Topic Identification • Trending ( #, unigrams) • Non-trending (#) Trending Topics • Exclude recurring or spurious • Include topic within 24hrs Distance metric • Use dynamic time warping to align time series Data Pipeline Approach
  • 8. 1. Normalization • Time series plot based on tweet rate • Fixed length ( 120min) • Tweet rate based on tweet 120 min ago 2. Linear Interpolation • Due to streaming API, 1% of tweets • Gaps in the data 1 Topics Tweets On-going Event: Wimbledon 9 Iowa State Spurious: Time for Pretty Little Liars The Weekend - Earned It Topic of the Day: State of the Union UnityMarch (Less than 30mins) (More than 24 hours) (Within 24 hours and more than 30mins) Excluded Included 2 Feature Engineering Approach
  • 9. K- Means Clustering with Dynamic Time Warping • Similar to speech – identify same word but said by diff people • Distance metric is Euclidean distance Alignment using Dynamic Time Warping Before… Modeling Approach …After
  • 10. Step-wise Time(min) – Before Trending Tweet Rate % Step -wise Burst Gradual Steady blimps blimps blimps Time(min) Tweet Rate % ‘Library’ Of Trends Initial Results
  • 11. • Labeling - Identification of Trending Topics • Forecasting – Ranking of Topics by Volume • Other social media streams ( Tumblr, Instagram etc) Potential Implications Next Steps

Editor's Notes

  1. Explain goal: Takeaways – usecase of twitter ( they way)/ the method/ implications For inspiration , use dataminr press (https://www.dataminr.com/press/) http://www.nytimes.com/2014/09/24/business/media/Dataminr-Scours-Social-Media-for-Hot-Tips.html?_r=0 https://gigaom.com/2014/03/16/how-twitter-confirmed-the-explosion-in-harlem-before-the-news-did/ Breaking news events – something light hearted or inspirational I like this more from a story telling point of view, I count success as how much fun I have.
  2. Talking points: Twitter as eye witness/ citizen reporting. Many eyes Real time, live chat…Faster than any local news Potential to detect events/topics at scale Lots of noise, how to detect events >>> proxy for events >>> looking for trending topics (topics that people care about) Twitter has trending topics but we want to be first in the know, we want to know before hand
  3. Talking points: How do we find topics that will be trending topics? Do we try to find the why? Why certain topics trend while others don’t? (maybe some taxonomy)? That is hard…say dress that white and gold or blue and black? Why was it so viral? How topics spread may be predictable? Math if 3 friends share then 1 share If 6 friends share than 3 share ( some exponential curve) Collectively that may resemble certain curves ( Is this abit of a leap here?) Come back later, maybe add a collective curve…. Measure of virality, ratio that message spreads
  4. More data driven, more automated. Later data decide…more automated (do not have to worry about linear vs non-linear, support non-linear) – mostly could be automated and support wide variations in automated fashions Each time new trend would have to figure out what the parameters are.
  5. More data driven, more automated. Later data decide…more automated (do not have to worry about linear vs non-linear, support non-linear) – mostly could be automated and support wide variations in automated fashions Each time new trend would have to figure out what the parameters are.
  6. Some takeaways: Overcoming limitations of streaming api, only 1% of tweets Also, binn-ing time series helped to smooth over Also why 120min…before that most topics have very little trend. Fixed length easier for comparing to see how it will trend.
  7. Takeaways: Why dynamic time warping – identify common apths/functions Can then tag the topic How would we know when it would be trending? How can trend forward? More classification. Also, how about the fixed length? Why eucliean distance Why K – means Why dynamic time warping
  8. Findings/ Initial Results of trend vs non trend Tweet rate and time span May need to explain tweet rate. Topics to choose if needed: happykyungsooday stateoftheunion camandkianvideo daystogokianandjc