3. Timeline
• 9.31AM – Explosion occurred
• + 1 min – First Tweet
• +20 min – Local news reported
Reference: https://gigaom.com/2014/03/16/how-twitter-confirmed-the-explosion-in-harlem-before-the-news-did/
Harlem Gas Explosion in NYC ( March 2014)
Find the ‘tweet-dle’ within the ‘tweet-stack’
Motivation
5. Step -wise
Gradual
Quick Rise
Tweet Rate Over Time Across Topics Implications
• Multiple ways topics can ‘trend’
• Approaches
– Parametric
• Too many variations.
– Non-parametric
• Support wide variations in
more automated fashion
Time Before Tagged As Trending Topic (min)
All Roads Lead to Rome!
Challenge
6. Step -wise
Gradual
Quick Rise
Tweet Rate Over Time Across Topics
Time Before Tagged As Trending Topic (min)
Clustering
(Used to classify new trends)
Time-Series Clustering
Approach
7. Data Collection Feature Engineering Modeling
K-Means Clustering
Tweet
Split trending vs
Non trending
Topics
Filter for topic
of the day
Tweets
(Streaming API)
Topics
(Trend API)
Notes:
• Streaming API: 1% of tweets
• English only
• 2 weeks sample ( Jan’15)
Tweet
Normalization/
Interpolation
Topic Identification
• Trending ( #, unigrams)
• Non-trending (#)
Trending Topics
• Exclude recurring or spurious
• Include topic within 24hrs
Distance metric
• Use dynamic time warping to align time series
Data Pipeline
Approach
8. 1. Normalization
• Time series plot based on tweet rate
• Fixed length ( 120min)
• Tweet rate based on tweet 120 min
ago
2. Linear Interpolation
• Due to streaming API, 1% of tweets
• Gaps in the data
1
Topics
Tweets
On-going Event:
Wimbledon
9 Iowa State
Spurious:
Time for Pretty Little Liars
The Weekend - Earned It
Topic of the Day:
State of the Union
UnityMarch
(Less than 30mins) (More than 24 hours) (Within 24 hours and more than
30mins)
Excluded Included
2
Feature Engineering
Approach
9. K- Means Clustering with Dynamic Time Warping
• Similar to speech – identify same word but said by diff people
• Distance metric is Euclidean distance
Alignment using Dynamic Time Warping
Before…
Modeling
Approach
…After
11. • Labeling - Identification of Trending Topics
• Forecasting – Ranking of Topics by Volume
• Other social media streams ( Tumblr,
Instagram etc)
Potential Implications
Next Steps
Editor's Notes
Explain goal: Takeaways – usecase of twitter ( they way)/ the method/ implications
For inspiration , use dataminr press (https://www.dataminr.com/press/)
http://www.nytimes.com/2014/09/24/business/media/Dataminr-Scours-Social-Media-for-Hot-Tips.html?_r=0
https://gigaom.com/2014/03/16/how-twitter-confirmed-the-explosion-in-harlem-before-the-news-did/
Breaking news events – something light hearted or inspirational
I like this more from a story telling point of view, I count success as how much fun I have.
Talking points:
Twitter as eye witness/ citizen reporting. Many eyes
Real time, live chat…Faster than any local news
Potential to detect events/topics at scale
Lots of noise, how to detect events >>> proxy for events >>> looking for trending topics (topics that people care about) Twitter has trending topics but we want to be first in the know, we want to know before hand
Talking points:
How do we find topics that will be trending topics?
Do we try to find the why? Why certain topics trend while others don’t? (maybe some taxonomy)? That is hard…say dress that white and gold or blue and black? Why was it so viral?
How topics spread may be predictable?
Math if 3 friends share then 1 share
If 6 friends share than 3 share ( some exponential curve)
Collectively that may resemble certain curves ( Is this abit of a leap here?) Come back later, maybe add a collective curve….
Measure of virality, ratio that message spreads
More data driven, more automated. Later data decide…more automated (do not have to worry about linear vs non-linear, support non-linear) – mostly could be automated and support wide variations in automated fashions
Each time new trend would have to figure out what the parameters are.
More data driven, more automated. Later data decide…more automated (do not have to worry about linear vs non-linear, support non-linear) – mostly could be automated and support wide variations in automated fashions
Each time new trend would have to figure out what the parameters are.
Some takeaways:
Overcoming limitations of streaming api, only 1% of tweets
Also, binn-ing time series helped to smooth over
Also why 120min…before that most topics have very little trend. Fixed length easier for comparing to see how it will trend.
Takeaways:
Why dynamic time warping – identify common apths/functions
Can then tag the topic
How would we know when it would be trending? How can trend forward? More classification. Also, how about the fixed length?
Why eucliean distance
Why K – means
Why dynamic time warping
Findings/ Initial Results of trend vs non trend
Tweet rate and time span
May need to explain tweet rate.
Topics to choose if needed:
happykyungsooday
stateoftheunion
camandkianvideo
daystogokianandjc