• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
SocialCom 2013
 

SocialCom 2013

on

  • 312 views

These slides refer to the talk I gave at the last ASE/IEEE SocialCom 2013 International Conference, where I presented the research work entitled "Trending Topics on Twitter Improve the Prediction of ...

These slides refer to the talk I gave at the last ASE/IEEE SocialCom 2013 International Conference, where I presented the research work entitled "Trending Topics on Twitter Improve the Prediction of Google Hot Queries", which turned to be selected among the top-5% best accepted papers.

Once every five minutes, Twitter publishes a list of trending topics by monitoring and analyzing tweets from its users. Similarly, Google makes available hourly a list of hot queries that have been issued to the search engine. In this work, we analyze the time series derived from the daily volume index of each trend, either by Twitter or Google. Our study on a real-world dataset reveals that about 26% of the trending topics raising from Twitter "as-is" are also found as hot queries issued to Google. Also, we find that about 72% of the similar trends appear first on Twitter. Thus, we assess the relation between comparable Twitter and Google trends by testing three classes of time series regression models. We validate the forecasting power of Twitter by showing that models, which use Google as the dependent variable and Twitter as the explanatory variable, retain as significant the past values of Twitter 60% of times.

Statistics

Views

Total Views
312
Views on SlideShare
308
Embed Views
4

Actions

Likes
0
Downloads
1
Comments
0

1 Embed 4

http://gabrieletolomei.wordpress.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    SocialCom 2013 SocialCom 2013 Presentation Transcript

    • Trending Topics on Twitter Improve the Prediction of Google Hot Queries Gabriele Tolomei Università Ca’ FoscariVenezia, Italy Federica Giummolè Università Ca’ FoscariVenezia, Italy Salvatore Orlando Università Ca’ FoscariVenezia, Italy 2013 ASE/IEEE International Conference on Social Computing September 8th-14th, 2013 - Washington D.C., USA Monday, September 30, 13
    • Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 2 Monday, September 30, 13
    • Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 32013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Twitter • The most popular real-time microblogging service • ~ 500M users • ~ 400M tweets per day on avg. (as of 2012) • 140-chars limited size tweets • Social trends pushed by the social network via user-generated content • hashtags (#) • trending topics 42013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Google • The most popular Web search engine • ~ 5B search queries per day on avg. (as of 2012) • Web trends derived from search keywords issued by users • Zeitgeist • Google (Hot)Trends 52013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Social vs.Web Trends ... 49ers ... dow jones ... nba ... obama 2016 ... world war z ... ... 50 cent ... democrats ... iphone 5 ... romney ... windows 8 ... ... anne hathaway ... barack obama ... election ... nyc marathon ... veterans day ... 62013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Which Came First? 0 20 40 60 80 100 11-01 11-03 11-05 11-07 11-09 11-11 11-13 11-15 VolumeIndex Timestamp election Google Twitter Our claim is that a trending topic on Twitter could later become a hot query on Google 72013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 82013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Data Collection 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 9 Streaming API Search API Atom feed • 15 consecutive days of crawling • from 2012-11-01 00:00:00UTC to 2012-11-15 23:59:59UTC • Google • Hot Trends • Twitter • Trending Topics • Public Timelines Monday, September 30, 13
    • Google Hot Trends 49ers ... election ... obama 2016 ... world war z Pre-processing & Cleaning Top-20 hourly US queries |VY|=190 Top-20 hourly US queries 102013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA y Monday, September 30, 13
    • SearchVolume Index Normalized integer score in [0,100] Daily relative searches for a keyword limited to a specific country within a range of dates 112013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Twitter Trending Topics |VX|=892 50 cent ... iphone 5 ... election ... windows 8 122013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Pre-processing & Cleaning Top-10 trending topics every 5 minutes Top-10 hourly aggregated x Monday, September 30, 13
    • TrendVolume Index 132013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA • Use the public timelines crawled ~ 260M tweets = 10% random sampling • To be consistent with Google • daily relative number of tweets mentioning a particular keyword could be hourly! • normalized integer score in [0,100] • limited to US and within a range of dates Monday, September 30, 13
    • Trend Time Series • 15 daily observations T = <t1, ..., t15> • Google • Hot Trends + SearchVolume Index • e.g., Yt = election = <5,...,7,40,100,...,15,...> • Twitter • Trending Topics + TrendVolume Index • e.g., Xt = election = <6,...,10,100,55,...,5,...> 142013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Trend Pairing • Not every pair of Google/Twitter trend time series are worth analyzing! • anne hathaway vs. veterans day • We focus only on trends that are “similar enough” to each other • election vs. election • election vs. barack obama 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 15 Monday, September 30, 13
    • Trend Bipartite Graph VX VY ... 49ers ... dow jones ... election ... nba ... obama 2016 ... world war z ... ... 50 cent ... democrats ... iphone 5 ... election ... romney ... windows 8 ... ... trend similarity x y 162013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Trend Similarity • Edge weighting scheme of the TBG • string/lexical: e.g., Levenshtein, Jaccard, n-grams, etc. • semantic: e.g., Wikipedia-based • We use the normalized longest common subsequence (nlcs) between two keywords 172013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Datasets • 2 thresholds on nlcs η1 = 1.0 and η2 = 0.6 lead to 2 TBGs • D1 = {(Xt, Yt) | nlcs (x, y) = η1}, |D1| = 50 • D2 = {(Xt, Yt) | nlcs (x, y) >= η2}, |D2| = 69 • Aggregate and normalize Twitter time series linked to the the same Google keyword in the TBG • |VX| > |VY| 182013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Research Questions 1) Is there any relation between a particular pair of (Xt,Yt)? • Cross-Correlation (lagged relationship) 2) Are variables from Twitter time series useful to forecast those from Google? • Time series regression 192013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Because from our data about 70% of times the same trend appears first on Twitter ...Why not the opposite? Monday, September 30, 13
    • Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 20 Monday, September 30, 13
    • Cross-Correlation • Measures the correlation between two time series Xt, Yt shifted by δ time units • Xt refers to Twitter and Yt refers to Google • min δ = 1 day • Check for which δ the cross-correlation is maximum • X leads Y if one or more Xt+δ are predictors of Yt and δ < 0 • X lags Y, otherwise 212013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Lagged Relationship Most pairs of time series exhibit their max cross-correlation at lag δ = 0 Nevertheless, some exceptions occur and cross-correlation at lag δ = -1 is still significant 222013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Twitter as measured one day before could help explain Google Monday, September 30, 13
    • Time Series Regression • Relate Y (dependent variable) to a parametric function of a set of explanatory variables X1,...,Xr • The widest used function is linear in the parameters • Linear Regression ε kx1 column vector kxr matrix of observed values for X1,...,Xr parametrized by β Y = Xβ + kx1 column vector of errors 232013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Ordinary Least Squares • Technique to estimate the real vector of coefficients β • Choose β’ such that: β’ = argminβ {(Y-Xβ)T (Y-Xβ)} β’ = (X T X)-1 X T Y 242013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Autoregressive: AR(p) • The simplest time series regression model • Relate a variable Yt to a linear combination of up to p of its previous values Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + εt 25 parameters random noise 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Distributed Lag: DL(q) • The dependent variable Yt is only related to q+1 explanatory variables Xt at previous time Yt = α + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt 262013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA parameters random noise Monday, September 30, 13
    • Autoregressive Distributed Lag: ADL(p,q) • Relate the dependent variable Yt to lags of itself and of an explanatory variable Xt + ψ1Xt + ψ2Xt-1 + ... + ψq+1Xt-q + εt Yt = α + φ1Yt-1 + φ2Yt-2 + ... + φpYt-p + 27 parameters random noise 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Model Comparison • We measure how likely a model AR(p), DL(q), ADL(p,q) retains its lagged component as significant • Null hypothesis H0:“the lagged coefficient is not significant” • Rejecting H0 means that the lagged coefficient is useful to fit the data • H0 is rejected whenever the p-value is below a significance level α (e.g., α = .05) 282013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Model Evaluation • Compute both R2 ∈ [0,1] and its adjusted variation which penalizes models with too much explanatory terms • Describes how well a regression line fits the observed data • Provides a measure of how future observation are likely to be predicted by the model 292013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • AR(p) vs. DL(q) On both D1 and D2, DL(q) retain their q-lagged coefficient much more often than AR(p) 302013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Twitter is actually useful to fit Google data! Monday, September 30, 13
    • ADL(p,q) 312013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Slightly less cases where the lagged component of Twitter is significant to predict Google data... But adjusted R2 evaluates much better than DL(q) Monday, September 30, 13
    • Wrap Up 322013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA ADL(1,1) is the best model Reasonable! It mixes the autoregressive component of Google with the prediction of Twitter, captured one day before Monday, September 30, 13
    • Overcome Limitations We might expect better results if finer-grained analysis (hourly) was possible... 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 33 Twitter vs.Wikipedia: Upcoming CIKM’13 Workshop Monday, September 30, 13
    • Agenda Social vs.Web Trends • Introduction • Methodology • Experiments & Results • Conclusion 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 34 Monday, September 30, 13
    • Conclusion • Relate Twitter trending topics (social trends) with Google hot queries (web trends) • Trend Bipartite Graph (TBG) links social and web trends • Time Series Analysis • maximum cross-correlation occurs at lag-0 but Twitter leads Google significantly (~ 60% of times) • the very best model to explain data uses both Twitter and Google lagged coefficients 352013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • ThankYou! Questions? 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA 36 Monday, September 30, 13
    • Monday, September 30, 13
    • Backup 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • TrendVocabularies VX VY ... 49ers ... dow jones ... nba ... obama 2016 ... world war z ... ... 50 cent ... democrats ... iphone 5 ... romney ... windows 8 ... ... anne hathaway ... barack obama ... election ... nyc marathon ... veterans day ... 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Trend Scores • Given a discrete time interval T = <t1, ..., tT> • Assign 2 scores (social and web) to each trending keyword during each time unit • The score measures the “strength” of how much trending is a keyword at a given time 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Trend Time Series • Model each Twitter/Google trending keyword as a time series of tT random variables • Each random variable evaluates to the trending score of the keyword • The observed time series for a trend is the sequence of values of its trending score 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • Trend Bipartite Graph • 2 disjoint sets of nodes are the vocabularies of Twitter and Google trends • Weighted edges measure the pairwise trend similarity • string/lexical: edit distance, LCS, n-grams • semantic:Wikipedia-based • TBG identifies a set of pairs of comparable time series associated with similar trends 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Monday, September 30, 13
    • (Weak) Stationarity Autocorrelation of stationary variable decays into “noise” and/or negative values in few lags 2013 ASE/IEEE SocialCom09/09/2013, Washington DC, USA Google Twitter Monday, September 30, 13