Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Predicting The Future With Social Media


Published on

These slides were used for an internal presentation of the SoNet group -
Every week, one member of the SoNet group presents a research papers to the other members. The mentioned paper(s) are hence written by other researchers.

This the abstract of the original paper made by
Sitaram Asur, Bernardo A. Huberman

In recent years, social media has become ubiquitous and important for social networking and content sharing. And yet, the content that is generated from these websites remains largely untapped. In this paper, we demonstrate how social media content can be used to predict real-world outcomes. In particular, we use the chatter from to forecast box-office revenues for movies. We show that a simple model built from the rate at which tweets are created about particular topics can outperform market-based predictors. We further demonstrate how sentiments extracted from Twitter can be further utilized to improve the forecasting power of social media.

Published in: Technology

Predicting The Future With Social Media

  1. 1. Predicting the Future With Social Media Social Computing Lab The Social Computing Lab focuses on methods Bernardo A. Huberman Sitaram Asur for harvesting the collective intelligence of groups of people in order to realize greater value from the interaction between users and information. Published on arXiv Cornell University – March 2010 Maurizio Napolitano, SoNet group, - April 2010
  2. 2. SoNet Research Meetings These slides were used for an internal presentation of the SoNet group. Every week, one member of the SoNet group presents a research papers to the other members. The mentioned paper(s) are hence written by other researchers. Being internal presentations, these slides might be a bit rough and unpolished. You can find more information (including this presentation) about the SoNet group at
  3. 3. The question How social media content can be used to predict real-world outcomes? The case study: predicting box-office revenues for movies using the chatter from Twitter Why Twitter? several tens of millions of users who actively participate in the creation and propagation of content Why movies? The topic of movies is of considerable interest among the social media user community The real-world outcomes can be easily observed from box-office revenue for movies
  4. 4. Topics Viral marketing • How buzz and attention is created for different movies • How buzz and attention changes over time movies that are well talked about will be well-watched? Sentiments •How are created •How positive and negative opinions propagate •How they influence people
  5. 5. What discovery • Social media feeds can be effective indicators of real-world performance • The rate at which movie tweets are generated can be used to build a powerful model for predicting movie box-office revenue. • The predictions are better than those produced by the Hollywood Stock Exchange, the gold standard in the Exchange industry
  6. 6. The dataset TWITTER search API 2.89 million tweets •tweets referring to 24 different movies •@userid period of 3 months (nov-feb) •retweet from 1.2 million users by using the movies keywords Armored Daybreakers Extraordinary Leap Year Princess And The Tooth Fairy (2009-12-04) (2010-01-08) Measures (2010-01-08) Fog (2010-02-26) (2010-02-22) (2009-11-13) Avatar Dear John From Paris With Love Legion Sherlock Holmes Transylmania (2009-12-18) (2010-02-05) (2010-02-05) (2010-01-22) (2009-12-15) (2009-12-04) The Blind Side Did You Hear The Imaginarium of Twilight: New Spy Next Door When in Rome (2009-11-15) About The Dr Parnassus moon (2010-01-15) (2010-01-29) Morgans (2010-01-08) (2009-11-20) (2009-12-08) The Book of Eli Edge of Darkness Invictus Pirate Radio The Crazies Youth in Revolt (2010-01-15) (2010-01-29) (2009-12-11) (2009-11-13) (2010-02-26) (2010-01-08) critical period = the time to the week before a release movie
  7. 7. Dataset charatecteristics Number of tweets per unique authors for different movies y → tweets LIKE the box-office trends!!! x → days lines → movies
  8. 8. Dataset characteristics Number of tweets per unique authors for different movies y → tweets per authors x → days ratio remains fairly consistent between 1 and 1.5 lines → movies
  9. 9. Dataset charatecteristics Log distribution of authors and tweets over the critical period POWER LAW – Zipfian distribution y → log(frequency of authors) A few authors generating a large number of tweets x → log(number of tweets)
  10. 10. Dataset characteristics Distribution of total authors and the movies they comment on POWER LAW y → authors A majority of the authors talking about only a few movies x → number of movies
  11. 11. Attention and popularity Twitter and real world “Prior to the release of a movie, media companies and and producers generate promotional information in the form of trailer videos, news, blogs and photos. We expect the tweets for movies before the time of their release to consist primarily of such promotional campaigns, geared to promote word-ofmouth cascades” In Twitter: tweets and retweets referring a particular url (photos, trailer and other promotional material)
  12. 12. Attention and popularity Percentages of urls in tweets for different movies there is a greater percentage of tweets containing urls in the week prior to release than afterwards
  13. 13. Attention and popularity tweets with url VS retweets URLs and RETWEETs PERCENTAGES FOR CRITICAL WEEK Features Week 0 Week 1 Week 2 url 39.5 25.5 22.5 retweet 12.1 12.1 11.66 CORRELATION and COEFFICENT OF DETERMINATION (R2 ) values for URLS and RETWEETs before release Features Correlation R2 url 0.64 0.39 retweet 0.5 0.20 “This result is quite surprising since we would expect promotional material to contribute significantly to a movie’s box-office income”
  14. 14. Prediction first weekend Box-office revenues “Using the tweets referring to movies prior to their release, can we accurately predict the box-office revenue generated by the movie in its opening weekend?” How use a quantifiable measure on the tweets? TWEETRATE number of tweets referring to a particular movie per hour ∣tweets mov∣ Tweetrate mov = ∣Time hours∣ “the correlation of the average tweetrate with the box-office gross for the 24 movies considered showed a strong positive correlation, with a correlation coefficient value of 0.90”
  15. 15. Prediction use the regression analisys! Prediction compared with the real box-office revenue information extracted from the Box Office Mojo website => POSITIVE RESULTS Regression analysis with: •Time series values of the tweet rate for the 7 days before the release •Thent → number of the theaters the movies were released •HSX Index → the index of the Hollywood Stock Exchange
  16. 16. Prediction linear regression the results Features Adjusted R2 p-value*** Avg Tweet-rate 0.80 3.65e-09 Tweet-rate timeseries 0.93 5.279e-09 Tweet-rate timeseries + thent 0.973 9.14e-12 HSX timeseries + thent 0.963 1.030e-10
  17. 17. Prediction Predicted vs Actual box office scores using tweet-rate and HSX predictors
  18. 18. Prediction Predicting prices Prediction of HSX end of opening weekend price Predictor Adjusted R2 p-value*** HSX timeseries + thent 0.95 4.495e-10 Tweet-rate timeseries + 0.97 2.379e-11 thent “The Hollywood Stock Week-end Adjusted R2 Exchange de-lists movie stocks after 4 Jan 15-17 0.92 weeks of release, Jan 22-24 0.97 which means that there is no timeseries Jan 29-31 0.92 available for movies after 4 weeks. In the Feb 05-07 0.95 case of tweets, people continue to discuss Coefficient of determination movies long after they (R2) values using tweet-rate are released” timeseries for different week- ends
  19. 19. Sentiment Analysis investigate the importance of sentiments in predicting future outcomes •For each tweet assign the label Positive, Negative or Neutral • Clean data (no stop-words, removel url and userid, replace title, question, exclamations) • Amazon Meccanical Turk (1000 workers) •Use LingPipe – DynamicLDClassifier • Obtained an accuracy of 98% 1)Define two variables ∣Positive and NegativeTweets∣ Subjectivity= ∣Neutral Tweets∣ ∣Tweets with Positive Sentiment∣ PNratio= ∣Tweets with Negative Sentiment∣
  20. 20. Sentiment Analysis X → movies the subjectivity increases after release Y → subjectivity
  21. 21. Sentiment Analysis The positive and negative go in the same direction X → movies of the movies success Y → polarity
  22. 22. Sentiment Analysis regression analisys and polartiy (PNRatio) Predictor Adjusted R2 p-value Avg Tweet-rate 0.79 8.39e-09 Avg Tweet-rate + thent 0.83 7.93a-09 Avg Tweet-rate + PNRatio 0.92 4.31e-12 Tweet-rate time series 0.84 4.18e-06 Tweet-rate timeseries + 0.863 3.64e-06 thent Tweet-rate timeseries + 0.94 1.84e-08 PNRatio the sentiments do provide improvements, although they are not as important as the rate of tweets themselves
  23. 23. GENERAL PREDICTION MODEL FOR SOCIALMEDIA y=a∗A p∗P d ∗D A : rate of attention seeking P : polarity of sentiments and reviews y=∧ D : distribution parameter y denote the revenue to be predicted Є the error β values correspond to the regression coefficients
  24. 24. Bibliography  D. M. Pennock, S. Lawrence, C. L. Giles, and F. A. Nielsen. The real power of artificial markets. Science, 291(5506):987– 988, Jan 2001.  W. Zhang and S. Skiena. Improving movie gross prediction through news analysis. In Web Intelligence, pages 301304, 2009.
  25. 25. These slides are released under Creative Commons Attribution-ShareAlike 2.5 ● You are free: ● to copy, distribute, display, and perform the work ● to make derivative works ● to make commercial use of the work Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one. ● For any reuse or distribution, you must make clear to others the license terms of this work. ● Any of these conditions can be waived if you get permission from the copyright holder. Your fair use and other rights are in no way affected by the above. More info at