Social media mining hicss 46 part 1

7,859 views
7,157 views

Published on

HICSS 46 Tutorial on Social Media Mining and Analysis - Part 1 -- Delivered on 1/7/13

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,859
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
105
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Requires structured data (numbers and categories well-defined)Transformed by data preparation or collected with an a prior design in mindTypically housed and organized in a relational database, data mart or data warehouse
  • First step in textual data preparation is to systematically collect samples of text, i.e. the documents related to the context being studiedRange of possibilities: word documents, PDFs, emails, IM chat, Web pages, RSS Feeds, Blogs, Tweets, Open ended surveys, Transcripts of Helpline calls …Convert into organized set of texts – called a corpus – standardized and prepared for the purpose of knowledge discovery
  •  “Do you have the flu? Fever, Chest discomfort, Weakness, Aches, Headache, Cough.”
  • Social media mining hicss 46 part 1

    1. Mining and Analyzing Social Media: Part 1 Dave King January 7, 2013
    2. AbstractOverview of the data mining and analysis of social media, exploring theapplication of various data mining, textual mining and analyticaltechniques to social media data sources. The focus will be on thepractical application of these techniques for the purposes of: • Monitoring of social media sources • Analyzing content to identify leading issues and sentiment • Analyzing and forecasting trends • Identifying and profiling influential participants, subgroups and communities 2 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    3. Agenda: Part 1• My Biography• Resources• Social Media Defined• Data Mining Example• Text Mining Processes• Using Text Mining for Prediction• Brief Look at Programming for Prediction 3 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    4. Agenda: Part 2• Sentiment Analysis & Opinion Mining Defined – Business Interest & Software Packages – Levels of Analysis – Automated Classification• Social Network Analysis – Defined – History – Basic techniques and measures – Ego and Social-Centric Analysis 4 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    5. Biography: Dave King • EVP of Product Development and Management at JDA Software • 30 years in enterprise package software business • 15 years as university professor • 15 years as Co-Chair of the Internet & Digital Economy Track (HICSS) • Long time interest in various aspects of E-Commerce, Business Intelligence, Analytics (including Text Analytics) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    6. Personal Experiences withAnalytics• Taught applied statistics and math modeling• In software R&D – Optimization in the 80s – Natural Language Frontends • NLI Query & CMU Robotics Lab – EIS Competitive Analysis • Dow Jones and Reuters • Verity Topics • NewsAlert – InXight’s Hyperbolic Tree – Supply Chain Analytics – Sentiment Analysis for Retailers• In the case of many of these advanced techniques, often the audiences have been small, sometimes bewildered, and often fleeting. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    7. Text Mining Resources Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 7
    8. Social Networking Analysis Resources 8 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    9. What is Social Media? Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 9
    10. Definedare online technologies and practices for socialinteraction enabling sharing opinions, insights,experiences, perspectives and media itself. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 10
    11. Defined is the media we use to be social. That’s it. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 11
    12. Social Media Types: Take Your Pick 12 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    13. Social Media is Still Huge!Alexa Traffic Oct 6, 2012 Rank Website Type 1 Facebook Social 2 Google Search 3 YouTube Social 4 Yahoo! Search 5 Baidu.com Search 6 Wikipedia Social 7 Windows Live Search 8 Twitter Social 9 QQ.COM Portal 10 Amazon.com E-Commerce 11 Blogspot.com Social 12 LinkedIn Social 13 Taobao.com E-Commerce 14 Google India Search 15 Yahoo! Japan Search 13 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    14. Social Media is Still Huge! Growth in Registered Users 2011 to 2012Facebook: 750M -1BTwitter: 200M - 500MLinkedIn: 100M – 175M Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 14
    15. Social Media is Still Huge!If Social Media sites were countries…China: 1.4BIndia: 1.2BFacebook: 1.0BTwitter: 500MUS: 310M Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 15
    16. Social Media is Still Huge! Usage Per DayFacebook: 3.2B Likes & CommentsTwitter: 340M TweetsLinkedIn: 14M Searches Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 16
    17. Analyzing Social Media:Two Paths Media - Content Social - Network Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 17
    18. Analyzing Social Media:Two Paths An Example: Which Blogs are Similar? Term1 Term2 Term3 … TermM Blog1 Blog2 Blog3 … BlogNBlog1 1 0 0 … 1 Blog1 - 1 0 … 1Blog2 0 0 1 … 0 Blog2 0 - 1 … 0Blog3 0 1 0 … 1 Blog3 1 1 - … 0… … … … … … … … … … - …BlogN 0 0 0 … 1 BlogN 1 0 1 … - Cluster Analysis Social Network (Graph) (e.g. K-Means) Analysis 18 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    19. social Media Formats • Articles • Pictures • Comments • Videos • Messages • Music • Reviews • Locations • Ratings • Tags • Rankings •… 19 19 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    20. social Media Data: One Commonality 20 20 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    21. Data Mining: DefinedDiscovering meaningfulpatterns from large datasets using patternrecognition technologies. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 21
    22. Data Mining: CRISP-DM Real-World Data Data Consolidation Business Data Understanding Understanding Data Preparation Data Cleaning Deployment Modeling Data Transformation Evaluation Data Reduction Well-FormedCross-Industry Standard Process for Data Mining Data 22 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    23. Data Mining:General Data Assumptions Structured Transformed Well-Formed 23 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    24. Data Mining: Simple Example Market Basket Analysis Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 24
    25. Market Basket Analysis:Applications• Cross Selling• Product Placement• Affinity Promotion• Customer Segmentation Analysis 25 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 25
    26. Market Basket Analysis: The Ole Beer and Diaper Legend…Transaction Item Binary Representation 1 Bread 1 Milk Transaction Beer Bread Cola Diapers Eggs Milk 2 Beer 1 0 1 0 0 0 1 2 Bread 2 1 1 0 1 1 0 2 Diapers 3 1 0 1 1 0 1 2 Eggs 4 1 1 0 1 0 1 3 Beer 5 0 1 1 1 0 1 3 Cola 3 Diapers 3 Milk 4 Beer Goal: Empirically determine those itemsets 4 Bread 4 Diapers that occur frequently together in a set of 4 Milk transactions, producing a set of 5 Bread 5 Cola Association Rules of the form LHS -> RHS 5 Diapers 5 Milk Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 26
    27. Market Basket Analysis:The Ole Beer and Diaper Legend… Transaction Beer Bread Cola Diapers Eggs Milk 1 0 1 0 0 0 1 2 1 1 0 1 1 0 3 1 0 1 1 0 1 4 1 1 0 1 0 1 5 0 1 1 1 0 1 Concept Definition Example Itemset A specific collection of items in transaction {Diapers, Beer} Support Count Number of transactions with itemset Support {Diapers,Beer} = 3 Transactions No of transactions = N N=5 Association RuleImplication rule of form LHS->RHS where LHS & RHS are {Diapers} -> {Beer} itemsets Rule Support No. of times rule appears in dataset 3/5 = .6 #tuples(LHS & RHS}/N Rule Confidence No. of times RHS occurs in transactions with LHS 3/4 = .75 #tuples(LHS, RHS)/#tuples(LHS) Rule Lift Strength of Association over random co-occurrence of LHS (3/5)/((4/5)x(3/5)) = 1.25 and RHS #tuples(LHS,RHS}/N)/(#tuples(LHS)/N x #tuples(RHS)/N) Confidence(RHS/LHS)/Support(RHS) Support(LHS,RHS)/Support(LHS)xSupport(RHS) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 27
    28. Market Basket Analysis:What if it were a text analysis problem? Joe went to the 7-11 to pick up some cigarettes. While he was there he also bought some dipers and beer. Sally was on her weekly shopping run at Wal-Mart. She had picked up some diapers and formula for her infant. She also thought about buying beer for her husband, but the they were out of the brand he liked. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 28
    29. Market Basket Analysis:What if it were a text analysis problem?• No specified format• Variable length• Variable spelling• Punctuation and non-alphanumeric characters• Contents are not predefined and no predefined set of values Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 29
    30. Text Mining (aka Text Analytics):Defined Using natural language processing & data mining to discover patterns in a collection of “documents” Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 30
    31. Text Mining:Document Collections• Word Documents • Blogs• PDFs • Tweets• Emails • Open ended• IM Chat surveys• Web Pages • Transcripts of Helpline calls 31 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    32. Text Mining:CRISP-Like Processes Real-World Text Data Document Business Understanding Document Understanding Consolidation Document Establish the Preparation Corpus Deployment Documents Modeling Corpus Refinement (Token, Stem, Stop…) Feature Selection Evaluation & Weighting Doc-Term Matrix* 32 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    33. Text Mining:Creating the corpusA large and “structured”or “organized” collectionof text Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 33
    34. Text Mining Process:Corpus Refinement Common representation of tokens within and between documents Eliminate Tokenization Normalize Stemming Stop Words• Tokenization —Parse the text to generate terms. Sophisticated analyzers can also extract phrases from the text.• Normalize — Convert them to lowercase.• Eliminate stop words — Eliminate terms that appear very often (e.g. the, and, …).• Stemming — Convert the terms into their stemmed form—remove plurals and different word forms (e.g. achieve, achieves, achieved – achiev) [note: word about synonyms – WordNet Synset] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 34
    35. Text Mining Processes:Feature Extraction & Weighting Feature Extraction “Bag of Words, Terms or Tokens” Vector Representation: Word, Term or Token/Doc Matrix “Bag of Words” (BOW) or Vector Space Model (VSM): Words or Tokens are attributes and documents are examples Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 35
    36. Text Mining Processes:Transforming Frequencies• Binary Frequencies: tf =1 for tf>0; otherwise 0• Term Frequencies: tf(i,j)/Sum of tf(i,j) in Doc K• Log Frequencies: 1 + log(tf) for tf>0; otherwise 0• Normalized Frequencies: Divide each frequency by SQRT of Sum of Squares of the frequencies within the vector (column)• Term Frequency–Inverse Document Frequency – TF * IDF – Inverse Document Frequency: log(N/(1+D)) where N is total number of docs and D is number with term Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 36
    37. Text Mining Processes:Twitter Example – Problem Features • Each tweet <= 140 characters (avg. 10- 15 words/message) • Heavy presence of non-alpha symbols, abbrevs, misspellings and slang • Tweets often include retweets (original tweet repeated) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 37
    38. Text Mining Processes:Twitter ExampleOne of the things I love and adore about Twitter …is how its open API has lit a fierce fire of innovationwhen it comes to analytics. Anyone and theirbrother and ma-in-law can develop a tool, and theyhave! Much to the benefit of the rest of us. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 38
    39. Text Mining Processes: Twitter Example – Twitter APIGet Search• http://search.twitter.com/search.json?q=<query>• search.twitter.com/search.json?q=Obama&rpp=100&page=5 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 39
    40. Text Mining Processes: Twitter Example – JSON• JSON (JavaScript Object Notation)• Lightweight, Text-Based, Data-Interchange Format• Built on Two Structures: – A collection of name:value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array. – An ordered list of values. In most programming languages, this is realized as an array, vector, list, or sequence Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 40
    41. Text Mining Processes: Twitter Example – Establish Corpus Query API Resultsearch.twitter.com/ {…search.json?q=%3A)+ results:[feel+ feeling& {iso_language_code: en,rpp=100&page=5 to_user_name: Andrea, ...search.twitter.com/sea text: u”Love is everything in thisrch.json?q=%3A(+ world! Its a feeling like no other. Ifeel+feeling& cant wait 2 feel that emotionrpp=100&page=5 again..but patience is key :-)” ... created_at: Thu, 18 Oct 2012 20:11:22 +0000, …}} ...} Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 41
    42. Text Mining Processes:Twitter Example – Simple QuestionAre there any language differences between“feeling” tweets containing  and  symbols? Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 42
    43. Text Mining Processes:Twitter Example – Establish Corpus text: value 1. Love is everything in this world! Its a feeling like no other. I pairs in cant wait 2 feel that emotion again..but patience is key :-)JSON object 2. @mzxAmaZiiN Whats up ma. How ya feeling? Lemme make that soul feel better. :) Remove 3. "..Ive got a good feeling about today :P Something makes RTs me feel i might make a sale or two :) *fingers crossed* #etsy #shop #seller #cra... 4. @IWontForgetDemi Awww poor thing :( hate feeling sick! Hope you feel better soon 5. @IzabelaLeafsfan no the worst feeling ever is when u feel like total crap. cuz u think no one luvs u :( 6. … Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 43
    44. Text Mining Processes: Twitter Example – Doc Preparation Eliminate Tokenization Normalize Stemming Stop WordsTweet:Love is everything in this world! Its a feeling like no other. I cant wait 2 feelthat emotion again..but patience is key :-)Words:[Love, is, everything, in, this, world!, Its, a, feeling, like, no, other., I, "cant", wait, 2, feel, that, emotion, again..but, patience, is, key, :-)]Tokens:[Love, is, everything, in, this, world, !, Its, a, feeling, like, no, other., I, ca, "nt", wait, 2, feel, that, emotion, again..but, patience, is, key, :, -, )] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 44
    45. Text Mining Processes:Twitter Example – Doc Preparation Eliminate Tokenization Normalize Stemming Stop Words• Normalize – [love, is, everything, in, this, world!, its, a, feeling, like, no, other., i, "cant", wait, 2, feel, that, emotion, again..but, patience, is, key, :-)]• Alpha – [love, is, everything, in, this, its, feeling, like, no, wait, feel, that, emotion, patience, is, key]• Remove Stopwords – [love, everything, feeling, like, wait, feel, emotion, patience, key]• Stemming – [love, everyth, feel, like, wait, feel, emot, patienc, key] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 45
    46. Text Mining Processes:Twitter Example – Analysis Item Collection List Set Lex Div Aver Len in Chars Aver No/Tweets(w/o) Tweets HF 498 - - 108 - SF 499 - - 100 - Tweets w/o "RT" HF 409 - - 105 - SF 429 - - 98 - Words HF 8077 2346 3 4 20 SF 8149 2073 4 4 19 Alpha (lower) HF 5622 1197 5 4 14 SF 5733 1041 6 4 13 Alpha w/o Stops HF 3400 1092 3 5 8 SF 3469 936 4 5 8 Stems HF 3400 978 3 4 8 SF 3469 844 4 4 8 Stems w/o "feel" HF 2619 977 3 4 6 SF 2635 843 3 4 6 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 46
    47. Text Mining Processes:Twitter Example – Analysis Corpus Av Word Len Aver Sent Ln Lexical Diversity HF Tweets 4 20 3 SF Tweets 4 19 4 austen-emma.txt 4 21 26 austen-persuasion.txt 4 23 16 austen-sense.txt 4 23 22 bible-kjv.txt 4 33 79 blake-poems.txt 4 18 5 bryant-stories.txt 4 17 14 burgess-busterbrown.txt 4 17 12 carroll-alice.txt 4 16 12 chesterton-ball.txt 4 17 11 chesterton-brown.txt 4 19 11 chesterton-thursday.txt 4 16 10 edgeworth-parents.txt 4 17 24 melville-moby_dick.txt 4 24 15 milton-paradise.txt 4 52 10 shakespeare-caesar.txt 4 11 8 shakespeare-hamlet.txt 4 12 7 shakespeare-macbeth.txt 4 12 6 whitman-leaves.txt 4 35 12 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 47
    48. Text Mining Processes:Twitter Example – Analysis Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 48
    49. Text Mining Processes:Twitter Example – Doc-Term Matrix3538 Total Type 216 166 111 86 83 76 66 64 58 58 … 4 4 4 4 4 4 4 4 4 4Total Tweets Face like better make good hope know get sick hate realli … ye rain quit chang stress happier longer cheer without everyth 5 Tweet1 HF 0 1 0 0 1 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 Tweet2 HF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 7 Tweet3 HF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet4 HF 0 0 0 0 0 0 0 0 0 1 … 0 0 0 0 0 0 0 0 0 0 2 Tweet5 HF 0 0 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 1 Tweet6 HF 0 1 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet7 HF 1 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 7 Tweet8 HF 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet9 HF 0 1 0 0 1 0 0 1 0 0 … 0 0 0 0 0 0 0 0 0 0 2 Tweet10 HF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 … … … … … … … … … … … … … … … … … … … … … … … … 5 Tweet829 SF 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet830 SF 1 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet831 SF 0 0 0 0 0 1 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 4 Tweet832 SF 0 0 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet833 SF 1 0 0 0 0 0 0 0 0 0 … 0 0 1 0 0 0 0 0 0 0 3 Tweet834 SF 0 1 1 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet835 SF 1 0 0 0 0 1 0 0 1 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet836 SF 0 0 0 0 0 0 0 1 1 0 … 0 0 0 0 0 0 0 0 0 0 5 Tweet837 SF 0 1 0 0 0 0 1 0 0 0 … 0 0 0 0 0 0 0 0 0 0 3 Tweet838 SF 1 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 49
    50. PredictionInformation + Epidemiology = Infodemiology Monitoring and analyzing queries from Internet search engines or peoples status updates on microblogs for syndromic surveillance to predict disease outbreaks Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 50
    51. PredictionSyndromic SurvellianceSurveillance using health-related datathat precede diagnosis and signal asufficient probability of a case or anoutbreak to warrant further publichealth response Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 51
    52. Monitoring the Onset of the Flu SeasonThe Official Standard Way Sentinel ILI - influenza-like-illness Physician Public Health Authority ILI Report Public Reports Aggregated Costly Data1-2 Week Lag Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 52
    53. PredictionInfodemiologyWhat is the first thing searchsome people do beforethey see a doctor or tweettake OTC medicines? Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 53
    54. Prediction - What are the terms,keywords, phrases…? What is the first thing search • Flu people do before they • Flu symptoms • H1N1 • Swine Flu • Cold • Fever see a doctor or take • Headache OTC medicines? tweet • … Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 54
    55. Prediction Infodemiology StudiesAuthors Title Date Type Data Source Dependent Explanatory Variables Model Results (M-Y) VariablesEysenbach Infodemiology: Tracking Nov-06 Search Impressions and clicks from a Number of Number of clicks on Linear correlation between Correlation of .81 with ILI data and Flu-Related Searches on Google Ad displaying the influenza lab Google Ad dealing with clicks and 3 measures of .90 with lab tests and positive the Web for Syndromic question:“Do you have the flu? tests, number of influenza publically reported influenza cases. Surveillance Fever, Chest discomfort, positive influenza Weakness, Aches, Headache, lab tests and Cough.” Covers Oct. 2004-May number of ILI 2005 reports from Sentinel GPs reported by PHA in CanadaHulth et al. Web Queries as a Feb-09 Search Queries to Swedish heath Weekly Lab Weekly ratio of influenza Two linear regression Average R squared was .90 for the Source for Syndromic advice site (Varguiden) from diagnosed cases of related Web queries to all models, one predicting lab two years. Survelliance June 2005 to June 2007 influenza and % of queries at Varguiden site diagnosed cases and the ILI reports of other ILI reports where the influenza from explanatory variable based GPs based on a composite of the best predicting query termsGinsberg et al. Detecting influenza Feb-09 Search Google Search: Historical web % US ILI Visits in ILI-related search queries logit(P)=b0 + b1xlogit(Q)+e Mean correlation of .90 between P epidemics using search logs of ILI related Google week reported to where P is % visits and Q is & Q for 9 CDC US healthcare engine query data Searchs 2003-2008 CDC normalized number of regions. queriesLampos & Tracking the flu Jun-10 Tweets Twitter: 24 weeks of Twitter Weekly ILI UK Average daily "flu-score" linear least squares Average correlation for 5 UK healthCristianini pandemic monitoring corpus in UK from June 09 to Health Protection for all tweets. Flu score for regression time series of HPA care regions was about .92. the Social Web Dec09 Agency smoothed single tweet is proportion flu rates on aggregated tweet for daily values of all ILI "stem" markers flu-scores (ngrams) that appear in the tweet.Culotta Towards detecting Jul-10 Tweets 575K flu-related Twitter % ILI weekly % of messages reporting Compares simple logit model Aggregating keyword frequencies influenza epidemics by messages and % ILI reports reports from CDC an ILI or related symptom with multiple regression using separate keywords (multi- analyzing Twitter from CDC for period Feb 2010 for specified (based on detailed model having different reg) works better than single messages to Apr 2010. period classification and counts for separate Tweet aggregated (simple logit) predictor. statistical procedures to keywords & phrases Simple BOW classifier can be used determine whether ILI or to filter ILI messages. Achieved not) r=.78 for 5 weeks of validation data. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 55
    56. Prediction Infodemiology StudiesAuthors Title Date Data Source Dependent Explanatory Variables Model Results (M-Y) VariablesAcherekar Predicting Flu Trends Apr-11 Tweets ILI "influenza" visits to Sentinel ILI influenza visits Previous weeks ILI visits Auto-regression model with: Analysis shows time-lagged ILI is using Twitter Data physicians reported weekly by and aggregate number of ILI(t) = a1*ILI(t-2) + not a strong predictor but the CDC from Oct 2009 to Oct 2010 tweets mentioning flu-like b*Tweets(t) + e aggregate number of tweets with and tweets mentioning "flu- symptoms flu-like systems is very strong with like" symptoms during same an r=.9846. time periodChan et al. Using Web Search May-11 Search Google Queries in select Official weekly Computed daily counts of O=b0+b1Q+e where O is Very strong correlation (~.9) except Query Data to Monitor countries from 2003-10 dengue case count queries referencing official weekly count of cases for Singapore (.82). Dengue Epidemics: A including Bolivia, Brazil, India, from Ministry of Dengue fever for separate in specified countries and S is New Model for Indonesia and Singapore Heath or WHO countries dengue related search query Neglected Tropical in those countries Disease SurvellianceSignorini et al. The Use of Twitter to May-11 Tweets Several panels of tweets and % ILI weekly Fraction of influenza Support Vector Regression Strong SVR fit for both national and Track Levels of Disease ILI reports. Prediction panel reports from CDC related tweets for total US utilizing SVM feature sets regional figures. Activity and Public includes weekly ILI %s and US for specified and CDC health regions. (collection of terms occuring Concern in the US tweets from Oct 2009 thru May period. Reports Utilizes complex Support 10X per week) to estimate during the Influenza A 2010. include Vector Machine using term- weekly ILI. H1N1 Pandemic nationwide and 10 frequency statistics. regional reports.Paul You are what you tweet: Jul-11 Tweets Covers a number of ailments % ILI weekly Utilizes an Ailment Topic Correlation between the Two types of ATAM/ATAM+ models Analyzing Twitter for including influenza. Based on 2 reports from CDC Aspect Model (ATAM) to probability the "flu ailment" were compared. The correlations Public Health billion tweets from Many 2009 for specified determine the type of designation for each week with ILI were .934 and .958. to October 2010. period ailment associated with a with the ILI rate in the US. tweet (in this case Flu)Lampos et al. Flu detector - Tracking Sep-12 Tweets Twitter: 40 weeks of Twitter Weekly ILI UK Sum across all tweets of linear least squares Correlations around .90 for 3 UK epidemics on Twitter corpus in UK from June 09 to Health Protection daily "flu-score" for each regression time series of HPA health care regions. Mar 10 Agency smoothed tweet based on number of flu rates on aggregated tweet for daily values ILI "stem" markers that flu-scores appear in the tweet. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 56
    57. InfodemiologyExample Prediction Analysis Nowcasting Events from the Social Web with Statistical Learning,” Lampos and Cristianini, ACM IS&T, 9/11 N-Gram Stems ILI Markers Avg. Weekly Twitter 50M Tweets Wght. Flu- API 5 UK Regions Score by 6’09-12’09 Region (time t) Weekly ILI Reports by r ~.89-.96 health region (time t) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 57
    58. InfodemiologyExample Prediction Analysis 50M Tweets Corpus 5 Region UK, 6/09-4/10 Corpus Lower Stop Tokens Stems Refinement Case Words Feature 1- 2- Hybrid Selection Grams Grams Grams Tweet- For each tweet, number of times each hybrid Hybrid Matrix occurs times hybrid weights. Then sum of weighted values equal flu-score for tweet. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 58
    59. InfodemiologyExample Prediction Analysis flu :( RT @OMGFacts: The US death toll from the 1918 flu epidemic was so high that it created a coffin shortage News_SwineFlu: #swineflu Delayed treatment for swine flu can lead to... http://t.co/ZSOjIvAW The last thing I want right now is the flu Much better with that lovely message from u,thank u! (heavy flu since 4 days ; Supposedly only about 1% of people in the world present flu-like symptoms after flu shot. Unfortunately 66% of the people in my house do Back on track...Fuck you flu! Sounds like either the flu or Killian walked into the room. Ill try haha, I got the flu! It costs an average company $135 a day for every employee sick at home with the flu. Encourage your employees to get vaccinated.... i wouldnt recommend it, but a couple of days with violent stomach flu does make you appreciate the small things, natures reset Is it possible to get the flu from the flu jab? Sedatives + cold/flu + ana = cant get up without having to hold a wall for 2 minutes", Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 59
    60. InfodemiologyExample Prediction Analysis (3) Bigram Frequency Bigram Frequency (uflu,ushot) 11 (ucommon,ucold) 2 (uflu,useason) 8 (uman,uflu) 2 (uflu,usymptom) 5 (usoutheast,uasia,) 2 (usymptom,ubusters) 4 (uplease,ugo) 2 (umtm,ublog) 4 (uheadache,usince) 2 (usore,uthroat) 4 (uget,uflu) 2 (uimmune,usystem) 4 (uevery,usingle) 2 (useason,uu2013) 4 (u"cant",uget) 2 (uevery,usymptom) 3 (ugonna,uget) 2 (usince,ulast) 3 (ustuffy,uhead) 2 (uflu,ujab) 3 (ufever,,uflu,) 2 (usore,uthroat,) 3 (ufeeling,ulike) 2 (ufeel,ulike) 3 (usymptom,uchecker) 2 (ujonas,ubrothers) 3 (urunny,unose) 2 (uflu,ushots) 2 (uriva,uoffering) 2 (u+,usore) 2 (uflu,upills) 2 (ulike,u"im") 2 (u#nutrition,ufoods) 2 (u"didnt",ugo) 2 (ucure,u#colds) 2 (u"dont",ufeel) 2 (ucold,,uflu) 2 (uoffering,ucomplimentary) 2 (u"im",ugonna) 2 (uflu-like,usymptom) 2 (uear,uinfection) 2 (uasia,,umillions) 2 (uflu,,usore) 2 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 60
    61. InfodemiologyExample Prediction Analysis (3) 1-Grams 2-Grams Hybrids Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 61
    62. InfodemiologyExample Prediction Analysis Weekly Avg. Weekly ILI Reports Wght. Flu- by health = b0 + b1 Score by + e region Region (time t) (time t) Hybrid 1 Hybrid 2 Hybrid 3 … Hybrid n Tweets Week wgt1 wgt2 wgt3 … wgtn Flu-Score Indep Var Tweet1 1 wgt1*N11 wgt2*N21 wgt3*N21 … wgtn*N21 Sum1 W*N Tweet2 1 wgt1*N21 wgt2*N21 wgt3*N21 … wgtn*N21 Sum2 W*N Tweet3 1 wgt1*N31 wgt2*N21 wgt3*N21 … wgtn*N21 Sum3 W*N Aver Wk-1 … … … … … … … … Tweet m 24 wgt1*Nm1wgt2*N21 wgt3*N21 … wgtn*N21 Summ W*N Aver Wk-m Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 62
    63. InfodemiologyExample Prediction Analysis http://geopatterns.enm.bris.ac.uk/epidemics/ Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 63
    64. Programming Text Mining forPrediction with Python 64 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    65. Text Mining for Prediction: Programming with Python & NLTK# Utilizes “nltk”, a Python “natural language toolkit”# Step 1: Initialize modules, stopwords and stemmerimport simplejsonimport urllibimport reimport nltkfrom nltk.corpus import stopwordsstopwords = stopwords.words(english)porter = nltk.PorterStemmer()def remove_stopwords(text): stopwords = nltk.corpus.stopwords.words(english) content = [w for w in text if w.lower() not in stopwords] return content Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 65
    66. Text Mining for Prediction:Programming with Python & NLTK# Step 2: Utilize Twitter Search API to collect “flu” tweets in JSON format.# Then extract the “text” fields associated with each tweet.itemsFlu = []for i in range(0,14): urlFlu = http://search.twitter.com/search.json?q=flu&rpp=100&page=1 resultFlu = simplejson.load(urllib.urlopen(urlFlu)) itemsFlu = itemsFlu + resultFlu[results]tweetsFlu = [ item[text] for item in itemsFlu]# Step 3. Eliminate RetweetsFluTweetsNoRTs= []for tweetText in tweetsFlu: if not re.search(RT ,tweetText): tweetTextList = [tweetText] FluTweetsNoRTs = FluTweetsNoRTs + tweetTextList Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 66
    67. Text Mining for Prediction:Programming with Python & NLTK# Step 3: Preparing to do Text analysisnoTweets = 0; Flu_tmp_word = []; Flu_tot_word = []Flu_tmp_low = []; Flu_tot_low = []; Flu_tmp_alpha = []; Flu_tot_alpha = []Flu_tmp_stop = []; Flu_tot_stop = []; Flu_tmp_stem = []; Flu_tot_stem = []stems_dict = {} # dictionary holding "text" broken into words for 1..N tweets# Step 4: Text processing. Produces lowercase, alpha stems devoid of stopwordsfor tweet in FluTw eetsNoRTs : noTweets = noTweets + 1 Flu_tmp_word = [ w for w in tweet.split()]; Flu_tot_word = Flu_tot_word + Flu_tmp_word Flu_tmp_low = [w.lower() for w in Flu_tmp_word] ; Flu_tot_low = Flu_tot_low + Flu_tmp_low Flu_tmp_alpha = [cv for w in Flu_tmp_low for cv in re.findall(r^[a-z]+[a-z]+$, w)] Flu_tot_alpha = Flu_tot_alpha + Flu_tmp_alpha Flu_tmp_stop = remove_stopwords(Flu_tmp_alpha); Flu_tot_stop = Flu_tot_stop + Flu_tmp_stop Flu_tmp_stem = [porter.stem(t) for t in Flu_tmp_stop]; Flu_tot_stem = Flu_tot_stem + Flu_tmp_stem Flu_stems_dict[noTweets] = Flu_tmp_stem Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 67
    68. Text Mining for Prediction:Programming with Python & NLTK# Step 5: Analyzing frequency distributionsfdist_Flu_stems = nltk.FreqDist(Flu_tot_stem)fdist_Flu_stems.plot(25)fdist_Flu_stems.items()[0:25]# Step 6. Produce doc-matrix “wc” (a dictionary hold the counts associated# with each word).for dCnt in range(1,len(Flu_stems_dict)): wlist = Flu_stems_dict[dCnt] for word in wlist: wc.setdefault(word,0) wc[word] += 1 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 68
    69. Text Mining for Prediction:Programming with Python & NLTK# Step 7. Eliminate all words that occur less than 3 timesapcount = {}for word,wcnt in wc.items(): apcount.setdefault(word,0) if wcnt > 3: apcount[word] = wcnt# Step 8. Write out doc-term matrix to a fileout = file(flu-doc-matrix.txt,w); out.write(Tweets)for word,count in apcount.items(): out.write(","); out.write(word) ; out.write(n)for tweetno, twlist in Flu_stems_dict.items(): tweetname = "Tweet" + str(tweetno); out.write(tweetname) for word, count in apcount.items(): if word in twlist: out.write(","); out.write("1") else: out.write(","); out.write("0"); out.write("n")out.close() Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 69
    70. Text Mining for Prediction:Programming with R, tm and RTextTools 70 Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL
    71. Text Mining for Prediction: Programming with R, tm and RTextTools# initialize libraries # text preprocessing functionlibrary(twitteR) toLowerCaseAlpha <- function(x){library(tm) removeHTTP <- gsub("http:[/a-zA-Z0-9._]+","",x)library(RTextTools) removeNonAlpha <- gsub("[^a-zA-Z]"," ",removeHTTP)library(plyr) removeMultipleSpaces <- gsub(" +"," ",removeNonAlpha)library(stringr) changeToLowerCase <- tolower(removeMultipleSpaces) return(changeToLowerCase) }# utilize Twitter Search API to collect ―flu‖ Tweets# in JSON format # clean tweet text, convert lower case, remove stop wordsfluTweets = searchTwitter(flu, n=1500, lang=en) # create corpus for analysis fluTweetTextCleaned <- toLowerCaseAlpha(fluTweetTextNonRTs)# extract text from JSON objects corpusFluTweets <- Corpus(VectorSource(fluTweetTextCleaned))fluTweetText = laply(fluTweets, function(t) t$getText()) corpusFluText <- tm_map(corpusFluTweets, removeWords, corpusStopwords)# remove retweetsindxFluRetweets <- grep(RT,fluTweetText)indxFluTweetText <- c(1:length(fluTweetText))indxFluTweetNonRTs <- !(indxFluTweetText %in%indxFluRetweets)fluTweetTextNonRTs <-fluTweetText[indxFluTweetNonRTs] Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 71
    72. Text Mining for Prediction: Programming with R, tm and RTextTools# function to convert tokens/words to stems # create document –term matrixconvertToStems <- function(x){ corpusFluStemsDTM <-tokens <- strsplit(x, +) DocumentTermMatrix(corpusFluStems)listToVect <- unlist(tokens)stems <- wordStem(listToVect) # create vector of individual stems and the associatedreturn(stems)} # frequencies with which they occur across all tweets fluStemsDTMMat <- as.matrix(corpusFluStemsDTM)# convert to stems and create new corpus fluStemsDTMMatSorted <-FluStems <- sapply(corpusFluText, convertToStems) sort(colSums(fluStemsDTMMat))corpusFluStems <- Corpus(VectorSource(FluStems)) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 72
    73. Text Mining for Prediction: Programming with R, tm and RTextTools# plot frequency disttribution of stemsnoOfPlotPnts = 50numOfStems = length(fluStemsDTMMatSorted)plotStart = numOfStems - noOfPlotPnts +1plotEnd = numOfStemstop50FluStems = fluStemsDTMMatSorted[plotStart:plotStart]barplot(top50,horiz=TRUE,cex.names=0.5, space = .5,las=1, xlim=c(0,100), main="Freq of Terms") Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 73
    74. Text Mining for Prediction: Programming with R, tm and RTextTools# create wordcloud of stems based on frequencylibrary(wordcloud)library(RColorBrewer)fluStemNames <- names(fluStemsDTMMatSorted)dfFluStemDTMMat <- data.frame(word=fluStemNames, freq=fluStemsDTMMatSorted)pal2 <- brewer.pal(8,"Dark2")wordcloud(dfFluStemDTMMat$word, dfFluStemDTMMat$freq, min.freq=10, colors=pal2) Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 74
    75. Infodemiology is a form of Weather in the next 6 hours Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 75
    76. Now+Forecasting Predicting the present by analyzing large volumes of data that can be used to "forecast" current events for which official analysis has not been released Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 76
    77. Basic Regression AnalysisDependent Dependent Traditional, AggregateVariable at Variable at Publicly SearchTime t Time t - n Available at Index or(Standard = b0 + b1 (Standard + b2 Time t - n + b3 Social +ePublicly Publicly Explanatory MediaAvailable Available Variable Freq.Measure) Measure) Count at Time t Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 77
    78. Authors Examples Title Date (M- Type Data Source Dependent Variables Explanatory Variables Model Results Y)Kholodilin et al. Do Google Searches Help Apr-10 Search Fed reserve data on US private Year-on-Year Growth Rate 220 Google Trend/Insights Search Y-o-Y monthly URPC growth rates Query term principal components in Nowcasting Private consumption and related Google search of Monthly US Real Private terms related to Priv for 3 sets of regressors -- Sentiment outperform standard Sentiment and Consumption? A Real-Time terms from Jan 05-Dec 09. Consumption, ALFRED db Consumption reduced to 10 (consumer sentiment and Financial Indicators. A combination of two Evidence for the US of Fed Rsrv of St. Louis principal components for montly confidence); Financial (short term of the factors work best -- those related to periods from Jan 2005 to Dec and long term interest rates and S&P mobility and health care consumption. 2009 500); Query (combinations of principal components of query terms)Sakaki et al. Earthquake Shakes Twitter Apr-10 Tweets Earthquake occurrences/intensity and Occurrence, intensity and Tweets that contain the query Utilized a support vector machine Of those earthquakes occurring in the Aug- Users: Real-time Event tweets containing the words location of an earthquate words "earthquake" and/or (SVM) to determine whether tweet Sept time frame in Japan, 96% of 24 quakes Detection by Social Sensors "earthquake" or "shaking" in Japan in Japan "shaking"by location reports refers to an earthquake above an intensity of 3 were reported in a from Aug 09-Sep 09 occurrence. The reports are then tweet. Of the 24 quakes, 80% were matched against actual occurrence reported with a minute of occurrence. This at a particular location to see if it is is much faster the reports issued by the detected within 1 min of occurrence. Japan Meteorological Agency.Carrière-Swallow & Nowcasting With Google Jul-10 Search Auto sales and Google search indices % year-on-year change in Google Search index of interest in % change in y-on-y autosales in Despite relatively low rates of InternetLabbé Trends in an Emerging Market for specific automobiles in Chile from auto sales in Chile automobile purchases in Chile Chile regressed against a usage in Chile, models incorporating 2005 thru 2010. compositive auto Google Search Google Trends Automotive Index index based on queries about 9 outperform benchmark specifications in leading automobile manufactures in both in-sample and out-of- sample Chile. nowcasts of y-on-y % changes in autosales while providing substantial gains in information delivery timesCiulla Beating the news using social May-12 Tweets Tweets from US users that related to Contestants who were Number of hashtags with Tweets Contestants with the fewest In general the simple tweet frequencies are media: the case study of American Idol contestants and tweeted eliminated or won final with hastags signifying mentions predicted to be the strong predictor of the contestant who will American Idol during the voting time window for each episode contestants candidate eliminated. be eliminated. episode during 11th season from Jan 12 - May 12Chadwick & Nowcasting Unemployment Jun-12 Search Monthly Turkey non-agricultural Monthly Turkey non- Google Search Index in Turkey for Linear auto regression models andModels with Google Search IndicatorsSengulCiu Rate in Turkey: Lets Ask unemployment rate from Jan 05-Dec agricultural unemployment terms directly ("looking for job") Bayesian Model Averaging perform better in nowcasting the 1 period, Google 11 rate or indirectly (job procedure to investigate whether 2 periods and 3 periods ahead announcements) related to Google search query data can unemployment rate than the benchmark unemployment. improve where we use only the lag values of the unemployment rate.Song, Pan, Ng Forecasting hotel room Sep-12 Search Weekly data on hotel bookings in Weekly Hotel Bookings in Indexed Search Volumes from Log of Room Nights for Log of Test various statistical models; all gave demand using search Charleston, SC and Google trend data Charleston, SC Google Trends/Insights Jan 2008- Search Volumes - Charleston, Travel reasonable forecasts. Best fit model was engine data for specific travel tourism search terms Aug 2009 Charleston, Charleston Hotels, Autoregressive Distributed Lag (ADLM) from Jan 08-Aug 09 Charleston Restaurants, Charleston with a lag period of 6 weeks. TourismMcLaren, Using internet search data as Q2-11 Search UK monthly unemployment data and Official monthly Google Trend/Insight query For unemployment, linear AR model For unemployment forecasts, claimantShanbhogue economic indicators housing price growth from June 04-Jan unemployment data and indexes for the term "Job Seekers with query term, claimant count, count strongest followed by query term. For 11 associated with Google Trend query housing price growth in the Allowance (JSA)" for and GfK consumer confid. as exp housing prices, the query term was much indices for job seeking and UK from June 2004-Jan unemployment and "Estate vars; for housing price growth with stronger than HBF and RICS data. unemployment 2011 Agents" for housing query term, Home Builders and Royal Instit. of Chartered Surveyors Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL price growth balances as exp vars. 78
    79. Authors Examples Title Date (M- Type Data Source Dependent Variables Explanatory Variables Model Results Y)Gruhl et al. The Predictive Power of Online Aug-05 Blogs Amazon Sales Rank for for best selling Amazon Sales Rank for Number of mentions of the Cross correlation of time series for While sales rank is a poor predictor of the Chatter. books from Jul 04-Aug 04 and number 2340 bestselling books in 4book/author in over 300K blogs sales rank and mentions. change in sales rankings, a prior spike in of mentions in blogs for same time month period (Jul 2004- whose postings that were mentions predicts quite well a future spike period Aug 2004) and spikes in maintained by IBMs in sales rank. these sales ranks WebFountain project (over 200K postings/day)Choi, Varian Predicting the Present with Apr-09 Search Monthly data for search from Google US Census Bureau Advance Google Trend/Insight query Google Trend indices for query Simple seasonal AR models and fixed- Google Trends Trends associated with various retail Monthl Retail Sales indices for categories and subcategories related to (log values) effects models that includes relevant sales and travel data from Jan 04-Aug (general and specific) and subcategories related to retail of overall monthly retail trade Google Trend variables tend to outperform 08 Travel (Visitor arrival in sales (general and specifix) and (NAICS categories), automotive models that exclude these variables. In Hong Kong) related to Travel sales, home sales and travel. some cases small gains, in other substantial.Suhoy Query Indices and a 2008 Jul-09 Search Monthly official economic growth data Monthly percent changes 30 Google Search Index Bayesian probabilities of downturn Six leading query categories including HR Downturn: Israeli Data for the months and quarters from 2nd of various real values of categories related to consumption calculated by Hamiltons two-sate (recruiting and staffing), home appliances, quarter 2004 to 2nd quarter 2009 along industrial production, retail and employment Markov switching AR(0) model. travel, real estate, food and drink and with various Google search indices trade, revenue of trade Used to determine changes in query beauty and contain cyclical components during the same period and services, consumer indices can predict changes in which conform with cycles of economic imports, exports of official economic variables. growth. The strongest relationship was services and the between HR and unemployment. employment rateSadikov et al. Blogs as Predictors of Movie Aug-09 Blogs Weekly box movie sales, gross sales, Movie critic ranking, user Analysis of spinn3r.com blog data Linear regression for weekly Minimal correlation between rankings and Success and critic and user rankings from Nov ranking, 2008 gross sales, set 11/07-11/08, counting movie rankings and sales data by blog references and sentiment. Strong 07-Nov 08 and counts for moving weekly box office sales references and sentiment within references and sentiment. correlation between references and gross references and sentiment measures for (weeks 1-5) specified time window before and sales but week with sentiment. Strongest same time period. after movie release date. relationships with timing of references in weeks after release.Zhang & Skiena Improving Movie Gross Sep-09 Blogs & Online News stores & blogs 1960-2008 Gross receipts from Variety of IMDB variables (e.g. Linear regression of receipts for Number of news articles mentioning Prediction Through News News along with movie receipt and IMDB movies movie genre), movie budget, various combinations of explanatory moving 1 week prior have highest Analysis data. News stores analyzed for various number of first week theaters, variables. Also, K-NN nearest correlation (~.7)and predictive ability pre-release time periods and number of stores and neighbor analysis determining mentions of movie titles, director, factors associated with gross top 3 & top 15 actors receiptsWu & Brynjolfsson The Future of Prediction: How Dec-09 Search Quarterly Housing Sales and Housing Housing Sales and Housing Google Search Index for Real Linear autoregression between Strong predictive relationships between Google Searches Foreshadow Price Index for 50 US states along with Price Index (HPI) Estate, Real Estate Agencies, and Housing Sales and prior sales, the Housing Sales and searches for Real Estate Housing Prices and Sales the Google Search Index for Real Home Appliances HPI, and Search Indices for Real Agencies. Similarly relationships for HPI. Estate, Real Estate Agencies, and Estate and Real Estate Agencies as Home Appliances from 4th Quarter well as the same regression for the 2007 to 2nd Quarter 2007 HPIAsur, Huberman Predicting the Future with Mar-10 Tweets 3 million tweets mentioning 24 movies Box-office revenues for Promotion tweets-retweets for a Regression of 1st weekend box Promotional tweets are weakly correlated Social Media from Nov 09-Feb 10 along with (24) movies particular movie, tweet rates for office revenues by promotional 1st weekend revs. Tweet rates are very associated box-office revenues particular movie per hour, ratio tweets-retweets, by tweet rates vs. strongly correlated (min .9) and a stronger of positive to negative Hollywood Stock Exchange prices, predictor than HSX. Finally, tweet rates are sentiments for the movie and 2nd weekend revenues by tweet strongly correlated with 2nd weekend rates and the sentiment ratio. revenues and sentiments improve the forecasts slightly. Copyright 2011 JDA Software Group, Inc. - CONFIDENTIAL 79

    ×