What The Future Holds For Social
Media Data Analysis
Predictive analytics using Twitter data
Peter Wlodarczak wlodarczak@gmail.com
Agenda
 Introduction
 Research methodology
 Applications
 Challenges
 Conclusions
Introduction I
 Shift from publisher-generated to user-
created content
 90% of the content on the Internet is now
user generated (Graham et al. 2011)
 Unprecedented amount of opinionated
data on the Internet
 Online social networks (OSN) are one of
the biggest data sources of the internet
(Oboler, Welsh & Cruz 2012)
Introduction II
 Opinions can be expressed on the
Internet without programming knowledge
(Web 2.0)
 Opinions are key influences of human
behavior
 People increasingly consult the Internet
before making decisions
Introduction III
 OSN give new insights into peoples
opinions, interests and views
 Social networking Web sites are amassing vast
quantities of data
 Computational social science is providing tools to
process this data (Oboler, Welsh & Cruz 2012)
 Social computing, a new paradigm of computing
and technology development, has become a
central theme across a number of information and
communication technology fields (Wang et al.
2007, p. 79)
Introduction IV
 Growing interest in Social Media Mining
(SMM) in the market
 Gnip, Klout, DataSift and Sprout social specialized
in SM data analysis
 Apple bought Topsy for 200 million US dollars
(Harris 2013)
 TV stations buy Facebook data to see how
popular their shows are (Rusli 2013)
 No surveys necessary
Introduction V
 Research in the area of computational
social science and Big Data
 Social computing is a cross-disciplinary research
and application field with theoretical underpinnings
including both computational and social sciences
(Wang et al. 2007, p. 80)
 Big Data is the ability of society to harness
information in novel ways to produce useful
insights or goods and services of significant value
(Mayer-Schonberger & Cukier 2013, p. 2)
Introduction VI
 Analyzing data to:
 Understand the underlying structure of it
and gain knowledge
 Make predictions from new, unseen
examples
Introduction VII
 Current behavior indication for future
decisions
 New area of research: predictive
analytics
 Machine learning techniques used for
prediction
 Learning from experience, “data”, to predict
future behavior of individuals
 Support decision making process
Introduction VIII
 Big Data
 Big Data is usually defined by the three
V’s. Volume, velocity and variety (Klein,
Tran-Gia & Hartmann 2013, p. 320)
 High volume
 Created at high velocity
 Structured, semi-structured and unstructured
Introduction IX
 Big Data principles
 No sample selection, all data analysed
 Data doesn’t have to be of high quality
 Structured and unstructured data
Introduction X
 Data mining
 Techniques for finding and describing
structural patterns in data
 Tool for helping to explain that data and
make predictions from it (Witten, Frank &
Hall 2011, p. 8)
 Used to
 gain knowledge
 make predictions
Introduction XI
 Data analysis steps
 Analyze mood by means of sentiment
analysis
 Create time series and correlate it to real
world phenomenon
 Make predictions based on new data
 Support decision making process
Introduction XII
 Social Media data has been analysed to
predict
 Financial indicators (Bollen, Mao & Zeng
2010)
 Elections (Tumasjan et al. 2011)
 Box office revenue (Asur & Huberman 2010)
 Disease outbreak (Achrekar et al. 2011)
 Natural disasters (Sakaki, Okazaki and
Matsuo 2010)
Research methodology I
 Predictive analysis of Social Media
consists of two phases
 Data conditioning phase
 Predictive analysis phase
Research methodology II
 Determination of time window
 Selection of search terms
 Selection of data extraction method
Collection and
filtering of raw
data
 Selection of prediction variables
 Measurement of predictor variables
Computation
of Predictor
Variables
Data Conditioning
Phase
 Selection of predictive method
 Identification of data for evaluation of prediction
Creation of
Predictive
Mode
 Selection of the evaluation method
 Specification of the prediction baseline
Evaluation of the
Predictive
Performance
Predictive Analysis
Phase
Analysis phases
Research methodology III
Input and output variables
Twitter
sentiments
Share price
Future
share price
Expressed as binary
sentiment
classification
Expressed in
dollars
Expressed in
dollars
Research methodology IV
Mood towards
Apple
Number of
Tweets
Apple stock
price
Data collection and analysis overview
Data collection
•Query Twitter
through API
•Store in
MongoDB
Preprocessing
•Remove
stopwords
•Remove
Tweets with
Links
Model
evaluation
•Classification
algorithm
•Neural
network
Time series
•Twitter volume
•Binary
sentiment
classification
Correlation
• Correlation
between
sentiment and
financial data
 Collection and analysis steps overview
 Some steps like model evaluation are
iterative
Data collection I
Data collection
•Query Twitter
through API
•Store in DB
Preprocessing
•Remove
stopwords
•Remove
Tweets with
Links
Model
evaluation
•Classification
algorithm
•Neural
network
Time series
•Twitter
volume
•Binary
sentiment
classification
Correlation
• Correlation
between
sentiment and
financial data
DB
Data collection II
Data Source
Twitter
Query API
Firehose API
Gardenhose API
Data Store
MongoDB
 Historic data collected through Twitter
APIs
 Timestamp, message text, region
Data collection III
 Data collected through Twitter query
API
 Using the Java programming language
 Using the Twitter4j library
 Stored as JSON (JavaScript Object
Notation) in a MongoDB
Data collection IV
public void runQuery() {
Twitter twitter = new TwitterFactory().getInstance();
AccessToken accessToken = new AccessToken(ACCESS_TOKEN, ACCESS_TOKEN_SECRET);
twitter.setOAuthConsumer(CUSTOMER_KEY, CUSTOMER_SECRET);
twitter.setOAuthAccessToken(accessToken);
try {
Query query = new Query(“$Appl");
QueryResult result;
result = twitter.search(query);
List<Status> tweets = result.getTweets();
for (Status tweet : tweets) {
System.out.println("@" + tweet.getUser().getScreenName() + " - " + tweet.getText());
}
}
catch (TwitterException te) {
te.printStackTrace();
System.out.println("Failed to search tweets: " + te.getMessage());
System.exit(-1);
}
}
Twitter query algorithm to retrieve Tweets on Apple
Data preprocessing I
Data collection
•Query Twitter
through API
•Store in DB
Preprocessing
•Remove
stopwords
•Remove
Tweets with
Links
Model
evaluation
•Classification
algorithm
•Neural
network
Time series
•Twitter
volume
•Binary
sentiment
classification
Correlation
• Correlation
between
sentiment and
financial data
Data preprocessing II
 Remove stop-words, “the”, “then”, “at” …
 Punctuation, apostrophe, brackets, colon ..
 Discard Tweets with no explicit statements
like “Going to the Apple store”
 Discard irrelevant Tweeds like “I love apples
and pears”
 Discard possible spam by discarding Tweets
that match the regular expression “http:” and
“www”
Data preprocessing III
 Machine learning algorithms don’t take text
as input
 Create feature vector
 Word frequencies
 n-grams, unigram, bigram, trigram …
 “good”, “very good”, “not very good”
 Create sentiment lexicon
 Sentiment analysis highly domain specific
 “This mattress had a valley after one month”
 “This car uses a lot of fuel”
Model evaluation I
Data collection
•Query Twitter
through API
•Store in DB
Preprocessing
•Remove
stopwords
•Remove
Tweets with
Links
Model
evaluation
•Classification
algorithm
•Neural
network
Time series
•Twitter
volume
•Binary
sentiment
classification
Correlation
• Correlation
between
sentiment and
financial data
90.2 %
84.7 %
97.3 %
Neural Network
Naïve Bayes
Nearest Neighbor
Model evaluation II
 Experience shows that no single machine
learning scheme is appropriate to all data
mining problems (Witten, Frank & Hall 2011,
p. 403)
 Different algorithms are trained
 The best performing algorithm will be
selected
Model evaluation III
 Data classification and analysis through
 Machine learning techniques
 System can learn from data, e. g. detect spam
 Finding and describing structural patterns in
data and generalize
 Data classification is a supervised
learning problem
 Class label is known
Model evaluation IV
 Other machine learning models are
 Unsupervised learning
 Class label is unknown
 Used for cluster analysis
 Semi-supervised learning
 Small amount of labeled data, big volumes of
unlabeled data
Model evaluation V
 Model evaluation through iterative supervised
machine learning process
 Select classification algorithm, Naïve Bayes, k-
NN, Decision tree induction …
 Find a function ƒ that classifies Tweets into
positive and negative Tweets
 Data is divided into training and test data
 Model is trained using the training data
 Trained model is verified using the test data
Model evaluation VI
 Determine through loss function how well the
model performs on future, unseen data
 Calculate error:
 Training error = fraction of training examples misclassified
 Test error = fraction of test examples misclassified
 Generalization error = probability of misclassifying new
random example
Model evaluation VII
 Testing determines the classification
accuracy
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑐𝑎𝑠𝑒𝑠
 Simple but very optimistic since training data
is used for testing
Model evaluation VIII
 n-fold cross-validation
 Divide data into n folds, where typically 4 < n < 11
 Data divided randomly into n folds
 n – 1 folds used for training, 1 holdout fold for
testing
 Error rate is calculated on the holdout fold
 repeated n times such that each fold is the holdout
fold once
 Error estimate is averaged over all n error rates
Model evaluation IX
 Typical data mining task goes through many
iterations
 As many iterations as necessary till result is
satisfying, i. e. accuracy converges
 Best data mining scheme is selected
 Used against unseen data for classification
 Can be used on real-time data
Model evaluation X
RapidMiner workbench
Model evaluation XI
Training data
sex mask cape tie ears smokes class
Batman male yes yes no yes no Good
Robin male yes yes no no no Good
Alfred male no no yes no no Good
Penguin male no no yes no yes Bad
Catwoman female yes no no yes no Bad
Joker male no no no no no Bad
Test data
Batgirl female yes yes no yes no ?
Riddler male yes no no no no ?
Model evaluation XII
 Description of data:
 Generalisation for new examples
if sex = male and mask = yes and cape = yes
and tie = yes and ears = yes and smokes = no
then character = Good
if mask = yes and ears = yes and smokes = no
then character = Good
Model evaluation XIII
tie
no yes
cape smokes
no yes no yes
bad badgood good
Model evaluation XIV
 Trees must be:
 Big enough to fit training data
 Big enough to capture true patterns
 Not too big (Ockham’s razor):
 Overfitting
 Capture noise
 Find spurious patterns
Model evaluation XIV
 Best tree size cannot be determined
from training error
Schapire 2004
Model evaluation XV
Schapire 2004
Model evaluation XVI
 For building an accurate classifier:
 Enough training examples
 Good performance on training set
 Classifier that is not too complex
 Strategy for controlling tree size:
 Build large tree that fully fits training data
 Prune back
Model evaluation XVII
 Grow on just part of the training data, then
prune using minimum error on held out
data
Classifiers I
 Decision trees:
 Best known:
 C4.5 (Quinlan), successor C5.0
 CART for classification and regression trees
(Breitman et al.)
 Fast to train and evaluate
 Relatively easy to interpret
 Accuracy often not satisfactory
Classifiers II
 Perceptron (Neuron)
 Linear classifier
 Data linearly separable using a hyperplane
 Where w = weights, a = real-valued vector,
feature vector, a0 = bias
 Binary classifier f(a) that maps its input
vector a to a single, binary output value
w0a0 + w1a1 + w2a2 + … + wkak = 0
Classifiers III
w0
1
bias
attr
a1
attr
a2
attr
a3
w1 w2
w3
f(a) = kwkak + b
f(a) > 0 or
f(a) < 0
Classifiers IV
 Multilayer Perceptron
 Non-linear classifier
 Perceptrons are connected in a
hierarchical structure
Classifiers V
 Not all data is linearly separable
Classifiers VI
1
bias
attr
a1
attr
a2
Input layer Hidden layer Output layer
Classifiers VII
 Multilayer Perceptron
 Perceptrons organized in several layers
 All layer is fully interconnected with the next
layer
 All nodes except input node are perceptrons
 Feedforward neural network
 Uses backpropagation for training
 Error propagated back to minimize loss function
Classifiers VIII
 Allows to get approximate solutions for
very complex problems
 Support Vector Machines (SVM) are a
much simpler alternative to ANN
 Many more classifiers
 k-Nearest Neighbor
 Naïve Bayes
 …
Data classification I
Data collection
•Query Twitter
through API
•Store in DB
Preprocessing
•Remove
stopwords
•Remove
Tweets with
Links
Model
evaluation
•Classification
algorithm
•Neural
network
Time series
•Twitter
volume
•Binary
sentiment
classification
Correlation
• Correlation
between
sentiment and
financial data
Data Classification II
 Data classification:
 Binary mood polarity: positive, negative
 Represented graphically as time series
Positive Tweets
Negative Tweets
Correlations I
Data collection
•Query Twitter
through API
•Store in DB
Preprocessing
•Remove
stopwords
•Remove
Tweets with
Links
Model
evaluation
•Classification
algorithm
•Neural
network
Time series
•Twitter
volume
•Binary
sentiment
classification
Correlation
• Correlation
between
sentiment and
financial data
Sentiment polarity
Share price
Correlations II
 Finding correlations:
 Binary sentiment classification time series
compared against stock price over same
time frame
 Does the number of positive Tweets
preceding a soar of Apple stock price?
Correlations III
Microsoft stock price (Yahoo! Finance 2014)
Correlations IV
Tweet polarity and MSFT stock price
Correlations V
 If there are correlations in historic data,
trained model used against real time
data
 Access real time Tweets using Twitters
streaming API
 Firehose API (100% of real time Tweets)
 Gardenhose API (10% of real time Tweets)
 Spritzer API (1% of real time Tweets)
Correlations VI
 Since correlations are most certainly non
linear, correlating has to be automated
 Bivariate Granger causality test
 Determine whether one time series can be
used to predict another
 If X in a time series causes Y = Granger-
cause
 X provides statistical significant information
about Y
Correlations VII
 Granger test examines linear causality
among bivariate or multivariate time series
 Many real world phenomenon are not
linear
 Non-linear extensions to Granger have
been developed
 Other correlation techniques
 Phase Slope Index measures temporal flux
between time series
Correlations VII
 More robust than Granger since more
immune against noise
 Machine learning techniques such as
ANN can be used for finding
correlations
Applications I
 Technologies for predictive analysis
have matured
 IBM SPSS
 Stata
 SAS
Applications II
 Free open source
 WEKA
 Partly open source
 RapidMiner
 Cloud solutions
 IBM WatsonAnalytics
 Google BigQuery
 SAS Cloud Analytics
Challenges I
 Real word data often very poor quality
 Social Media vast, noisy and
unstructured
 Getting relevant posts is challenging
 Spam has become a serious issue
 Detecting sarcasm very difficult
 Political opinions full of irony and sarcasm
 Data preprocessing one of the most
important steps
Challenges II
 Opinion mining remains challenging
task
 Overall statement often difficult to
determine
 No ground truth
 Not everybody is using Social Media
 Self-selection bias
Conclusions I
 Predictive analysis poses many
interesting research problems
 Many opportunities for future research
 Determining the credibility of posts (catfish,
sock puppet)
 Better filtering mechanisms
 More research in Machine Learning
than feature extraction
Conclusions II
 Correlation does not mean causation
 Finding causative mechanism for
correlation
Thank you for the attention
 Questions?
References I
 Achrekar, H, Gandhe, A, Lazarus, R, Ssu-Hsin, Y and Benyuan, L 2011, 'Predicting Flu Trends using Twitter data', Computer
Communications Workshops (INFOCOM WKSHPS), IEEE, pp. 702-7.
 Arias, M, Arratia, A & Xuriguera, R 2014, 'Forecasting with twitter data', ACM Trans. Intell. Syst. Technol., vol. 5, no. 1, pp. 1-24.
 Asur, S & Huberman, BA 2010, 'Predicting the Future with Social Media', in Web Intelligence and Intelligent Agent Technology
(WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 1, pp. 492-9.Berman, JJ 2013, PRINCIPLES OF BIG DATA,
Elsevier Inc., Waltham, USA.
 Bollen, J, Mao, H & Zeng, X-J 2010, 'Twitter mood predicts the stock market', Journal of Computational Science, vol. 2, p. 8.
 Buhl, H, Röglinger, M, Moser, F & Heidemann, J 2013, 'Big Data', WIRTSCHAFTSINFORMATIK, vol. 55, no. 2, pp. 63-8.
 Bulysheva, L & Bulyshev, A 2012, 'Segmentation modeling algorithm: a novel algorithm in data mining', Information Technology
and Management, vol. 13, no. 4, pp. 263-71.
 Darwish, A & Lakhtaria, KI 2011, The Impact of the New Web 2.0 Technologies in Communication, Development, and
Revolutions of Societies, vol. 2, 2011.
 Goh, KY, Heng, CS & Lin, Z 2012, ‘Social Media Brand Community and Consumer Behavior: Quantifying the Relative Impact of
User- and Marketer-Generated Content’, School of Computing, National University of Singapore, viewed 9 April 2013,
<https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2048614>.
 Graham, DM, Hale, SA & Stephens, M 2011, 'User-generated Content in Google', Oxford University, Oxford, UK, viewed 27
October 2013, < http://www.oii.ox.ac.uk/vis/?id=4e3c030d>.
 Harris, D 2013, 'DataSift raises $42M', Gigaom, viewed 27 December 2013, <http://gigaom.com/2013/12/03/datasift-raises-42m-
maybe-theres-something-to-this-social-data-after-all/>.
 Huang, S, Peng, W, Li, J & Lee, D 2013, 'Sentiment and topic analysis on social media: a multi-task multi-label classification
approach', paper presented to Proceedings of the 5th Annual ACM Web Science Conference, Paris, France.
 Kao, A, Ferng, W, Poteet, S, Quach, L & Tjoelker, R 2013, 'TALISON - Tensor analysis of social media data', in Intelligence and
Security Informatics (ISI), 2013 IEEE International Conference on, pp. 137-42.
 Klein, D, Tran-Gia, P & Hartmann, M 2013, 'Big Data', Informatik-Spektrum, vol. 36, no. 3, p. 319.
 Kumar, P, Nitin, Chauhan, DS & Sehgal, VK 2012, 'Selection of evolutionary approach based hybrid data mining algorithms for
decision support systems and business intelligence', paper presented to Proceedings of the International Conference on
Advances in Computing, Communications and Informatics, Chennai, India.
References II
 Kumar, P, Kumar Sehgal, N, Kumar Sehgal, V & Singh Chauhan, D 2012, 'A Benchmark to Select Data Mining Based
Classification Algorithms for Business Intelligence and Decision Support Systems', International Journal of Data Mining &
Knowledge Management Process, vol. 2, no. 5, pp. 25-42.
 Lim, E-P, Chen, H & Chen, G 2013, 'Business Intelligence and Analytics: Research Directions', ACM Trans. Manage. Inf. Syst.,
vol. 3, no. 4, pp. 1-10.
 Manyika, J, Chui, M, Brown, B, Bughin, J, Dobbs, R, Roxburgh, C & Byers, AH 2011, Big data: The next frontier for innovation,
competition, and productivity, McKinsey Global Institute.
 Mayer-Schonberger, V & Cukier, K 2013, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Houghton
Mifflin Harcourt Publishing Company, New York, USA.
 Mayer, A 2009, 'Online social networks in economics', Decision Support Systems, vol. 47, no. 3, pp. 169-184, viewed 22
September 2013, < http://sistemas-humano-computacionais.wdfiles.com/local--files/capitulo%3Aredes-sociais/amayer.pdf>.
 McKelvey, K, Rudnick, A, Conover, MD & Menczer, F 2012, 'Visualizing Communication on Social Media, Making Big Data
Accessible', Indiana University School of Informatics and Computing, viewed 29 September 2013,
<http://arxiv.org/pdf/1202.1367v1.pdf>.
 Neri, F, Aliprandi, C, Capeci, F, Cuadros, M & By, T 2012, 'Sentiment Analysis on Social Media', in Advances in Social Networks
Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, pp. 919-26.
 Oboler, A, Welsh, K & Cruz, L 2012, The danger of big data: Social media as computational social science, 2012.
 Ostrowski, DA 2011, 'Predictive Semantic Social Media Analysis', in Semantic Computing (ICSC), 2011 Fifth IEEE International
Conference on, pp. 283-90.
 Paltoglou, G & Thelwall, M 2012, 'Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media', ACM Trans. Intell.
Syst. Technol., vol. 3, no. 4, pp. 1-19.
 Rusli, EM 2013, Facebook Woos TV Networks With Data, Digits, viewed 15 February 2014,
<http://blogs.wsj.com/digits/2013/09/29/facebook-woos-tv-networks-with-more-data/>.
 Smith, MS, Ventura, AD, Dewey, DP, Knutson, CD & Embley, DW 2011, ‘A Computational Framework for Social Capital in Online
Communities’, Brigham Young University, viewed 28 July 2013, <http://posts.smithworx.com/publications/d.pdf>.
References III
 Yahoo! Finance 2014, Microsoft Corporation (MSFT), Yahoo, viewed 15 February 2014,
<http://finance.yahoo.com/echarts?s=MSFT+Interactive#symbol=msft;range=20130102,20140214;compare=;indicator=volume;chartty
pe=area;crosshair=on;ohlcvalues=0;logscale=off;source=; >.
 Trif, S 2011, 'Using Genetic Algorithms in Secured Business Intelligence Mobile Applications', Informatica economica, vol. 15, no. 1,
pp. 69-79.
 Tumasjan, A, Welpe, IM, Sandner, PG, Tumasjan, A & Sprenger, TO 2011, 'Election Forecasts With Twitter: How 140 Characters
Reflect the Political Landscape', Social science computer review, vol. 29, no. 4, pp. 402-18.
 Sakaki, T, Okazaki, M and Matsuo, Y 2010, 'Earthquake shakes Twitter users: real-time event detection by social sensors', Proc. of the
19th international conference on World wide web, Raleigh.
 Twitter Statistics 2014, Statistic brain, viewed 18 February 2014, <http://www.statisticbrain.com/twitter-statistics/>.
 Walton, A 2014, ‘Twitter Usage by Region’, Chron, viewed 18 February 2014, < http://smallbusiness.chron.com/twitter-usage-region-
62762.html>.
 Wang, F-Y, Carley, KM, Zeng, D & Mao, W 2007, 'Social Computing: From Social Informatics to Social Intelligence', Intelligent
Systems, IEEE, vol. 22, no. 2, pp. 79-83.
 Weka knowledge explorer, viewed 15 February 2014, <http://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html>.
 Witten, IH, Frank, E & Hall, MA 2011, Data Mining, 3 edn, Elsevier, Burlington, MA, USA.
 Wlodarczak, P 2014, ‘Big Personal Data’, Social Science Research Network, <http://dx.doi.org/10.2139/ssrn.2514721>.
 World Stock Exchanges 2011, viewed 18 February 2014, <http://www.world-stock-exchanges.net/top10.html>.
 Wong, FMF, Sen, S & Chiang, M 2012, 'Why Watching Movie Tweets Won’t Tell the Whole Story?', Cornell University, viewed 14 May
2013, <http://arxiv.org/pdf/1203.4642v1.pdf>.
 Wu, X, Kumar, V, Ross Quinlan, J, Ghosh, J, Yang, Q, Motoda, H, McLachlan, GJ, Ng, A, Liu, B, Yu, PS, Zhou, Z-H, Steinbach, M,
Hand, DJ & Steinberg, D 2007, 'Top 10 algorithms in data mining', Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37.
 Zeng, D, Chen, H, Lusch, R & Li, S-H 2010, 'Social Media Analytics and Intelligence', Intelligent Systems, IEEE, vol. 25, no. 6, pp. 13-
6.
 Zeng, L, Li, L & Duan, L 2012, 'Business intelligence in enterprise computing environment', Information Technology and Management,
vol. 13, no. 4, pp. 297-310.

Predicting the future with social media

  • 1.
    What The FutureHolds For Social Media Data Analysis Predictive analytics using Twitter data Peter Wlodarczak wlodarczak@gmail.com
  • 2.
    Agenda  Introduction  Researchmethodology  Applications  Challenges  Conclusions
  • 3.
    Introduction I  Shiftfrom publisher-generated to user- created content  90% of the content on the Internet is now user generated (Graham et al. 2011)  Unprecedented amount of opinionated data on the Internet  Online social networks (OSN) are one of the biggest data sources of the internet (Oboler, Welsh & Cruz 2012)
  • 4.
    Introduction II  Opinionscan be expressed on the Internet without programming knowledge (Web 2.0)  Opinions are key influences of human behavior  People increasingly consult the Internet before making decisions
  • 5.
    Introduction III  OSNgive new insights into peoples opinions, interests and views  Social networking Web sites are amassing vast quantities of data  Computational social science is providing tools to process this data (Oboler, Welsh & Cruz 2012)  Social computing, a new paradigm of computing and technology development, has become a central theme across a number of information and communication technology fields (Wang et al. 2007, p. 79)
  • 6.
    Introduction IV  Growinginterest in Social Media Mining (SMM) in the market  Gnip, Klout, DataSift and Sprout social specialized in SM data analysis  Apple bought Topsy for 200 million US dollars (Harris 2013)  TV stations buy Facebook data to see how popular their shows are (Rusli 2013)  No surveys necessary
  • 7.
    Introduction V  Researchin the area of computational social science and Big Data  Social computing is a cross-disciplinary research and application field with theoretical underpinnings including both computational and social sciences (Wang et al. 2007, p. 80)  Big Data is the ability of society to harness information in novel ways to produce useful insights or goods and services of significant value (Mayer-Schonberger & Cukier 2013, p. 2)
  • 8.
    Introduction VI  Analyzingdata to:  Understand the underlying structure of it and gain knowledge  Make predictions from new, unseen examples
  • 9.
    Introduction VII  Currentbehavior indication for future decisions  New area of research: predictive analytics  Machine learning techniques used for prediction  Learning from experience, “data”, to predict future behavior of individuals  Support decision making process
  • 10.
    Introduction VIII  BigData  Big Data is usually defined by the three V’s. Volume, velocity and variety (Klein, Tran-Gia & Hartmann 2013, p. 320)  High volume  Created at high velocity  Structured, semi-structured and unstructured
  • 11.
    Introduction IX  BigData principles  No sample selection, all data analysed  Data doesn’t have to be of high quality  Structured and unstructured data
  • 12.
    Introduction X  Datamining  Techniques for finding and describing structural patterns in data  Tool for helping to explain that data and make predictions from it (Witten, Frank & Hall 2011, p. 8)  Used to  gain knowledge  make predictions
  • 13.
    Introduction XI  Dataanalysis steps  Analyze mood by means of sentiment analysis  Create time series and correlate it to real world phenomenon  Make predictions based on new data  Support decision making process
  • 14.
    Introduction XII  SocialMedia data has been analysed to predict  Financial indicators (Bollen, Mao & Zeng 2010)  Elections (Tumasjan et al. 2011)  Box office revenue (Asur & Huberman 2010)  Disease outbreak (Achrekar et al. 2011)  Natural disasters (Sakaki, Okazaki and Matsuo 2010)
  • 15.
    Research methodology I Predictive analysis of Social Media consists of two phases  Data conditioning phase  Predictive analysis phase
  • 16.
    Research methodology II Determination of time window  Selection of search terms  Selection of data extraction method Collection and filtering of raw data  Selection of prediction variables  Measurement of predictor variables Computation of Predictor Variables Data Conditioning Phase  Selection of predictive method  Identification of data for evaluation of prediction Creation of Predictive Mode  Selection of the evaluation method  Specification of the prediction baseline Evaluation of the Predictive Performance Predictive Analysis Phase Analysis phases
  • 17.
    Research methodology III Inputand output variables Twitter sentiments Share price Future share price Expressed as binary sentiment classification Expressed in dollars Expressed in dollars
  • 18.
    Research methodology IV Moodtowards Apple Number of Tweets Apple stock price
  • 19.
    Data collection andanalysis overview Data collection •Query Twitter through API •Store in MongoDB Preprocessing •Remove stopwords •Remove Tweets with Links Model evaluation •Classification algorithm •Neural network Time series •Twitter volume •Binary sentiment classification Correlation • Correlation between sentiment and financial data  Collection and analysis steps overview  Some steps like model evaluation are iterative
  • 20.
    Data collection I Datacollection •Query Twitter through API •Store in DB Preprocessing •Remove stopwords •Remove Tweets with Links Model evaluation •Classification algorithm •Neural network Time series •Twitter volume •Binary sentiment classification Correlation • Correlation between sentiment and financial data DB
  • 21.
    Data collection II DataSource Twitter Query API Firehose API Gardenhose API Data Store MongoDB  Historic data collected through Twitter APIs  Timestamp, message text, region
  • 22.
    Data collection III Data collected through Twitter query API  Using the Java programming language  Using the Twitter4j library  Stored as JSON (JavaScript Object Notation) in a MongoDB
  • 23.
    Data collection IV publicvoid runQuery() { Twitter twitter = new TwitterFactory().getInstance(); AccessToken accessToken = new AccessToken(ACCESS_TOKEN, ACCESS_TOKEN_SECRET); twitter.setOAuthConsumer(CUSTOMER_KEY, CUSTOMER_SECRET); twitter.setOAuthAccessToken(accessToken); try { Query query = new Query(“$Appl"); QueryResult result; result = twitter.search(query); List<Status> tweets = result.getTweets(); for (Status tweet : tweets) { System.out.println("@" + tweet.getUser().getScreenName() + " - " + tweet.getText()); } } catch (TwitterException te) { te.printStackTrace(); System.out.println("Failed to search tweets: " + te.getMessage()); System.exit(-1); } } Twitter query algorithm to retrieve Tweets on Apple
  • 24.
    Data preprocessing I Datacollection •Query Twitter through API •Store in DB Preprocessing •Remove stopwords •Remove Tweets with Links Model evaluation •Classification algorithm •Neural network Time series •Twitter volume •Binary sentiment classification Correlation • Correlation between sentiment and financial data
  • 25.
    Data preprocessing II Remove stop-words, “the”, “then”, “at” …  Punctuation, apostrophe, brackets, colon ..  Discard Tweets with no explicit statements like “Going to the Apple store”  Discard irrelevant Tweeds like “I love apples and pears”  Discard possible spam by discarding Tweets that match the regular expression “http:” and “www”
  • 26.
    Data preprocessing III Machine learning algorithms don’t take text as input  Create feature vector  Word frequencies  n-grams, unigram, bigram, trigram …  “good”, “very good”, “not very good”  Create sentiment lexicon  Sentiment analysis highly domain specific  “This mattress had a valley after one month”  “This car uses a lot of fuel”
  • 27.
    Model evaluation I Datacollection •Query Twitter through API •Store in DB Preprocessing •Remove stopwords •Remove Tweets with Links Model evaluation •Classification algorithm •Neural network Time series •Twitter volume •Binary sentiment classification Correlation • Correlation between sentiment and financial data 90.2 % 84.7 % 97.3 % Neural Network Naïve Bayes Nearest Neighbor
  • 28.
    Model evaluation II Experience shows that no single machine learning scheme is appropriate to all data mining problems (Witten, Frank & Hall 2011, p. 403)  Different algorithms are trained  The best performing algorithm will be selected
  • 29.
    Model evaluation III Data classification and analysis through  Machine learning techniques  System can learn from data, e. g. detect spam  Finding and describing structural patterns in data and generalize  Data classification is a supervised learning problem  Class label is known
  • 30.
    Model evaluation IV Other machine learning models are  Unsupervised learning  Class label is unknown  Used for cluster analysis  Semi-supervised learning  Small amount of labeled data, big volumes of unlabeled data
  • 31.
    Model evaluation V Model evaluation through iterative supervised machine learning process  Select classification algorithm, Naïve Bayes, k- NN, Decision tree induction …  Find a function ƒ that classifies Tweets into positive and negative Tweets  Data is divided into training and test data  Model is trained using the training data  Trained model is verified using the test data
  • 32.
    Model evaluation VI Determine through loss function how well the model performs on future, unseen data  Calculate error:  Training error = fraction of training examples misclassified  Test error = fraction of test examples misclassified  Generalization error = probability of misclassifying new random example
  • 33.
    Model evaluation VII Testing determines the classification accuracy 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑠𝑡 𝑐𝑎𝑠𝑒𝑠  Simple but very optimistic since training data is used for testing
  • 34.
    Model evaluation VIII n-fold cross-validation  Divide data into n folds, where typically 4 < n < 11  Data divided randomly into n folds  n – 1 folds used for training, 1 holdout fold for testing  Error rate is calculated on the holdout fold  repeated n times such that each fold is the holdout fold once  Error estimate is averaged over all n error rates
  • 35.
    Model evaluation IX Typical data mining task goes through many iterations  As many iterations as necessary till result is satisfying, i. e. accuracy converges  Best data mining scheme is selected  Used against unseen data for classification  Can be used on real-time data
  • 36.
  • 37.
    Model evaluation XI Trainingdata sex mask cape tie ears smokes class Batman male yes yes no yes no Good Robin male yes yes no no no Good Alfred male no no yes no no Good Penguin male no no yes no yes Bad Catwoman female yes no no yes no Bad Joker male no no no no no Bad Test data Batgirl female yes yes no yes no ? Riddler male yes no no no no ?
  • 38.
    Model evaluation XII Description of data:  Generalisation for new examples if sex = male and mask = yes and cape = yes and tie = yes and ears = yes and smokes = no then character = Good if mask = yes and ears = yes and smokes = no then character = Good
  • 39.
    Model evaluation XIII tie noyes cape smokes no yes no yes bad badgood good
  • 40.
    Model evaluation XIV Trees must be:  Big enough to fit training data  Big enough to capture true patterns  Not too big (Ockham’s razor):  Overfitting  Capture noise  Find spurious patterns
  • 41.
    Model evaluation XIV Best tree size cannot be determined from training error Schapire 2004
  • 42.
  • 43.
    Model evaluation XVI For building an accurate classifier:  Enough training examples  Good performance on training set  Classifier that is not too complex  Strategy for controlling tree size:  Build large tree that fully fits training data  Prune back
  • 44.
    Model evaluation XVII Grow on just part of the training data, then prune using minimum error on held out data
  • 45.
    Classifiers I  Decisiontrees:  Best known:  C4.5 (Quinlan), successor C5.0  CART for classification and regression trees (Breitman et al.)  Fast to train and evaluate  Relatively easy to interpret  Accuracy often not satisfactory
  • 46.
    Classifiers II  Perceptron(Neuron)  Linear classifier  Data linearly separable using a hyperplane  Where w = weights, a = real-valued vector, feature vector, a0 = bias  Binary classifier f(a) that maps its input vector a to a single, binary output value w0a0 + w1a1 + w2a2 + … + wkak = 0
  • 47.
  • 48.
    Classifiers IV  MultilayerPerceptron  Non-linear classifier  Perceptrons are connected in a hierarchical structure
  • 49.
    Classifiers V  Notall data is linearly separable
  • 50.
  • 51.
    Classifiers VII  MultilayerPerceptron  Perceptrons organized in several layers  All layer is fully interconnected with the next layer  All nodes except input node are perceptrons  Feedforward neural network  Uses backpropagation for training  Error propagated back to minimize loss function
  • 52.
    Classifiers VIII  Allowsto get approximate solutions for very complex problems  Support Vector Machines (SVM) are a much simpler alternative to ANN  Many more classifiers  k-Nearest Neighbor  Naïve Bayes  …
  • 53.
    Data classification I Datacollection •Query Twitter through API •Store in DB Preprocessing •Remove stopwords •Remove Tweets with Links Model evaluation •Classification algorithm •Neural network Time series •Twitter volume •Binary sentiment classification Correlation • Correlation between sentiment and financial data
  • 54.
    Data Classification II Data classification:  Binary mood polarity: positive, negative  Represented graphically as time series Positive Tweets Negative Tweets
  • 55.
    Correlations I Data collection •QueryTwitter through API •Store in DB Preprocessing •Remove stopwords •Remove Tweets with Links Model evaluation •Classification algorithm •Neural network Time series •Twitter volume •Binary sentiment classification Correlation • Correlation between sentiment and financial data Sentiment polarity Share price
  • 56.
    Correlations II  Findingcorrelations:  Binary sentiment classification time series compared against stock price over same time frame  Does the number of positive Tweets preceding a soar of Apple stock price?
  • 57.
    Correlations III Microsoft stockprice (Yahoo! Finance 2014)
  • 58.
    Correlations IV Tweet polarityand MSFT stock price
  • 59.
    Correlations V  Ifthere are correlations in historic data, trained model used against real time data  Access real time Tweets using Twitters streaming API  Firehose API (100% of real time Tweets)  Gardenhose API (10% of real time Tweets)  Spritzer API (1% of real time Tweets)
  • 60.
    Correlations VI  Sincecorrelations are most certainly non linear, correlating has to be automated  Bivariate Granger causality test  Determine whether one time series can be used to predict another  If X in a time series causes Y = Granger- cause  X provides statistical significant information about Y
  • 61.
    Correlations VII  Grangertest examines linear causality among bivariate or multivariate time series  Many real world phenomenon are not linear  Non-linear extensions to Granger have been developed  Other correlation techniques  Phase Slope Index measures temporal flux between time series
  • 62.
    Correlations VII  Morerobust than Granger since more immune against noise  Machine learning techniques such as ANN can be used for finding correlations
  • 63.
    Applications I  Technologiesfor predictive analysis have matured  IBM SPSS  Stata  SAS
  • 64.
    Applications II  Freeopen source  WEKA  Partly open source  RapidMiner  Cloud solutions  IBM WatsonAnalytics  Google BigQuery  SAS Cloud Analytics
  • 65.
    Challenges I  Realword data often very poor quality  Social Media vast, noisy and unstructured  Getting relevant posts is challenging  Spam has become a serious issue  Detecting sarcasm very difficult  Political opinions full of irony and sarcasm  Data preprocessing one of the most important steps
  • 66.
    Challenges II  Opinionmining remains challenging task  Overall statement often difficult to determine  No ground truth  Not everybody is using Social Media  Self-selection bias
  • 67.
    Conclusions I  Predictiveanalysis poses many interesting research problems  Many opportunities for future research  Determining the credibility of posts (catfish, sock puppet)  Better filtering mechanisms  More research in Machine Learning than feature extraction
  • 68.
    Conclusions II  Correlationdoes not mean causation  Finding causative mechanism for correlation
  • 69.
    Thank you forthe attention  Questions?
  • 70.
    References I  Achrekar,H, Gandhe, A, Lazarus, R, Ssu-Hsin, Y and Benyuan, L 2011, 'Predicting Flu Trends using Twitter data', Computer Communications Workshops (INFOCOM WKSHPS), IEEE, pp. 702-7.  Arias, M, Arratia, A & Xuriguera, R 2014, 'Forecasting with twitter data', ACM Trans. Intell. Syst. Technol., vol. 5, no. 1, pp. 1-24.  Asur, S & Huberman, BA 2010, 'Predicting the Future with Social Media', in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 1, pp. 492-9.Berman, JJ 2013, PRINCIPLES OF BIG DATA, Elsevier Inc., Waltham, USA.  Bollen, J, Mao, H & Zeng, X-J 2010, 'Twitter mood predicts the stock market', Journal of Computational Science, vol. 2, p. 8.  Buhl, H, Röglinger, M, Moser, F & Heidemann, J 2013, 'Big Data', WIRTSCHAFTSINFORMATIK, vol. 55, no. 2, pp. 63-8.  Bulysheva, L & Bulyshev, A 2012, 'Segmentation modeling algorithm: a novel algorithm in data mining', Information Technology and Management, vol. 13, no. 4, pp. 263-71.  Darwish, A & Lakhtaria, KI 2011, The Impact of the New Web 2.0 Technologies in Communication, Development, and Revolutions of Societies, vol. 2, 2011.  Goh, KY, Heng, CS & Lin, Z 2012, ‘Social Media Brand Community and Consumer Behavior: Quantifying the Relative Impact of User- and Marketer-Generated Content’, School of Computing, National University of Singapore, viewed 9 April 2013, <https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2048614>.  Graham, DM, Hale, SA & Stephens, M 2011, 'User-generated Content in Google', Oxford University, Oxford, UK, viewed 27 October 2013, < http://www.oii.ox.ac.uk/vis/?id=4e3c030d>.  Harris, D 2013, 'DataSift raises $42M', Gigaom, viewed 27 December 2013, <http://gigaom.com/2013/12/03/datasift-raises-42m- maybe-theres-something-to-this-social-data-after-all/>.  Huang, S, Peng, W, Li, J & Lee, D 2013, 'Sentiment and topic analysis on social media: a multi-task multi-label classification approach', paper presented to Proceedings of the 5th Annual ACM Web Science Conference, Paris, France.  Kao, A, Ferng, W, Poteet, S, Quach, L & Tjoelker, R 2013, 'TALISON - Tensor analysis of social media data', in Intelligence and Security Informatics (ISI), 2013 IEEE International Conference on, pp. 137-42.  Klein, D, Tran-Gia, P & Hartmann, M 2013, 'Big Data', Informatik-Spektrum, vol. 36, no. 3, p. 319.  Kumar, P, Nitin, Chauhan, DS & Sehgal, VK 2012, 'Selection of evolutionary approach based hybrid data mining algorithms for decision support systems and business intelligence', paper presented to Proceedings of the International Conference on Advances in Computing, Communications and Informatics, Chennai, India.
  • 71.
    References II  Kumar,P, Kumar Sehgal, N, Kumar Sehgal, V & Singh Chauhan, D 2012, 'A Benchmark to Select Data Mining Based Classification Algorithms for Business Intelligence and Decision Support Systems', International Journal of Data Mining & Knowledge Management Process, vol. 2, no. 5, pp. 25-42.  Lim, E-P, Chen, H & Chen, G 2013, 'Business Intelligence and Analytics: Research Directions', ACM Trans. Manage. Inf. Syst., vol. 3, no. 4, pp. 1-10.  Manyika, J, Chui, M, Brown, B, Bughin, J, Dobbs, R, Roxburgh, C & Byers, AH 2011, Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute.  Mayer-Schonberger, V & Cukier, K 2013, Big Data: A Revolution That Will Transform How We Live, Work, and Think, Houghton Mifflin Harcourt Publishing Company, New York, USA.  Mayer, A 2009, 'Online social networks in economics', Decision Support Systems, vol. 47, no. 3, pp. 169-184, viewed 22 September 2013, < http://sistemas-humano-computacionais.wdfiles.com/local--files/capitulo%3Aredes-sociais/amayer.pdf>.  McKelvey, K, Rudnick, A, Conover, MD & Menczer, F 2012, 'Visualizing Communication on Social Media, Making Big Data Accessible', Indiana University School of Informatics and Computing, viewed 29 September 2013, <http://arxiv.org/pdf/1202.1367v1.pdf>.  Neri, F, Aliprandi, C, Capeci, F, Cuadros, M & By, T 2012, 'Sentiment Analysis on Social Media', in Advances in Social Networks Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, pp. 919-26.  Oboler, A, Welsh, K & Cruz, L 2012, The danger of big data: Social media as computational social science, 2012.  Ostrowski, DA 2011, 'Predictive Semantic Social Media Analysis', in Semantic Computing (ICSC), 2011 Fifth IEEE International Conference on, pp. 283-90.  Paltoglou, G & Thelwall, M 2012, 'Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media', ACM Trans. Intell. Syst. Technol., vol. 3, no. 4, pp. 1-19.  Rusli, EM 2013, Facebook Woos TV Networks With Data, Digits, viewed 15 February 2014, <http://blogs.wsj.com/digits/2013/09/29/facebook-woos-tv-networks-with-more-data/>.  Smith, MS, Ventura, AD, Dewey, DP, Knutson, CD & Embley, DW 2011, ‘A Computational Framework for Social Capital in Online Communities’, Brigham Young University, viewed 28 July 2013, <http://posts.smithworx.com/publications/d.pdf>.
  • 72.
    References III  Yahoo!Finance 2014, Microsoft Corporation (MSFT), Yahoo, viewed 15 February 2014, <http://finance.yahoo.com/echarts?s=MSFT+Interactive#symbol=msft;range=20130102,20140214;compare=;indicator=volume;chartty pe=area;crosshair=on;ohlcvalues=0;logscale=off;source=; >.  Trif, S 2011, 'Using Genetic Algorithms in Secured Business Intelligence Mobile Applications', Informatica economica, vol. 15, no. 1, pp. 69-79.  Tumasjan, A, Welpe, IM, Sandner, PG, Tumasjan, A & Sprenger, TO 2011, 'Election Forecasts With Twitter: How 140 Characters Reflect the Political Landscape', Social science computer review, vol. 29, no. 4, pp. 402-18.  Sakaki, T, Okazaki, M and Matsuo, Y 2010, 'Earthquake shakes Twitter users: real-time event detection by social sensors', Proc. of the 19th international conference on World wide web, Raleigh.  Twitter Statistics 2014, Statistic brain, viewed 18 February 2014, <http://www.statisticbrain.com/twitter-statistics/>.  Walton, A 2014, ‘Twitter Usage by Region’, Chron, viewed 18 February 2014, < http://smallbusiness.chron.com/twitter-usage-region- 62762.html>.  Wang, F-Y, Carley, KM, Zeng, D & Mao, W 2007, 'Social Computing: From Social Informatics to Social Intelligence', Intelligent Systems, IEEE, vol. 22, no. 2, pp. 79-83.  Weka knowledge explorer, viewed 15 February 2014, <http://www.cs.waikato.ac.nz/~ml/weka/gui_explorer.html>.  Witten, IH, Frank, E & Hall, MA 2011, Data Mining, 3 edn, Elsevier, Burlington, MA, USA.  Wlodarczak, P 2014, ‘Big Personal Data’, Social Science Research Network, <http://dx.doi.org/10.2139/ssrn.2514721>.  World Stock Exchanges 2011, viewed 18 February 2014, <http://www.world-stock-exchanges.net/top10.html>.  Wong, FMF, Sen, S & Chiang, M 2012, 'Why Watching Movie Tweets Won’t Tell the Whole Story?', Cornell University, viewed 14 May 2013, <http://arxiv.org/pdf/1203.4642v1.pdf>.  Wu, X, Kumar, V, Ross Quinlan, J, Ghosh, J, Yang, Q, Motoda, H, McLachlan, GJ, Ng, A, Liu, B, Yu, PS, Zhou, Z-H, Steinbach, M, Hand, DJ & Steinberg, D 2007, 'Top 10 algorithms in data mining', Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37.  Zeng, D, Chen, H, Lusch, R & Li, S-H 2010, 'Social Media Analytics and Intelligence', Intelligent Systems, IEEE, vol. 25, no. 6, pp. 13- 6.  Zeng, L, Li, L & Duan, L 2012, 'Business intelligence in enterprise computing environment', Information Technology and Management, vol. 13, no. 4, pp. 297-310.