SlideShare a Scribd company logo
MQF Program
Team Members:
Chang Liu
Jun Li
Wei He
Text Mining:
Using Tweets to Predict the
Stock Price Fluctuation
Content
• Part I. Fundamental of Sentiment Analysis
• Part II. Naive Bayes Classifier
• Part III. Pre-implementation
• Part IV. Implementation Steps
• Part V. Results and Conclusions
Fundamental of Sentiment
Analysis - Market Anomaly
Sentiment vs. Returns
Many studies have documented that investors
sentiment plays an important role in determining
prices. Behavioral finance attempts to show that
investors’ irrational behavior actually affect stock
price. Especially for stocks with not enough
arbitrage forces to absolve shocks, investor
sentiment systematically affects the movement of
stock prices.
Fundamental of Sentiment
Analysis - EMH
Efficient Market Hypothesis (EMH)
Stocks always trade at their fair value on stock
exchanges, making it impossible for investors to
either purchase undervalued stocks or sell stocks
for inflated prices. As such, it should be
impossible to outperform the overall market
through expert stock selection or market timing,
and that the only way an investor can possibly
obtain higher returns is by purchasing riskier
investments.
But, EMH is not always true!
Fundamental of Sentiment
Analysis - Sentiment Analysis
• Sentiment analysis refers to the use
of natural language processing, text
analysis and computational linguistics to
identify and extract subjective information
in source materials.
• The basic task of sentiment analysis is to
determine whether the opinion is positive,
negative, or neutral in a given text.
• Beyond this, sentiment analysis can also
determine various emotional states such as
"angry," "sad," and "happy."
Fundamental of Sentiment Analysis
- Sentiment Analysis (Cont.)
• Sentiment tracking techniques has been
improved to extract indicators of public
mood directly from social media content
such as blog content and in particular large-
scale Twitter feeds.
• Although each tweet is limited to only 140
characters, the aggregate of millions of
tweets submitted to Twitter at any given
time may provide an accurate
representation of public mood and
sentiment.
Fundamental of Sentiment Analysis
- Sentiment Analysis (Cont.)
to diminish the strength, flavor, or
brilliance of by admixture-Merriam
Webster
Make something weaker in force,
content, or value by modifying it or
adding other elements to it-Oxford
NEGATIVE!
“It‘s a product that tries too hard to do too
much. It's trying to be a tablet and a
notebook and it really succeeds at being
neither. It's sort of diluted."
Fundamental of Sentiment Analysis
– Basic Idea of Our Project
• Typical sentiment analysis are based on some training sets
built by manually dividing words into different groups such
as positive, neutral, and negative from thousands of text
samples.
• Different from the typical approach, we collect and
aggregate all available tweets that include a specific stock
(name or ticker) within each minute and determine whether
this collection of tweets has influence to that stock price in
the next minute.
• In other words, we combine the tweets within one minute
and the change of stock price in the next minute as one
sample. For instance, if stock price increases or its return is
positive, we claim that this sample belongs to the positive
group.
Naive Bayes Classifier
• Probability Basics
• Bayes Theorem
• MAP Classification Rule
• Naive Bayes Classification
• Naive Bayes Algorithm
• We define prior, conditional and joint probability for
random variables
– Prior probability:
– Conditional probability:
– Joint probability:
– Relationship:
– Independence:
• Bayesian Rule
)(
)()(
)(
X
X
X
P
CPC|P
|CP 
Evidence
PriorLikelihood
Posterior


)(XP
)|,)( 121 XP(XX|XP 2
))(),,( 22 ,XP(XPXX 11  XX
)()()),()|(),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1 
)()|()()|() 2211122 XPXXPXPXXP,XP(X1 
Naive Bayes Classifier –
Probability Basics
Assume that we have two classes:
c1 = Male, and c2 = Female.
We have a person whose sex we do not know,
name “Drew”.
Classifying this “Drew” as male or female is
equivalent to asking is it more probable that
“Drew” is male or female, or which value is
greater: p(male| drew) or p(female| drew) ?
𝑝 𝑚𝑎𝑙𝑒 Drew =
𝑝 Drew 𝑚𝑎𝑙𝑒 ∗ 𝑝(𝑚𝑎𝑙𝑒)
𝑝("𝐷𝑟𝑒𝑤")
What is the probability of being called “Drew”
given that you are a male?
What is the probability of
being a male?
What is the probability of
being named “Drew”?
Drew Chadwick
Drew Barrymore
Naive Bayes Classifier –
Bayes Theorem
• MAP classification rule
– MAP (Maximum A Posterior)
– Assign x to c* if
• Method of Generative classification with the MAP rule
1. Apply Bayesian rule to convert them into posterior probabilities
2. Then apply the MAP rule
Lc,,cccc|cCP|cCP  1
**
,)()( xXxX
Li
cCPcC|P
P
cCPcC|P
|cCP
ii
ii
i
,,2,1for
)()(
)(
)()(
)(





xX
xX
xX
xX
Another “Drew”
Naive Bayes Classifier –
MAP Classification Rule
• Bayes classification
• Naive Bayes classification
– Assume that all input attributes are conditionally independent
– MAP classification rule: for
)()|,,()()()( 1 CPCXXPCPC|P|CP n XX
)|()|()|(
)|,,()|(
)|,,(),,,|()|,,,(
21
21
22121
CXPCXPCXP
CXXPCXP
CXXPCXXXPCXXXP
n
n
nnn



Lnn ccccccPcxPcxPcPcxPcxP ,,,),()]|()|([)()]|()|([ 1
*
1
***
1 
),,,( 21 nxxx x
Naive Bayes Classifier –
Naive Bayes Classification
• Naive Bayes Algorithm (for discrete input attributes) has two phases
– 1. Learning Phase: Given a training set S,
Output: conditional probability tables; for elements
– 2. Test Phase: Given an unknown instance ,
Look up tables to assign the label c* to X’ if
;inexampleswith)|(estimate)|(ˆ
),1;,,1(attributeeachofvalueattributeeveryFor
;inexampleswith)(estimate)(ˆ
ofvaluetargeteachFor 1
S
S
ijkjijkj
jjjk
ii
Lii
cCxXPcCxXP
N,knjXx
cCPcCP
)c,,c(cc




LNX jj ,
),,( 1 naa X
Lnn ccccccPcaPcaPcPcaPcaP ,,,),(ˆ)]|(ˆ)|(ˆ[)(ˆ)]|(ˆ)|(ˆ[ 1
*
1
***
1 
Naive Bayes Classifier –
Naive Bayes Algorithm
Naive Bayes Classifier
Pre-implementation
– Basic Idea
• If people are talking something good or bad about
Microsoft
“Love the new surface pro3. Good work Microsoft!”
“Having a great internship in Microsoft.”
• If companies are announcing something good or bad
about Microsoft, the stock price of MSFT will rise.
“The quarterly report of Microsoft seems pretty good!”
“Microsoft has closed 5 major stores in east China.”
• If internet sensations are saying they like or dislike the
products of Microsoft
“Tim Cook: The new surface book is not interesting.”
“Ma Yun: We will cooperate with Microsoft to create new online
shopping experience.”
Pre-implementation
– Basic Idea
• Traditional Sentiment Model
• Naïve Bayes Classification and Support Vector Machine (SVM)
• Single tweet? Or tweet-block?
a) Is one single tweet really has an impact on the stock price of the
Microsoft?
b) Will a tweet-block which contains all the twitters in a minute has
more effect on the stock price?
c) We believe the second. This is different from the traditional sentiment
model.
• Decision Tree model?
• Since we choose to use tweet-block and one minute we may find
hundreds even thousands related tweets, decision tree may not be a
good choice here.
• Why Python?
• Faster and open-source
• Package nltk, csv, tweepy
Pre-implementation
– Data Source
• Free historical twitters data is difficult to find.
• Python package tweepy
a) A free approach to the tweets data
b) Only latest public twitters
c) Let the system run from 9:30 am to 4:00 pm everyday between
25th Nov and 2nd Dec
d) Potential problem: only business day data will be useful in our
model.
Pre-implementation
– Data Source (Cont.)
• Intraday Stock Price of MSFT
• Prices changes per minute
Pre-implementation
– Data Filtration
• JSON
a) JavaScript Object Notation: a syntax that used to transfer data
between servers
b) Different tag-names will contain different information
c) Eg. “created_at”: time of the twitter created
“id”: the id of the twitter
“text”: the text content of the twitter
“lang”: language of the twitter used
“place” and “country” etc;
Pre-implementation
– Data Filtration (Cont.)
• Keywords
a) We only care about the twitters which were related to the
Microsoft company or stock.
b) stream.filter(track = ['microsoft']);
c) Other words?(MSFT, MSFT stocks)
• Language
a) We only care about the twitters which were written in English.
b) “lang” = ‘en’
Pre-implementation
– Data Filtration (Cont.)
• Encoding
a) The data source used ‘UTF - 8’
b) Even though the twitter was written in English, it may still
contained some Chinese or Korean characters
c) Use regular expression to detect the characters
• Some JSON package is not twitter
a) From the practice we found that sometimes we may get a JSON
package with the content as {“limit”….}
b) Just get rid of it
c) if line[0:8] != '{"limit"':
Pre-implementation
– Data Filtration (Cont.)
• NICE & nice & Nice
a) Same word and transfer them all into the lower case
b) tweet = tweet.lower();
• @ChangLiu & www.twitter.com
a) The author and the URL contained in the tweeter has nothing
related to our model.
b) Regular expression
c) tweet = re.sub('((www.[^s]+)|(https?://[^s]+))','URL',tweet);
d) tweet = re.sub('@[^s]+','ATUSER',tweet);
• Repeating letters
a) E.g. hunggrryyy, huuuuuuungry for 'hungry'.
b) We can look for 2 or more repetitive letters in words and
replace them by 2 of the same.
Pre-implementation
– Data Filtration (Cont.)
• Stop words
a) a, is, the, with etc.
b) These words don't indicate any
sentiment and can be removed.
c) Before processing the tweets, we
read in a file which contains all the
stop words.
d) These words typically has no
meaning, we just ignore them.
Implementation Steps
1. Download Data
2. Data Processing and Filtering.
3. Split tweet-blocks by one minute.
a) We detect the time tag of the twitter, and for each minute we
simply concatenate all the twitters within this minute. Make
them as one sentence.
4. Combine the stock price data
a) We believe the information time lag is 1 minute. This means the
information made public one minute ago will have an impact
on this minute’s stock price.
b) E.g. we have a tweet-block for time 9:50 am to 9:51 am
c) We find the stock price change of MSFT from time 9:51 am to
9:52 am is 0.03%
d) Then we classify this tweet-block as ‘positive’.
Implementation Steps (Cont.)
5. Form the training data set
a) We classify positive return rate as “Positive”.
b) Negative return rate as “Negative”.
c) 0 return rate as “Neutral”.
Implementation Steps (Cont.)
6. Build a classifier based on Naive Bayes classification
method
a) It will provide a feature list which contains the words that appeared for
‘positive’, ‘negative’ and ‘neutral’
b) Then for a test tweet, it will figure out whether the tweet contains the
words in ‘positive’ list, ‘negative’ list or ‘neutral’ list
Implementation Steps (Cont.)
6. Build a classifier based on Naïve Bayes classification
method
a) Training set: Data of 9:30 am to 4:00 pm
b) Date: 25th Nov, 27th Nov, 30th Nov, 1st Dec, 2nd Dec (9:30 am–
2:50 pm)
c) Total number of twitters: near 110,000 single tweets
d) Sample size: 1,518 single minute tweet-block records
e) Training time: 377 seconds
Implementation Steps (Cont.)
7. In-sample test
a) Testing set: 2nd Dec (9:30 am – 2:50 pm)
b) Size of testing set: 317
c) Correct prediction: 97.16%
d) Testing time: 431 seconds
Implementation Steps (Cont.)
8. Out-of-sample test
a) Testing set: 2nd Dec (2:50 pm – 4:00 pm)
b) Size of testing set: 61
c) Correct prediction: 40.98%
d) Testing time: 111 seconds
Results and Conclusions –
Result Analysis (Cont.)
-0.004
-0.003
-0.002
-0.001
0
0.001
0.002
0.003
0.004
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Min0
Min7
Min14
Min21
Min28
Min35
Min42
Min49
Min56
Min63
Min70
Min77
Min84
Min91
Min98
Min105
Min112
Min119
Min126
Min133
Min140
Min147
Min154
Min161
Min168
Min175
Min182
Min189
Min196
Min203
Min210
Min217
Min224
Min231
Min238
Min245
Min252
Min259
Min266
Min273
Min280
Min287
Min294
Min301
Min308
Min315
In-sample Test Result
Predicted Value Real value Change rate
Results and Conclusions –
Result Analysis
-0.003
-0.002
-0.001
0
0.001
0.002
0.003
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Min
0
Min
2
Min
4
Min
6
Min
8
Min
10
Min
12
Min
14
Min
16
Min
18
Min
20
Min
22
Min
24
Min
26
Min
28
Min
30
Min
32
Min
34
Min
36
Min
38
Min
40
Min
42
Min
44
Min
46
Min
48
Min
50
Min
52
Min
54
Min
56
Min
58
Min
60
Out-of-sample Test Result
Predicted Value Real value Change rate
Results and Conclusions –
Drawbacks
1. Not enough training data
a) Traditional sentiment model requires at least 10,000 training
records
b) Our model requests even more but we only have 1,500 records
2. Not enough testing data to verify the model
3. Single word set V/S Bi-word set
a) Not interesting
b) Don’t like
c) not bad
4. Time lag problem
a) 1 minute or 5 minutes
b) Maybe the tweets between time 4:30 pm to tomorrow 9:30 am
will have more impact on tomorrow’s stock price
Results and Conclusions –
Further Improvement
 Gather more data
 Change time lags to find the optimal one
 Consider about other languages
Some automatic translation tools
 Use other classification methods
SVM, Entropy approach
Even modified decision tree
TextMiningTwitters

More Related Content

Viewers also liked

Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
Chanon Hongsirikulkit
 
Stock Market Analysis Markov Models
Stock Market Analysis Markov ModelsStock Market Analysis Markov Models
Stock Market Analysis Markov ModelsGabriel Policiuc
 
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Eric Brown
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
Eric Wilson
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine Learning
Nihar Suryawanshi
 
Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014
Bo Hyun Kim
 
Omsa
OmsaOmsa
Major presentation
Major presentationMajor presentation
Major presentation
PS241092
 
Text classification
Text classificationText classification
Text classification
James Wong
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
Dhwaj Raj
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
Tom Maiaroto
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes ClassifiersDongseo University
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter Data
Nurendra Choudhary
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
Josh Patterson
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
Dev Sahu
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
CJ Jenkins
 

Viewers also liked (18)

Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
 
Stock Market Analysis Markov Models
Stock Market Analysis Markov ModelsStock Market Analysis Markov Models
Stock Market Analysis Markov Models
 
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
 
Naive Bayes
Naive Bayes Naive Bayes
Naive Bayes
 
Sentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine LearningSentiment Analysis Using Machine Learning
Sentiment Analysis Using Machine Learning
 
Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014Lexicon-Based Sentiment Analysis at GHC 2014
Lexicon-Based Sentiment Analysis at GHC 2014
 
Omsa
OmsaOmsa
Omsa
 
Major presentation
Major presentationMajor presentation
Major presentation
 
Text classification
Text classificationText classification
Text classification
 
Introduction to text classification using naive bayes
Introduction to text classification using naive bayesIntroduction to text classification using naive bayes
Introduction to text classification using naive bayes
 
MongoDB & Machine Learning
MongoDB & Machine LearningMongoDB & Machine Learning
MongoDB & Machine Learning
 
Lecture10 - Naïve Bayes
Lecture10 - Naïve BayesLecture10 - Naïve Bayes
Lecture10 - Naïve Bayes
 
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
2013-1 Machine Learning Lecture 03 - Naïve Bayes Classifiers
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter Data
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier Sentiment analysis using naive bayes classifier
Sentiment analysis using naive bayes classifier
 
Naive bayes
Naive bayesNaive bayes
Naive bayes
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 

Similar to TextMiningTwitters

Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
Vaibhav Varshney
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptx
SaravanaD2
 
Analyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning projectAnalyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning project
Boston Institute of Analytics
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Aun Akbar
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
SVasuKrishna1
 
Predicting the NBA MVP
Predicting the NBA MVPPredicting the NBA MVP
Predicting the NBA MVP
Thinkful
 
Classification
ClassificationClassification
Classification
Dr. C.V. Suresh Babu
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
HJ van Veen
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regionsbutest
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
Amit Sharma
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
Sumit Raj
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
DevinSohi
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Alok Singh
 
Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed
Joshua Erb
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
Olaf de Leeuw
 
Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummies
Saurav Chakravorty
 

Similar to TextMiningTwitters (20)

Recommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic AlgorithmRecommendation engine Using Genetic Algorithm
Recommendation engine Using Genetic Algorithm
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptx
 
Analyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning projectAnalyzing Movie Reviews : Machine learning project
Analyzing Movie Reviews : Machine learning project
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Predicting the NBA MVP
Predicting the NBA MVPPredicting the NBA MVP
Predicting the NBA MVP
 
Classification
ClassificationClassification
Classification
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
SentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdfSentimentAnalysisofTwitterProductReviewsDocument.pdf
SentimentAnalysisofTwitterProductReviewsDocument.pdf
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed Georgetown Data Science - Team BuzzFeed
Georgetown Data Science - Team BuzzFeed
 
Dataworkz odsc london 2018
Dataworkz odsc london 2018Dataworkz odsc london 2018
Dataworkz odsc london 2018
 
Data science for advanced dummies
Data science for advanced dummiesData science for advanced dummies
Data science for advanced dummies
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Rm tutorial
Rm tutorialRm tutorial
Rm tutorial
 
Overfitting and-tbl
Overfitting and-tblOverfitting and-tbl
Overfitting and-tbl
 

TextMiningTwitters

  • 1. MQF Program Team Members: Chang Liu Jun Li Wei He Text Mining: Using Tweets to Predict the Stock Price Fluctuation
  • 2. Content • Part I. Fundamental of Sentiment Analysis • Part II. Naive Bayes Classifier • Part III. Pre-implementation • Part IV. Implementation Steps • Part V. Results and Conclusions
  • 3. Fundamental of Sentiment Analysis - Market Anomaly Sentiment vs. Returns Many studies have documented that investors sentiment plays an important role in determining prices. Behavioral finance attempts to show that investors’ irrational behavior actually affect stock price. Especially for stocks with not enough arbitrage forces to absolve shocks, investor sentiment systematically affects the movement of stock prices.
  • 4. Fundamental of Sentiment Analysis - EMH Efficient Market Hypothesis (EMH) Stocks always trade at their fair value on stock exchanges, making it impossible for investors to either purchase undervalued stocks or sell stocks for inflated prices. As such, it should be impossible to outperform the overall market through expert stock selection or market timing, and that the only way an investor can possibly obtain higher returns is by purchasing riskier investments. But, EMH is not always true!
  • 5. Fundamental of Sentiment Analysis - Sentiment Analysis • Sentiment analysis refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials. • The basic task of sentiment analysis is to determine whether the opinion is positive, negative, or neutral in a given text. • Beyond this, sentiment analysis can also determine various emotional states such as "angry," "sad," and "happy."
  • 6. Fundamental of Sentiment Analysis - Sentiment Analysis (Cont.) • Sentiment tracking techniques has been improved to extract indicators of public mood directly from social media content such as blog content and in particular large- scale Twitter feeds. • Although each tweet is limited to only 140 characters, the aggregate of millions of tweets submitted to Twitter at any given time may provide an accurate representation of public mood and sentiment.
  • 7. Fundamental of Sentiment Analysis - Sentiment Analysis (Cont.) to diminish the strength, flavor, or brilliance of by admixture-Merriam Webster Make something weaker in force, content, or value by modifying it or adding other elements to it-Oxford NEGATIVE! “It‘s a product that tries too hard to do too much. It's trying to be a tablet and a notebook and it really succeeds at being neither. It's sort of diluted."
  • 8. Fundamental of Sentiment Analysis – Basic Idea of Our Project • Typical sentiment analysis are based on some training sets built by manually dividing words into different groups such as positive, neutral, and negative from thousands of text samples. • Different from the typical approach, we collect and aggregate all available tweets that include a specific stock (name or ticker) within each minute and determine whether this collection of tweets has influence to that stock price in the next minute. • In other words, we combine the tweets within one minute and the change of stock price in the next minute as one sample. For instance, if stock price increases or its return is positive, we claim that this sample belongs to the positive group.
  • 9. Naive Bayes Classifier • Probability Basics • Bayes Theorem • MAP Classification Rule • Naive Bayes Classification • Naive Bayes Algorithm
  • 10. • We define prior, conditional and joint probability for random variables – Prior probability: – Conditional probability: – Joint probability: – Relationship: – Independence: • Bayesian Rule )( )()( )( X X X P CPC|P |CP  Evidence PriorLikelihood Posterior   )(XP )|,)( 121 XP(XX|XP 2 ))(),,( 22 ,XP(XPXX 11  XX )()()),()|(),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1  )()|()()|() 2211122 XPXXPXPXXP,XP(X1  Naive Bayes Classifier – Probability Basics
  • 11. Assume that we have two classes: c1 = Male, and c2 = Female. We have a person whose sex we do not know, name “Drew”. Classifying this “Drew” as male or female is equivalent to asking is it more probable that “Drew” is male or female, or which value is greater: p(male| drew) or p(female| drew) ? 𝑝 𝑚𝑎𝑙𝑒 Drew = 𝑝 Drew 𝑚𝑎𝑙𝑒 ∗ 𝑝(𝑚𝑎𝑙𝑒) 𝑝("𝐷𝑟𝑒𝑤") What is the probability of being called “Drew” given that you are a male? What is the probability of being a male? What is the probability of being named “Drew”? Drew Chadwick Drew Barrymore Naive Bayes Classifier – Bayes Theorem
  • 12. • MAP classification rule – MAP (Maximum A Posterior) – Assign x to c* if • Method of Generative classification with the MAP rule 1. Apply Bayesian rule to convert them into posterior probabilities 2. Then apply the MAP rule Lc,,cccc|cCP|cCP  1 ** ,)()( xXxX Li cCPcC|P P cCPcC|P |cCP ii ii i ,,2,1for )()( )( )()( )(      xX xX xX xX Another “Drew” Naive Bayes Classifier – MAP Classification Rule
  • 13. • Bayes classification • Naive Bayes classification – Assume that all input attributes are conditionally independent – MAP classification rule: for )()|,,()()()( 1 CPCXXPCPC|P|CP n XX )|()|()|( )|,,()|( )|,,(),,,|()|,,,( 21 21 22121 CXPCXPCXP CXXPCXP CXXPCXXXPCXXXP n n nnn    Lnn ccccccPcxPcxPcPcxPcxP ,,,),()]|()|([)()]|()|([ 1 * 1 *** 1  ),,,( 21 nxxx x Naive Bayes Classifier – Naive Bayes Classification
  • 14. • Naive Bayes Algorithm (for discrete input attributes) has two phases – 1. Learning Phase: Given a training set S, Output: conditional probability tables; for elements – 2. Test Phase: Given an unknown instance , Look up tables to assign the label c* to X’ if ;inexampleswith)|(estimate)|(ˆ ),1;,,1(attributeeachofvalueattributeeveryFor ;inexampleswith)(estimate)(ˆ ofvaluetargeteachFor 1 S S ijkjijkj jjjk ii Lii cCxXPcCxXP N,knjXx cCPcCP )c,,c(cc     LNX jj , ),,( 1 naa X Lnn ccccccPcaPcaPcPcaPcaP ,,,),(ˆ)]|(ˆ)|(ˆ[)(ˆ)]|(ˆ)|(ˆ[ 1 * 1 *** 1  Naive Bayes Classifier – Naive Bayes Algorithm
  • 16. Pre-implementation – Basic Idea • If people are talking something good or bad about Microsoft “Love the new surface pro3. Good work Microsoft!” “Having a great internship in Microsoft.” • If companies are announcing something good or bad about Microsoft, the stock price of MSFT will rise. “The quarterly report of Microsoft seems pretty good!” “Microsoft has closed 5 major stores in east China.” • If internet sensations are saying they like or dislike the products of Microsoft “Tim Cook: The new surface book is not interesting.” “Ma Yun: We will cooperate with Microsoft to create new online shopping experience.”
  • 17. Pre-implementation – Basic Idea • Traditional Sentiment Model • Naïve Bayes Classification and Support Vector Machine (SVM) • Single tweet? Or tweet-block? a) Is one single tweet really has an impact on the stock price of the Microsoft? b) Will a tweet-block which contains all the twitters in a minute has more effect on the stock price? c) We believe the second. This is different from the traditional sentiment model. • Decision Tree model? • Since we choose to use tweet-block and one minute we may find hundreds even thousands related tweets, decision tree may not be a good choice here. • Why Python? • Faster and open-source • Package nltk, csv, tweepy
  • 18. Pre-implementation – Data Source • Free historical twitters data is difficult to find. • Python package tweepy a) A free approach to the tweets data b) Only latest public twitters c) Let the system run from 9:30 am to 4:00 pm everyday between 25th Nov and 2nd Dec d) Potential problem: only business day data will be useful in our model.
  • 19. Pre-implementation – Data Source (Cont.) • Intraday Stock Price of MSFT • Prices changes per minute
  • 20. Pre-implementation – Data Filtration • JSON a) JavaScript Object Notation: a syntax that used to transfer data between servers b) Different tag-names will contain different information c) Eg. “created_at”: time of the twitter created “id”: the id of the twitter “text”: the text content of the twitter “lang”: language of the twitter used “place” and “country” etc;
  • 21. Pre-implementation – Data Filtration (Cont.) • Keywords a) We only care about the twitters which were related to the Microsoft company or stock. b) stream.filter(track = ['microsoft']); c) Other words?(MSFT, MSFT stocks) • Language a) We only care about the twitters which were written in English. b) “lang” = ‘en’
  • 22. Pre-implementation – Data Filtration (Cont.) • Encoding a) The data source used ‘UTF - 8’ b) Even though the twitter was written in English, it may still contained some Chinese or Korean characters c) Use regular expression to detect the characters • Some JSON package is not twitter a) From the practice we found that sometimes we may get a JSON package with the content as {“limit”….} b) Just get rid of it c) if line[0:8] != '{"limit"':
  • 23. Pre-implementation – Data Filtration (Cont.) • NICE & nice & Nice a) Same word and transfer them all into the lower case b) tweet = tweet.lower(); • @ChangLiu & www.twitter.com a) The author and the URL contained in the tweeter has nothing related to our model. b) Regular expression c) tweet = re.sub('((www.[^s]+)|(https?://[^s]+))','URL',tweet); d) tweet = re.sub('@[^s]+','ATUSER',tweet); • Repeating letters a) E.g. hunggrryyy, huuuuuuungry for 'hungry'. b) We can look for 2 or more repetitive letters in words and replace them by 2 of the same.
  • 24. Pre-implementation – Data Filtration (Cont.) • Stop words a) a, is, the, with etc. b) These words don't indicate any sentiment and can be removed. c) Before processing the tweets, we read in a file which contains all the stop words. d) These words typically has no meaning, we just ignore them.
  • 25. Implementation Steps 1. Download Data 2. Data Processing and Filtering. 3. Split tweet-blocks by one minute. a) We detect the time tag of the twitter, and for each minute we simply concatenate all the twitters within this minute. Make them as one sentence. 4. Combine the stock price data a) We believe the information time lag is 1 minute. This means the information made public one minute ago will have an impact on this minute’s stock price. b) E.g. we have a tweet-block for time 9:50 am to 9:51 am c) We find the stock price change of MSFT from time 9:51 am to 9:52 am is 0.03% d) Then we classify this tweet-block as ‘positive’.
  • 26. Implementation Steps (Cont.) 5. Form the training data set a) We classify positive return rate as “Positive”. b) Negative return rate as “Negative”. c) 0 return rate as “Neutral”.
  • 27. Implementation Steps (Cont.) 6. Build a classifier based on Naive Bayes classification method a) It will provide a feature list which contains the words that appeared for ‘positive’, ‘negative’ and ‘neutral’ b) Then for a test tweet, it will figure out whether the tweet contains the words in ‘positive’ list, ‘negative’ list or ‘neutral’ list
  • 28. Implementation Steps (Cont.) 6. Build a classifier based on Naïve Bayes classification method a) Training set: Data of 9:30 am to 4:00 pm b) Date: 25th Nov, 27th Nov, 30th Nov, 1st Dec, 2nd Dec (9:30 am– 2:50 pm) c) Total number of twitters: near 110,000 single tweets d) Sample size: 1,518 single minute tweet-block records e) Training time: 377 seconds
  • 29. Implementation Steps (Cont.) 7. In-sample test a) Testing set: 2nd Dec (9:30 am – 2:50 pm) b) Size of testing set: 317 c) Correct prediction: 97.16% d) Testing time: 431 seconds
  • 30. Implementation Steps (Cont.) 8. Out-of-sample test a) Testing set: 2nd Dec (2:50 pm – 4:00 pm) b) Size of testing set: 61 c) Correct prediction: 40.98% d) Testing time: 111 seconds
  • 31. Results and Conclusions – Result Analysis (Cont.) -0.004 -0.003 -0.002 -0.001 0 0.001 0.002 0.003 0.004 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Min0 Min7 Min14 Min21 Min28 Min35 Min42 Min49 Min56 Min63 Min70 Min77 Min84 Min91 Min98 Min105 Min112 Min119 Min126 Min133 Min140 Min147 Min154 Min161 Min168 Min175 Min182 Min189 Min196 Min203 Min210 Min217 Min224 Min231 Min238 Min245 Min252 Min259 Min266 Min273 Min280 Min287 Min294 Min301 Min308 Min315 In-sample Test Result Predicted Value Real value Change rate
  • 32. Results and Conclusions – Result Analysis -0.003 -0.002 -0.001 0 0.001 0.002 0.003 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Min 0 Min 2 Min 4 Min 6 Min 8 Min 10 Min 12 Min 14 Min 16 Min 18 Min 20 Min 22 Min 24 Min 26 Min 28 Min 30 Min 32 Min 34 Min 36 Min 38 Min 40 Min 42 Min 44 Min 46 Min 48 Min 50 Min 52 Min 54 Min 56 Min 58 Min 60 Out-of-sample Test Result Predicted Value Real value Change rate
  • 33. Results and Conclusions – Drawbacks 1. Not enough training data a) Traditional sentiment model requires at least 10,000 training records b) Our model requests even more but we only have 1,500 records 2. Not enough testing data to verify the model 3. Single word set V/S Bi-word set a) Not interesting b) Don’t like c) not bad 4. Time lag problem a) 1 minute or 5 minutes b) Maybe the tweets between time 4:30 pm to tomorrow 9:30 am will have more impact on tomorrow’s stock price
  • 34. Results and Conclusions – Further Improvement  Gather more data  Change time lags to find the optimal one  Consider about other languages Some automatic translation tools  Use other classification methods SVM, Entropy approach Even modified decision tree