TextMiningTwitters

MQF Program
Team Members:
Chang Liu
Jun Li
Wei He
Text Mining:
Using Tweets to Predict the
Stock Price Fluctuation

Content
• Part I. Fundamental of Sentiment Analysis
• Part II. Naive Bayes Classifier
• Part III. Pre-implementation
• Part IV. Implementation Steps
• Part V. Results and Conclusions

Fundamental of Sentiment
Analysis - Market Anomaly
Sentiment vs. Returns
Many studies have documented that investors
sentiment plays an important role in determining
prices. Behavioral finance attempts to show that
investors’ irrational behavior actually affect stock
price. Especially for stocks with not enough
arbitrage forces to absolve shocks, investor
sentiment systematically affects the movement of
stock prices.

Analysis - EMH
Efficient Market Hypothesis (EMH)
Stocks always trade at their fair value on stock
exchanges, making it impossible for investors to
either purchase undervalued stocks or sell stocks
for inflated prices. As such, it should be
impossible to outperform the overall market
through expert stock selection or market timing,
and that the only way an investor can possibly
obtain higher returns is by purchasing riskier
investments.
But, EMH is not always true!

Analysis - Sentiment Analysis
• Sentiment analysis refers to the use
of natural language processing, text
analysis and computational linguistics to
identify and extract subjective information
in source materials.
• The basic task of sentiment analysis is to
determine whether the opinion is positive,
negative, or neutral in a given text.
• Beyond this, sentiment analysis can also
determine various emotional states such as
"angry," "sad," and "happy."

Fundamental of Sentiment Analysis
- Sentiment Analysis (Cont.)
• Sentiment tracking techniques has been
improved to extract indicators of public
mood directly from social media content
such as blog content and in particular large-
scale Twitter feeds.
• Although each tweet is limited to only 140
characters, the aggregate of millions of
tweets submitted to Twitter at any given
time may provide an accurate
representation of public mood and
sentiment.

- Sentiment Analysis (Cont.)
to diminish the strength, flavor, or
brilliance of by admixture-Merriam
Webster
Make something weaker in force,
content, or value by modifying it or
adding other elements to it-Oxford
NEGATIVE!
“It‘s a product that tries too hard to do too
much. It's trying to be a tablet and a
notebook and it really succeeds at being
neither. It's sort of diluted."

– Basic Idea of Our Project
• Typical sentiment analysis are based on some training sets
built by manually dividing words into different groups such
as positive, neutral, and negative from thousands of text
samples.
• Different from the typical approach, we collect and
aggregate all available tweets that include a specific stock
(name or ticker) within each minute and determine whether
this collection of tweets has influence to that stock price in
the next minute.
• In other words, we combine the tweets within one minute
and the change of stock price in the next minute as one
sample. For instance, if stock price increases or its return is
positive, we claim that this sample belongs to the positive
group.

Naive Bayes Classifier
• Probability Basics
• Bayes Theorem
• MAP Classification Rule
• Naive Bayes Classification
• Naive Bayes Algorithm

• We define prior, conditional and joint probability for
random variables
– Prior probability:
– Conditional probability:
– Joint probability:
– Relationship:
– Independence:
• Bayesian Rule
)(
)()(
)(
X
X
X
P
CPC|P
|CP 
Evidence
PriorLikelihood
Posterior


)(XP
)|,)( 121 XP(XX|XP 2
))(),,( 22 ,XP(XPXX 11  XX
)()()),()|(),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1 
)()|()()|() 2211122 XPXXPXPXXP,XP(X1 
Naive Bayes Classifier –
Probability Basics

Assume that we have two classes:
c1 = Male, and c2 = Female.
We have a person whose sex we do not know,
name “Drew”.
Classifying this “Drew” as male or female is
equivalent to asking is it more probable that
“Drew” is male or female, or which value is
greater: p(male| drew) or p(female| drew) ?
𝑝 𝑚𝑎𝑙𝑒 Drew =
𝑝 Drew 𝑚𝑎𝑙𝑒 ∗ 𝑝(𝑚𝑎𝑙𝑒)
𝑝("𝐷𝑟𝑒𝑤")
What is the probability of being called “Drew”
given that you are a male?
What is the probability of
being a male?
What is the probability of
being named “Drew”?
Drew Chadwick
Drew Barrymore
Bayes Theorem

• MAP classification rule
– MAP (Maximum A Posterior)
– Assign x to c* if
• Method of Generative classification with the MAP rule
1. Apply Bayesian rule to convert them into posterior probabilities
2. Then apply the MAP rule
Lc,,cccc|cCP|cCP  1
**
,)()( xXxX
Li
cCPcC|P
P
cCPcC|P
|cCP
ii
ii
i
,,2,1for
)()(
)(
)()(
)(





xX
xX
xX
xX
Another “Drew”
MAP Classification Rule

• Bayes classification
• Naive Bayes classification
– Assume that all input attributes are conditionally independent
– MAP classification rule: for
)()|,,()()()( 1 CPCXXPCPC|P|CP n XX
)|()|()|(
)|,,()|(
)|,,(),,,|()|,,,(
21
21
22121
CXPCXPCXP
CXXPCXP
CXXPCXXXPCXXXP
n
n
nnn



Lnn ccccccPcxPcxPcPcxPcxP ,,,),()]|()|([)()]|()|([ 1
*
1
***
1 
),,,( 21 nxxx x
Naive Bayes Classification

• Naive Bayes Algorithm (for discrete input attributes) has two phases
– 1. Learning Phase: Given a training set S,
Output: conditional probability tables; for elements
– 2. Test Phase: Given an unknown instance ,
Look up tables to assign the label c* to X’ if
;inexampleswith)|(estimate)|(ˆ
),1;,,1(attributeeachofvalueattributeeveryFor
;inexampleswith)(estimate)(ˆ
ofvaluetargeteachFor 1
S
S
ijkjijkj
jjjk
ii
Lii
cCxXPcCxXP
N,knjXx
cCPcCP
)c,,c(cc




LNX jj ,
),,( 1 naa X
Lnn ccccccPcaPcaPcPcaPcaP ,,,),(ˆ)]|(ˆ)|(ˆ[)(ˆ)]|(ˆ)|(ˆ[ 1
*
1
***
1 
Naive Bayes Algorithm

Pre-implementation
– Basic Idea
• If people are talking something good or bad about
Microsoft
“Love the new surface pro3. Good work Microsoft!”
“Having a great internship in Microsoft.”
• If companies are announcing something good or bad
about Microsoft, the stock price of MSFT will rise.
“The quarterly report of Microsoft seems pretty good!”
“Microsoft has closed 5 major stores in east China.”
• If internet sensations are saying they like or dislike the
products of Microsoft
“Tim Cook: The new surface book is not interesting.”
“Ma Yun: We will cooperate with Microsoft to create new online
shopping experience.”

Pre-implementation
– Basic Idea
• Traditional Sentiment Model
• Naïve Bayes Classification and Support Vector Machine (SVM)
• Single tweet? Or tweet-block?
a) Is one single tweet really has an impact on the stock price of the
Microsoft?
b) Will a tweet-block which contains all the twitters in a minute has
more effect on the stock price?
c) We believe the second. This is different from the traditional sentiment
model.
• Decision Tree model?
• Since we choose to use tweet-block and one minute we may find
hundreds even thousands related tweets, decision tree may not be a
good choice here.
• Why Python?
• Faster and open-source
• Package nltk, csv, tweepy

Pre-implementation
– Data Source
• Free historical twitters data is difficult to find.
• Python package tweepy
a) A free approach to the tweets data
b) Only latest public twitters
c) Let the system run from 9:30 am to 4:00 pm everyday between
25th Nov and 2nd Dec
d) Potential problem: only business day data will be useful in our
model.

Pre-implementation
– Data Source (Cont.)
• Intraday Stock Price of MSFT
• Prices changes per minute

Pre-implementation
– Data Filtration
• JSON
a) JavaScript Object Notation: a syntax that used to transfer data
between servers
b) Different tag-names will contain different information
c) Eg. “created_at”: time of the twitter created
“id”: the id of the twitter
“text”: the text content of the twitter
“lang”: language of the twitter used
“place” and “country” etc;

Pre-implementation
– Data Filtration (Cont.)
• Keywords
a) We only care about the twitters which were related to the
Microsoft company or stock.
b) stream.filter(track = ['microsoft']);
c) Other words?(MSFT, MSFT stocks)
• Language
a) We only care about the twitters which were written in English.
b) “lang” = ‘en’

Pre-implementation
• Encoding
a) The data source used ‘UTF - 8’
b) Even though the twitter was written in English, it may still
contained some Chinese or Korean characters
c) Use regular expression to detect the characters
• Some JSON package is not twitter
a) From the practice we found that sometimes we may get a JSON
package with the content as {“limit”….}
b) Just get rid of it
c) if line[0:8] != '{"limit"':

Pre-implementation
• NICE & nice & Nice
a) Same word and transfer them all into the lower case
b) tweet = tweet.lower();
• @ChangLiu & www.twitter.com
a) The author and the URL contained in the tweeter has nothing
related to our model.
b) Regular expression
c) tweet = re.sub('((www.[^s]+)|(https?://[^s]+))','URL',tweet);
d) tweet = re.sub('@[^s]+','ATUSER',tweet);
• Repeating letters
a) E.g. hunggrryyy, huuuuuuungry for 'hungry'.
b) We can look for 2 or more repetitive letters in words and
replace them by 2 of the same.

Pre-implementation
• Stop words
a) a, is, the, with etc.
b) These words don't indicate any
sentiment and can be removed.
c) Before processing the tweets, we
read in a file which contains all the
stop words.
d) These words typically has no
meaning, we just ignore them.

Implementation Steps
1. Download Data
2. Data Processing and Filtering.
3. Split tweet-blocks by one minute.
a) We detect the time tag of the twitter, and for each minute we
simply concatenate all the twitters within this minute. Make
them as one sentence.
4. Combine the stock price data
a) We believe the information time lag is 1 minute. This means the
information made public one minute ago will have an impact
on this minute’s stock price.
b) E.g. we have a tweet-block for time 9:50 am to 9:51 am
c) We find the stock price change of MSFT from time 9:51 am to
9:52 am is 0.03%
d) Then we classify this tweet-block as ‘positive’.

Implementation Steps (Cont.)
5. Form the training data set
a) We classify positive return rate as “Positive”.
b) Negative return rate as “Negative”.
c) 0 return rate as “Neutral”.

6. Build a classifier based on Naive Bayes classification
method
a) It will provide a feature list which contains the words that appeared for
‘positive’, ‘negative’ and ‘neutral’
b) Then for a test tweet, it will figure out whether the tweet contains the
words in ‘positive’ list, ‘negative’ list or ‘neutral’ list

6. Build a classifier based on Naïve Bayes classification
method
a) Training set: Data of 9:30 am to 4:00 pm
b) Date: 25th Nov, 27th Nov, 30th Nov, 1st Dec, 2nd Dec (9:30 am–
2:50 pm)
c) Total number of twitters: near 110,000 single tweets
d) Sample size: 1,518 single minute tweet-block records
e) Training time: 377 seconds

7. In-sample test
a) Testing set: 2nd Dec (9:30 am – 2:50 pm)
b) Size of testing set: 317
c) Correct prediction: 97.16%
d) Testing time: 431 seconds

8. Out-of-sample test
a) Testing set: 2nd Dec (2:50 pm – 4:00 pm)
b) Size of testing set: 61
c) Correct prediction: 40.98%
d) Testing time: 111 seconds

Results and Conclusions –
Result Analysis (Cont.)
-0.004
-0.003
-0.002
-0.001
0
0.001
0.002
0.003
0.004
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Min0
Min7
Min14
Min21
Min28
Min35
Min42
Min49
Min56
Min63
Min70
Min77
Min84
Min91
Min98
Min105
Min112
Min119
Min126
Min133
Min140
Min147
Min154
Min161
Min168
Min175
Min182
Min189
Min196
Min203
Min210
Min217
Min224
Min231
Min238
Min245
Min252
Min259
Min266
Min273
Min280
Min287
Min294
Min301
Min308
Min315
In-sample Test Result
Predicted Value Real value Change rate

Result Analysis
-0.003
-0.002
-0.001
0
0.001
0.002
0.003
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Min
0
Min
2
Min
4
Min
6
Min
8
Min
10
Min
12
Min
14
Min
16
Min
18
Min
20
Min
22
Min
24
Min
26
Min
28
Min
30
Min
32
Min
34
Min
36
Min
38
Min
40
Min
42
Min
44
Min
46
Min
48
Min
50
Min
52
Min
54
Min
56
Min
58
Min
60
Out-of-sample Test Result
Predicted Value Real value Change rate

Drawbacks
1. Not enough training data
a) Traditional sentiment model requires at least 10,000 training
records
b) Our model requests even more but we only have 1,500 records
2. Not enough testing data to verify the model
3. Single word set V/S Bi-word set
a) Not interesting
b) Don’t like
c) not bad
4. Time lag problem
a) 1 minute or 5 minutes
b) Maybe the tweets between time 4:30 pm to tomorrow 9:30 am
will have more impact on tomorrow’s stock price

Further Improvement
 Gather more data
 Change time lags to find the optimal one
 Consider about other languages
Some automatic translation tools
 Use other classification methods
SVM, Entropy approach
Even modified decision tree

TextMiningTwitters

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to TextMiningTwitters

Similar to TextMiningTwitters (20)

TextMiningTwitters