wendi_ppt

Data Mining Methods in
Trading Strategies
Wendi Zhu
wendi.zhu1991@gmail.com
An analysis based on news sentiment

8 Terabytes
Twitter: 8,000,000,000,000 Bytes

Take Twitter SPY in 2010 as a simple example

Question: Mining news data from
Social Media to enhance trading?
Yes!

1. A Wall Street news analytics company: Sentiment
data is a determinant of market moves after Federal
Open Market Committee (FOMC) rate announcements,
with a 75%accuracy rate in 2014.
2. A Hedge Fund report : We capture a burst of
negative sentiment of ResMed at 11:14AM, October 9,
2014. Despite the serious allegations and the seeming
validity of the report, it took the market over 60
minutes to react.
3. An Institutional Investor : News sentiment Open-
to-Close (OTC) strategy on SPY returned 29.76%
(before cost) over 2014 with a Sharpe Ratio of 3.1.
Claims

1
2
3
4
Preview
First look at social media data
Implemetation
Parsing twitter news sentiment
Improvement
A brief summary of Advanced methods
Trade the news
Tentative trading practices

News Mining: Step 1
What
is a typical Social Media news like?

A typical twitter user interface

Take Twitter SPY in 2010 as a simple example
• 2010-01-19T15:14:52Z: $SPY looks strong. riding 5EMA - large gap from
SMA50 - concern about 113 level gone for now.
• 2010-12-09T13:28:49Z: $SPY managed to reclaim the 1227 support level,
which should bode well for further price appreciation.
• 2010-12-10T15:59:50Z: $SPY long
• 2010-01-21T20:57:21Z: $SPY closing @ the lows!
• 2010-09-07T00:10:55Z: Last Sunday, strength in patterns showing a bearish
market move
• 2010-12-08T16:25:17Z: $SPY has now failed a breakout. Could recover, but
for now this is a perfect picture of a failed breakout
• 2010-12-16T15:20:07Z: this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like?
Neutral

News Mining: Step 2
How can we interpret the news
sentiment by machine?

Parsing News Sentiment using NLTK and Naive
Bayes(supervised learning) for classification
An introduction to NLTK:
NLTK is a platform for building Python programs to work with human
language. It provides easy-to-use interfaces to over 50 corpora and lexical
resources along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning.
Twitter text Database description:
Source: http://stocktwits.com/
Format: JSON
Size: over 15 million
Data entries: Id, body, create at, user name, followers, following …
{"id":918510,"body":"Options Trade in Nordstrom Today $JWN - http://www.cnbc.com/id/34644850/site/14081545/for/cnbc/","created_at":"2010-01-
01T00:09:02Z", "user":{"id":6328,"username":"OptionsHawk","name":"Joe Kunkle","avatar_url":"http://avatars.stocktwits.net/production/6328/thumb-
1290207489.png", "avatar_url_ssl":"https://s3.amazonaws.com/st-avatars/production/6328/thumb-1290207489.png", "official":false,
"identity":"User","classification":[], “join_date":"2009-11- 01","followers":7072, "following":31,"ideas":18866, "following_stocks":0,"location":"Boston",
"bio":"Active Options Trader - OptionsHawk.com Founder", "website_url": "http://www.OptionsHawk.com", "trading_strategy":{
"assets_frequently_traded“ :["Equities", "Options","Forex","Futures"],"approach":"Technical","holding_period":"Swing
Trader","experience":"Professional"}},"source":{"id":1,"title":"StockTwits","url":"http://stocktwits.com"},"symbols":[{"id":6039,"symbol":"JWN","title":"Nord
strom Inc.","exchange":"NYSE","sector":"Services","industry":"Apparel Stores","trending":false}],"entities":{"sentiment":null}}

• Starting Set:10,000 manually labeled twitter news items
• Distribution of sentiment:
Initial training set: sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000

1) Prepare training set – Data cleaning(removing nulls, web links etc) & import
• pos_tweets = '$SPY looks strong. riding 5EMA - large gap from…','positive…
• neg_tweets = '$SPY closing @ the lows', 'negative…
2) Split the text sentence into word features
• 'spy', 'looks', 'strong', 'riding', '5ema', 'large', 'gap', 'from', 'sma50', 'concern',
'about', 'level', 'gone', 'for', 'now'…
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features:
• 'contains(spy)': True,
• 'contains(support)': False,
• 'contains(strong)': True ,……
Parsing News Sentiment: Dictionary Mapping

5) Apply this mapping into all the news texts and get the following form:
Parsing News Sentiment:
‘spy’ 'support’ ‘gone’ …
Twitter_1 True False True …
Twitter_2 True False False …
Twitter_3 … … … …
Sentiment
WordFeature
_1
WordFeature
_2
WordFeature
_3
…
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) …
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) …
Twitter_3 0 (Neu) … … … …
A typical classification problem!

6) Classification:
Naive Bayes classifier
6*) A simple description of Naive Bayes
Bayes Formula:
The "naive" assumptions come into play: assume that each feature is conditionally independent
of every other feature :
In this twitter example, it means the word features independently affect the sentiment of the text
Parsing News Sentiment:
k
ki
Cclasswithitemsnewsof#
Cclassandxfeaturewordwithitemsnewsof#
)C|x(p ki

7) Model trained. We got 14,356 word features. Most Informative Features include:
8) In sample test and out-of-sample test:
• Tweet= '$SPY has now failed a breakout. We could recover, but for now this is a
perfect picture of a failed breakout'
• Negative Prob(‘negative')= 0.85525
• Tweet= ‘SPY UP! I like that! '
• Positive Prob('positive')= 0.4123 Prob(‘negative')=0.1936
• TOTAL in-sample accuracy : 79.2%
• TOTAL out-of-sample accuracy : 36.3%
• With a large enough training set, the accuracy rate would get
very high
Result:
NEWS ITEM CONTAINS RATIO
‘widely’ positi : negati = 219.8 : 1.0
‘held’ positi : negati = 172.4 : 1.0
‘most’ positi : negati = 45.7 : 1.0
‘fall’ negati : positi = 45.4 : 1.0
‘might’ negati : neutra = 30.6 : 1.0

Pros:
1. A basic approach in sentiment analysis; Easy to use;
2. Effective if the training set is large enough;
3. Ability to learn; as the training set gets larger, the results get
more and more accurate(intelligence);
Cons:
1. Failure in grasping the connection between words;
2. Doesn’t consider the sequence of words;
3. Non-relevant word features;
A simple summary:
Nltk and Naive Bayes method
Possible Improvements:
1. Larger training set
2. PCA; addressing the problem of too many features;
3. Filtering; remove spam and meaningless tweets;
4. Detecting short sequence of words;
Currently working on them…

News Mining: Step 3
Other Advanced Methods in
measuring news sentiment?

Other advanced approaches
Stanford NLP: http://nlp.stanford.edu/
Paper: Christopher Manning and Dan Klein. 2003. Optimization, Maxent Models, and
Conditional Estimation without Magic. Tutorial at HLT-NAACL 2003 and ACL 2003.
Core idea:
Maximum entropy classifier. Otherwise known as multiclass logistic regression. The
Max Entropy does not assume that the features are conditionally independent of each
other.
Vivekn: http://github.com/vivekn/sentiment/
Paper: Fast and accurate sentiment classification using an enhanced Naive Bayes
model. Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture
Notes in Computer Science Volume 8206, 2013, pp 194-201
Core idea:
This tool works by examining individual words and short sequences of words (n-
grams) . "not bad" will be classified as positive despite having two individual words
with a negative sentiment.

More advanced approaches
•Other ones I am currently working on:
•Vadersentiment- https://github.com/cjhutto/vaderSentiment
•Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment
Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media
(ICWSM-14). Ann Arbor, MI, June 2014.
•Indico-https://indico.io/
-0.1
-0.05
0
0.05
0.1
0.15
0.2
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
12/3/2009 1/22/2010 3/13/2010 5/2/2010 6/21/2010 8/10/2010 9/29/2010 11/18/2010 1/7/2011 2/26/2011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580,000 twitter news

More advanced approaches
•Thomson Reuters news analytics: http://thomsonreuters.com/en.html
•Gate (+Annie) - http://gate.ac.uk/
•LingPipe - http://alias-i.com/lingpipe
•WEKA NLP- http://www.cs.waikato.ac.nz/ml/w...
•OpenNLP - http://incubator.apache.org/open...
•JULIE - http://www.julielab.de/
•Research still on going….
•Visit my personal site:
https:// public.tableausoftware.com/ views/SPYVadernewssentiment2010/
SPY?:showVizHome=no#1

wendi_ppt

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to wendi_ppt

Similar to wendi_ppt (20)

wendi_ppt