1) The document discusses using machine learning methods like Naive Bayes classification to analyze sentiment in news data from social media sources like Twitter to enhance trading strategies.
2) It provides an example of parsing Twitter data related to the SPY ETF using NLTK and Naive Bayes, achieving 79.2% accuracy on the training set and 36.3% on the test set.
3) The document also outlines some more advanced approaches to measuring sentiment like Stanford NLP and Vader sentiment and discusses ongoing research into improving news sentiment analysis methods.
6. 1. A Wall Street news analytics company: Sentiment
data is a determinant of market moves after Federal
Open Market Committee (FOMC) rate announcements,
with a 75%accuracy rate in 2014.
2. A Hedge Fund report : We capture a burst of
negative sentiment of ResMed at 11:14AM, October 9,
2014. Despite the serious allegations and the seeming
validity of the report, it took the market over 60
minutes to react.
3. An Institutional Investor : News sentiment Open-
to-Close (OTC) strategy on SPY returned 29.76%
(before cost) over 2014 with a Sharpe Ratio of 3.1.
Claims
7. 1
2
3
4
Preview
First look at social media data
Implemetation
Parsing twitter news sentiment
Improvement
A brief summary of Advanced methods
Trade the news
Tentative trading practices
10. Take Twitter SPY in 2010 as a simple example
⢠2010-01-19T15:14:52Z: $SPY looks strong. riding 5EMA - large gap from
SMA50 - concern about 113 level gone for now.
⢠2010-12-09T13:28:49Z: $SPY managed to reclaim the 1227 support level,
which should bode well for further price appreciation.
⢠2010-12-10T15:59:50Z: $SPY long
⢠2010-01-21T20:57:21Z: $SPY closing @ the lows!
⢠2010-09-07T00:10:55Z: Last Sunday, strength in patterns showing a bearish
market move
⢠2010-12-08T16:25:17Z: $SPY has now failed a breakout. Could recover, but
for now this is a perfect picture of a failed breakout
⢠2010-12-16T15:20:07Z: this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like?
Neutral
11. News Mining: Step 2
How can we interpret the news
sentiment by machine?
12. Parsing News Sentiment using NLTK and Naive
Bayes(supervised learning) for classification
An introduction to NLTK:
NLTK is a platform for building Python programs to work with human
language. It provides easy-to-use interfaces to over 50 corpora and lexical
resources along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning.
Twitter text Database description:
Source: http://stocktwits.com/
Format: JSON
Size: over 15 million
Data entries: Id, body, create at, user name, followers, following âŚ
{"id":918510,"body":"Options Trade in Nordstrom Today $JWN - http://www.cnbc.com/id/34644850/site/14081545/for/cnbc/","created_at":"2010-01-
01T00:09:02Z", "user":{"id":6328,"username":"OptionsHawk","name":"Joe Kunkle","avatar_url":"http://avatars.stocktwits.net/production/6328/thumb-
1290207489.png", "avatar_url_ssl":"https://s3.amazonaws.com/st-avatars/production/6328/thumb-1290207489.png", "official":false,
"identity":"User","classification":[], âjoin_date":"2009-11- 01","followers":7072, "following":31,"ideas":18866, "following_stocks":0,"location":"Boston",
"bio":"Active Options Trader - OptionsHawk.com Founder", "website_url": "http://www.OptionsHawk.com", "trading_strategy":{
"assets_frequently_tradedâ :["Equities", "Options","Forex","Futures"],"approach":"Technical","holding_period":"Swing
Trader","experience":"Professional"}},"source":{"id":1,"title":"StockTwits","url":"http://stocktwits.com"},"symbols":[{"id":6039,"symbol":"JWN","title":"Nord
strom Inc.","exchange":"NYSE","sector":"Services","industry":"Apparel Stores","trending":false}],"entities":{"sentiment":null}}
13. ⢠Starting Set:10,000 manually labeled twitter news items
⢠Distribution of sentiment:
Initial training set: sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
14. 1) Prepare training set â Data cleaning(removing nulls, web links etc) & import
⢠pos_tweets = '$SPY looks strong. riding 5EMA - large gap fromâŚ','positiveâŚ
⢠neg_tweets = '$SPY closing @ the lows', 'negativeâŚ
2) Split the text sentence into word features
⢠'spy', 'looks', 'strong', 'riding', '5ema', 'large', 'gap', 'from', 'sma50', 'concern',
'about', 'level', 'gone', 'for', 'now'âŚ
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features:
⢠'contains(spy)': True,
⢠'contains(support)': False,
⢠'contains(strong)': True ,âŚâŚ
Parsing News Sentiment: Dictionary Mapping
15. 5) Apply this mapping into all the news texts and get the following form:
Parsing News Sentiment:
âspyâ 'supportâ âgoneâ âŚ
Twitter_1 True False True âŚ
Twitter_2 True False False âŚ
Twitter_3 ⌠⌠⌠âŚ
Sentiment
WordFeature
_1
WordFeature
_2
WordFeature
_3
âŚ
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) âŚ
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) âŚ
Twitter_3 0 (Neu) ⌠⌠⌠âŚ
A typical classification problem!
16. 6) Classification:
Naive Bayes classifier
6*) A simple description of Naive Bayes
Bayes Formula:
The "naive" assumptions come into play: assume that each feature is conditionally independent
of every other feature :
In this twitter example, it means the word features independently affect the sentiment of the text
Parsing News Sentiment:
k
ki
Cclasswithitemsnewsof#
Cclassandxfeaturewordwithitemsnewsof#
ď˝)C|x(p ki
17. 7) Model trained. We got 14,356 word features. Most Informative Features include:
8) In sample test and out-of-sample test:
⢠Tweet= '$SPY has now failed a breakout. We could recover, but for now this is a
perfect picture of a failed breakout'
⢠Negative Prob(ânegative')= 0.85525
⢠Tweet= âSPY UP! I like that! '
⢠Positive Prob('positive')= 0.4123 Prob(ânegative')=0.1936
⢠TOTAL in-sample accuracy : 79.2%
⢠TOTAL out-of-sample accuracy : 36.3%
⢠With a large enough training set, the accuracy rate would get
very high
Result:
NEWS ITEM CONTAINS RATIO
âwidelyâ positi : negati = 219.8 : 1.0
âheldâ positi : negati = 172.4 : 1.0
âmostâ positi : negati = 45.7 : 1.0
âfallâ negati : positi = 45.4 : 1.0
âmightâ negati : neutra = 30.6 : 1.0
18. Pros:
1. A basic approach in sentiment analysis; Easy to use;
2. Effective if the training set is large enough;
3. Ability to learn; as the training set gets larger, the results get
more and more accurate(intelligence);
Cons:
1. Failure in grasping the connection between words;
2. Doesnât consider the sequence of words;
3. Non-relevant word features;
A simple summary:
Nltk and Naive Bayes method
Possible Improvements:
1. Larger training set
2. PCA; addressing the problem of too many features;
3. Filtering; remove spam and meaningless tweets;
4. Detecting short sequence of words;
Currently working on themâŚ
19. News Mining: Step 3
Other Advanced Methods in
measuring news sentiment?
20. Other advanced approaches
Stanford NLP: http://nlp.stanford.edu/
Paper: Christopher Manning and Dan Klein. 2003. Optimization, Maxent Models, and
Conditional Estimation without Magic. Tutorial at HLT-NAACL 2003 and ACL 2003.
Core idea:
Maximum entropy classifier. Otherwise known as multiclass logistic regression. The
Max Entropy does not assume that the features are conditionally independent of each
other.
Vivekn: http://github.com/vivekn/sentiment/
Paper: Fast and accurate sentiment classification using an enhanced Naive Bayes
model. Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture
Notes in Computer Science Volume 8206, 2013, pp 194-201
Core idea:
This tool works by examining individual words and short sequences of words (n-
grams) . "not bad" will be classified as positive despite having two individual words
with a negative sentiment.
21. More advanced approaches
â˘Other ones I am currently working on:
â˘Vadersentiment- https://github.com/cjhutto/vaderSentiment
â˘Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment
Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media
(ICWSM-14). Ann Arbor, MI, June 2014.
â˘Indico-https://indico.io/
-0.1
-0.05
0
0.05
0.1
0.15
0.2
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
12/3/2009 1/22/2010 3/13/2010 5/2/2010 6/21/2010 8/10/2010 9/29/2010 11/18/2010 1/7/2011 2/26/2011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580,000 twitter news
22. More advanced approaches
â˘Thomson Reuters news analytics: http://thomsonreuters.com/en.html
â˘Gate (+Annie) - http://gate.ac.uk/
â˘LingPipe - http://alias-i.com/lingpipe
â˘WEKA NLP- http://www.cs.waikato.ac.nz/ml/w...
â˘OpenNLP - http://incubator.apache.org/open...
â˘JULIE - http://www.julielab.de/
â˘Research still on goingâŚ.
â˘Visit my personal site:
https:// public.tableausoftware.com/ views/SPYVadernewssentiment2010/
SPY?:showVizHome=no#1