SlideShare a Scribd company logo
Data Mining Methods in
Trading Strategies
Wendi Zhu
wendi.zhu1991@gmail.com
An analysis based on news sentiment
The Age of Big Data
8 Terabytes
Twitter: 8,000,000,000,000 Bytes
Take Twitter SPY in 2010 as a simple example
Question: Mining news data from
Social Media to enhance trading?
Yes!
1. A Wall Street news analytics company: Sentiment
data is a determinant of market moves after Federal
Open Market Committee (FOMC) rate announcements,
with a 75%accuracy rate in 2014.
2. A Hedge Fund report : We capture a burst of
negative sentiment of ResMed at 11:14AM, October 9,
2014. Despite the serious allegations and the seeming
validity of the report, it took the market over 60
minutes to react.
3. An Institutional Investor : News sentiment Open-
to-Close (OTC) strategy on SPY returned 29.76%
(before cost) over 2014 with a Sharpe Ratio of 3.1.
Claims
1
2
3
4
Preview
First look at social media data
Implemetation
Parsing twitter news sentiment
Improvement
A brief summary of Advanced methods
Trade the news
Tentative trading practices
News Mining: Step 1
What
is a typical Social Media news like?
A typical twitter user interface
Take Twitter SPY in 2010 as a simple example
• 2010-01-19T15:14:52Z: $SPY looks strong. riding 5EMA - large gap from
SMA50 - concern about 113 level gone for now.
• 2010-12-09T13:28:49Z: $SPY managed to reclaim the 1227 support level,
which should bode well for further price appreciation.
• 2010-12-10T15:59:50Z: $SPY long
• 2010-01-21T20:57:21Z: $SPY closing @ the lows!
• 2010-09-07T00:10:55Z: Last Sunday, strength in patterns showing a bearish
market move
• 2010-12-08T16:25:17Z: $SPY has now failed a breakout. Could recover, but
for now this is a perfect picture of a failed breakout
• 2010-12-16T15:20:07Z: this weeks patterns $SPY see here
Positive
Positive
Positive
Negative
Negative
Negative
What does a financial twitter news look like?
Neutral
News Mining: Step 2
How can we interpret the news
sentiment by machine?
Parsing News Sentiment using NLTK and Naive
Bayes(supervised learning) for classification
An introduction to NLTK:
NLTK is a platform for building Python programs to work with human
language. It provides easy-to-use interfaces to over 50 corpora and lexical
resources along with a suite of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and semantic reasoning.
Twitter text Database description:
Source: http://stocktwits.com/
Format: JSON
Size: over 15 million
Data entries: Id, body, create at, user name, followers, following …
{"id":918510,"body":"Options Trade in Nordstrom Today $JWN - http://www.cnbc.com/id/34644850/site/14081545/for/cnbc/","created_at":"2010-01-
01T00:09:02Z", "user":{"id":6328,"username":"OptionsHawk","name":"Joe Kunkle","avatar_url":"http://avatars.stocktwits.net/production/6328/thumb-
1290207489.png", "avatar_url_ssl":"https://s3.amazonaws.com/st-avatars/production/6328/thumb-1290207489.png", "official":false,
"identity":"User","classification":[], “join_date":"2009-11- 01","followers":7072, "following":31,"ideas":18866, "following_stocks":0,"location":"Boston",
"bio":"Active Options Trader - OptionsHawk.com Founder", "website_url": "http://www.OptionsHawk.com", "trading_strategy":{
"assets_frequently_traded“ :["Equities", "Options","Forex","Futures"],"approach":"Technical","holding_period":"Swing
Trader","experience":"Professional"}},"source":{"id":1,"title":"StockTwits","url":"http://stocktwits.com"},"symbols":[{"id":6039,"symbol":"JWN","title":"Nord
strom Inc.","exchange":"NYSE","sector":"Services","industry":"Apparel Stores","trending":false}],"entities":{"sentiment":null}}
• Starting Set:10,000 manually labeled twitter news items
• Distribution of sentiment:
Initial training set: sample test
SENTIMENT TRAININGSET TESTING TOTAL
POSITIVE 2379 807 3186
NEUTRAL 3849 1248 5097
NEGATIVE 1214 428 1642
SUBTOTAL 7442 2483 9925
NULL 58 17 75
TOTAL 7500 2500 10000
1) Prepare training set – Data cleaning(removing nulls, web links etc) & import
• pos_tweets = '$SPY looks strong. riding 5EMA - large gap from…','positive…
• neg_tweets = '$SPY closing @ the lows', 'negative…
2) Split the text sentence into word features
• 'spy', 'looks', 'strong', 'riding', '5ema', 'large', 'gap', 'from', 'sma50', 'concern',
'about', 'level', 'gone', 'for', 'now'…
3) Build a dictionary
A collection of all the recognized word features in the training set
4) Map text onto word features:
• 'contains(spy)': True,
• 'contains(support)': False,
• 'contains(strong)': True ,……
Parsing News Sentiment: Dictionary Mapping
5) Apply this mapping into all the news texts and get the following form:
Parsing News Sentiment:
‘spy’ 'support’ ‘gone’ …
Twitter_1 True False True …
Twitter_2 True False False …
Twitter_3 … … … …
Sentiment
WordFeature
_1
WordFeature
_2
WordFeature
_3
…
Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) …
Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) …
Twitter_3 0 (Neu) … … … …
A typical classification problem!
6) Classification:
Naive Bayes classifier
6*) A simple description of Naive Bayes
Bayes Formula:
The "naive" assumptions come into play: assume that each feature is conditionally independent
of every other feature :
In this twitter example, it means the word features independently affect the sentiment of the text
Parsing News Sentiment:
k
ki
Cclasswithitemsnewsof#
Cclassandxfeaturewordwithitemsnewsof#
)C|x(p ki
7) Model trained. We got 14,356 word features. Most Informative Features include:
8) In sample test and out-of-sample test:
• Tweet= '$SPY has now failed a breakout. We could recover, but for now this is a
perfect picture of a failed breakout'
• Negative Prob(‘negative')= 0.85525
• Tweet= ‘SPY UP! I like that! '
• Positive Prob('positive')= 0.4123 Prob(‘negative')=0.1936
• TOTAL in-sample accuracy : 79.2%
• TOTAL out-of-sample accuracy : 36.3%
• With a large enough training set, the accuracy rate would get
very high
Result:
NEWS ITEM CONTAINS RATIO
‘widely’ positi : negati = 219.8 : 1.0
‘held’ positi : negati = 172.4 : 1.0
‘most’ positi : negati = 45.7 : 1.0
‘fall’ negati : positi = 45.4 : 1.0
‘might’ negati : neutra = 30.6 : 1.0
Pros:
1. A basic approach in sentiment analysis; Easy to use;
2. Effective if the training set is large enough;
3. Ability to learn; as the training set gets larger, the results get
more and more accurate(intelligence);
Cons:
1. Failure in grasping the connection between words;
2. Doesn’t consider the sequence of words;
3. Non-relevant word features;
A simple summary:
Nltk and Naive Bayes method
Possible Improvements:
1. Larger training set
2. PCA; addressing the problem of too many features;
3. Filtering; remove spam and meaningless tweets;
4. Detecting short sequence of words;
Currently working on them…
News Mining: Step 3
Other Advanced Methods in
measuring news sentiment?
Other advanced approaches
Stanford NLP: http://nlp.stanford.edu/
Paper: Christopher Manning and Dan Klein. 2003. Optimization, Maxent Models, and
Conditional Estimation without Magic. Tutorial at HLT-NAACL 2003 and ACL 2003.
Core idea:
Maximum entropy classifier. Otherwise known as multiclass logistic regression. The
Max Entropy does not assume that the features are conditionally independent of each
other.
Vivekn: http://github.com/vivekn/sentiment/
Paper: Fast and accurate sentiment classification using an enhanced Naive Bayes
model. Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture
Notes in Computer Science Volume 8206, 2013, pp 194-201
Core idea:
This tool works by examining individual words and short sequences of words (n-
grams) . "not bad" will be classified as positive despite having two individual words
with a negative sentiment.
More advanced approaches
•Other ones I am currently working on:
•Vadersentiment- https://github.com/cjhutto/vaderSentiment
•Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment
Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media
(ICWSM-14). Ann Arbor, MI, June 2014.
•Indico-https://indico.io/
-0.1
-0.05
0
0.05
0.1
0.15
0.2
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
12/3/2009 1/22/2010 3/13/2010 5/2/2010 6/21/2010 8/10/2010 9/29/2010 11/18/2010 1/7/2011 2/26/2011
indico vader spy_cum_return
Daily averaged news sentiment
A plot of sentiment engines based on 2010 SPY 580,000 twitter news
More advanced approaches
•Thomson Reuters news analytics: http://thomsonreuters.com/en.html
•Gate (+Annie) - http://gate.ac.uk/
•LingPipe - http://alias-i.com/lingpipe
•WEKA NLP- http://www.cs.waikato.ac.nz/ml/w...
•OpenNLP - http://incubator.apache.org/open...
•JULIE - http://www.julielab.de/
•Research still on going….
•Visit my personal site:
https:// public.tableausoftware.com/ views/SPYVadernewssentiment2010/
SPY?:showVizHome=no#1
Thank you!

More Related Content

What's hot

Opinion Mining – Twitter
Opinion Mining – TwitterOpinion Mining – Twitter
Opinion Mining – Twitter
Sandhiya Kothandan
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
SonuCreation
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
Parvathy Devaraj
 
Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique
IJERA Editor
 

What's hot (20)

SentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and TweetsSentiCheNews - Sentiment Analysis on Newspapers and Tweets
SentiCheNews - Sentiment Analysis on Newspapers and Tweets
 
Twitter sentimentanalysis report
Twitter sentimentanalysis reportTwitter sentimentanalysis report
Twitter sentimentanalysis report
 
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
IRJET - Implementation of Twitter Sentimental Analysis According to Hash Tag
 
Python report on twitter sentiment analysis
Python report on twitter sentiment analysisPython report on twitter sentiment analysis
Python report on twitter sentiment analysis
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
IRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisIRE2014-Sentiment Analysis
IRE2014-Sentiment Analysis
 
Opinion Mining – Twitter
Opinion Mining – TwitterOpinion Mining – Twitter
Opinion Mining – Twitter
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
SENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATASENTIMENT ANALYSIS OF TWITTER DATA
SENTIMENT ANALYSIS OF TWITTER DATA
 
Sentiment analysis in twitter using python
Sentiment analysis in twitter using pythonSentiment analysis in twitter using python
Sentiment analysis in twitter using python
 
IRJET - Twitter Sentimental Analysis
IRJET -  	  Twitter Sentimental AnalysisIRJET -  	  Twitter Sentimental Analysis
IRJET - Twitter Sentimental Analysis
 
Twitter as a personalizable information service ii
Twitter as a personalizable information service iiTwitter as a personalizable information service ii
Twitter as a personalizable information service ii
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
 
Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
Sentiment Analysis on Twitter
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on Twitter
 
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse AnalysisSentiment Analysis in Twitter with Lightweight Discourse Analysis
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
 
Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique Sentiment Analysis of Twitter tweets using supervised classification technique
Sentiment Analysis of Twitter tweets using supervised classification technique
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
 
Best Practices for Sentiment Analysis Webinar
Best Practices for Sentiment Analysis Webinar Best Practices for Sentiment Analysis Webinar
Best Practices for Sentiment Analysis Webinar
 
social network analysis project twitter sentimental analysis
social network analysis project twitter sentimental analysissocial network analysis project twitter sentimental analysis
social network analysis project twitter sentimental analysis
 

Similar to wendi_ppt

ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
OSINT using Twitter & Python
OSINT using Twitter & PythonOSINT using Twitter & Python
OSINT using Twitter & Python
37point2
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
stilliegeorgiana
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
anhcrowley
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptx
nilesh405711
 

Similar to wendi_ppt (20)

Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using python
 
IRJET - Cyberbulling Detection Model
IRJET -  	  Cyberbulling Detection ModelIRJET -  	  Cyberbulling Detection Model
IRJET - Cyberbulling Detection Model
 
Sentiment analysis using machine learning
Sentiment analysis using machine learningSentiment analysis using machine learning
Sentiment analysis using machine learning
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Tweet analyzer web applicaion
Tweet analyzer web applicaionTweet analyzer web applicaion
Tweet analyzer web applicaion
 
UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS
UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSISUTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS
UTILIZING TWITTER TO PERFORM AUTONOMOUS SENTIMENT ANALYSIS
 
Twitter sentiment classifications 1
Twitter sentiment classifications 1Twitter sentiment classifications 1
Twitter sentiment classifications 1
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine Learning
 
DP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptxDP1_160430723010_Divya.pptx
DP1_160430723010_Divya.pptx
 
Twitter Data Analysis
Twitter Data Analysis Twitter Data Analysis
Twitter Data Analysis
 
OSINT using Twitter & Python
OSINT using Twitter & PythonOSINT using Twitter & Python
OSINT using Twitter & Python
 
Sentiment Analysis.pptx
Sentiment Analysis.pptxSentiment Analysis.pptx
Sentiment Analysis.pptx
 
paper_148.pptx
paper_148.pptxpaper_148.pptx
paper_148.pptx
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2
 
Narrative Mind Week 5 H4D Stanford 2016
Narrative Mind Week 5 H4D Stanford 2016Narrative Mind Week 5 H4D Stanford 2016
Narrative Mind Week 5 H4D Stanford 2016
 
Post 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docxPost 1What is text analytics How does it differ from text mini.docx
Post 1What is text analytics How does it differ from text mini.docx
 
Post 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text miniPost 1What is text analytics How does it differ from text mini
Post 1What is text analytics How does it differ from text mini
 
KDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptxKDD, Data Mining, Data Science_I.pptx
KDD, Data Mining, Data Science_I.pptx
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptx
 
Drone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan TeknologiDrone Emprit: Konsep dan Teknologi
Drone Emprit: Konsep dan Teknologi
 

wendi_ppt

  • 1. Data Mining Methods in Trading Strategies Wendi Zhu wendi.zhu1991@gmail.com An analysis based on news sentiment
  • 2. The Age of Big Data
  • 4. Take Twitter SPY in 2010 as a simple example
  • 5. Question: Mining news data from Social Media to enhance trading? Yes!
  • 6. 1. A Wall Street news analytics company: Sentiment data is a determinant of market moves after Federal Open Market Committee (FOMC) rate announcements, with a 75%accuracy rate in 2014. 2. A Hedge Fund report : We capture a burst of negative sentiment of ResMed at 11:14AM, October 9, 2014. Despite the serious allegations and the seeming validity of the report, it took the market over 60 minutes to react. 3. An Institutional Investor : News sentiment Open- to-Close (OTC) strategy on SPY returned 29.76% (before cost) over 2014 with a Sharpe Ratio of 3.1. Claims
  • 7. 1 2 3 4 Preview First look at social media data Implemetation Parsing twitter news sentiment Improvement A brief summary of Advanced methods Trade the news Tentative trading practices
  • 8. News Mining: Step 1 What is a typical Social Media news like?
  • 9. A typical twitter user interface
  • 10. Take Twitter SPY in 2010 as a simple example • 2010-01-19T15:14:52Z: $SPY looks strong. riding 5EMA - large gap from SMA50 - concern about 113 level gone for now. • 2010-12-09T13:28:49Z: $SPY managed to reclaim the 1227 support level, which should bode well for further price appreciation. • 2010-12-10T15:59:50Z: $SPY long • 2010-01-21T20:57:21Z: $SPY closing @ the lows! • 2010-09-07T00:10:55Z: Last Sunday, strength in patterns showing a bearish market move • 2010-12-08T16:25:17Z: $SPY has now failed a breakout. Could recover, but for now this is a perfect picture of a failed breakout • 2010-12-16T15:20:07Z: this weeks patterns $SPY see here Positive Positive Positive Negative Negative Negative What does a financial twitter news look like? Neutral
  • 11. News Mining: Step 2 How can we interpret the news sentiment by machine?
  • 12. Parsing News Sentiment using NLTK and Naive Bayes(supervised learning) for classification An introduction to NLTK: NLTK is a platform for building Python programs to work with human language. It provides easy-to-use interfaces to over 50 corpora and lexical resources along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Twitter text Database description: Source: http://stocktwits.com/ Format: JSON Size: over 15 million Data entries: Id, body, create at, user name, followers, following … {"id":918510,"body":"Options Trade in Nordstrom Today $JWN - http://www.cnbc.com/id/34644850/site/14081545/for/cnbc/","created_at":"2010-01- 01T00:09:02Z", "user":{"id":6328,"username":"OptionsHawk","name":"Joe Kunkle","avatar_url":"http://avatars.stocktwits.net/production/6328/thumb- 1290207489.png", "avatar_url_ssl":"https://s3.amazonaws.com/st-avatars/production/6328/thumb-1290207489.png", "official":false, "identity":"User","classification":[], “join_date":"2009-11- 01","followers":7072, "following":31,"ideas":18866, "following_stocks":0,"location":"Boston", "bio":"Active Options Trader - OptionsHawk.com Founder", "website_url": "http://www.OptionsHawk.com", "trading_strategy":{ "assets_frequently_traded“ :["Equities", "Options","Forex","Futures"],"approach":"Technical","holding_period":"Swing Trader","experience":"Professional"}},"source":{"id":1,"title":"StockTwits","url":"http://stocktwits.com"},"symbols":[{"id":6039,"symbol":"JWN","title":"Nord strom Inc.","exchange":"NYSE","sector":"Services","industry":"Apparel Stores","trending":false}],"entities":{"sentiment":null}}
  • 13. • Starting Set:10,000 manually labeled twitter news items • Distribution of sentiment: Initial training set: sample test SENTIMENT TRAININGSET TESTING TOTAL POSITIVE 2379 807 3186 NEUTRAL 3849 1248 5097 NEGATIVE 1214 428 1642 SUBTOTAL 7442 2483 9925 NULL 58 17 75 TOTAL 7500 2500 10000
  • 14. 1) Prepare training set – Data cleaning(removing nulls, web links etc) & import • pos_tweets = '$SPY looks strong. riding 5EMA - large gap from…','positive… • neg_tweets = '$SPY closing @ the lows', 'negative… 2) Split the text sentence into word features • 'spy', 'looks', 'strong', 'riding', '5ema', 'large', 'gap', 'from', 'sma50', 'concern', 'about', 'level', 'gone', 'for', 'now'… 3) Build a dictionary A collection of all the recognized word features in the training set 4) Map text onto word features: • 'contains(spy)': True, • 'contains(support)': False, • 'contains(strong)': True ,…… Parsing News Sentiment: Dictionary Mapping
  • 15. 5) Apply this mapping into all the news texts and get the following form: Parsing News Sentiment: ‘spy’ 'support’ ‘gone’ … Twitter_1 True False True … Twitter_2 True False False … Twitter_3 … … … … Sentiment WordFeature _1 WordFeature _2 WordFeature _3 … Twitter_1 1 (Pos) 1 (True) 0 (False) 1 (True) … Twitter_2 -1 (Neg) 1 (True) 0 (False) 0 (False) … Twitter_3 0 (Neu) … … … … A typical classification problem!
  • 16. 6) Classification: Naive Bayes classifier 6*) A simple description of Naive Bayes Bayes Formula: The "naive" assumptions come into play: assume that each feature is conditionally independent of every other feature : In this twitter example, it means the word features independently affect the sentiment of the text Parsing News Sentiment: k ki Cclasswithitemsnewsof# Cclassandxfeaturewordwithitemsnewsof# )C|x(p ki
  • 17. 7) Model trained. We got 14,356 word features. Most Informative Features include: 8) In sample test and out-of-sample test: • Tweet= '$SPY has now failed a breakout. We could recover, but for now this is a perfect picture of a failed breakout' • Negative Prob(‘negative')= 0.85525 • Tweet= ‘SPY UP! I like that! ' • Positive Prob('positive')= 0.4123 Prob(‘negative')=0.1936 • TOTAL in-sample accuracy : 79.2% • TOTAL out-of-sample accuracy : 36.3% • With a large enough training set, the accuracy rate would get very high Result: NEWS ITEM CONTAINS RATIO ‘widely’ positi : negati = 219.8 : 1.0 ‘held’ positi : negati = 172.4 : 1.0 ‘most’ positi : negati = 45.7 : 1.0 ‘fall’ negati : positi = 45.4 : 1.0 ‘might’ negati : neutra = 30.6 : 1.0
  • 18. Pros: 1. A basic approach in sentiment analysis; Easy to use; 2. Effective if the training set is large enough; 3. Ability to learn; as the training set gets larger, the results get more and more accurate(intelligence); Cons: 1. Failure in grasping the connection between words; 2. Doesn’t consider the sequence of words; 3. Non-relevant word features; A simple summary: Nltk and Naive Bayes method Possible Improvements: 1. Larger training set 2. PCA; addressing the problem of too many features; 3. Filtering; remove spam and meaningless tweets; 4. Detecting short sequence of words; Currently working on them…
  • 19. News Mining: Step 3 Other Advanced Methods in measuring news sentiment?
  • 20. Other advanced approaches Stanford NLP: http://nlp.stanford.edu/ Paper: Christopher Manning and Dan Klein. 2003. Optimization, Maxent Models, and Conditional Estimation without Magic. Tutorial at HLT-NAACL 2003 and ACL 2003. Core idea: Maximum entropy classifier. Otherwise known as multiclass logistic regression. The Max Entropy does not assume that the features are conditionally independent of each other. Vivekn: http://github.com/vivekn/sentiment/ Paper: Fast and accurate sentiment classification using an enhanced Naive Bayes model. Intelligent Data Engineering and Automated Learning IDEAL 2013 Lecture Notes in Computer Science Volume 8206, 2013, pp 194-201 Core idea: This tool works by examining individual words and short sequences of words (n- grams) . "not bad" will be classified as positive despite having two individual words with a negative sentiment.
  • 21. More advanced approaches •Other ones I am currently working on: •Vadersentiment- https://github.com/cjhutto/vaderSentiment •Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014. •Indico-https://indico.io/ -0.1 -0.05 0 0.05 0.1 0.15 0.2 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 12/3/2009 1/22/2010 3/13/2010 5/2/2010 6/21/2010 8/10/2010 9/29/2010 11/18/2010 1/7/2011 2/26/2011 indico vader spy_cum_return Daily averaged news sentiment A plot of sentiment engines based on 2010 SPY 580,000 twitter news
  • 22. More advanced approaches •Thomson Reuters news analytics: http://thomsonreuters.com/en.html •Gate (+Annie) - http://gate.ac.uk/ •LingPipe - http://alias-i.com/lingpipe •WEKA NLP- http://www.cs.waikato.ac.nz/ml/w... •OpenNLP - http://incubator.apache.org/open... •JULIE - http://www.julielab.de/ •Research still on going…. •Visit my personal site: https:// public.tableausoftware.com/ views/SPYVadernewssentiment2010/ SPY?:showVizHome=no#1