The document describes a project using tweets to predict stock price fluctuations of Microsoft (MSFT). It discusses:
1. Using sentiment analysis and naive Bayes classification to determine if tweets in a 1-minute period are positive or negative, and linking this to stock price movement in the next minute.
2. Pre-implementation considerations like data sources, filtration of tweets by keywords, language, encoding, and normalization.
3. The implementation steps of downloading tweet data, processing and filtering it, splitting into 1-minute blocks and linking to intraday stock prices.
Lexicon-based approaches to Twitter sentiment analysis are gaining much popularity due to their simplicity, domain independence, and relatively good performance. These approaches rely on sentiment lexicons, where a collection of words are marked with fixed sentiment polarities. However, words' sentiment orientation (positive, neural, negative) and/or sentiment strengths could change depending on context and targeted entities. In this paper we present SentiCircle; a novel lexicon-based approach that takes into account the contextual and conceptual semantics of words when calculating their sentiment orientation and strength in Twitter. We evaluate our approach on three Twitter datasets using three different sentiment lexicons. Results show that our approach significantly outperforms two lexicon baselines. Results are competitive but inconclusive when comparing to state-of-art SentiStrength, and vary from one dataset to another. SentiCircle outperforms SentiStrength in accuracy on average, but falls marginally behind in F-measure.
Preference amplification in recommendation system taeseon ryu
안녕하세요 딥러닝 논문읽기 모임입니다 오늘 업로드된 논문 리뷰 영상은 2021 Facebook research에서 발표한 'Preference Amplification in Recommender Systems' 라는 제목의 추천시스템 논문입니다!
추천 시스템은 사용자에게 콘텐츠를 제안하는 데 있어 점점 더 정확해지고 있으며, 그 결과 사용자는 주로 추천을 통해 콘텐츠를 소비하게 되었습니다. 이로 인해 사용자의 관심이 권장 콘텐츠로 좁혀질 수 있습니다. 이를 'preference amplification'이라고 합니다. 이는 참여 증가에 기여할 수 있지만 다양성 및 에코 챔버 부족과 같은 부정적인 경험으로 이어질 수도 있습니다. 논문은 amplification in a matrix factorization based recommender system을 제안하여 위와 같은 문제를 해결하는대 집중하였습니다!
오늘도 많은 관심 미리 감사드립니다!
Probabilistic programming is a new approach to machine learning and data science that is currently the focus of intense academic research, including an ongoing DARPA program. If successful, probabilistic programming systems will allow sophisticated predictive models to be written by a wide range of domain experts. Before we get to the promised land, though, some basic challenges need to be addressed, including performance on real-world datasets, programming tools support, and education.
A Multiscale Visualization of Attention in the Transformer Modeltaeseon ryu
안녕하세요 딥러닝 논문읽기 모임입니다 오늘 업로드된 논문 리뷰 영상은 2019 ACL 에서 발표된 A Multiscale Visualization of Attention in the Transformer Model 라는 제목의 논문입니다.
본 논문은 최고의 주가를 달리는 트랜스포머가 연산하는 과정을 다양한 관점에서 비주얼라이징을 할 수 있는 툴에 관련된 논문 입니다.
트랜스포머와 버트, GPT에 대한 간단한 소개와 더불어, 해당 비주얼라이징이 어떻게 활용 될 수 있는지에 대한 여러 유스케이스를 자연어 처리팀 백지윤님이 자세한 리뷰 도와 주셨습니다.
Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front of our members at the right time. But the job of recommendation does not end there. The homepage should be able to convey to the member enough evidence of why a title may be good for her, especially for shows that the member has never heard of. One way to address this challenge is to personalize the way we portray the titles on our service. An important aspect of how to portray titles is through the artwork or imagery we display to visually represent each title. The artwork may highlight an actor that you recognize, capture an exciting moment like a car chase, or contain a dramatic scene that conveys the essence of a movie or show. It is important to select good artwork because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we use on the Netflix homepage. The system selects an image for each member and video to give better visual evidence for why the title might be appealing to that particular member.
Lexicon-based approaches to Twitter sentiment analysis are gaining much popularity due to their simplicity, domain independence, and relatively good performance. These approaches rely on sentiment lexicons, where a collection of words are marked with fixed sentiment polarities. However, words' sentiment orientation (positive, neural, negative) and/or sentiment strengths could change depending on context and targeted entities. In this paper we present SentiCircle; a novel lexicon-based approach that takes into account the contextual and conceptual semantics of words when calculating their sentiment orientation and strength in Twitter. We evaluate our approach on three Twitter datasets using three different sentiment lexicons. Results show that our approach significantly outperforms two lexicon baselines. Results are competitive but inconclusive when comparing to state-of-art SentiStrength, and vary from one dataset to another. SentiCircle outperforms SentiStrength in accuracy on average, but falls marginally behind in F-measure.
Preference amplification in recommendation system taeseon ryu
안녕하세요 딥러닝 논문읽기 모임입니다 오늘 업로드된 논문 리뷰 영상은 2021 Facebook research에서 발표한 'Preference Amplification in Recommender Systems' 라는 제목의 추천시스템 논문입니다!
추천 시스템은 사용자에게 콘텐츠를 제안하는 데 있어 점점 더 정확해지고 있으며, 그 결과 사용자는 주로 추천을 통해 콘텐츠를 소비하게 되었습니다. 이로 인해 사용자의 관심이 권장 콘텐츠로 좁혀질 수 있습니다. 이를 'preference amplification'이라고 합니다. 이는 참여 증가에 기여할 수 있지만 다양성 및 에코 챔버 부족과 같은 부정적인 경험으로 이어질 수도 있습니다. 논문은 amplification in a matrix factorization based recommender system을 제안하여 위와 같은 문제를 해결하는대 집중하였습니다!
오늘도 많은 관심 미리 감사드립니다!
Probabilistic programming is a new approach to machine learning and data science that is currently the focus of intense academic research, including an ongoing DARPA program. If successful, probabilistic programming systems will allow sophisticated predictive models to be written by a wide range of domain experts. Before we get to the promised land, though, some basic challenges need to be addressed, including performance on real-world datasets, programming tools support, and education.
A Multiscale Visualization of Attention in the Transformer Modeltaeseon ryu
안녕하세요 딥러닝 논문읽기 모임입니다 오늘 업로드된 논문 리뷰 영상은 2019 ACL 에서 발표된 A Multiscale Visualization of Attention in the Transformer Model 라는 제목의 논문입니다.
본 논문은 최고의 주가를 달리는 트랜스포머가 연산하는 과정을 다양한 관점에서 비주얼라이징을 할 수 있는 툴에 관련된 논문 입니다.
트랜스포머와 버트, GPT에 대한 간단한 소개와 더불어, 해당 비주얼라이징이 어떻게 활용 될 수 있는지에 대한 여러 유스케이스를 자연어 처리팀 백지윤님이 자세한 리뷰 도와 주셨습니다.
Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.
Artwork Personalization at Netflix Fernando Amat RecSys2018 Fernando Amat
For many years, the main goal of the Netflix personalized recommendation system has been to get the right titles in front of our members at the right time. But the job of recommendation does not end there. The homepage should be able to convey to the member enough evidence of why a title may be good for her, especially for shows that the member has never heard of. One way to address this challenge is to personalize the way we portray the titles on our service. An important aspect of how to portray titles is through the artwork or imagery we display to visually represent each title. The artwork may highlight an actor that you recognize, capture an exciting moment like a car chase, or contain a dramatic scene that conveys the essence of a movie or show. It is important to select good artwork because it may be the first time a member becomes aware of a title (and sometimes the only time), so it must speak to them in a meaningful way. In this talk, we will present an approach for personalizing the artwork we use on the Netflix homepage. The system selects an image for each member and video to give better visual evidence for why the title might be appealing to that particular member.
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Eric Brown
In this presentation, I provide an overview of my research into using twitter sentiment and message volume as inputs into modeling stock price movements. A quick and dirty linear regression model using Twitter Sentiment, the Number of Tweets per day, the VIX Closing price and the VIX Price change delivers a simple model for the S&P 500 SPY ETF that has an accuracy of 57% over 6 months (tested on out-of sample data). This model was built using data from July 11 2011 to August 11 2011.
This presentation is about Sentiment analysis Using Machine Learning which is a modern way to perform sentiment analysis operation. it has various techniques and algorithm described and compared for SA
Lexicon-Based Sentiment Analysis at GHC 2014Bo Hyun Kim
Attended Grace Hopper Celebration to present the work in Data Science Track. The presentation is on using HP Vertica Pulse and enhancing the accuracy using the right pre-processing methods and training for accuracy using the naive bayes theorem.
Update: Social Harvest is going open source, see http://www.socialharvest.io for more information.
My MongoSV 2011 talk about implementing machine learning and other algorithms in MongoDB. With a little real-world example at the end about what Social Harvest is doing with MongoDB. For more updates about my research, check out my blog at www.shift8creative.com
Sentiment Analysis/Opinion Mining of Twitter Data on Unigram/Bigram/Unigram+Bigram Model using:
1. Machine Learning
2. Lexical Scores
3. Emoticon Scores
YouTube Video: https://youtu.be/VuR16P87yPE
Link to the WebPage: http://akirato.github.io/Twitter-Sentiment-Analysis-Tool
Github Page: https://github.com/Akirato/Twitter-Sentiment-Analysis-Tool
A Deep Dive into Classification with Naive Bayes. Along the way we take a look at some basics from Ian Witten's Data Mining book and dig into the algorithm.
Presented on Wed Apr 27 2011 at SeaHUG in Seattle, WA.
Sentiment analysis using naive bayes classifier Dev Sahu
This ppt contains a small description of naive bayes classifier algorithm. It is a machine learning approach for detection of sentiment and text classification.
Mobile Recommendation Engine
collaborative filtering and content based approach in hybrid manner then Genetic Algorithm for Enhancement of the Recommendation Engine. by this marketers also will get the unique characteristics of the product that must be created and also recommend to the user.
Dive into the world of sentiment analysis applied to movie reviews. Explore how data science techniques can uncover the true sentiments behind the words, providing valuable insights for filmmakers and critics alike. Join us as we analyze the highs and lows of movie emotions. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Eric Brown
In this presentation, I provide an overview of my research into using twitter sentiment and message volume as inputs into modeling stock price movements. A quick and dirty linear regression model using Twitter Sentiment, the Number of Tweets per day, the VIX Closing price and the VIX Price change delivers a simple model for the S&P 500 SPY ETF that has an accuracy of 57% over 6 months (tested on out-of sample data). This model was built using data from July 11 2011 to August 11 2011.
This presentation is about Sentiment analysis Using Machine Learning which is a modern way to perform sentiment analysis operation. it has various techniques and algorithm described and compared for SA
Lexicon-Based Sentiment Analysis at GHC 2014Bo Hyun Kim
Attended Grace Hopper Celebration to present the work in Data Science Track. The presentation is on using HP Vertica Pulse and enhancing the accuracy using the right pre-processing methods and training for accuracy using the naive bayes theorem.
Update: Social Harvest is going open source, see http://www.socialharvest.io for more information.
My MongoSV 2011 talk about implementing machine learning and other algorithms in MongoDB. With a little real-world example at the end about what Social Harvest is doing with MongoDB. For more updates about my research, check out my blog at www.shift8creative.com
Sentiment Analysis/Opinion Mining of Twitter Data on Unigram/Bigram/Unigram+Bigram Model using:
1. Machine Learning
2. Lexical Scores
3. Emoticon Scores
YouTube Video: https://youtu.be/VuR16P87yPE
Link to the WebPage: http://akirato.github.io/Twitter-Sentiment-Analysis-Tool
Github Page: https://github.com/Akirato/Twitter-Sentiment-Analysis-Tool
A Deep Dive into Classification with Naive Bayes. Along the way we take a look at some basics from Ian Witten's Data Mining book and dig into the algorithm.
Presented on Wed Apr 27 2011 at SeaHUG in Seattle, WA.
Sentiment analysis using naive bayes classifier Dev Sahu
This ppt contains a small description of naive bayes classifier algorithm. It is a machine learning approach for detection of sentiment and text classification.
Mobile Recommendation Engine
collaborative filtering and content based approach in hybrid manner then Genetic Algorithm for Enhancement of the Recommendation Engine. by this marketers also will get the unique characteristics of the product that must be created and also recommend to the user.
Dive into the world of sentiment analysis applied to movie reviews. Explore how data science techniques can uncover the true sentiments behind the words, providing valuable insights for filmmakers and critics alike. Join us as we analyze the highs and lows of movie emotions. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
2. Content
• Part I. Fundamental of Sentiment Analysis
• Part II. Naive Bayes Classifier
• Part III. Pre-implementation
• Part IV. Implementation Steps
• Part V. Results and Conclusions
3. Fundamental of Sentiment
Analysis - Market Anomaly
Sentiment vs. Returns
Many studies have documented that investors
sentiment plays an important role in determining
prices. Behavioral finance attempts to show that
investors’ irrational behavior actually affect stock
price. Especially for stocks with not enough
arbitrage forces to absolve shocks, investor
sentiment systematically affects the movement of
stock prices.
4. Fundamental of Sentiment
Analysis - EMH
Efficient Market Hypothesis (EMH)
Stocks always trade at their fair value on stock
exchanges, making it impossible for investors to
either purchase undervalued stocks or sell stocks
for inflated prices. As such, it should be
impossible to outperform the overall market
through expert stock selection or market timing,
and that the only way an investor can possibly
obtain higher returns is by purchasing riskier
investments.
But, EMH is not always true!
5. Fundamental of Sentiment
Analysis - Sentiment Analysis
• Sentiment analysis refers to the use
of natural language processing, text
analysis and computational linguistics to
identify and extract subjective information
in source materials.
• The basic task of sentiment analysis is to
determine whether the opinion is positive,
negative, or neutral in a given text.
• Beyond this, sentiment analysis can also
determine various emotional states such as
"angry," "sad," and "happy."
6. Fundamental of Sentiment Analysis
- Sentiment Analysis (Cont.)
• Sentiment tracking techniques has been
improved to extract indicators of public
mood directly from social media content
such as blog content and in particular large-
scale Twitter feeds.
• Although each tweet is limited to only 140
characters, the aggregate of millions of
tweets submitted to Twitter at any given
time may provide an accurate
representation of public mood and
sentiment.
7. Fundamental of Sentiment Analysis
- Sentiment Analysis (Cont.)
to diminish the strength, flavor, or
brilliance of by admixture-Merriam
Webster
Make something weaker in force,
content, or value by modifying it or
adding other elements to it-Oxford
NEGATIVE!
“It‘s a product that tries too hard to do too
much. It's trying to be a tablet and a
notebook and it really succeeds at being
neither. It's sort of diluted."
8. Fundamental of Sentiment Analysis
– Basic Idea of Our Project
• Typical sentiment analysis are based on some training sets
built by manually dividing words into different groups such
as positive, neutral, and negative from thousands of text
samples.
• Different from the typical approach, we collect and
aggregate all available tweets that include a specific stock
(name or ticker) within each minute and determine whether
this collection of tweets has influence to that stock price in
the next minute.
• In other words, we combine the tweets within one minute
and the change of stock price in the next minute as one
sample. For instance, if stock price increases or its return is
positive, we claim that this sample belongs to the positive
group.
10. • We define prior, conditional and joint probability for
random variables
– Prior probability:
– Conditional probability:
– Joint probability:
– Relationship:
– Independence:
• Bayesian Rule
)(
)()(
)(
X
X
X
P
CPC|P
|CP
Evidence
PriorLikelihood
Posterior
)(XP
)|,)( 121 XP(XX|XP 2
))(),,( 22 ,XP(XPXX 11 XX
)()()),()|(),()|( 212121212 XPXP,XP(XXPXXPXPXXP 1
)()|()()|() 2211122 XPXXPXPXXP,XP(X1
Naive Bayes Classifier –
Probability Basics
11. Assume that we have two classes:
c1 = Male, and c2 = Female.
We have a person whose sex we do not know,
name “Drew”.
Classifying this “Drew” as male or female is
equivalent to asking is it more probable that
“Drew” is male or female, or which value is
greater: p(male| drew) or p(female| drew) ?
𝑝 𝑚𝑎𝑙𝑒 Drew =
𝑝 Drew 𝑚𝑎𝑙𝑒 ∗ 𝑝(𝑚𝑎𝑙𝑒)
𝑝("𝐷𝑟𝑒𝑤")
What is the probability of being called “Drew”
given that you are a male?
What is the probability of
being a male?
What is the probability of
being named “Drew”?
Drew Chadwick
Drew Barrymore
Naive Bayes Classifier –
Bayes Theorem
12. • MAP classification rule
– MAP (Maximum A Posterior)
– Assign x to c* if
• Method of Generative classification with the MAP rule
1. Apply Bayesian rule to convert them into posterior probabilities
2. Then apply the MAP rule
Lc,,cccc|cCP|cCP 1
**
,)()( xXxX
Li
cCPcC|P
P
cCPcC|P
|cCP
ii
ii
i
,,2,1for
)()(
)(
)()(
)(
xX
xX
xX
xX
Another “Drew”
Naive Bayes Classifier –
MAP Classification Rule
13. • Bayes classification
• Naive Bayes classification
– Assume that all input attributes are conditionally independent
– MAP classification rule: for
)()|,,()()()( 1 CPCXXPCPC|P|CP n XX
)|()|()|(
)|,,()|(
)|,,(),,,|()|,,,(
21
21
22121
CXPCXPCXP
CXXPCXP
CXXPCXXXPCXXXP
n
n
nnn
Lnn ccccccPcxPcxPcPcxPcxP ,,,),()]|()|([)()]|()|([ 1
*
1
***
1
),,,( 21 nxxx x
Naive Bayes Classifier –
Naive Bayes Classification
14. • Naive Bayes Algorithm (for discrete input attributes) has two phases
– 1. Learning Phase: Given a training set S,
Output: conditional probability tables; for elements
– 2. Test Phase: Given an unknown instance ,
Look up tables to assign the label c* to X’ if
;inexampleswith)|(estimate)|(ˆ
),1;,,1(attributeeachofvalueattributeeveryFor
;inexampleswith)(estimate)(ˆ
ofvaluetargeteachFor 1
S
S
ijkjijkj
jjjk
ii
Lii
cCxXPcCxXP
N,knjXx
cCPcCP
)c,,c(cc
LNX jj ,
),,( 1 naa X
Lnn ccccccPcaPcaPcPcaPcaP ,,,),(ˆ)]|(ˆ)|(ˆ[)(ˆ)]|(ˆ)|(ˆ[ 1
*
1
***
1
Naive Bayes Classifier –
Naive Bayes Algorithm
16. Pre-implementation
– Basic Idea
• If people are talking something good or bad about
Microsoft
“Love the new surface pro3. Good work Microsoft!”
“Having a great internship in Microsoft.”
• If companies are announcing something good or bad
about Microsoft, the stock price of MSFT will rise.
“The quarterly report of Microsoft seems pretty good!”
“Microsoft has closed 5 major stores in east China.”
• If internet sensations are saying they like or dislike the
products of Microsoft
“Tim Cook: The new surface book is not interesting.”
“Ma Yun: We will cooperate with Microsoft to create new online
shopping experience.”
17. Pre-implementation
– Basic Idea
• Traditional Sentiment Model
• Naïve Bayes Classification and Support Vector Machine (SVM)
• Single tweet? Or tweet-block?
a) Is one single tweet really has an impact on the stock price of the
Microsoft?
b) Will a tweet-block which contains all the twitters in a minute has
more effect on the stock price?
c) We believe the second. This is different from the traditional sentiment
model.
• Decision Tree model?
• Since we choose to use tweet-block and one minute we may find
hundreds even thousands related tweets, decision tree may not be a
good choice here.
• Why Python?
• Faster and open-source
• Package nltk, csv, tweepy
18. Pre-implementation
– Data Source
• Free historical twitters data is difficult to find.
• Python package tweepy
a) A free approach to the tweets data
b) Only latest public twitters
c) Let the system run from 9:30 am to 4:00 pm everyday between
25th Nov and 2nd Dec
d) Potential problem: only business day data will be useful in our
model.
20. Pre-implementation
– Data Filtration
• JSON
a) JavaScript Object Notation: a syntax that used to transfer data
between servers
b) Different tag-names will contain different information
c) Eg. “created_at”: time of the twitter created
“id”: the id of the twitter
“text”: the text content of the twitter
“lang”: language of the twitter used
“place” and “country” etc;
21. Pre-implementation
– Data Filtration (Cont.)
• Keywords
a) We only care about the twitters which were related to the
Microsoft company or stock.
b) stream.filter(track = ['microsoft']);
c) Other words?(MSFT, MSFT stocks)
• Language
a) We only care about the twitters which were written in English.
b) “lang” = ‘en’
22. Pre-implementation
– Data Filtration (Cont.)
• Encoding
a) The data source used ‘UTF - 8’
b) Even though the twitter was written in English, it may still
contained some Chinese or Korean characters
c) Use regular expression to detect the characters
• Some JSON package is not twitter
a) From the practice we found that sometimes we may get a JSON
package with the content as {“limit”….}
b) Just get rid of it
c) if line[0:8] != '{"limit"':
23. Pre-implementation
– Data Filtration (Cont.)
• NICE & nice & Nice
a) Same word and transfer them all into the lower case
b) tweet = tweet.lower();
• @ChangLiu & www.twitter.com
a) The author and the URL contained in the tweeter has nothing
related to our model.
b) Regular expression
c) tweet = re.sub('((www.[^s]+)|(https?://[^s]+))','URL',tweet);
d) tweet = re.sub('@[^s]+','ATUSER',tweet);
• Repeating letters
a) E.g. hunggrryyy, huuuuuuungry for 'hungry'.
b) We can look for 2 or more repetitive letters in words and
replace them by 2 of the same.
24. Pre-implementation
– Data Filtration (Cont.)
• Stop words
a) a, is, the, with etc.
b) These words don't indicate any
sentiment and can be removed.
c) Before processing the tweets, we
read in a file which contains all the
stop words.
d) These words typically has no
meaning, we just ignore them.
25. Implementation Steps
1. Download Data
2. Data Processing and Filtering.
3. Split tweet-blocks by one minute.
a) We detect the time tag of the twitter, and for each minute we
simply concatenate all the twitters within this minute. Make
them as one sentence.
4. Combine the stock price data
a) We believe the information time lag is 1 minute. This means the
information made public one minute ago will have an impact
on this minute’s stock price.
b) E.g. we have a tweet-block for time 9:50 am to 9:51 am
c) We find the stock price change of MSFT from time 9:51 am to
9:52 am is 0.03%
d) Then we classify this tweet-block as ‘positive’.
26. Implementation Steps (Cont.)
5. Form the training data set
a) We classify positive return rate as “Positive”.
b) Negative return rate as “Negative”.
c) 0 return rate as “Neutral”.
27. Implementation Steps (Cont.)
6. Build a classifier based on Naive Bayes classification
method
a) It will provide a feature list which contains the words that appeared for
‘positive’, ‘negative’ and ‘neutral’
b) Then for a test tweet, it will figure out whether the tweet contains the
words in ‘positive’ list, ‘negative’ list or ‘neutral’ list
28. Implementation Steps (Cont.)
6. Build a classifier based on Naïve Bayes classification
method
a) Training set: Data of 9:30 am to 4:00 pm
b) Date: 25th Nov, 27th Nov, 30th Nov, 1st Dec, 2nd Dec (9:30 am–
2:50 pm)
c) Total number of twitters: near 110,000 single tweets
d) Sample size: 1,518 single minute tweet-block records
e) Training time: 377 seconds
29. Implementation Steps (Cont.)
7. In-sample test
a) Testing set: 2nd Dec (9:30 am – 2:50 pm)
b) Size of testing set: 317
c) Correct prediction: 97.16%
d) Testing time: 431 seconds
30. Implementation Steps (Cont.)
8. Out-of-sample test
a) Testing set: 2nd Dec (2:50 pm – 4:00 pm)
b) Size of testing set: 61
c) Correct prediction: 40.98%
d) Testing time: 111 seconds
32. Results and Conclusions –
Result Analysis
-0.003
-0.002
-0.001
0
0.001
0.002
0.003
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
Min
0
Min
2
Min
4
Min
6
Min
8
Min
10
Min
12
Min
14
Min
16
Min
18
Min
20
Min
22
Min
24
Min
26
Min
28
Min
30
Min
32
Min
34
Min
36
Min
38
Min
40
Min
42
Min
44
Min
46
Min
48
Min
50
Min
52
Min
54
Min
56
Min
58
Min
60
Out-of-sample Test Result
Predicted Value Real value Change rate
33. Results and Conclusions –
Drawbacks
1. Not enough training data
a) Traditional sentiment model requires at least 10,000 training
records
b) Our model requests even more but we only have 1,500 records
2. Not enough testing data to verify the model
3. Single word set V/S Bi-word set
a) Not interesting
b) Don’t like
c) not bad
4. Time lag problem
a) 1 minute or 5 minutes
b) Maybe the tweets between time 4:30 pm to tomorrow 9:30 am
will have more impact on tomorrow’s stock price
34. Results and Conclusions –
Further Improvement
Gather more data
Change time lags to find the optimal one
Consider about other languages
Some automatic translation tools
Use other classification methods
SVM, Entropy approach
Even modified decision tree