Yi-Shan Shir
Instructor: Dr. Nam P. Nguyen
Department of Computer and Information
Science
Towson University
EXPLORING CORRELATION
BETWEEN SENTIMENT OF
ENVIRONMENTAL TWEETS
AND THE STOCK MARKET
Overview
• Motivation
• Research Approaches
• Tools
• Data Collection
• Sentiment Analysis
• Analysis of Correlation between Sentiments and Stock Price
• Conclusion
Motivation
• Hiroko Okajima and Barin Nag, Department of e-Business and
Technology Management, Towson University
• Previous studies:
Sentiment on social media can predict stock market fluctuations
• Question:
What about specific terms?
-- Environmental tweets over 5 years.
Tools
• Environment: Ubuntu 16.04
• Language: Python, SQL
• Database: MySQL
• Approaches:
1. Natural Language Processing
-- Sentiment Analysis
2. Machine Learning
Locating Target Enterprises
• PHOTOGRAPH BY KAREN DUCEY, GETTY IMAGES
Locating Target Enterprises
• Target set 1:
Top 100 from the Fortune 500 list
• Target set 2:
Enterprises with significant(notorious) reputation on environmental
issues
-- accounts: tweets > 30K or top 50%
Category Company/Brand
IT (renewable energy) Amazon, Samsung, Google
Oil Shell, BP, Exxon
Palm Oil (Deforestation) Nestle, JNJ, Unilever
Wastes Starbucks, CocaCola, PepsiCo
Fast Food (Deforestation) McDonalds, BurgerKing, KFC, TacoBell
Data Collection(1): Twitter API
• Twitter API
• Python implementation: Tweetpy
• Cons: only allow data collection for the most recent week
Data Collection(2): Advanced Search
Data Collection(2): Advanced Search
• Scraping tweets from search result of Twitter advanced search
• Source code: Jefferson Henrique
https://github.com/Jefferson-Henrique/GetOldTweets-python
• Cons: adjustments has to be made once Twitter change something.
Data Storage
• Raw data:
-- tweets: 5,818,254 tweets
-- account: 158
• Database schema:
1. Raw data 2. Filtered data 3. Stock tickers
Data Storage
Data Filtering
• 1. Filter with Python:
-- Filtering through a list of keywords
-- Pros: fast, keep as much data as possible
-- Cons: lower accuracy
-- e.g. “He has a lot of energy.”
• 2. Filter with SQL
-- Filtering inside the DB
-- Pros: higher accuracy
-- Cons: slow, may leave out tweets
-- e.g. “energy efficiency” vs “energy with efficiency”
• Examining the data after filtering: not practical for large dataset.
-- Google “we recycle gmail accounts.”
Data Filtering
• Key words:
# emission
# renewable
# climate
# recycle
# waste
# resource
# pollution
# deforestation
# environmental
Data After Filtering
• Tweets: 68,655
• Accounts: 154
• Distribution:
Stock Price Collection: Quandl
• Financial Database
• Quandl API
• Python implementation: quandl
• Source: WIKI Prices DB from Quandl
Sentiment Analysis
• 2 approaches:
• 1. Vader: a sentiment analysis package in Python NLTK library
-- does all the NLP works for you!
-- claim to achieve 96% accuracy on tweets
• 2. Scikit-Learn: a machine learning library
-- input data has to be preprocessed
-- various choices of models
Sentiment Analysis: Vader
• Pro: easy to use, fast to run
• Cleaning text,
Weighting by booster words,
Assigning sentiment score according to a lexicon.
• Output:
1. normalized compound polarity score: -1 ~ 1
2. positive, neutral, negative
Sentiment Analysis: Data prepressing(1)
• Text Cleaning:
-- converting cases
-- removing additional white space, repeated characters
-- replacing URL, @, # with stopwards
•
Sentiment Analysis: Data prepressing (2)
• Feature Extraction:
-- removing stopwords
-- map contraction to original forms
-- appending cleaned words to the feature vector
• Feature Vector
Sentiment Analysis: Scikit-Learn(1)
• Training Data:
-- tweets: 1,615,343
-- source:
1. Sentiment 140
2. Crowdflower's Data for Everyone library
• Feature Extraction methods:
1. Bag of Words
2. TF-IDF (Term Frequency - Inverse Document Frequency)
Sentiment Analysis: Scikit-Learn(2)
• Models
1. Multinomial Naïve Bayes
2. Logistic Regression
3. SVM
• Accuracy:
• (2 feature extraction) x (3 models) = 6 results
• For each feature set, take mode of 3 results
Bag of Words TF-IDF
Multinomial Naïve Bayes 0.767 0.761
Logistic Regression 0.777 0.779
SVM 0.769 0.772
Sentiment Analysis
positive negative neutral
Bag of words 45,915 21,865 874
TF-IDF 46,442 21,933 279
Vader 48,851 11,386 8,417
Data Integration
• Sentiment data, stock price data, Twitter username vs. Stock ticker
data
• 1. Merge sentiment data with stock ticker data upon username
• 2. For date delta from 1 to 7, merge (1) with stock price data upon
date and ticker.
Linear Regression Analysis
• Sentiment data is highly sparse
--> time series is not applicable
• Dealing with sparseness:
-- goal: joining all sentiment data into one dataset
-- method: normalizing all stock price before integration
For t = delta of date, Y = closing price, D = normalized closing
-- output:
Linear Regression Analysis
• Variance score = 0
Plotting the Result (1)
• Google
Plotting the Result (2)
• Johnson & Johnson
Plotting the Result (3)
• McDonald’s
Plotting the Result (4)
• PepsiCo
Plotting the Result (5)
• Exxon
Conclusion
• Correlation between sentiment of environmental tweets and the
drops of stock price might exist in some cases.
• Issues:
1. Most tweets tweeted by official accounts are positive.
2. Different types of enterprises might focus on different aspect of
their corporate images.
• Future Work:
1. Improving filtering strategy
2. Exploring of other analysis models/ plotting strategy
3. Adding tweets mentioning these companies by other users
4. Analyzing tweets by environmental NGOs
5. Incorporating other Social Network Analysis approaches

Exploring Correlation Between Sentiment of Environmental Tweets and the Stock Market

  • 1.
    Yi-Shan Shir Instructor: Dr.Nam P. Nguyen Department of Computer and Information Science Towson University EXPLORING CORRELATION BETWEEN SENTIMENT OF ENVIRONMENTAL TWEETS AND THE STOCK MARKET
  • 2.
    Overview • Motivation • ResearchApproaches • Tools • Data Collection • Sentiment Analysis • Analysis of Correlation between Sentiments and Stock Price • Conclusion
  • 3.
    Motivation • Hiroko Okajimaand Barin Nag, Department of e-Business and Technology Management, Towson University • Previous studies: Sentiment on social media can predict stock market fluctuations • Question: What about specific terms? -- Environmental tweets over 5 years.
  • 4.
    Tools • Environment: Ubuntu16.04 • Language: Python, SQL • Database: MySQL • Approaches: 1. Natural Language Processing -- Sentiment Analysis 2. Machine Learning
  • 5.
    Locating Target Enterprises •PHOTOGRAPH BY KAREN DUCEY, GETTY IMAGES
  • 6.
    Locating Target Enterprises •Target set 1: Top 100 from the Fortune 500 list • Target set 2: Enterprises with significant(notorious) reputation on environmental issues -- accounts: tweets > 30K or top 50% Category Company/Brand IT (renewable energy) Amazon, Samsung, Google Oil Shell, BP, Exxon Palm Oil (Deforestation) Nestle, JNJ, Unilever Wastes Starbucks, CocaCola, PepsiCo Fast Food (Deforestation) McDonalds, BurgerKing, KFC, TacoBell
  • 7.
    Data Collection(1): TwitterAPI • Twitter API • Python implementation: Tweetpy • Cons: only allow data collection for the most recent week
  • 8.
  • 9.
    Data Collection(2): AdvancedSearch • Scraping tweets from search result of Twitter advanced search • Source code: Jefferson Henrique https://github.com/Jefferson-Henrique/GetOldTweets-python • Cons: adjustments has to be made once Twitter change something.
  • 10.
    Data Storage • Rawdata: -- tweets: 5,818,254 tweets -- account: 158 • Database schema: 1. Raw data 2. Filtered data 3. Stock tickers
  • 11.
  • 12.
    Data Filtering • 1.Filter with Python: -- Filtering through a list of keywords -- Pros: fast, keep as much data as possible -- Cons: lower accuracy -- e.g. “He has a lot of energy.” • 2. Filter with SQL -- Filtering inside the DB -- Pros: higher accuracy -- Cons: slow, may leave out tweets -- e.g. “energy efficiency” vs “energy with efficiency” • Examining the data after filtering: not practical for large dataset. -- Google “we recycle gmail accounts.”
  • 13.
    Data Filtering • Keywords: # emission # renewable # climate # recycle # waste # resource # pollution # deforestation # environmental
  • 14.
    Data After Filtering •Tweets: 68,655 • Accounts: 154 • Distribution:
  • 15.
    Stock Price Collection:Quandl • Financial Database • Quandl API • Python implementation: quandl • Source: WIKI Prices DB from Quandl
  • 16.
    Sentiment Analysis • 2approaches: • 1. Vader: a sentiment analysis package in Python NLTK library -- does all the NLP works for you! -- claim to achieve 96% accuracy on tweets • 2. Scikit-Learn: a machine learning library -- input data has to be preprocessed -- various choices of models
  • 17.
    Sentiment Analysis: Vader •Pro: easy to use, fast to run • Cleaning text, Weighting by booster words, Assigning sentiment score according to a lexicon. • Output: 1. normalized compound polarity score: -1 ~ 1 2. positive, neutral, negative
  • 18.
    Sentiment Analysis: Dataprepressing(1) • Text Cleaning: -- converting cases -- removing additional white space, repeated characters -- replacing URL, @, # with stopwards •
  • 19.
    Sentiment Analysis: Dataprepressing (2) • Feature Extraction: -- removing stopwords -- map contraction to original forms -- appending cleaned words to the feature vector • Feature Vector
  • 20.
    Sentiment Analysis: Scikit-Learn(1) •Training Data: -- tweets: 1,615,343 -- source: 1. Sentiment 140 2. Crowdflower's Data for Everyone library • Feature Extraction methods: 1. Bag of Words 2. TF-IDF (Term Frequency - Inverse Document Frequency)
  • 21.
    Sentiment Analysis: Scikit-Learn(2) •Models 1. Multinomial Naïve Bayes 2. Logistic Regression 3. SVM • Accuracy: • (2 feature extraction) x (3 models) = 6 results • For each feature set, take mode of 3 results Bag of Words TF-IDF Multinomial Naïve Bayes 0.767 0.761 Logistic Regression 0.777 0.779 SVM 0.769 0.772
  • 22.
    Sentiment Analysis positive negativeneutral Bag of words 45,915 21,865 874 TF-IDF 46,442 21,933 279 Vader 48,851 11,386 8,417
  • 23.
    Data Integration • Sentimentdata, stock price data, Twitter username vs. Stock ticker data • 1. Merge sentiment data with stock ticker data upon username • 2. For date delta from 1 to 7, merge (1) with stock price data upon date and ticker.
  • 24.
    Linear Regression Analysis •Sentiment data is highly sparse --> time series is not applicable • Dealing with sparseness: -- goal: joining all sentiment data into one dataset -- method: normalizing all stock price before integration For t = delta of date, Y = closing price, D = normalized closing -- output:
  • 25.
  • 26.
    Plotting the Result(1) • Google
  • 27.
    Plotting the Result(2) • Johnson & Johnson
  • 28.
    Plotting the Result(3) • McDonald’s
  • 29.
    Plotting the Result(4) • PepsiCo
  • 30.
    Plotting the Result(5) • Exxon
  • 31.
    Conclusion • Correlation betweensentiment of environmental tweets and the drops of stock price might exist in some cases. • Issues: 1. Most tweets tweeted by official accounts are positive. 2. Different types of enterprises might focus on different aspect of their corporate images. • Future Work: 1. Improving filtering strategy 2. Exploring of other analysis models/ plotting strategy 3. Adding tweets mentioning these companies by other users 4. Analyzing tweets by environmental NGOs 5. Incorporating other Social Network Analysis approaches