Exploring Correlation Between Sentiment of Environmental Tweets and the Stock Market

Yi-Shan Shir
Instructor: Dr. Nam P. Nguyen
Department of Computer and Information
Science
Towson University
EXPLORING CORRELATION
BETWEEN SENTIMENT OF
ENVIRONMENTAL TWEETS
AND THE STOCK MARKET

Overview
• Motivation
• Research Approaches
• Tools
• Data Collection
• Sentiment Analysis
• Analysis of Correlation between Sentiments and Stock Price
• Conclusion

Motivation
• Hiroko Okajima and Barin Nag, Department of e-Business and
Technology Management, Towson University
• Previous studies:
Sentiment on social media can predict stock market fluctuations
• Question:
What about specific terms?
-- Environmental tweets over 5 years.

Tools
• Environment: Ubuntu 16.04
• Language: Python, SQL
• Database: MySQL
• Approaches:
1. Natural Language Processing
-- Sentiment Analysis
2. Machine Learning

Locating Target Enterprises
• PHOTOGRAPH BY KAREN DUCEY, GETTY IMAGES

Locating Target Enterprises
• Target set 1:
Top 100 from the Fortune 500 list
• Target set 2:
Enterprises with significant(notorious) reputation on environmental
issues
-- accounts: tweets > 30K or top 50%
Category Company/Brand
IT (renewable energy) Amazon, Samsung, Google
Oil Shell, BP, Exxon
Palm Oil (Deforestation) Nestle, JNJ, Unilever
Wastes Starbucks, CocaCola, PepsiCo
Fast Food (Deforestation) McDonalds, BurgerKing, KFC, TacoBell

Data Collection(1): Twitter API
• Twitter API
• Python implementation: Tweetpy
• Cons: only allow data collection for the most recent week

Data Collection(2): Advanced Search

Data Collection(2): Advanced Search
• Scraping tweets from search result of Twitter advanced search
• Source code: Jefferson Henrique
https://github.com/Jefferson-Henrique/GetOldTweets-python
• Cons: adjustments has to be made once Twitter change something.

Data Storage
• Raw data:
-- tweets: 5,818,254 tweets
-- account: 158
• Database schema:
1. Raw data 2. Filtered data 3. Stock tickers

Data Filtering
• 1. Filter with Python:
-- Filtering through a list of keywords
-- Pros: fast, keep as much data as possible
-- Cons: lower accuracy
-- e.g. “He has a lot of energy.”
• 2. Filter with SQL
-- Filtering inside the DB
-- Pros: higher accuracy
-- Cons: slow, may leave out tweets
-- e.g. “energy efficiency” vs “energy with efficiency”
• Examining the data after filtering: not practical for large dataset.
-- Google “we recycle gmail accounts.”

Data Filtering
• Key words:
# emission
# renewable
# climate
# recycle
# waste
# resource
# pollution
# deforestation
# environmental

Data After Filtering
• Tweets: 68,655
• Accounts: 154
• Distribution:

Stock Price Collection: Quandl
• Financial Database
• Quandl API
• Python implementation: quandl
• Source: WIKI Prices DB from Quandl

Sentiment Analysis
• 2 approaches:
• 1. Vader: a sentiment analysis package in Python NLTK library
-- does all the NLP works for you!
-- claim to achieve 96% accuracy on tweets
• 2. Scikit-Learn: a machine learning library
-- input data has to be preprocessed
-- various choices of models

Sentiment Analysis: Vader
• Pro: easy to use, fast to run
• Cleaning text,
Weighting by booster words,
Assigning sentiment score according to a lexicon.
• Output:
1. normalized compound polarity score: -1 ~ 1
2. positive, neutral, negative

Sentiment Analysis: Data prepressing(1)
• Text Cleaning:
-- converting cases
-- removing additional white space, repeated characters
-- replacing URL, @, # with stopwards
•

Sentiment Analysis: Data prepressing (2)
• Feature Extraction:
-- removing stopwords
-- map contraction to original forms
-- appending cleaned words to the feature vector
• Feature Vector

Sentiment Analysis: Scikit-Learn(1)
• Training Data:
-- tweets: 1,615,343
-- source:
1. Sentiment 140
2. Crowdflower's Data for Everyone library
• Feature Extraction methods:
1. Bag of Words
2. TF-IDF (Term Frequency - Inverse Document Frequency)

Sentiment Analysis: Scikit-Learn(2)
• Models
1. Multinomial Naïve Bayes
2. Logistic Regression
3. SVM
• Accuracy:
• (2 feature extraction) x (3 models) = 6 results
• For each feature set, take mode of 3 results
Bag of Words TF-IDF
Multinomial Naïve Bayes 0.767 0.761
Logistic Regression 0.777 0.779
SVM 0.769 0.772

Sentiment Analysis
positive negative neutral
Bag of words 45,915 21,865 874
TF-IDF 46,442 21,933 279
Vader 48,851 11,386 8,417

Data Integration
• Sentiment data, stock price data, Twitter username vs. Stock ticker
data
• 1. Merge sentiment data with stock ticker data upon username
• 2. For date delta from 1 to 7, merge (1) with stock price data upon
date and ticker.

Linear Regression Analysis
• Sentiment data is highly sparse
--> time series is not applicable
• Dealing with sparseness:
-- goal: joining all sentiment data into one dataset
-- method: normalizing all stock price before integration
For t = delta of date, Y = closing price, D = normalized closing
-- output:

Linear Regression Analysis
• Variance score = 0

Plotting the Result (1)
• Google

• Johnson & Johnson

• McDonald’s

• PepsiCo

• Exxon

Conclusion
• Correlation between sentiment of environmental tweets and the
drops of stock price might exist in some cases.
• Issues:
1. Most tweets tweeted by official accounts are positive.
2. Different types of enterprises might focus on different aspect of
their corporate images.
• Future Work:
1. Improving filtering strategy
2. Exploring of other analysis models/ plotting strategy
3. Adding tweets mentioning these companies by other users
4. Analyzing tweets by environmental NGOs
5. Incorporating other Social Network Analysis approaches

Exploring Correlation Between Sentiment of Environmental Tweets and the Stock Market

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Exploring Correlation Between Sentiment of Environmental Tweets and the Stock Market

Similar to Exploring Correlation Between Sentiment of Environmental Tweets and the Stock Market (20)

More from Data Works MD

More from Data Works MD (18)

Recently uploaded

Recently uploaded (20)

Exploring Correlation Between Sentiment of Environmental Tweets and the Stock Market