Video of the presentation is available here: https://youtu.be/L6EMnvALYtU
Talk: Fortune 500 Company Performance Analysis Using Social Networks
Speaker: Yi-Shan Shir
This presentation focus on studying the correlation between financial performance and social media relationship and behavior of Fortune 500 companies. The findings from this research can assist in the prediction of Fortune 500 stock performance based on a number of social network analysis metrics.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Exploring Correlation Between Sentiment of Environmental Tweets and the Stock Market
1. Yi-Shan Shir
Instructor: Dr. Nam P. Nguyen
Department of Computer and Information
Science
Towson University
EXPLORING CORRELATION
BETWEEN SENTIMENT OF
ENVIRONMENTAL TWEETS
AND THE STOCK MARKET
2. Overview
• Motivation
• Research Approaches
• Tools
• Data Collection
• Sentiment Analysis
• Analysis of Correlation between Sentiments and Stock Price
• Conclusion
3. Motivation
• Hiroko Okajima and Barin Nag, Department of e-Business and
Technology Management, Towson University
• Previous studies:
Sentiment on social media can predict stock market fluctuations
• Question:
What about specific terms?
-- Environmental tweets over 5 years.
6. Locating Target Enterprises
• Target set 1:
Top 100 from the Fortune 500 list
• Target set 2:
Enterprises with significant(notorious) reputation on environmental
issues
-- accounts: tweets > 30K or top 50%
Category Company/Brand
IT (renewable energy) Amazon, Samsung, Google
Oil Shell, BP, Exxon
Palm Oil (Deforestation) Nestle, JNJ, Unilever
Wastes Starbucks, CocaCola, PepsiCo
Fast Food (Deforestation) McDonalds, BurgerKing, KFC, TacoBell
7. Data Collection(1): Twitter API
• Twitter API
• Python implementation: Tweetpy
• Cons: only allow data collection for the most recent week
9. Data Collection(2): Advanced Search
• Scraping tweets from search result of Twitter advanced search
• Source code: Jefferson Henrique
https://github.com/Jefferson-Henrique/GetOldTweets-python
• Cons: adjustments has to be made once Twitter change something.
10. Data Storage
• Raw data:
-- tweets: 5,818,254 tweets
-- account: 158
• Database schema:
1. Raw data 2. Filtered data 3. Stock tickers
12. Data Filtering
• 1. Filter with Python:
-- Filtering through a list of keywords
-- Pros: fast, keep as much data as possible
-- Cons: lower accuracy
-- e.g. “He has a lot of energy.”
• 2. Filter with SQL
-- Filtering inside the DB
-- Pros: higher accuracy
-- Cons: slow, may leave out tweets
-- e.g. “energy efficiency” vs “energy with efficiency”
• Examining the data after filtering: not practical for large dataset.
-- Google “we recycle gmail accounts.”
15. Stock Price Collection: Quandl
• Financial Database
• Quandl API
• Python implementation: quandl
• Source: WIKI Prices DB from Quandl
16. Sentiment Analysis
• 2 approaches:
• 1. Vader: a sentiment analysis package in Python NLTK library
-- does all the NLP works for you!
-- claim to achieve 96% accuracy on tweets
• 2. Scikit-Learn: a machine learning library
-- input data has to be preprocessed
-- various choices of models
17. Sentiment Analysis: Vader
• Pro: easy to use, fast to run
• Cleaning text,
Weighting by booster words,
Assigning sentiment score according to a lexicon.
• Output:
1. normalized compound polarity score: -1 ~ 1
2. positive, neutral, negative
18. Sentiment Analysis: Data prepressing(1)
• Text Cleaning:
-- converting cases
-- removing additional white space, repeated characters
-- replacing URL, @, # with stopwards
•
19. Sentiment Analysis: Data prepressing (2)
• Feature Extraction:
-- removing stopwords
-- map contraction to original forms
-- appending cleaned words to the feature vector
• Feature Vector
20. Sentiment Analysis: Scikit-Learn(1)
• Training Data:
-- tweets: 1,615,343
-- source:
1. Sentiment 140
2. Crowdflower's Data for Everyone library
• Feature Extraction methods:
1. Bag of Words
2. TF-IDF (Term Frequency - Inverse Document Frequency)
21. Sentiment Analysis: Scikit-Learn(2)
• Models
1. Multinomial Naïve Bayes
2. Logistic Regression
3. SVM
• Accuracy:
• (2 feature extraction) x (3 models) = 6 results
• For each feature set, take mode of 3 results
Bag of Words TF-IDF
Multinomial Naïve Bayes 0.767 0.761
Logistic Regression 0.777 0.779
SVM 0.769 0.772
23. Data Integration
• Sentiment data, stock price data, Twitter username vs. Stock ticker
data
• 1. Merge sentiment data with stock ticker data upon username
• 2. For date delta from 1 to 7, merge (1) with stock price data upon
date and ticker.
24. Linear Regression Analysis
• Sentiment data is highly sparse
--> time series is not applicable
• Dealing with sparseness:
-- goal: joining all sentiment data into one dataset
-- method: normalizing all stock price before integration
For t = delta of date, Y = closing price, D = normalized closing
-- output:
31. Conclusion
• Correlation between sentiment of environmental tweets and the
drops of stock price might exist in some cases.
• Issues:
1. Most tweets tweeted by official accounts are positive.
2. Different types of enterprises might focus on different aspect of
their corporate images.
• Future Work:
1. Improving filtering strategy
2. Exploring of other analysis models/ plotting strategy
3. Adding tweets mentioning these companies by other users
4. Analyzing tweets by environmental NGOs
5. Incorporating other Social Network Analysis approaches