Objective of the Project
Tweet sentiment analysis gives businesses insights into customers and competitors. In this project, we combined several text preprocessing techniques with machine learning algorithms. Neural network, Random Forest and Logistic Regression models were trained on the Sentiment140 twitter data set. We then predicted the sentiment of a hold-out test set of tweets. We used both Python and PySpark (local Spark Context) to program different parts of the pre-processing and modelling.
2. Predicting Sentiment from Tweets
Business Goals
● To help businesses understand customers’ feelings towards products/brands,
response towards their advertising campaigns or product launches based on
the sentiment (positive/negative) behind the tweets concerning them.
Project Goals
● Build and assess different models to accurately determine whether a Tweet
is positive or negative by training the models on a large split of data (training
data) and testing their accuracy of predicting sentiment for the test data using
python & its libraries.
● Gain experience using Hadoop, Spark and Pyspark.
3. Sentiment140 Dataset
DataSet
● 1.6M tweets from 2009, 800k positive and 800k negative.
● Tweets were labeled by interpreting emoticons as “distant supervised” data.
● Tweet labels: 0 = negative, 4 = positive.
Attributes - 6 fields
1. target: the polarity of the tweet (0 = negative, 4 = positive)
2. ids: The id of the tweet ( 2087)
3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
5. user: the user that tweeted (robotickilldozr)
6. text: the text of the tweet (Lyx is cool)
4. Data Examples
target id date flag user text
0 14678
10672
Mon Apr 06
22:19:49
PDT 2009
NO_QUERY scotthamilton is upset that he can't update his
Facebook by texting it... and might cry
as a result School today also. Blah!
4 14678
22272
Mon Apr 06
22:22:45
PDT 2009
NO_QUERY ersle I LOVE @Health4UandPets u guys r
the best!!
5. Analysis Plan
Cleanse and Preprocess data
Explore the Data
● Word Clouds
● Bar graphs of Hashtags
Assess different pros and cons of
models
Choose models
Split Data into Train and Test Sets
For each Model
● Choose parameters
● Make predictions
● Assess performance of model
Discussion and Comparison
Conclusion
6. Data Cleansing, Pre-processing and Modeling
Using Python
Clean, Pre-process and Explore
● Clean Tweets
● Tokenize, etc.
● EDA- Word Clouds
● EDA- Hashtag Graphs
Models
● TF-IDF + Logistic Regression
● Random Forest
● fastText
Evaluate Performance
Using Pyspark
Load Clean Tweets into Pyspark
● Remove Nulls (few thousand)
Divide data set into train and test sets
● Build Models
● TF-IDF + Logistic Regression
● 2-Gram
● 3-Gram
Compare Predictions to Labels
Evaluate Performance
7. Data Cleansing and Pre-processing
Taming the Tweets
● Used NLTK for stemming- PorterStemmer
● Defined pattern find and replace for special character and stop
word removal
Tweets
Remove special
chars (not #)
Remove words < 3
chars.
Stem the words
Stitch back
together =
Clean Tweets
Tokenize
words
8. Data Exploration: WordClouds of Positive Tweets
Positive Tweets
● Many positive words
such as love, nice, well.
● Many neutral words like
time, think, today, work.
● Positive and Negative
Tweets are different, but
not in an obvious way
9. WordClouds of Negative Tweets
Negative Tweets
● Some negative words
such as hate, sick, damn,
● Many neutral words like
time, think, today, work.
● Difficult to say what
actions should be taken
10. Hashtags- Positive - Many items reflect the time frame (4/09-6/09)
● #FollowFriday or #ff started in 2009
● McFly British rock band, #1 album May 2009
11. Hashtags- Negative - What do we learn here?
● Iran Election took place in June 2009, then the arrests
● iPhone 3GS released June 9, 2009 (not great...)
12. Model selection-Why we chose what we chose
We read several articles on sentiment analysis of tweets and of the sentiment140
data set. Among the successful models were:
● TF-IDF + N-Gram+ Logistic Regression
● Random Forest
● fastText
13. Pros and Cons of Chosen Approaches
Approach Pros Cons
Logistic
Regression
Well-understood binary
classification method
Prone to over-fitting
Random Forest Decorrelates trees
reduced variance
Not as easy to visually interpret
fastText Fast, Conformed performance,
Memory
C++ language, Fixed data format,
File I/O
N-gram Captures “not good” or “not the
best” instead of “good” or “best”
Not “linguistically based”
Lacks longer N dependencies
14. Assessing Model Performance
AUC Confusion Matrix
Area under Receiver Operator Curve
How much more does the model predict
above the presence in the population?
Confusion Matrix, Recall and Precision
Recall: What % of Actual Cancers are Detected?
Precision: What % of Detected Cancers are Actual?
15. Performance Metrics to Evaluate Models
● Accuracy = (True positives + True negatives) / Total Examples
● Precision = True positives / (True positives + False Positives)
● Recall =True positives / (True positives + False Negatives)
● AUC = Area Under the Receiver Operator Curve
● F1 score = 2 * (Precision * Recall) / (Precision + Recall)
16. Model 1: Random Forest
Pre-processing
# of
words
Accuracy Precision Recall AUC F1-score
BoW 6000 0.6856 0.6856 0.6856 0.6856 0.6856
Bag of Words : 300 words 1 tree : 14s Results:
0.61~0.62
Bag of Words : 300 words 100 tree : 1114s Results:
0.61~0.63
Bag of Words : 6000 words 1 tree : 266s
Bag of Words: 6000 words 100 tree : more than 12 hours
17. What is fastText?
fastText is a library for efficient learning of word representations and sentence classification.
Requirements
fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good
C++11 support. These include :
●(gcc-4.6.3 or newer) or (clang-3.3 or newer)
Compilation is carried out using a Makefile, so you will need to have a working make. For the word-similarity evaluation
script you will need:
●python 2.6 or newer
●numpy & scipy
(fastText)
18.
19. Model 2: fastText
Model # of epoch Precision Recall
fastText 1 0.7497 0.7497
fastText 30 0.7444 0.7444
fastText 50 0.7398 0.7398
The number of epochs is a hyperparameter that defines the number times
that the learning algorithm will work through the entire training dataset.
20. Model 3 N-Grams Groups of N Consecutive Words
2-Gram Process
Hadoop and Apache Spark are both big-data frameworks
3-Gram Process
Hadoop and Apache Spark are both big-data frameworks that
21. Pipeline for 1, 2 and 3 gram (N-Gram, TF/IDF, LR)
TF = Term Frequency
(# times term-t appears in doc)/ (Total # terms in doc)
Clean_Tweets
Tidy_Tweet
Target (0,4)
Tokenizer
Input: Tidy_Tweet
Output: “words”
Tokenize the
words of clean
tweets
Hashing TF+IDF
Input: “words”
Output: tf-grams,
idf-grams
Create Inverse
Document Frequency
Logistic
Regression
Input: target,
tf-grams, idf-
grams
Output:
Predictions
Train Model and
Predict Test
data
N-Gram
Input:
“words”
Output:
NGrams
Create Term
Frequencies
IDF = Inverse Document Frequency
log_e(Total number of documents /
Number of documents with term t in it).
24. Discussion and Conclusions
● It was hard to beat 0.76 accuracy
● Many more combinations are possible to pursue
○ Preprocessing (BOW, TF/IDF, stemming, lexicons)
○ Algorithms (Naive-Bayes, Support Vector Machines, Logistic Regression, etc.)
● For each preprocessing and model, there are also many parameters to adjust
● Sentiment mining is still at a very early stage of development
● Text analysis doesn’t handle sarcasm (meaning the opposite).
● Even humans have trouble assessing the sentiment of Tweets
25. Future Work
● Try different combinations of pre-processing and machine learning techniques
● Test how practical these approaches are for 10x or 100x the number of rows.
● What changes are needed for much larger volumes (e.g.sampling strategy)
● Analyze topics over longer time frames to characterize sentiment waves
● Attribute causes to changes in tweet sentiment, volume etc.
27. References
Bourguignat, C. (2015, July 19). 6 Differences Between Pandas And Spark DataFrames. Retrieved from
https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2
FastText. (n.d.). FastText. Retrieved from https://fasttext.cc/
Giachanou, Anastasia and Crestani, Fabio. (2016, June). Like it or not: A survey of Twitter sentiment analysis methods. ACM Comput. Surv.
49, 2, Article 28 Retrieved from:
https://www.researchgate.net/publication/304916478_Like_It_or_Not_A_Survey_of_Twitter_Sentiment_Analysis_Methods
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N Project Report,
Stanford, 1(12), 2009.
Joshi, P. (2018, July 30). Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset & code. Retrieved from
https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/
Kim, R. (2018, March 13). Sentiment Analysis with PySpark. Retrieved from https://towardsdatascience.com/sentiment-analysis-
with-pyspark-bc8e83f80c35
Kim, R. (2018, January 13). Another Twitter sentiment analysis with Python - Part 5 (Tfidf vectorizer, model comparison...
Retrieved from https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-5-50b4e87d9bdd