SlideShare a Scribd company logo
1 of 28
Predicting the Sentiment
Behind Millions of Tweets
Jaeduck Han
Lucinda Linde
Omesh Gadhave
May 16, 2019
ALY 6110 Final
Project
Image credit ©Towards Data Science
Predicting Sentiment from Tweets
Business Goals
● To help businesses understand customers’ feelings towards products/brands,
response towards their advertising campaigns or product launches based on
the sentiment (positive/negative) behind the tweets concerning them.
Project Goals
● Build and assess different models to accurately determine whether a Tweet
is positive or negative by training the models on a large split of data (training
data) and testing their accuracy of predicting sentiment for the test data using
python & its libraries.
● Gain experience using Hadoop, Spark and Pyspark.
Sentiment140 Dataset
DataSet
● 1.6M tweets from 2009, 800k positive and 800k negative.
● Tweets were labeled by interpreting emoticons as “distant supervised” data.
● Tweet labels: 0 = negative, 4 = positive.
Attributes - 6 fields
1. target: the polarity of the tweet (0 = negative, 4 = positive)
2. ids: The id of the tweet ( 2087)
3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
4. flag: The query (lyx). If there is no query, then this value is NO_QUERY.
5. user: the user that tweeted (robotickilldozr)
6. text: the text of the tweet (Lyx is cool)
Data Examples
target id date flag user text
0 14678
10672
Mon Apr 06
22:19:49
PDT 2009
NO_QUERY scotthamilton is upset that he can't update his
Facebook by texting it... and might cry
as a result School today also. Blah!
4 14678
22272
Mon Apr 06
22:22:45
PDT 2009
NO_QUERY ersle I LOVE @Health4UandPets u guys r
the best!!
Analysis Plan
Cleanse and Preprocess data
Explore the Data
● Word Clouds
● Bar graphs of Hashtags
Assess different pros and cons of
models
Choose models
Split Data into Train and Test Sets
For each Model
● Choose parameters
● Make predictions
● Assess performance of model
Discussion and Comparison
Conclusion
Data Cleansing, Pre-processing and Modeling
Using Python
Clean, Pre-process and Explore
● Clean Tweets
● Tokenize, etc.
● EDA- Word Clouds
● EDA- Hashtag Graphs
Models
● TF-IDF + Logistic Regression
● Random Forest
● fastText
Evaluate Performance
Using Pyspark
Load Clean Tweets into Pyspark
● Remove Nulls (few thousand)
Divide data set into train and test sets
● Build Models
● TF-IDF + Logistic Regression
● 2-Gram
● 3-Gram
Compare Predictions to Labels
Evaluate Performance
Data Cleansing and Pre-processing
Taming the Tweets
● Used NLTK for stemming- PorterStemmer
● Defined pattern find and replace for special character and stop
word removal
Tweets
Remove special
chars (not #)
Remove words < 3
chars.
Stem the words
Stitch back
together =
Clean Tweets
Tokenize
words
Data Exploration: WordClouds of Positive Tweets
Positive Tweets
● Many positive words
such as love, nice, well.
● Many neutral words like
time, think, today, work.
● Positive and Negative
Tweets are different, but
not in an obvious way
WordClouds of Negative Tweets
Negative Tweets
● Some negative words
such as hate, sick, damn,
● Many neutral words like
time, think, today, work.
● Difficult to say what
actions should be taken
Hashtags- Positive - Many items reflect the time frame (4/09-6/09)
● #FollowFriday or #ff started in 2009
● McFly British rock band, #1 album May 2009
Hashtags- Negative - What do we learn here?
● Iran Election took place in June 2009, then the arrests
● iPhone 3GS released June 9, 2009 (not great...)
Model selection-Why we chose what we chose
We read several articles on sentiment analysis of tweets and of the sentiment140
data set. Among the successful models were:
● TF-IDF + N-Gram+ Logistic Regression
● Random Forest
● fastText
Pros and Cons of Chosen Approaches
Approach Pros Cons
Logistic
Regression
Well-understood binary
classification method
Prone to over-fitting
Random Forest Decorrelates trees
reduced variance
Not as easy to visually interpret
fastText Fast, Conformed performance,
Memory
C++ language, Fixed data format,
File I/O
N-gram Captures “not good” or “not the
best” instead of “good” or “best”
Not “linguistically based”
Lacks longer N dependencies
Assessing Model Performance
AUC Confusion Matrix
Area under Receiver Operator Curve
How much more does the model predict
above the presence in the population?
Confusion Matrix, Recall and Precision
Recall: What % of Actual Cancers are Detected?
Precision: What % of Detected Cancers are Actual?
Performance Metrics to Evaluate Models
● Accuracy = (True positives + True negatives) / Total Examples
● Precision = True positives / (True positives + False Positives)
● Recall =True positives / (True positives + False Negatives)
● AUC = Area Under the Receiver Operator Curve
● F1 score = 2 * (Precision * Recall) / (Precision + Recall)
Model 1: Random Forest
Pre-processing
# of
words
Accuracy Precision Recall AUC F1-score
BoW 6000 0.6856 0.6856 0.6856 0.6856 0.6856
Bag of Words : 300 words 1 tree : 14s Results:
0.61~0.62
Bag of Words : 300 words 100 tree : 1114s Results:
0.61~0.63
Bag of Words : 6000 words 1 tree : 266s
Bag of Words: 6000 words 100 tree : more than 12 hours
What is fastText?
fastText is a library for efficient learning of word representations and sentence classification.
Requirements
fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good
C++11 support. These include :
●(gcc-4.6.3 or newer) or (clang-3.3 or newer)
Compilation is carried out using a Makefile, so you will need to have a working make. For the word-similarity evaluation
script you will need:
●python 2.6 or newer
●numpy & scipy
(fastText)
Model 2: fastText
Model # of epoch Precision Recall
fastText 1 0.7497 0.7497
fastText 30 0.7444 0.7444
fastText 50 0.7398 0.7398
The number of epochs is a hyperparameter that defines the number times
that the learning algorithm will work through the entire training dataset.
Model 3 N-Grams Groups of N Consecutive Words
2-Gram Process
Hadoop and Apache Spark are both big-data frameworks
3-Gram Process
Hadoop and Apache Spark are both big-data frameworks that
Pipeline for 1, 2 and 3 gram (N-Gram, TF/IDF, LR)
TF = Term Frequency
(# times term-t appears in doc)/ (Total # terms in doc)
Clean_Tweets
Tidy_Tweet
Target (0,4)
Tokenizer
Input: Tidy_Tweet
Output: “words”
Tokenize the
words of clean
tweets
Hashing TF+IDF
Input: “words”
Output: tf-grams,
idf-grams
Create Inverse
Document Frequency
Logistic
Regression
Input: target,
tf-grams, idf-
grams
Output:
Predictions
Train Model and
Predict Test
data
N-Gram
Input:
“words”
Output:
NGrams
Create Term
Frequencies
IDF = Inverse Document Frequency
log_e(Total number of documents /
Number of documents with term t in it).
2-Gram Approach Yields Best Results
Pre-processing Model Accuracy Precision Recall AUC F1-score
1-Gram
Logistic
Regression
0.7566 0.7495 0.7755 0.8274 0.7623
2-Gram
Logistic
Regression
0.7702 0.7581 0.7980 0.8457 0.7768
3-Gram
Logistic
Regression
0.7696 0.7578 0.7867 0.8458 0.7768
Packages used: pyspark.ml.feature; NGram and VectorAssembler
Table Comparing Model Results
Pre-processing Model Accuracy Precision Recall AUC F1-score
BoW Random forest 0.6856 0.6856 0.6856 0.6856 0.6856
fastText fastText - 0.7497 0.7497 - -
1-Gram
Logistic
Regression
0.7566 0.7495 0.7755 0.8274 0.7623
2-Gram
Logistic
Regression
0.7702 0.7581 0.7980 0.8457 0.7768
3-Gram
Logistic
Regression
0.7696 0.7578 0.7867 0.8458 0.7768
Discussion and Conclusions
● It was hard to beat 0.76 accuracy
● Many more combinations are possible to pursue
○ Preprocessing (BOW, TF/IDF, stemming, lexicons)
○ Algorithms (Naive-Bayes, Support Vector Machines, Logistic Regression, etc.)
● For each preprocessing and model, there are also many parameters to adjust
● Sentiment mining is still at a very early stage of development
● Text analysis doesn’t handle sarcasm (meaning the opposite).
● Even humans have trouble assessing the sentiment of Tweets
Future Work
● Try different combinations of pre-processing and machine learning techniques
● Test how practical these approaches are for 10x or 100x the number of rows.
● What changes are needed for much larger volumes (e.g.sampling strategy)
● Analyze topics over longer time frames to characterize sentiment waves
● Attribute causes to changes in tweet sentiment, volume etc.
Image Credit: https://englishalcalans.wordpress.com/the-questions-suggestions-corner/
References
Bourguignat, C. (2015, July 19). 6 Differences Between Pandas And Spark DataFrames. Retrieved from
https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2
FastText. (n.d.). FastText. Retrieved from https://fasttext.cc/
Giachanou, Anastasia and Crestani, Fabio. (2016, June). Like it or not: A survey of Twitter sentiment analysis methods. ACM Comput. Surv.
49, 2, Article 28 Retrieved from:
https://www.researchgate.net/publication/304916478_Like_It_or_Not_A_Survey_of_Twitter_Sentiment_Analysis_Methods
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N Project Report,
Stanford, 1(12), 2009.
Joshi, P. (2018, July 30). Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset & code. Retrieved from
https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/
Kim, R. (2018, March 13). Sentiment Analysis with PySpark. Retrieved from https://towardsdatascience.com/sentiment-analysis-
with-pyspark-bc8e83f80c35
Kim, R. (2018, January 13). Another Twitter sentiment analysis with Python - Part 5 (Tfidf vectorizer, model comparison...
Retrieved from https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-5-50b4e87d9bdd
Data source
● http://www.sentiment140.com/
END OF PRESENTATION

More Related Content

What's hot

Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14
Rachit Goel
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
Fabio Benedetti
 

What's hot (20)

Approaches to Sentiment Analysis
Approaches to Sentiment AnalysisApproaches to Sentiment Analysis
Approaches to Sentiment Analysis
 
Twitter sentiment analysis project report
Twitter sentiment analysis project reportTwitter sentiment analysis project report
Twitter sentiment analysis project report
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14Twitter sentiment-analysis Jiit2013-14
Twitter sentiment-analysis Jiit2013-14
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 
Convolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language ProcessingConvolutional Neural Networks and Natural Language Processing
Convolutional Neural Networks and Natural Language Processing
 
Sentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use casesSentiment analysis - Our approach and use cases
Sentiment analysis - Our approach and use cases
 
Comparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP ModelsComparative Analysis of Transformer Based Pre-Trained NLP Models
Comparative Analysis of Transformer Based Pre-Trained NLP Models
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Tutorial of Sentiment Analysis
Tutorial of Sentiment AnalysisTutorial of Sentiment Analysis
Tutorial of Sentiment Analysis
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
IRE2014-Sentiment Analysis
IRE2014-Sentiment AnalysisIRE2014-Sentiment Analysis
IRE2014-Sentiment Analysis
 
New sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumarNew sentiment analysis of tweets using python by Ravi kumar
New sentiment analysis of tweets using python by Ravi kumar
 
Sentiment analysis of Twitter Data
Sentiment analysis of Twitter DataSentiment analysis of Twitter Data
Sentiment analysis of Twitter Data
 
bag-of-words models
bag-of-words models bag-of-words models
bag-of-words models
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Sentiment Analaysis on Twitter
Sentiment Analaysis on TwitterSentiment Analaysis on Twitter
Sentiment Analaysis on Twitter
 
Amazon sentimental analysis
Amazon sentimental analysisAmazon sentimental analysis
Amazon sentimental analysis
 
Sentiment analysis in twitter using python
Sentiment analysis in twitter using pythonSentiment analysis in twitter using python
Sentiment analysis in twitter using python
 

Similar to Predicting Tweet Sentiment

Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resume
venkata sateeshs
 
part 1 - intorduction data structure 2021 mte.ppt
part 1 -  intorduction data structure  2021 mte.pptpart 1 -  intorduction data structure  2021 mte.ppt
part 1 - intorduction data structure 2021 mte.ppt
abdoSelem1
 

Similar to Predicting Tweet Sentiment (20)

Production-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to heroProduction-Ready BIG ML Workflows - from zero to hero
Production-Ready BIG ML Workflows - from zero to hero
 
AI hype or reality
AI  hype or realityAI  hype or reality
AI hype or reality
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
infoShare AI Roadshow 2018 - Adam Karwan (Groupon) - Jak wykorzystać uczenie ...
 
useR 2014 jskim
useR 2014 jskimuseR 2014 jskim
useR 2014 jskim
 
applications and advantages of python
applications and advantages of pythonapplications and advantages of python
applications and advantages of python
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
Data Analysis - Making Big Data Work
Data Analysis - Making Big Data WorkData Analysis - Making Big Data Work
Data Analysis - Making Big Data Work
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Christine_Straub - ML Engineer.pdf
Christine_Straub - ML Engineer.pdfChristine_Straub - ML Engineer.pdf
Christine_Straub - ML Engineer.pdf
 
resume_MH
resume_MHresume_MH
resume_MH
 
Venkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-ResumeVenkata Sateesh_BigData_Latest-Resume
Venkata Sateesh_BigData_Latest-Resume
 
Machine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWERMachine Learning AND Deep Learning for OpenPOWER
Machine Learning AND Deep Learning for OpenPOWER
 
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
Build Deep Learning Applications Using Apache MXNet, Featuring Workday (AIM40...
 
kdd2015
kdd2015kdd2015
kdd2015
 
part 1 - intorduction data structure 2021 mte.ppt
part 1 -  intorduction data structure  2021 mte.pptpart 1 -  intorduction data structure  2021 mte.ppt
part 1 - intorduction data structure 2021 mte.ppt
 

Recently uploaded

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Recently uploaded (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
➥🔝 7737669865 🔝▻ Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 

Predicting Tweet Sentiment

  • 1. Predicting the Sentiment Behind Millions of Tweets Jaeduck Han Lucinda Linde Omesh Gadhave May 16, 2019 ALY 6110 Final Project Image credit ©Towards Data Science
  • 2. Predicting Sentiment from Tweets Business Goals ● To help businesses understand customers’ feelings towards products/brands, response towards their advertising campaigns or product launches based on the sentiment (positive/negative) behind the tweets concerning them. Project Goals ● Build and assess different models to accurately determine whether a Tweet is positive or negative by training the models on a large split of data (training data) and testing their accuracy of predicting sentiment for the test data using python & its libraries. ● Gain experience using Hadoop, Spark and Pyspark.
  • 3. Sentiment140 Dataset DataSet ● 1.6M tweets from 2009, 800k positive and 800k negative. ● Tweets were labeled by interpreting emoticons as “distant supervised” data. ● Tweet labels: 0 = negative, 4 = positive. Attributes - 6 fields 1. target: the polarity of the tweet (0 = negative, 4 = positive) 2. ids: The id of the tweet ( 2087) 3. date: the date of the tweet (Sat May 16 23:58:44 UTC 2009) 4. flag: The query (lyx). If there is no query, then this value is NO_QUERY. 5. user: the user that tweeted (robotickilldozr) 6. text: the text of the tweet (Lyx is cool)
  • 4. Data Examples target id date flag user text 0 14678 10672 Mon Apr 06 22:19:49 PDT 2009 NO_QUERY scotthamilton is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah! 4 14678 22272 Mon Apr 06 22:22:45 PDT 2009 NO_QUERY ersle I LOVE @Health4UandPets u guys r the best!!
  • 5. Analysis Plan Cleanse and Preprocess data Explore the Data ● Word Clouds ● Bar graphs of Hashtags Assess different pros and cons of models Choose models Split Data into Train and Test Sets For each Model ● Choose parameters ● Make predictions ● Assess performance of model Discussion and Comparison Conclusion
  • 6. Data Cleansing, Pre-processing and Modeling Using Python Clean, Pre-process and Explore ● Clean Tweets ● Tokenize, etc. ● EDA- Word Clouds ● EDA- Hashtag Graphs Models ● TF-IDF + Logistic Regression ● Random Forest ● fastText Evaluate Performance Using Pyspark Load Clean Tweets into Pyspark ● Remove Nulls (few thousand) Divide data set into train and test sets ● Build Models ● TF-IDF + Logistic Regression ● 2-Gram ● 3-Gram Compare Predictions to Labels Evaluate Performance
  • 7. Data Cleansing and Pre-processing Taming the Tweets ● Used NLTK for stemming- PorterStemmer ● Defined pattern find and replace for special character and stop word removal Tweets Remove special chars (not #) Remove words < 3 chars. Stem the words Stitch back together = Clean Tweets Tokenize words
  • 8. Data Exploration: WordClouds of Positive Tweets Positive Tweets ● Many positive words such as love, nice, well. ● Many neutral words like time, think, today, work. ● Positive and Negative Tweets are different, but not in an obvious way
  • 9. WordClouds of Negative Tweets Negative Tweets ● Some negative words such as hate, sick, damn, ● Many neutral words like time, think, today, work. ● Difficult to say what actions should be taken
  • 10. Hashtags- Positive - Many items reflect the time frame (4/09-6/09) ● #FollowFriday or #ff started in 2009 ● McFly British rock band, #1 album May 2009
  • 11. Hashtags- Negative - What do we learn here? ● Iran Election took place in June 2009, then the arrests ● iPhone 3GS released June 9, 2009 (not great...)
  • 12. Model selection-Why we chose what we chose We read several articles on sentiment analysis of tweets and of the sentiment140 data set. Among the successful models were: ● TF-IDF + N-Gram+ Logistic Regression ● Random Forest ● fastText
  • 13. Pros and Cons of Chosen Approaches Approach Pros Cons Logistic Regression Well-understood binary classification method Prone to over-fitting Random Forest Decorrelates trees reduced variance Not as easy to visually interpret fastText Fast, Conformed performance, Memory C++ language, Fixed data format, File I/O N-gram Captures “not good” or “not the best” instead of “good” or “best” Not “linguistically based” Lacks longer N dependencies
  • 14. Assessing Model Performance AUC Confusion Matrix Area under Receiver Operator Curve How much more does the model predict above the presence in the population? Confusion Matrix, Recall and Precision Recall: What % of Actual Cancers are Detected? Precision: What % of Detected Cancers are Actual?
  • 15. Performance Metrics to Evaluate Models ● Accuracy = (True positives + True negatives) / Total Examples ● Precision = True positives / (True positives + False Positives) ● Recall =True positives / (True positives + False Negatives) ● AUC = Area Under the Receiver Operator Curve ● F1 score = 2 * (Precision * Recall) / (Precision + Recall)
  • 16. Model 1: Random Forest Pre-processing # of words Accuracy Precision Recall AUC F1-score BoW 6000 0.6856 0.6856 0.6856 0.6856 0.6856 Bag of Words : 300 words 1 tree : 14s Results: 0.61~0.62 Bag of Words : 300 words 100 tree : 1114s Results: 0.61~0.63 Bag of Words : 6000 words 1 tree : 266s Bag of Words: 6000 words 100 tree : more than 12 hours
  • 17. What is fastText? fastText is a library for efficient learning of word representations and sentence classification. Requirements fastText builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. These include : ●(gcc-4.6.3 or newer) or (clang-3.3 or newer) Compilation is carried out using a Makefile, so you will need to have a working make. For the word-similarity evaluation script you will need: ●python 2.6 or newer ●numpy & scipy (fastText)
  • 18.
  • 19. Model 2: fastText Model # of epoch Precision Recall fastText 1 0.7497 0.7497 fastText 30 0.7444 0.7444 fastText 50 0.7398 0.7398 The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset.
  • 20. Model 3 N-Grams Groups of N Consecutive Words 2-Gram Process Hadoop and Apache Spark are both big-data frameworks 3-Gram Process Hadoop and Apache Spark are both big-data frameworks that
  • 21. Pipeline for 1, 2 and 3 gram (N-Gram, TF/IDF, LR) TF = Term Frequency (# times term-t appears in doc)/ (Total # terms in doc) Clean_Tweets Tidy_Tweet Target (0,4) Tokenizer Input: Tidy_Tweet Output: “words” Tokenize the words of clean tweets Hashing TF+IDF Input: “words” Output: tf-grams, idf-grams Create Inverse Document Frequency Logistic Regression Input: target, tf-grams, idf- grams Output: Predictions Train Model and Predict Test data N-Gram Input: “words” Output: NGrams Create Term Frequencies IDF = Inverse Document Frequency log_e(Total number of documents / Number of documents with term t in it).
  • 22. 2-Gram Approach Yields Best Results Pre-processing Model Accuracy Precision Recall AUC F1-score 1-Gram Logistic Regression 0.7566 0.7495 0.7755 0.8274 0.7623 2-Gram Logistic Regression 0.7702 0.7581 0.7980 0.8457 0.7768 3-Gram Logistic Regression 0.7696 0.7578 0.7867 0.8458 0.7768 Packages used: pyspark.ml.feature; NGram and VectorAssembler
  • 23. Table Comparing Model Results Pre-processing Model Accuracy Precision Recall AUC F1-score BoW Random forest 0.6856 0.6856 0.6856 0.6856 0.6856 fastText fastText - 0.7497 0.7497 - - 1-Gram Logistic Regression 0.7566 0.7495 0.7755 0.8274 0.7623 2-Gram Logistic Regression 0.7702 0.7581 0.7980 0.8457 0.7768 3-Gram Logistic Regression 0.7696 0.7578 0.7867 0.8458 0.7768
  • 24. Discussion and Conclusions ● It was hard to beat 0.76 accuracy ● Many more combinations are possible to pursue ○ Preprocessing (BOW, TF/IDF, stemming, lexicons) ○ Algorithms (Naive-Bayes, Support Vector Machines, Logistic Regression, etc.) ● For each preprocessing and model, there are also many parameters to adjust ● Sentiment mining is still at a very early stage of development ● Text analysis doesn’t handle sarcasm (meaning the opposite). ● Even humans have trouble assessing the sentiment of Tweets
  • 25. Future Work ● Try different combinations of pre-processing and machine learning techniques ● Test how practical these approaches are for 10x or 100x the number of rows. ● What changes are needed for much larger volumes (e.g.sampling strategy) ● Analyze topics over longer time frames to characterize sentiment waves ● Attribute causes to changes in tweet sentiment, volume etc.
  • 27. References Bourguignat, C. (2015, July 19). 6 Differences Between Pandas And Spark DataFrames. Retrieved from https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 FastText. (n.d.). FastText. Retrieved from https://fasttext.cc/ Giachanou, Anastasia and Crestani, Fabio. (2016, June). Like it or not: A survey of Twitter sentiment analysis methods. ACM Comput. Surv. 49, 2, Article 28 Retrieved from: https://www.researchgate.net/publication/304916478_Like_It_or_Not_A_Survey_of_Twitter_Sentiment_Analysis_Methods Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12), 2009. Joshi, P. (2018, July 30). Comprehensive Hands on Guide to Twitter Sentiment Analysis with dataset & code. Retrieved from https://www.analyticsvidhya.com/blog/2018/07/hands-on-sentiment-analysis-dataset-python/ Kim, R. (2018, March 13). Sentiment Analysis with PySpark. Retrieved from https://towardsdatascience.com/sentiment-analysis- with-pyspark-bc8e83f80c35 Kim, R. (2018, January 13). Another Twitter sentiment analysis with Python - Part 5 (Tfidf vectorizer, model comparison... Retrieved from https://towardsdatascience.com/another-twitter-sentiment-analysis-with-python-part-5-50b4e87d9bdd