SlideShare a Scribd company logo
1 of 23
Download to read offline
Amazon Alexa Reviews
Nikhil Shrivastava
Positive or Negative Alexa Reviews
Love my Echo!
Not working
Not good at all!
Amazing product
Focus of the Project: Alexa Reviews: Is this review positive or negative?
Dataset
Sentiment Classification for Alexa Reviews
Amazon Alexa Reviews Classification: A list of 3150 Amazon customers
reviews for Alexa Echo, Firestick, Echo Dot, etc and classify them if it’s
positive or negative.
Source of Dataset: https://www.kaggle.com/sid321axn/amazon-alexa-
reviews/metadata
Alexa Reviews Kaggle Dataset
Rating 5
• I love my Echo. It's easy to
operate, loads of fun. It is
everything as advertised. I use it
mainly to play my favorite tunes
and test Alexa's knowledge.
• Being able to add speakers is a
plus. I take it on my deck when I
am outside. Just love it. I have
my big Alexia in my bedroom
Ratings 4-1
• I didn't like that almost every
time i asked Alexa a question she
would say I don't know that, or I
haven't learned that.
• This device does not interact
with my home filled with Apple
devices. How disappointing!
Alexa Reviews Dataset Deep Dive
Dataset Snapshot:
Total length of the Data : 3150
Length of different ratings:
Combining Ratings 1,2,3 and 4 in negative sentiments and Rating 5 in positive sentiments
Dataset Deep Dive(Word cloud for Positive and Negative
Sentiments)
For Positive sentiments which is rating 5 we can
see words like love, great, good ,easy, etc
For Negative sentiments which is rating 1-4 we can
see words like disappointed, return,need, etc.
Most common words in entire dataset
We can clearly see that love has occurred 545 times and is pretty common.
Sentiment Analysis Setup
Feature Engineering and Baseline Algorithms
1. Tokenization
2. Vectorize
3. Classification using
1. Naïve Bayes Classifier
2. Random Forest Classifier
Tokenization
• First use stop-words to get clean reviews
• Tokenize the cleaned reviews using word_tokenize()
Vectorization: Creating Bag-of-Words model
• Used both Count Vectorizer and TF-IDF Vectorizer to count the occurrences and
frequency of tokens and building a sparse matrix of documents x tokens
• Count Vectorizer: Counts the occurrences of tokens to build the matrix.
• TF-IDF Vectorizer: Stands for Term Frequency Inverse Document Frequency. It is a
statistical measure used to evaluate how important a word is to a document in
the collection.
Count and TF-IDF Vectorizer
Finally proceeded with Count Vectorizer as it
was giving better results with ML models.
For TF-IDF to work better, I could have selected
bi-gram and tri-gram methods which would
give more accurate bag-of-words model
Multinomial Naïve Bayes Classifier
• In order to chose a label which should be assigned to a document w =
{w1,w2…wn), multinomial NB classifier begins by calculating the prior probability
Pr( c) of each label c which is determined by checking the frequency of each label
in the training set. The contribution from each word is then combined with Pr( c),
to arrive at a likelihood estimate for each label. It can be defined formally as:
Multinomial NB Classifier: Train and Test
Started with training the dataset and
the n checking the accuracy on test
dataset. Test dataset was 33% of the
entire dataset.
Accuracy is 80% and F-score which is
the harmonic mean of Precision and
Recall is 87%.
Weighted Precision, Recall Confusion Matrix
• Precision is the measure of false positives : TP/TP+FP which means
retrieval of relevant instances out of all positive instances. High
Precision means that an algorithm returned more relevant results
than irrelevant ones.
• Recall is the retrieval of True Positives out of TP’s and FN’s: TP/TP+FN.
High Recall means that an algorithm returned most of the relevant
results.
• TP = 701, TN = 135, FP= 169, FN = 35
Why weighted?
Used weighted Precision, Recall because weighted by support (the
number of true instances for each label) alters 'macro' to account for
label imbalance otherwise it can result in an F-score that is not
between precision and recall.
Precision is 0.80
Recall is 0.80
Random Forest Classifier
• Random forests is considered as a highly accurate and robust method
because of the number of decision trees participating in the process.
• It does not suffer from the overfitting problem. The main reason is
that it takes the average of all the predictions, which cancels out the
biases by using “feature bagging”.
Grid Search – To get the best estimator
• Used Grid Search to get the
best estimator in terms of
max features, max depth of
the tree, min_sample_split
and min_sample_leaf.
• Predicted the test using the
best estimator Random Forest
model.
• Accuracy is better slightly
approx. 81.05% and F- score is
also good 87.5%.
Precision, Recall and Confusion Matrix
• We can see that the FP’s has reduced
and TN have increased. But it’s still
better based on Precision and Recall.
• Precision and Recall is slightly better
with 81% approximately.
• Both have similar scores so our results
are evenly balanced here.
Feature Importance
Based on the feature importance, we can clearly see that
words like love, work, great, disappointed were the most
important words in determining any class of reviews.
Conclusion
• Overall, we can predict with 80% accuracy positive or negative review.
• Random Forest result were better than Naïve Bayes
Further Potential Enhancement
• By selecting and putting only important features, shown on previous
slide model accuracy can be further improved.
References
• https://medium.com/greyatom/an-introduction-to-bag-of-words-in-
nlp-ac967d43b428
• https://www.researchgate.net/publication/317173563_Bayesian_Mul
tinomial_Naive_Bayes_Classifier_to_Text_Classification

More Related Content

What's hot

Software requirements specification
Software  requirements specificationSoftware  requirements specification
Software requirements specification
Krishnasai Gudavalli
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
SonuCreation
 

What's hot (20)

Customer Segmentation
Customer SegmentationCustomer Segmentation
Customer Segmentation
 
Introduction to Tableau
Introduction to Tableau Introduction to Tableau
Introduction to Tableau
 
Data Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
Data Visualisation & Analytics with Tableau (Beginner) - by Maria KoumandrakiData Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
Data Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
 
Tableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data VisualizationTableau Software - Business Analytics and Data Visualization
Tableau Software - Business Analytics and Data Visualization
 
Tableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, DisadvantagesTableau PPT Intro, Features, Advantages, Disadvantages
Tableau PPT Intro, Features, Advantages, Disadvantages
 
Tableau ppt
Tableau pptTableau ppt
Tableau ppt
 
Software requirements specification
Software  requirements specificationSoftware  requirements specification
Software requirements specification
 
Tableau file types
Tableau   file typesTableau   file types
Tableau file types
 
Customer Clustering For Retail Marketing
Customer Clustering For Retail MarketingCustomer Clustering For Retail Marketing
Customer Clustering For Retail Marketing
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Development of-pharmacy-management-system
Development of-pharmacy-management-systemDevelopment of-pharmacy-management-system
Development of-pharmacy-management-system
 
Final Year Project of Online Food Ordering System
Final Year Project of Online Food Ordering SystemFinal Year Project of Online Food Ordering System
Final Year Project of Online Food Ordering System
 
Flipkart Software Requirements Specification (SRS)
Flipkart Software Requirements Specification (SRS)Flipkart Software Requirements Specification (SRS)
Flipkart Software Requirements Specification (SRS)
 
Predictive analytics in health insurance
Predictive analytics in health insurancePredictive analytics in health insurance
Predictive analytics in health insurance
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
Data Management in R
Data Management in RData Management in R
Data Management in R
 
Time Logger- BSc.CSIT Internship report
Time Logger- BSc.CSIT Internship reportTime Logger- BSc.CSIT Internship report
Time Logger- BSc.CSIT Internship report
 
Data mining and analysis of customer churn dataset
Data mining and analysis of customer churn datasetData mining and analysis of customer churn dataset
Data mining and analysis of customer churn dataset
 
Power bi
Power biPower bi
Power bi
 
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...Project prSentiment Analysis  of Twitter Data Using Machine Learning Approach...
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
 

Similar to Sentiment Analysis - Amazon Alexa Reviews

Similar to Sentiment Analysis - Amazon Alexa Reviews (20)

Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
Evaluation of multilabel multi class classification
Evaluation of multilabel multi class classificationEvaluation of multilabel multi class classification
Evaluation of multilabel multi class classification
 
Machine Learning
Machine Learning Machine Learning
Machine Learning
 
AI -learning and machine learning.pptx
AI  -learning and machine learning.pptxAI  -learning and machine learning.pptx
AI -learning and machine learning.pptx
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
Random Forest Decision Tree.pptx
Random Forest Decision Tree.pptxRandom Forest Decision Tree.pptx
Random Forest Decision Tree.pptx
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018Random forest sgv_ai_talk_oct_2_2018
Random forest sgv_ai_talk_oct_2_2018
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
Random Forest / Bootstrap Aggregation
Random Forest / Bootstrap AggregationRandom Forest / Bootstrap Aggregation
Random Forest / Bootstrap Aggregation
 
MACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptxMACHINE LEARNING YEAR DL SECOND PART.pptx
MACHINE LEARNING YEAR DL SECOND PART.pptx
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 

Recently uploaded

sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
siskavia95
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 

Recently uploaded (20)

What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 

Sentiment Analysis - Amazon Alexa Reviews

  • 2. Positive or Negative Alexa Reviews Love my Echo! Not working Not good at all! Amazing product Focus of the Project: Alexa Reviews: Is this review positive or negative?
  • 4. Sentiment Classification for Alexa Reviews Amazon Alexa Reviews Classification: A list of 3150 Amazon customers reviews for Alexa Echo, Firestick, Echo Dot, etc and classify them if it’s positive or negative. Source of Dataset: https://www.kaggle.com/sid321axn/amazon-alexa- reviews/metadata
  • 5. Alexa Reviews Kaggle Dataset Rating 5 • I love my Echo. It's easy to operate, loads of fun. It is everything as advertised. I use it mainly to play my favorite tunes and test Alexa's knowledge. • Being able to add speakers is a plus. I take it on my deck when I am outside. Just love it. I have my big Alexia in my bedroom Ratings 4-1 • I didn't like that almost every time i asked Alexa a question she would say I don't know that, or I haven't learned that. • This device does not interact with my home filled with Apple devices. How disappointing!
  • 6. Alexa Reviews Dataset Deep Dive Dataset Snapshot: Total length of the Data : 3150 Length of different ratings: Combining Ratings 1,2,3 and 4 in negative sentiments and Rating 5 in positive sentiments
  • 7. Dataset Deep Dive(Word cloud for Positive and Negative Sentiments) For Positive sentiments which is rating 5 we can see words like love, great, good ,easy, etc For Negative sentiments which is rating 1-4 we can see words like disappointed, return,need, etc.
  • 8. Most common words in entire dataset We can clearly see that love has occurred 545 times and is pretty common.
  • 10. Feature Engineering and Baseline Algorithms 1. Tokenization 2. Vectorize 3. Classification using 1. Naïve Bayes Classifier 2. Random Forest Classifier
  • 11. Tokenization • First use stop-words to get clean reviews • Tokenize the cleaned reviews using word_tokenize()
  • 12. Vectorization: Creating Bag-of-Words model • Used both Count Vectorizer and TF-IDF Vectorizer to count the occurrences and frequency of tokens and building a sparse matrix of documents x tokens • Count Vectorizer: Counts the occurrences of tokens to build the matrix. • TF-IDF Vectorizer: Stands for Term Frequency Inverse Document Frequency. It is a statistical measure used to evaluate how important a word is to a document in the collection.
  • 13. Count and TF-IDF Vectorizer Finally proceeded with Count Vectorizer as it was giving better results with ML models. For TF-IDF to work better, I could have selected bi-gram and tri-gram methods which would give more accurate bag-of-words model
  • 14. Multinomial Naïve Bayes Classifier • In order to chose a label which should be assigned to a document w = {w1,w2…wn), multinomial NB classifier begins by calculating the prior probability Pr( c) of each label c which is determined by checking the frequency of each label in the training set. The contribution from each word is then combined with Pr( c), to arrive at a likelihood estimate for each label. It can be defined formally as:
  • 15. Multinomial NB Classifier: Train and Test Started with training the dataset and the n checking the accuracy on test dataset. Test dataset was 33% of the entire dataset. Accuracy is 80% and F-score which is the harmonic mean of Precision and Recall is 87%.
  • 16. Weighted Precision, Recall Confusion Matrix • Precision is the measure of false positives : TP/TP+FP which means retrieval of relevant instances out of all positive instances. High Precision means that an algorithm returned more relevant results than irrelevant ones. • Recall is the retrieval of True Positives out of TP’s and FN’s: TP/TP+FN. High Recall means that an algorithm returned most of the relevant results. • TP = 701, TN = 135, FP= 169, FN = 35
  • 17. Why weighted? Used weighted Precision, Recall because weighted by support (the number of true instances for each label) alters 'macro' to account for label imbalance otherwise it can result in an F-score that is not between precision and recall. Precision is 0.80 Recall is 0.80
  • 18. Random Forest Classifier • Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process. • It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases by using “feature bagging”.
  • 19. Grid Search – To get the best estimator • Used Grid Search to get the best estimator in terms of max features, max depth of the tree, min_sample_split and min_sample_leaf. • Predicted the test using the best estimator Random Forest model. • Accuracy is better slightly approx. 81.05% and F- score is also good 87.5%.
  • 20. Precision, Recall and Confusion Matrix • We can see that the FP’s has reduced and TN have increased. But it’s still better based on Precision and Recall. • Precision and Recall is slightly better with 81% approximately. • Both have similar scores so our results are evenly balanced here.
  • 21. Feature Importance Based on the feature importance, we can clearly see that words like love, work, great, disappointed were the most important words in determining any class of reviews.
  • 22. Conclusion • Overall, we can predict with 80% accuracy positive or negative review. • Random Forest result were better than Naïve Bayes Further Potential Enhancement • By selecting and putting only important features, shown on previous slide model accuracy can be further improved.