This document discusses sentiment analysis of Amazon Alexa reviews using machine learning classifiers. It analyzes a dataset of over 3,000 Alexa product reviews rated 1-5, classifying ratings 1-4 as negative and 5 as positive. Two classifiers are tested: Multinomial Naive Bayes achieves 80% accuracy and 87% F1 score, while Random Forest achieves slightly higher at 81% accuracy and 87.5% F1 score. Key terms like "love", "disappointed" are important indicators. Overall the analysis demonstrates the ability to accurately predict sentiment from reviews with these techniques.
2. Positive or Negative Alexa Reviews
Love my Echo!
Not working
Not good at all!
Amazing product
Focus of the Project: Alexa Reviews: Is this review positive or negative?
4. Sentiment Classification for Alexa Reviews
Amazon Alexa Reviews Classification: A list of 3150 Amazon customers
reviews for Alexa Echo, Firestick, Echo Dot, etc and classify them if it’s
positive or negative.
Source of Dataset: https://www.kaggle.com/sid321axn/amazon-alexa-
reviews/metadata
5. Alexa Reviews Kaggle Dataset
Rating 5
• I love my Echo. It's easy to
operate, loads of fun. It is
everything as advertised. I use it
mainly to play my favorite tunes
and test Alexa's knowledge.
• Being able to add speakers is a
plus. I take it on my deck when I
am outside. Just love it. I have
my big Alexia in my bedroom
Ratings 4-1
• I didn't like that almost every
time i asked Alexa a question she
would say I don't know that, or I
haven't learned that.
• This device does not interact
with my home filled with Apple
devices. How disappointing!
6. Alexa Reviews Dataset Deep Dive
Dataset Snapshot:
Total length of the Data : 3150
Length of different ratings:
Combining Ratings 1,2,3 and 4 in negative sentiments and Rating 5 in positive sentiments
7. Dataset Deep Dive(Word cloud for Positive and Negative
Sentiments)
For Positive sentiments which is rating 5 we can
see words like love, great, good ,easy, etc
For Negative sentiments which is rating 1-4 we can
see words like disappointed, return,need, etc.
8. Most common words in entire dataset
We can clearly see that love has occurred 545 times and is pretty common.
10. Feature Engineering and Baseline Algorithms
1. Tokenization
2. Vectorize
3. Classification using
1. Naïve Bayes Classifier
2. Random Forest Classifier
11. Tokenization
• First use stop-words to get clean reviews
• Tokenize the cleaned reviews using word_tokenize()
12. Vectorization: Creating Bag-of-Words model
• Used both Count Vectorizer and TF-IDF Vectorizer to count the occurrences and
frequency of tokens and building a sparse matrix of documents x tokens
• Count Vectorizer: Counts the occurrences of tokens to build the matrix.
• TF-IDF Vectorizer: Stands for Term Frequency Inverse Document Frequency. It is a
statistical measure used to evaluate how important a word is to a document in
the collection.
13. Count and TF-IDF Vectorizer
Finally proceeded with Count Vectorizer as it
was giving better results with ML models.
For TF-IDF to work better, I could have selected
bi-gram and tri-gram methods which would
give more accurate bag-of-words model
14. Multinomial Naïve Bayes Classifier
• In order to chose a label which should be assigned to a document w =
{w1,w2…wn), multinomial NB classifier begins by calculating the prior probability
Pr( c) of each label c which is determined by checking the frequency of each label
in the training set. The contribution from each word is then combined with Pr( c),
to arrive at a likelihood estimate for each label. It can be defined formally as:
15. Multinomial NB Classifier: Train and Test
Started with training the dataset and
the n checking the accuracy on test
dataset. Test dataset was 33% of the
entire dataset.
Accuracy is 80% and F-score which is
the harmonic mean of Precision and
Recall is 87%.
16. Weighted Precision, Recall Confusion Matrix
• Precision is the measure of false positives : TP/TP+FP which means
retrieval of relevant instances out of all positive instances. High
Precision means that an algorithm returned more relevant results
than irrelevant ones.
• Recall is the retrieval of True Positives out of TP’s and FN’s: TP/TP+FN.
High Recall means that an algorithm returned most of the relevant
results.
• TP = 701, TN = 135, FP= 169, FN = 35
17. Why weighted?
Used weighted Precision, Recall because weighted by support (the
number of true instances for each label) alters 'macro' to account for
label imbalance otherwise it can result in an F-score that is not
between precision and recall.
Precision is 0.80
Recall is 0.80
18. Random Forest Classifier
• Random forests is considered as a highly accurate and robust method
because of the number of decision trees participating in the process.
• It does not suffer from the overfitting problem. The main reason is
that it takes the average of all the predictions, which cancels out the
biases by using “feature bagging”.
19. Grid Search – To get the best estimator
• Used Grid Search to get the
best estimator in terms of
max features, max depth of
the tree, min_sample_split
and min_sample_leaf.
• Predicted the test using the
best estimator Random Forest
model.
• Accuracy is better slightly
approx. 81.05% and F- score is
also good 87.5%.
20. Precision, Recall and Confusion Matrix
• We can see that the FP’s has reduced
and TN have increased. But it’s still
better based on Precision and Recall.
• Precision and Recall is slightly better
with 81% approximately.
• Both have similar scores so our results
are evenly balanced here.
21. Feature Importance
Based on the feature importance, we can clearly see that
words like love, work, great, disappointed were the most
important words in determining any class of reviews.
22. Conclusion
• Overall, we can predict with 80% accuracy positive or negative review.
• Random Forest result were better than Naïve Bayes
Further Potential Enhancement
• By selecting and putting only important features, shown on previous
slide model accuracy can be further improved.