Sentiment Analysis
By: Gunjan Srivastava
About
Sentiment analysis is one of the Natural
Processing techniques that extracts the
emotions from the raw set of data. It is
basically applied on the news data, social
media posts, customer reviews etc to
understand the emotions of the readers or the
customers and to understand how the users
are feeling about the posts they are reading.
With the increased competitions, the customer
feedback has become very important. With the
increased and larger users opinion, reviews
and feedback, automated techniques are
required to analyse them and to take actions
accordingly.
Techniques
● Lexical analysis
● Machine learning based analysis
● Hybrid/Combined analysis
Lexical Analysis
The input text is converted to tokens by the Tokenizer and then every new token encountered is then
matched for the lexicon in the dictionary. On finding the positive match, the score is added to the
total pool of score for the input text.
An accuracy of about 80% on single phrases can be achieved by the use of hand tagged lexicons
comprised of only adjectives, which are crucial for deciding the subjectivity of an evaluative text.
Machine Learning Based Analysis
Data
Collection
Pre
Processin
g
Training
Data
Classificat
ion
Plotting
Results
Methodologies
Web
Scraping
EDA
Word
Cloud
Train
Model
Sentiment
Analysis
LDA
Topic
Modelling
Libraries
● NLTK: Python module for NLP techniques
● Vader: NLTK library used for sentiment analysis
● Gensim: Used for topic-modelling
● Scikit-learn: Python machine learning library
Web Scraping
● Scraping Reviews from Bookings.com
● The hotel I have chosen is "Hotel
Hilton," San Francisco, CA.
● The scraped data includes:
● Basic information of the reviewer and
reviews
● Rating Score
● Reviewer Name
● Reviewer's Nationality
● Overall Review (contains both
positive & negative reviews)
● Reviewer Reviewed Times
● Review Date
● Review Tags like Trip type, such as
business trip, leisure trip
● Positive reviews
● Negative reviews
Web Scraping
The negative reviews are more compared positive reviews.
Histogram representation of hotel reviews. The reviews are more negative
compared to positive reviews.
Solution
Histogram showing the reviews based on the trip type, for example: couple
trip, solo trip, family, business, etc.
Positive Review Outcome
● From the above plot, we can conclude that most people are
probably satisfied with the location, very convenient,
comfortable and close to Union Square or Chinatown.
● Easy to find restaurants or pubs nearby, friendly and helpful
staff
● Clean room, comfortable bed, and good price, etc.
WordCloud for Positive Reviews.
Negative Review Outcome
● Words like “breakfast”, “room” and “staff” are mentioned quite often, which
indicates that maybe people were complaining about the staffs who were being
rude, small rooms, and coffee/ cereal/ muffin provided during breakfast.
● The air conditioning or the shower system may need improvements as we see
words like “hot”, “cold”, “air”, “condition”, “bathroom” and “shower” in the
WordCloud.
● The hotel may also need to solve issues related to soundproofing and parking.
WordCloud for Negative Reviews.
Sentiment Analysis Outcome
The green dots that lies on the vertical line are the “neutral” reviews
The red dots on the left are the “negative” reviews
The blue dots on the right are the “positive” reviews.
Bigger dots indicate more subjectivity.
Sentiment Analysis Outcome
Model Training
GradientBoostingClassifier
GradientBoostingClassifier build trees one at a time, where each new tree helps to correct errors
made by previously trained tree. After applying the classifier, the accuracy score found is 67%
which can vary in the range of 63% to 80% depending upon the combination of the selected
features.
RandomForestClassifier
Random forest consists of a large number of individual decision trees that operate as an
ensemble. Each individual tree in the random forest spits out a class prediction and the class with
the most votes becomes the model’s prediction.
Topic Modeling
LDA model to find each document topic distribution and the high probability of word in each topic.
Here, we want to specifically look at the negative reviews to find out what aspects should the hotel be
focusing on improving.
Steps to find the optimal LDA model:
Convert the reviews to document-term matrix
GridSearch and tune for the optimal LDA model
Output the optimal lda model and its parameters
Compare LDA Model Performance Scores
Topic Modeling
From the graph, we see that there is little impact to choose different learning decay.
5 topics would produce the best model.
Conclusion
● The train dataset used to train the model provides a good prediction for the hotel reviews whether they
are positive, negative, very positive or very negative.
● The accuracy of the prediction is around 70% which is considered good.
● From the Sentiment Analysis scatter plot, we see that positive reviews are slightly more than the
negatives.
● Hotel Hilton definitely needs to improve hotel guest satisfaction.
● The WordCloud reveals some problems for the hotel manager to look into, like their breakfast.
● The hotel manager should train staff well to provide friendlier and better services.
● The hotel may also need to work with issues related to soundproofing, air conditioning, shower system
and parking.
● The EDA section could give the hotel manager a general idea of the reviews as well as the rating
distribution.
● The pyLDAvis interactive visualization would help the hotel manager to further understand what most
popular topics within the negative reviews are and make improvements accordingly.
Thank You

Sentiment analysis presentation

  • 1.
  • 2.
    About Sentiment analysis isone of the Natural Processing techniques that extracts the emotions from the raw set of data. It is basically applied on the news data, social media posts, customer reviews etc to understand the emotions of the readers or the customers and to understand how the users are feeling about the posts they are reading. With the increased competitions, the customer feedback has become very important. With the increased and larger users opinion, reviews and feedback, automated techniques are required to analyse them and to take actions accordingly.
  • 3.
    Techniques ● Lexical analysis ●Machine learning based analysis ● Hybrid/Combined analysis
  • 4.
    Lexical Analysis The inputtext is converted to tokens by the Tokenizer and then every new token encountered is then matched for the lexicon in the dictionary. On finding the positive match, the score is added to the total pool of score for the input text. An accuracy of about 80% on single phrases can be achieved by the use of hand tagged lexicons comprised of only adjectives, which are crucial for deciding the subjectivity of an evaluative text.
  • 5.
    Machine Learning BasedAnalysis Data Collection Pre Processin g Training Data Classificat ion Plotting Results
  • 6.
  • 7.
    Libraries ● NLTK: Pythonmodule for NLP techniques ● Vader: NLTK library used for sentiment analysis ● Gensim: Used for topic-modelling ● Scikit-learn: Python machine learning library
  • 8.
    Web Scraping ● ScrapingReviews from Bookings.com ● The hotel I have chosen is "Hotel Hilton," San Francisco, CA. ● The scraped data includes: ● Basic information of the reviewer and reviews ● Rating Score ● Reviewer Name ● Reviewer's Nationality ● Overall Review (contains both positive & negative reviews) ● Reviewer Reviewed Times ● Review Date ● Review Tags like Trip type, such as business trip, leisure trip ● Positive reviews ● Negative reviews
  • 9.
  • 10.
    The negative reviewsare more compared positive reviews.
  • 11.
    Histogram representation ofhotel reviews. The reviews are more negative compared to positive reviews. Solution
  • 12.
    Histogram showing thereviews based on the trip type, for example: couple trip, solo trip, family, business, etc.
  • 13.
    Positive Review Outcome ●From the above plot, we can conclude that most people are probably satisfied with the location, very convenient, comfortable and close to Union Square or Chinatown. ● Easy to find restaurants or pubs nearby, friendly and helpful staff ● Clean room, comfortable bed, and good price, etc.
  • 14.
  • 15.
    Negative Review Outcome ●Words like “breakfast”, “room” and “staff” are mentioned quite often, which indicates that maybe people were complaining about the staffs who were being rude, small rooms, and coffee/ cereal/ muffin provided during breakfast. ● The air conditioning or the shower system may need improvements as we see words like “hot”, “cold”, “air”, “condition”, “bathroom” and “shower” in the WordCloud. ● The hotel may also need to solve issues related to soundproofing and parking.
  • 16.
  • 17.
    Sentiment Analysis Outcome Thegreen dots that lies on the vertical line are the “neutral” reviews The red dots on the left are the “negative” reviews The blue dots on the right are the “positive” reviews. Bigger dots indicate more subjectivity.
  • 18.
  • 19.
    Model Training GradientBoostingClassifier GradientBoostingClassifier buildtrees one at a time, where each new tree helps to correct errors made by previously trained tree. After applying the classifier, the accuracy score found is 67% which can vary in the range of 63% to 80% depending upon the combination of the selected features. RandomForestClassifier Random forest consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes the model’s prediction.
  • 20.
    Topic Modeling LDA modelto find each document topic distribution and the high probability of word in each topic. Here, we want to specifically look at the negative reviews to find out what aspects should the hotel be focusing on improving. Steps to find the optimal LDA model: Convert the reviews to document-term matrix GridSearch and tune for the optimal LDA model Output the optimal lda model and its parameters Compare LDA Model Performance Scores
  • 21.
    Topic Modeling From thegraph, we see that there is little impact to choose different learning decay. 5 topics would produce the best model.
  • 22.
    Conclusion ● The traindataset used to train the model provides a good prediction for the hotel reviews whether they are positive, negative, very positive or very negative. ● The accuracy of the prediction is around 70% which is considered good. ● From the Sentiment Analysis scatter plot, we see that positive reviews are slightly more than the negatives. ● Hotel Hilton definitely needs to improve hotel guest satisfaction. ● The WordCloud reveals some problems for the hotel manager to look into, like their breakfast. ● The hotel manager should train staff well to provide friendlier and better services. ● The hotel may also need to work with issues related to soundproofing, air conditioning, shower system and parking. ● The EDA section could give the hotel manager a general idea of the reviews as well as the rating distribution. ● The pyLDAvis interactive visualization would help the hotel manager to further understand what most popular topics within the negative reviews are and make improvements accordingly.
  • 23.