Sentiment analysis techniques are used to analyze customer reviews and understand sentiment. Lexical analysis uses dictionaries to analyze sentiment while machine learning uses labeled training data. The document describes using these techniques to analyze hotel reviews from Booking.com. Word clouds and scatter plots of reviews are generated, showing mostly negative sentiment around breakfast, staff, rooms and facilities. Topic modeling reveals specific issues to address like soundproofing, air conditioning and parking. The analysis helps the hotel manager understand customer sentiment and priorities for improvement.
2. About
Sentiment analysis is one of the Natural
Processing techniques that extracts the
emotions from the raw set of data. It is
basically applied on the news data, social
media posts, customer reviews etc to
understand the emotions of the readers or the
customers and to understand how the users
are feeling about the posts they are reading.
With the increased competitions, the customer
feedback has become very important. With the
increased and larger users opinion, reviews
and feedback, automated techniques are
required to analyse them and to take actions
accordingly.
4. Lexical Analysis
The input text is converted to tokens by the Tokenizer and then every new token encountered is then
matched for the lexicon in the dictionary. On finding the positive match, the score is added to the
total pool of score for the input text.
An accuracy of about 80% on single phrases can be achieved by the use of hand tagged lexicons
comprised of only adjectives, which are crucial for deciding the subjectivity of an evaluative text.
5. Machine Learning Based Analysis
Data
Collection
Pre
Processin
g
Training
Data
Classificat
ion
Plotting
Results
7. Libraries
● NLTK: Python module for NLP techniques
● Vader: NLTK library used for sentiment analysis
● Gensim: Used for topic-modelling
● Scikit-learn: Python machine learning library
8. Web Scraping
● Scraping Reviews from Bookings.com
● The hotel I have chosen is "Hotel
Hilton," San Francisco, CA.
● The scraped data includes:
● Basic information of the reviewer and
reviews
● Rating Score
● Reviewer Name
● Reviewer's Nationality
● Overall Review (contains both
positive & negative reviews)
● Reviewer Reviewed Times
● Review Date
● Review Tags like Trip type, such as
business trip, leisure trip
● Positive reviews
● Negative reviews
11. Histogram representation of hotel reviews. The reviews are more negative
compared to positive reviews.
Solution
12. Histogram showing the reviews based on the trip type, for example: couple
trip, solo trip, family, business, etc.
13. Positive Review Outcome
● From the above plot, we can conclude that most people are
probably satisfied with the location, very convenient,
comfortable and close to Union Square or Chinatown.
● Easy to find restaurants or pubs nearby, friendly and helpful
staff
● Clean room, comfortable bed, and good price, etc.
15. Negative Review Outcome
● Words like “breakfast”, “room” and “staff” are mentioned quite often, which
indicates that maybe people were complaining about the staffs who were being
rude, small rooms, and coffee/ cereal/ muffin provided during breakfast.
● The air conditioning or the shower system may need improvements as we see
words like “hot”, “cold”, “air”, “condition”, “bathroom” and “shower” in the
WordCloud.
● The hotel may also need to solve issues related to soundproofing and parking.
17. Sentiment Analysis Outcome
The green dots that lies on the vertical line are the “neutral” reviews
The red dots on the left are the “negative” reviews
The blue dots on the right are the “positive” reviews.
Bigger dots indicate more subjectivity.
19. Model Training
GradientBoostingClassifier
GradientBoostingClassifier build trees one at a time, where each new tree helps to correct errors
made by previously trained tree. After applying the classifier, the accuracy score found is 67%
which can vary in the range of 63% to 80% depending upon the combination of the selected
features.
RandomForestClassifier
Random forest consists of a large number of individual decision trees that operate as an
ensemble. Each individual tree in the random forest spits out a class prediction and the class with
the most votes becomes the model’s prediction.
20. Topic Modeling
LDA model to find each document topic distribution and the high probability of word in each topic.
Here, we want to specifically look at the negative reviews to find out what aspects should the hotel be
focusing on improving.
Steps to find the optimal LDA model:
Convert the reviews to document-term matrix
GridSearch and tune for the optimal LDA model
Output the optimal lda model and its parameters
Compare LDA Model Performance Scores
21. Topic Modeling
From the graph, we see that there is little impact to choose different learning decay.
5 topics would produce the best model.
22. Conclusion
● The train dataset used to train the model provides a good prediction for the hotel reviews whether they
are positive, negative, very positive or very negative.
● The accuracy of the prediction is around 70% which is considered good.
● From the Sentiment Analysis scatter plot, we see that positive reviews are slightly more than the
negatives.
● Hotel Hilton definitely needs to improve hotel guest satisfaction.
● The WordCloud reveals some problems for the hotel manager to look into, like their breakfast.
● The hotel manager should train staff well to provide friendlier and better services.
● The hotel may also need to work with issues related to soundproofing, air conditioning, shower system
and parking.
● The EDA section could give the hotel manager a general idea of the reviews as well as the rating
distribution.
● The pyLDAvis interactive visualization would help the hotel manager to further understand what most
popular topics within the negative reviews are and make improvements accordingly.