Hotel Review Classification(NLP Classification) PPT

Hotel Rating Classification
Team
Members:
 Ashwini Salwadgi
 Anuja Borse
 Shubham Pawar
 Akshay Kumar
 Rudra Shukla
 Devesh Gaonkar
 Pranjalee Bokde

CONTENTS
01 02 03
Project
Architecture
Problem
statement
Introduction to ML
classification
04 05 07
Dataset Details
Data Preprocessing
And EDA
Feature
engineering
08
Model Selection
09
Deployment

Project Architecture
Datasets
-Import Libraries
-Load Datasets
Data Cleaning
-Missing Value Treatment
-Checking Duplicates
Data Preprocessing/EDA
-Normalization & Lemmatization
-Punctuation Removal & Stopwords Removal
Data Visualization
-Positive Reviews
-Negative Reviews
Model Selection
Model Deployment

Problem statement
The Hotel dataset consists of 20,491 reviews and feedbacks for different hotels.
Our goal is to examine how travelers are communicating their positive and
negative experiences on online platforms for staying in a specific hotel.
The major objective is what are the attributes that travelers are considering
while selecting a hotel. With this managers can understand which elements of
their hotel influence more in forming a positive review or improving the hotel
brand image.

Introduction to NLP
 Natural Language Processing (NLP) is a field of computer science and
artificial intelligence focused on the interaction between computers and
human languages.
 It involves programming computers to process, analyze, and derive
meaning from large amounts of natural language data.
 NLP is applied in various areas, including automatic question answering,
text summarization, and language translation.
 Research in NLP spans across disciplines such as cognitive science,
linguistics, and psychology.
 One significant application of NLP is text classification, where the goal is
to categorize text into predefined labels based on its content.

NLP classification
Text Classification:
 Text classification is a common NLP task used to solve business problems
in various fields.
 The goal of text classification is to categorize or predict a class of unseen
text documents, often with the help of supervised machine learning.
 Similar to a classification algorithm that has been trained on a tabular
dataset to predict a class, text classification also uses supervised machine
learning.
 The fact that text is involved in text classification is the main distinction
between the two.

Dataset Details
 Hotel_Review.csv- the dataset we are using in our
project.
 No. of Rows: 20,491
 No. of Columns: 02

Data Preprocessing
Data Types
• Checked Data
Types: Both
“Review” and
“Feedback”
columns are of
object type
Null Value
Treatment
• Checked
for Missing
Values: No
missing
values
found
Duplicate
Value
Treatment
• Checked
for
Duplicates:
No
duplicates
found

Text Preprocessing
Text
Preprocessing
Normalization:
Converted text to lower
case
Punctuation Removal:
Removed unnecessary
punctuation
Lemmatization:
Reduced words to their
root forms
Stopwords Removal:
Excluded common
stopwords

Data Visualization(EDA)
Distribution of Feedback Labels
 Bar Plot of Feedback Counts
 Visual representation of the unique counts for each class
 (positive/negative).

Positive Reviews
using Word cloud
Negative Reviews
using Word cloud

Feature engineering
o Feature engineering in Natural Language Processing (NLP) involves
transforming raw text data into meaningful features that can be used by
machine learning algorithms to make predictions or generate insights.
o Unlike traditional structured data, text data is unstructured, so feature
engineering in NLP often involves a series of pre-processing steps and the
creation of specialized features to capture the nuances of language.
o Feature engineering in NLP is highly dependent on the specific problem
and the type of data being used.
o The goal is to create features that best capture the underlying patterns in the
text, leading to better model performance.

Sentiment Analysis
Features
Custom Features

Model Selection
Logistic regression:
 Logistic regression is a fundamental machine learning algorithm that is
widely used in Natural Language Processing (NLP) tasks, particularly for
binary classification problems.
 Despite its simplicity, it performs well on many NLP tasks when combined
with the right features and data preprocessing techniques.
 Logistic regression is trained using the maximum likelihood estimation,
where the model parameters are optimized to best fit the training data.
 Logistic regression remains a powerful tool in NLP, especially when you
need a model that is simple, interpretable, and performs well on a wide
range of binary classification tasks.

Applications of Logistic regression in NLP:
 Text Classification: Logistic regression can be used for tasks like sentiment
analysis, spam detection, or any other task where text needs to be classified
into two categories.
 Feature Representation:
Bag of Words (BoW): Converts text into a vector of word frequencies.
TF-IDF: Weights words by their importance, giving more significance
to rarer words in the document.
Word Embeddings: Converts words into dense vectors capturing
semantic meaning (e.g., using Word2Vec, GloVe).
n-grams: Captures sequences of words (e.g., bigrams, trigrams) to
consider word order and context.

Why we selected Logistic regression ?

Deployment
 Deployment is the process by which a ML model is
moved from an offline environment and integrated
into an existing production environment such as a live
application.
 It is a critical step that must be completed in order for
a model to serve its intended purpose and solve the
challenges it is designed.
 Here, we are using ‘Stremlit’ for deploying our
application.

Challenges in project
 Data Collection and Quality
 Noise in Data: Hotel reviews often contain spelling errors, slang,
abbreviations, and grammatical mistakes, which can make text preprocessing
difficult.
 Length Variation: Reviews can vary significantly in length, from a few words
to several paragraphs, which might require different handling during
preprocessing.
 Text Preprocessing Challenges
 Handling Stop Words: Deciding whether to remove stop words (common
words like "and", "the") can be tricky, as they might carry sentiment in some
contexts (e.g., "not good").
 Stemming and Lemmatization: Reducing words to their base forms can help
in generalizing features but might also lose some context (e.g., "better" being
reduced to "good").

Challenges in project
 Model Selection and Training
 Choosing the Right Model: Simple models like logistic regression might not
capture complex relationships in the data, while more advanced models like
neural networks might require extensive tuning and more computational
resources.
 Overfitting: With limited data or noisy data, the model might overfit, especially
when using complex models, leading to poor generalization to new reviews.
 Deployment Challenges
 Real-time Processing: If the model is to be deployed in a real-time system
(e.g., for live review monitoring), efficiency and speed of processing become
critical.
 Scalability: The model needs to scale with an increasing volume of reviews,
requiring optimization in terms of computational resources and processing
time.

references
 Pandas documentation Link- https://pandas.pydata.org/docs/
 Matplotlib documentation- https://matplotlib.org/stable/index.html
 Streamlit documentation- https://docs.streamlit.io/
 https://www.kaggle.com/

Hotel Review Classification(NLP Classification) PPT

More Related Content

Similar to Hotel Review Classification(NLP Classification) PPT

Recently uploaded

Hotel Review Classification(NLP Classification) PPT