Hotel Rating Classification
Team
Members:
 Ashwini Salwadgi
 Anuja Borse
 Shubham Pawar
 Akshay Kumar
 Rudra Shukla
 Devesh Gaonkar
 Pranjalee Bokde
CONTENTS
01 02 03
Project
Architecture
Problem
statement
Introduction to ML
classification
04 05 07
Dataset Details
Data Preprocessing
And EDA
Feature
engineering
08
Model Selection
09
Deployment
Project Architecture
Datasets
-Import Libraries
-Load Datasets
Data Cleaning
-Missing Value Treatment
-Checking Duplicates
Data Preprocessing/EDA
-Normalization & Lemmatization
-Punctuation Removal & Stopwords Removal
Data Visualization
-Positive Reviews
-Negative Reviews
Model Selection
Model Deployment
Problem statement
The Hotel dataset consists of 20,491 reviews and feedbacks for different hotels.
Our goal is to examine how travelers are communicating their positive and
negative experiences on online platforms for staying in a specific hotel.
The major objective is what are the attributes that travelers are considering
while selecting a hotel. With this managers can understand which elements of
their hotel influence more in forming a positive review or improving the hotel
brand image.
Introduction to NLP
 Natural Language Processing (NLP) is a field of computer science and
artificial intelligence focused on the interaction between computers and
human languages.
 It involves programming computers to process, analyze, and derive
meaning from large amounts of natural language data.
 NLP is applied in various areas, including automatic question answering,
text summarization, and language translation.
 Research in NLP spans across disciplines such as cognitive science,
linguistics, and psychology.
 One significant application of NLP is text classification, where the goal is
to categorize text into predefined labels based on its content.
NLP classification
Text Classification:
 Text classification is a common NLP task used to solve business problems
in various fields.
 The goal of text classification is to categorize or predict a class of unseen
text documents, often with the help of supervised machine learning.
 Similar to a classification algorithm that has been trained on a tabular
dataset to predict a class, text classification also uses supervised machine
learning.
 The fact that text is involved in text classification is the main distinction
between the two.
Dataset Details
 Hotel_Review.csv- the dataset we are using in our
project.
 No. of Rows: 20,491
 No. of Columns: 02
Data Preprocessing
Data Types
• Checked Data
Types: Both
“Review” and
“Feedback”
columns are of
object type
Null Value
Treatment
• Checked
for Missing
Values: No
missing
values
found
Duplicate
Value
Treatment
• Checked
for
Duplicates:
No
duplicates
found
Text Preprocessing
Text
Preprocessing
Normalization:
Converted text to lower
case
Punctuation Removal:
Removed unnecessary
punctuation
Lemmatization:
Reduced words to their
root forms
Stopwords Removal:
Excluded common
stopwords
Data Visualization(EDA)
Distribution of Feedback Labels
 Bar Plot of Feedback Counts
 Visual representation of the unique counts for each class
 (positive/negative).
Positive Reviews
using Word cloud
Negative Reviews
using Word cloud
Top Bigrams
Top Trigrams
Feature engineering
o Feature engineering in Natural Language Processing (NLP) involves
transforming raw text data into meaningful features that can be used by
machine learning algorithms to make predictions or generate insights.
o Unlike traditional structured data, text data is unstructured, so feature
engineering in NLP often involves a series of pre-processing steps and the
creation of specialized features to capture the nuances of language.
o Feature engineering in NLP is highly dependent on the specific problem
and the type of data being used.
o The goal is to create features that best capture the underlying patterns in the
text, leading to better model performance.
Sentiment Analysis
Features
Custom Features
Model Selection
Logistic regression:
 Logistic regression is a fundamental machine learning algorithm that is
widely used in Natural Language Processing (NLP) tasks, particularly for
binary classification problems.
 Despite its simplicity, it performs well on many NLP tasks when combined
with the right features and data preprocessing techniques.
 Logistic regression is trained using the maximum likelihood estimation,
where the model parameters are optimized to best fit the training data.
 Logistic regression remains a powerful tool in NLP, especially when you
need a model that is simple, interpretable, and performs well on a wide
range of binary classification tasks.
Applications of Logistic regression in NLP:
 Text Classification: Logistic regression can be used for tasks like sentiment
analysis, spam detection, or any other task where text needs to be classified
into two categories.
 Feature Representation:
Bag of Words (BoW): Converts text into a vector of word frequencies.
TF-IDF: Weights words by their importance, giving more significance
to rarer words in the document.
Word Embeddings: Converts words into dense vectors capturing
semantic meaning (e.g., using Word2Vec, GloVe).
n-grams: Captures sequences of words (e.g., bigrams, trigrams) to
consider word order and context.
Why we selected Logistic regression ?
Deployment
 Deployment is the process by which a ML model is
moved from an offline environment and integrated
into an existing production environment such as a live
application.
 It is a critical step that must be completed in order for
a model to serve its intended purpose and solve the
challenges it is designed.
 Here, we are using ‘Stremlit’ for deploying our
application.
Challenges in project
 Data Collection and Quality
 Noise in Data: Hotel reviews often contain spelling errors, slang,
abbreviations, and grammatical mistakes, which can make text preprocessing
difficult.
 Length Variation: Reviews can vary significantly in length, from a few words
to several paragraphs, which might require different handling during
preprocessing.
 Text Preprocessing Challenges
 Handling Stop Words: Deciding whether to remove stop words (common
words like "and", "the") can be tricky, as they might carry sentiment in some
contexts (e.g., "not good").
 Stemming and Lemmatization: Reducing words to their base forms can help
in generalizing features but might also lose some context (e.g., "better" being
reduced to "good").
Challenges in project
 Model Selection and Training
 Choosing the Right Model: Simple models like logistic regression might not
capture complex relationships in the data, while more advanced models like
neural networks might require extensive tuning and more computational
resources.
 Overfitting: With limited data or noisy data, the model might overfit, especially
when using complex models, leading to poor generalization to new reviews.
 Deployment Challenges
 Real-time Processing: If the model is to be deployed in a real-time system
(e.g., for live review monitoring), efficiency and speed of processing become
critical.
 Scalability: The model needs to scale with an increasing volume of reviews,
requiring optimization in terms of computational resources and processing
time.
references
 Pandas documentation Link- https://pandas.pydata.org/docs/
 Matplotlib documentation- https://matplotlib.org/stable/index.html
 Streamlit documentation- https://docs.streamlit.io/
 https://www.kaggle.com/
THANK YOU!!!!!.........

Hotel Review Classification(NLP Classification) PPT

  • 1.
    Hotel Rating Classification Team Members: Ashwini Salwadgi  Anuja Borse  Shubham Pawar  Akshay Kumar  Rudra Shukla  Devesh Gaonkar  Pranjalee Bokde
  • 2.
    CONTENTS 01 02 03 Project Architecture Problem statement Introductionto ML classification 04 05 07 Dataset Details Data Preprocessing And EDA Feature engineering 08 Model Selection 09 Deployment
  • 3.
    Project Architecture Datasets -Import Libraries -LoadDatasets Data Cleaning -Missing Value Treatment -Checking Duplicates Data Preprocessing/EDA -Normalization & Lemmatization -Punctuation Removal & Stopwords Removal Data Visualization -Positive Reviews -Negative Reviews Model Selection Model Deployment
  • 4.
    Problem statement The Hoteldataset consists of 20,491 reviews and feedbacks for different hotels. Our goal is to examine how travelers are communicating their positive and negative experiences on online platforms for staying in a specific hotel. The major objective is what are the attributes that travelers are considering while selecting a hotel. With this managers can understand which elements of their hotel influence more in forming a positive review or improving the hotel brand image.
  • 5.
    Introduction to NLP Natural Language Processing (NLP) is a field of computer science and artificial intelligence focused on the interaction between computers and human languages.  It involves programming computers to process, analyze, and derive meaning from large amounts of natural language data.  NLP is applied in various areas, including automatic question answering, text summarization, and language translation.  Research in NLP spans across disciplines such as cognitive science, linguistics, and psychology.  One significant application of NLP is text classification, where the goal is to categorize text into predefined labels based on its content.
  • 6.
    NLP classification Text Classification: Text classification is a common NLP task used to solve business problems in various fields.  The goal of text classification is to categorize or predict a class of unseen text documents, often with the help of supervised machine learning.  Similar to a classification algorithm that has been trained on a tabular dataset to predict a class, text classification also uses supervised machine learning.  The fact that text is involved in text classification is the main distinction between the two.
  • 7.
    Dataset Details  Hotel_Review.csv-the dataset we are using in our project.  No. of Rows: 20,491  No. of Columns: 02
  • 8.
    Data Preprocessing Data Types •Checked Data Types: Both “Review” and “Feedback” columns are of object type Null Value Treatment • Checked for Missing Values: No missing values found Duplicate Value Treatment • Checked for Duplicates: No duplicates found
  • 9.
    Text Preprocessing Text Preprocessing Normalization: Converted textto lower case Punctuation Removal: Removed unnecessary punctuation Lemmatization: Reduced words to their root forms Stopwords Removal: Excluded common stopwords
  • 10.
    Data Visualization(EDA) Distribution ofFeedback Labels  Bar Plot of Feedback Counts  Visual representation of the unique counts for each class  (positive/negative).
  • 11.
    Positive Reviews using Wordcloud Negative Reviews using Word cloud
  • 12.
  • 13.
    Feature engineering o Featureengineering in Natural Language Processing (NLP) involves transforming raw text data into meaningful features that can be used by machine learning algorithms to make predictions or generate insights. o Unlike traditional structured data, text data is unstructured, so feature engineering in NLP often involves a series of pre-processing steps and the creation of specialized features to capture the nuances of language. o Feature engineering in NLP is highly dependent on the specific problem and the type of data being used. o The goal is to create features that best capture the underlying patterns in the text, leading to better model performance.
  • 14.
  • 15.
    Model Selection Logistic regression: Logistic regression is a fundamental machine learning algorithm that is widely used in Natural Language Processing (NLP) tasks, particularly for binary classification problems.  Despite its simplicity, it performs well on many NLP tasks when combined with the right features and data preprocessing techniques.  Logistic regression is trained using the maximum likelihood estimation, where the model parameters are optimized to best fit the training data.  Logistic regression remains a powerful tool in NLP, especially when you need a model that is simple, interpretable, and performs well on a wide range of binary classification tasks.
  • 16.
    Applications of Logisticregression in NLP:  Text Classification: Logistic regression can be used for tasks like sentiment analysis, spam detection, or any other task where text needs to be classified into two categories.  Feature Representation: Bag of Words (BoW): Converts text into a vector of word frequencies. TF-IDF: Weights words by their importance, giving more significance to rarer words in the document. Word Embeddings: Converts words into dense vectors capturing semantic meaning (e.g., using Word2Vec, GloVe). n-grams: Captures sequences of words (e.g., bigrams, trigrams) to consider word order and context.
  • 17.
    Why we selectedLogistic regression ?
  • 18.
    Deployment  Deployment isthe process by which a ML model is moved from an offline environment and integrated into an existing production environment such as a live application.  It is a critical step that must be completed in order for a model to serve its intended purpose and solve the challenges it is designed.  Here, we are using ‘Stremlit’ for deploying our application.
  • 21.
    Challenges in project Data Collection and Quality  Noise in Data: Hotel reviews often contain spelling errors, slang, abbreviations, and grammatical mistakes, which can make text preprocessing difficult.  Length Variation: Reviews can vary significantly in length, from a few words to several paragraphs, which might require different handling during preprocessing.  Text Preprocessing Challenges  Handling Stop Words: Deciding whether to remove stop words (common words like "and", "the") can be tricky, as they might carry sentiment in some contexts (e.g., "not good").  Stemming and Lemmatization: Reducing words to their base forms can help in generalizing features but might also lose some context (e.g., "better" being reduced to "good").
  • 22.
    Challenges in project Model Selection and Training  Choosing the Right Model: Simple models like logistic regression might not capture complex relationships in the data, while more advanced models like neural networks might require extensive tuning and more computational resources.  Overfitting: With limited data or noisy data, the model might overfit, especially when using complex models, leading to poor generalization to new reviews.  Deployment Challenges  Real-time Processing: If the model is to be deployed in a real-time system (e.g., for live review monitoring), efficiency and speed of processing become critical.  Scalability: The model needs to scale with an increasing volume of reviews, requiring optimization in terms of computational resources and processing time.
  • 23.
    references  Pandas documentationLink- https://pandas.pydata.org/docs/  Matplotlib documentation- https://matplotlib.org/stable/index.html  Streamlit documentation- https://docs.streamlit.io/  https://www.kaggle.com/
  • 24.