Analyzing Movie Reviews : Machine learning project

Sentiment Analysis of movie reviews

Introduction
• In an era where the digital landscape is flooded with an abundance of
user-generated content, understanding the sentiments expressed in
movie reviews provides valuable insights into audience reactions,
preferences, and also, provide filmmakers with feedback on how their
work is being received.
• Sentiment analysis is a technique for analyzing a piece of text to
determine the sentiment contained within it. In our case, we have been
given an IMDB movie review dataset that contains about 50k
sentimental movie reviews as positive or negative.
• However, our aim is to study the given dataset, build and train a model
such that it would be able to classify a new unseen review as positive or
negative accurately.

Problem Statement
• In the realm of the ever-expanding digital landscape and the proliferation
of user-generated content, the film industry faces a pressing need to
systematically understand and analyze the sentiments expressed in
movie reviews.
• Moviegoers share their opinions across diverse platforms, including
websites, social media, and online forums, offering a rich tapestry of
sentiments ranging from enthusiastic praise to critical evaluation.
• The problem at hand involves developing an effective sentiment analysis
system tailored specifically for movie reviews. This system must
automatically categorize and interpret the sentiments expressed in
textual content, classifying them as positive, or negative

How will the ML model help
• Insight Generation:- By accurately classifying sentiments, the model will
generate actionable insights into how audiences perceive and react to
movies. Filmmakers and studios can gain a comprehensive understanding of
the strengths and weaknesses of their films, helping them make informed
decisions for future projects.
• Box Office Predictions:- The model's analysis of sentiments can contribute to
predicting box office performance. Positive sentiments often correlate with
higher audience interest, potentially leading to increased box office revenue.
This predictive capability provides stakeholders with valuable foresight into a
film's commercial success.
• Marketing Strategy Optimization:- The model's outputs can guide the
optimization of marketing and promotional strategies. Positive sentiments can
be leveraged to create compelling promotional content, while addressing
negative sentiments allows for targeted improvements and strategic
communication to manage public perception.

Challenges Faced
• Nuanced language:- Movie reviews often contain nuanced language,
sarcasm, irony, or humor. Capturing these subtleties can be challenging
for sentiment analysis models, as they may misinterpret the intended
sentiment.
• Subjectivity:- Sentiment is inherently subjective, and individuals may
express their opinions in diverse ways. Differentiating between personal
opinions and objective statements poses a challenge, as models need
to navigate the subjective nature of language.
• Contextual understanding:- Understanding the context in which certain
phrases or cultural references are used is crucial. A lack of contextual
understanding may lead to misinterpretation of sentiments, especially
when specific references are involved.

Proposed System
• Data collection:- The dataset should encompass a wide range of
reviews, including different outcomes. By including diverse data, the
model can learn patterns from the reviews.
• EDA:- Exploratory Data Analysis (EDA) plays a crucial role in
understanding the dataset and extracting meaningful insights, which can
aid in predicting the sentiment of the reviews.

• Data preprocessing:-
 Removing html tags:- Our reviews have html tags because this data is scraped
from the internet so we will have to remove the html tags.
Converting everything to lower case:- Here we will convert all the words to lower
case.

 Removing all punctuations:- We will remove all the punctuations used in the reviews as it is of no use.
 Spelling correction:- Here we will correct all the spelling mistakes in the reviews using the .correct() function
 Tokenization:- This involves breaking down a text into smaller units called tokens.

Removing stop words:- Stop words are words like and,or,the,from which exist in
the reviews and are of no use in training the model so we will remove them.
Stemming:- This will convert all the similar words to the most basic version of the
word. Example - playing, played will all be converted to play.

• Feature Extraction:-
Bag Of Words:- The "Bag of Words" (BoW) model is a common and
straightforward technique used in natural language processing (NLP) for
representing textual data. The basic idea behind the Bag of Words
model is to represent a document as an unordered set of words. The
Bag of Words model helps convert raw textual data into a numerical
format that machine learning algorithms can understand.

• Model Training:- Model training is a crucial step in machine learning
where a model learns to make predictions or decisions by being
exposed to a labeled dataset. In the context of sentiment analysis, the
training process involves teaching the model to associate features
extracted from text data (such as bag of words or word embeddings)
with corresponding sentiment labels (positive, or negative).

• Model Selection:-
Multinomial Naïve Bayes:- Multinomial Naive Bayes is often considered
a suitable choice for sentiment analysis of text, including movie reviews,
due to several characteristics that align well with the nature of the task.
Sentiment analysis is essentially a text classification task where the goal
is to assign a sentiment label (positive, or negative) to a given
document. Multinomial Naive Bayes is particularly well-suited for such
classification tasks.

 Logistic Regression:- Logistic Regression is another commonly used
model for sentiment analysis, including the analysis of movie reviews.
Sentiment analysis is often treated as a binary classification task where
the goal is to predict whether a document (e.g., a movie review)
expresses positive or negative sentiment. Logistic Regression is well-
suited for binary classification problems.

 Random Forest:- Random Forest is an ensemble of decision trees. It
combines the predictions of multiple weak learners (individual decision
trees) to create a more robust and accurate model. Ensemble methods
often lead to improved generalization performance. Random Forest
provides a feature importance ranking, indicating the contribution of
each feature (word) to the overall predictive performance. This can be
valuable for understanding which words play a crucial role in
determining a sentiment.

• Model Evaluation:-
Multinomial Naive Bayes:-
In this model, we got the accuracy of 84%. The precision was 83% and
recall was 87% for negative sentiments and the precision was 86% and
recall was 82% for positive sentiments

Logistic Regression:-
In this model, we got the accuracy of 85%. The precision was 86% and
recall was 84% for negative sentiments and we got the precision of 84%
and recall of 86% for positive sentiments

Random forest:-
In this model, we got the accuracy of 83%. The precision was 82% and recall was
83% for negative sentiments and we got the precision of 83% and recall of 82% for
positive sentiments respectively.
From the above figures, it is clear that we should select Logistic Regression as our
final model as it gives us the higher level of accuracy. Also, we can see that the
precision and recall values are almost the same for all the models.

• Error analysis:-
Multinomial Naïve Bayes:-
In Multinomial Naive Bayes, we got 2530 correct predictions and 470
wrong predictions.

In Logistic Regression, we got 2549 correct predictions and 451
erroneous predictions.

Random Forest:-
In Random Forest, we got 2504 correct predictions and 496 incorrect
predictions.

• Hyperparameter tuning:-
By tuning the hyperparameters, we have increased the accuracy of 1%
which is from 84% accuracy to 85% accuracy.

Tuning the hyperparameters of logistic regression helped us increase the
accuracy of 1% as we got 86% accuracy post hyperparameter tuning.

Random Forest:-
There was no change observed in the case of hyperparameter tuning of
random forest as the accuracy remained the same.

• Experimenting with Term Frequency - Inverse Document Frequency (TF-
IDF) Vectorizer:-
Using TFIDF vectorizer increased the accuracy by 2% as we got 86%
accuracy which was 84% previously

Using TFIDF helped us achieve 2% higher accuracy as we got 87%
accuracy for Logistic Regression

Random Forest:-
Here in the case of random forest, the accuracy was increased by 1%
when TFIDF vectorizer was used

• Conclusion:- In this sentiment analysis of movie reviews, we aimed to assess the
performance of our sentiment analysis model on a diverse set of movie reviews.
Through our investigation, we have gained valuable insights into the both rightly
and wrongly classified instances. Our sentiment analysis model demonstrated
commendable accuracy, achieving an accuracy of 87% based on the Logistic
Regression model. Positive sentiments were well-captured, with precision of 86%
and recall of 89%. Negative sentiments had a precision of 89% and a recall of
85%.
• Limitations of the model:-
Lack of Context Understanding:- Sentiment analysis models may struggle to
understand the context in which certain words or phrases are used in movie
reviews. For instance, positive words in a sarcastic context might be
misinterpreted.
Overemphasis on Keywords:- Some models may rely heavily on specific
keywords, potentially leading to misclassifications when sentiments are expressed
through less common or synonymous terms.

• Future scope:-
Fine-grained Sentiment Analysis:- Future models can aim for more fine-
grained sentiment analysis, capturing not only positive and negative
sentiments but also nuanced emotions and sentiments across a
spectrum. This could involve incorporating more granular sentiment
categories or intensity levels.
Multimodal Sentiment Analysis:- Integrating information from multiple
modalities, such as text, images, and possibly even video clips from
movie reviews, can provide a more comprehensive understanding of
sentiment. This could enhance the model's accuracy by considering
visual cues and expressions.
Domain-specific Adaptations:- Designing sentiment analysis models
specifically tailored for the domain of movie reviews can lead to improved
accuracy. Considering film-related terminology, genre-specific sentiments,
and understanding cinematic nuances can enhance the model's
performance in this context.

Analyzing Movie Reviews : Machine learning project

Analyzing Movie Reviews : Machine learning project

Recommended

Recommended

More Related Content

Similar to Analyzing Movie Reviews : Machine learning project

Similar to Analyzing Movie Reviews : Machine learning project (20)

More from Boston Institute of Analytics

More from Boston Institute of Analytics (20)

Recently uploaded

Recently uploaded (20)

Analyzing Movie Reviews : Machine learning project