Predicting Movie Success on IMDb: A Data-Driven Approach

CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
PREDICTING MOVIE SUCCESS

AGENDA
• Introdction
• Domin
• Data preprocessing
• EDA
• Model training
• Feature importance
• ROC AUC
• Questions & answer

INTRODUCTION
🔹 Movie industry data holds significant insights that can be utilized for
predicting box-office performance, helping stakeholders make informed
decisions.
🔹 IMDB scores, influenced by various factors such as genre, budget, and
audience preferences, are a key indicator of a movie’s success.
🔹 Machine learning, with its ability to analyze vast datasets and detect
patterns, offers a powerful approach for predicting whether a movie will
be a hit, average, or a flop.Through a data-driven approach and predictive
modeling, this presentation will showcase my project on predicting movie
success based on IMDB scores.

Why ML model use in movie domin?
• In the movie industry, machine learning (ML) models can be applied across a wide variety of tasks, such as
predicting box office success, recommending content, or understanding audience sentiment. Here are
some common ML models used in the movie domain:
• Linear Regression: Used to predict continuous outcomes, such as movie box office earnings, based on
factors like budget, cast, genre, and runtime.
• Polynomial Regression: A variant of linear regression, this model can capture nonlinear relationships,
which are useful when predicting complex metrics like audience scores or box office revenue.
• Logistic Regression: Used to classify whether a movie will be a "hit" or "flop" based on historical data.
• Random Forest: Commonly used for classification tasks like predicting whether a movie will be successful.
It can also rank feature importance (e.g., how much a feature like actor popularity contributes to a movie’s
success).
• Decision Trees: Helpful for visualizing decision-making processes, decision trees can predict movie success
categories (e.g., "hit", "average", "flop").
• Support Vector Machines (SVM): Effective for classifying movies into categories like hit, average, or flop by
finding the optimal separating boundary between them.
• Naive Bayes: Useful for text-related tasks like sentiment analysis or classification based on movie reviews,
plot summaries, or scripts.
• Gradient Boosting Machines (GBM): Boosted decision trees that can improve prediction accuracy by
combining multiple weak models. These can be used to predict box office earnings, success probability, or
audience engagement.

DATA PREPROCESSING
• Data cleaning:
• Missing values: here I found total
2.5% of missing values in my data
set,but this gross is major influencing
factor for prediting the movie success
and it has the most missing values,so
here I used linear regression model
for filling the missing data’s
• Feature and Target Variable: Assign x as variable consists
of feature of the dataset and assign y to a movie success of
the dataset
• Data Splitting: The dataset was split into 80% training data
and 20% testing data to evaluate model performance. Here
we use train test split function from sklearn
• Scaling: Standard Scaler was used to scale the feature values
for better model performance

EDA

Click to edit
Master title
style
MODEL TRAINING
• Training Data: We use train data to train the model and
train and test data was split using train_test_split method.
• Hyper Parameter Tuning: Using Gridsearch CV we
Tuned the model with different hyper parameter and
calculate accuracy score and rocuc/auc curves for the model
and choose the best parameters
Accuracy Score: 0.9371980676328503classification matrix :
precision recall f1-score support 0 0.80 0.57
0.67 28 1 0.94 0.99 0.97 375 2
1.00 0.09 0.17 11 accuracy 0.94
414 macro avg 0.91 0.55 0.60 414weighted avg
0.94 0.94 0.92 414

FEATURE IMPORTANCE

ROC AUC 1)Comparison of Model Performance with classes:
• The ROC curves compare the classification performance of
three class: average , hit , flop
All three classes perform well with high AUC (Area Under the
Curve) values:
I. Avarage: 0.92
II. Hit : 0.92
III. Flop : 0.92
Model Selection Insights:
• Random Forest is known for its high accuracy compared
to other algorithms. By combining the outputs of
multiple decision trees, it reduces overfitting and
enhances the predictive performance, especially on
complex datasets.
• Handles Missing Values WellRandom Forest can handle
datasets with missing values effectively by using the
median (for numeric data) or mode (for categorical
data) during the training process, making it suitable for
real-world datasets that may have incomplete entries.

Questions ?

Thank You!

Predicting Movie Success on IMDb: A Data-Driven Approach

More Related Content

Similar to Predicting Movie Success on IMDb: A Data-Driven Approach

More from Boston Institute of Analytics

Recently uploaded

Predicting Movie Success on IMDb: A Data-Driven Approach