CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
PREDICTING MOVIE SUCCESS
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
AGENDA
• Introdction
• Domin
• Data preprocessing
• EDA
• Model training
• Feature importance
• ROC AUC
• Questions & answer
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
INTRODUCTION
🔹 Movie industry data holds significant insights that can be utilized for
predicting box-office performance, helping stakeholders make informed
decisions.
🔹 IMDB scores, influenced by various factors such as genre, budget, and
audience preferences, are a key indicator of a movie’s success.
🔹 Machine learning, with its ability to analyze vast datasets and detect
patterns, offers a powerful approach for predicting whether a movie will
be a hit, average, or a flop.Through a data-driven approach and predictive
modeling, this presentation will showcase my project on predicting movie
success based on IMDB scores.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Why ML model use in movie domin?
• In the movie industry, machine learning (ML) models can be applied across a wide variety of tasks, such as
predicting box office success, recommending content, or understanding audience sentiment. Here are
some common ML models used in the movie domain:
• Linear Regression: Used to predict continuous outcomes, such as movie box office earnings, based on
factors like budget, cast, genre, and runtime.
• Polynomial Regression: A variant of linear regression, this model can capture nonlinear relationships,
which are useful when predicting complex metrics like audience scores or box office revenue.
• Logistic Regression: Used to classify whether a movie will be a "hit" or "flop" based on historical data.
• Random Forest: Commonly used for classification tasks like predicting whether a movie will be successful.
It can also rank feature importance (e.g., how much a feature like actor popularity contributes to a movie’s
success).
• Decision Trees: Helpful for visualizing decision-making processes, decision trees can predict movie success
categories (e.g., "hit", "average", "flop").
• Support Vector Machines (SVM): Effective for classifying movies into categories like hit, average, or flop by
finding the optimal separating boundary between them.
• Naive Bayes: Useful for text-related tasks like sentiment analysis or classification based on movie reviews,
plot summaries, or scripts.
• Gradient Boosting Machines (GBM): Boosted decision trees that can improve prediction accuracy by
combining multiple weak models. These can be used to predict box office earnings, success probability, or
audience engagement.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
DATA PREPROCESSING
• Data cleaning:
• Missing values: here I found total
2.5% of missing values in my data
set,but this gross is major influencing
factor for prediting the movie success
and it has the most missing values,so
here I used linear regression model
for filling the missing data’s
• Feature and Target Variable: Assign x as variable consists
of feature of the dataset and assign y to a movie success of
the dataset
• Data Splitting: The dataset was split into 80% training data
and 20% testing data to evaluate model performance. Here
we use train test split function from sklearn
• Scaling: Standard Scaler was used to scale the feature values
for better model performance
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
EDA
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Click to edit
Master title
style
MODEL TRAINING
• Training Data: We use train data to train the model and
train and test data was split using train_test_split method.
• Hyper Parameter Tuning: Using Gridsearch CV we
Tuned the model with different hyper parameter and
calculate accuracy score and rocuc/auc curves for the model
and choose the best parameters
Accuracy Score: 0.9371980676328503classification matrix :
precision recall f1-score support 0 0.80 0.57
0.67 28 1 0.94 0.99 0.97 375 2
1.00 0.09 0.17 11 accuracy 0.94
414 macro avg 0.91 0.55 0.60 414weighted avg
0.94 0.94 0.92 414
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
FEATURE IMPORTANCE
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
ROC AUC 1)Comparison of Model Performance with classes:
• The ROC curves compare the classification performance of
three class: average , hit , flop
All three classes perform well with high AUC (Area Under the
Curve) values:
I. Avarage: 0.92
II. Hit : 0.92
III. Flop : 0.92
Model Selection Insights:
• Random Forest is known for its high accuracy compared
to other algorithms. By combining the outputs of
multiple decision trees, it reduces overfitting and
enhances the predictive performance, especially on
complex datasets.
• Handles Missing Values WellRandom Forest can handle
datasets with missing values effectively by using the
median (for numeric data) or mode (for categorical
data) during the training process, making it suitable for
real-world datasets that may have incomplete entries.
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Questions ?
CONFIDENTIAL: The information in this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this
material is prohibited and subject to legal action under breach of IP and confidentiality clauses.
Thank You!

Predicting Movie Success on IMDb: A Data-Driven Approach

  • 1.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. PREDICTING MOVIE SUCCESS
  • 2.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. AGENDA • Introdction • Domin • Data preprocessing • EDA • Model training • Feature importance • ROC AUC • Questions & answer
  • 3.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. INTRODUCTION 🔹 Movie industry data holds significant insights that can be utilized for predicting box-office performance, helping stakeholders make informed decisions. 🔹 IMDB scores, influenced by various factors such as genre, budget, and audience preferences, are a key indicator of a movie’s success. 🔹 Machine learning, with its ability to analyze vast datasets and detect patterns, offers a powerful approach for predicting whether a movie will be a hit, average, or a flop.Through a data-driven approach and predictive modeling, this presentation will showcase my project on predicting movie success based on IMDB scores.
  • 4.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Why ML model use in movie domin? • In the movie industry, machine learning (ML) models can be applied across a wide variety of tasks, such as predicting box office success, recommending content, or understanding audience sentiment. Here are some common ML models used in the movie domain: • Linear Regression: Used to predict continuous outcomes, such as movie box office earnings, based on factors like budget, cast, genre, and runtime. • Polynomial Regression: A variant of linear regression, this model can capture nonlinear relationships, which are useful when predicting complex metrics like audience scores or box office revenue. • Logistic Regression: Used to classify whether a movie will be a "hit" or "flop" based on historical data. • Random Forest: Commonly used for classification tasks like predicting whether a movie will be successful. It can also rank feature importance (e.g., how much a feature like actor popularity contributes to a movie’s success). • Decision Trees: Helpful for visualizing decision-making processes, decision trees can predict movie success categories (e.g., "hit", "average", "flop"). • Support Vector Machines (SVM): Effective for classifying movies into categories like hit, average, or flop by finding the optimal separating boundary between them. • Naive Bayes: Useful for text-related tasks like sentiment analysis or classification based on movie reviews, plot summaries, or scripts. • Gradient Boosting Machines (GBM): Boosted decision trees that can improve prediction accuracy by combining multiple weak models. These can be used to predict box office earnings, success probability, or audience engagement.
  • 5.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. DATA PREPROCESSING • Data cleaning: • Missing values: here I found total 2.5% of missing values in my data set,but this gross is major influencing factor for prediting the movie success and it has the most missing values,so here I used linear regression model for filling the missing data’s • Feature and Target Variable: Assign x as variable consists of feature of the dataset and assign y to a movie success of the dataset • Data Splitting: The dataset was split into 80% training data and 20% testing data to evaluate model performance. Here we use train test split function from sklearn • Scaling: Standard Scaler was used to scale the feature values for better model performance
  • 6.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. EDA
  • 7.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Click to edit Master title style MODEL TRAINING • Training Data: We use train data to train the model and train and test data was split using train_test_split method. • Hyper Parameter Tuning: Using Gridsearch CV we Tuned the model with different hyper parameter and calculate accuracy score and rocuc/auc curves for the model and choose the best parameters Accuracy Score: 0.9371980676328503classification matrix : precision recall f1-score support 0 0.80 0.57 0.67 28 1 0.94 0.99 0.97 375 2 1.00 0.09 0.17 11 accuracy 0.94 414 macro avg 0.91 0.55 0.60 414weighted avg 0.94 0.94 0.92 414
  • 8.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. FEATURE IMPORTANCE
  • 9.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. ROC AUC 1)Comparison of Model Performance with classes: • The ROC curves compare the classification performance of three class: average , hit , flop All three classes perform well with high AUC (Area Under the Curve) values: I. Avarage: 0.92 II. Hit : 0.92 III. Flop : 0.92 Model Selection Insights: • Random Forest is known for its high accuracy compared to other algorithms. By combining the outputs of multiple decision trees, it reduces overfitting and enhances the predictive performance, especially on complex datasets. • Handles Missing Values WellRandom Forest can handle datasets with missing values effectively by using the median (for numeric data) or mode (for categorical data) during the training process, making it suitable for real-world datasets that may have incomplete entries.
  • 10.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Questions ?
  • 11.
    CONFIDENTIAL: The informationin this document belongs to Boston Institute of Analytics LLC. Any unauthorized sharing of this material is prohibited and subject to legal action under breach of IP and confidentiality clauses. Thank You!