1. Model Comparison
Movie Breakeven Analysis In
U.S Market
Liu Jialin | Priyadarshini Majumdar | Zhang Jiexi
Data Analytics Lab Project Challenge from Nov 23rd onwards at a theatre near YOU
INTRODUCTION
METHODOLOGY
What plays the most important role
in making a movie profitable?
Movie technical
Language 4.3/10
Content rating 4/10
Aspect ratio 3/10
Budget 2.5 /10
Duration 1.5/10
Colour or B&W 1/10
IMDB website Influence
No of IMDB users who voted 9/10
No of users reviewed 8/10
No of critics for reviews 6/10
IMDB score 5/10
Facebook influence
• Movie Facebook likes 4.5/10
• Actor 3 Facebook likes > Actor 2 > Actor 1
• Cast total Facebook likes 3.5/10
• Director Facebook likes 3.2/10
Poster and Promotional
materials
No of faces in a poster 2.6/10
objectives
Data Processing1
Remove repetitive entries in JMP.
Calculate gross profit=
Create the binary Profit/Loss target
variable and remove missing
values.
SAS Enterprise Miner:
• Import the JMP file using File
Import and Save Data nodes.
• Change the level for Aspect Ratio
to nominal in the File Import node.
• Conduct text parsing,
text clustering and text filter on
plot key words and genres.
• Use Multiplot node to view the
distribution of the variables.
• Recode missing values and
erroneous entries using
Replacement node.
• Sample the data into Training Set
and Validation Set using the Data
Partition node.
Before running the parametric
models, fill in all missing values
using the Impute node and transform
the interval variables with skewed
distributions using the Transform
node.
Predictive Model Construction
Decision Tree
Applying nonparametric algorithm, decision tree is capable of fitting a large number of
functional forms and mapping observations to categorical targets.
Model Comparison
Conclusion
Background:
Movies are one of the top grossing industries in the world today
and in the U.S. itself it is a 38 billion dollar market as of 2016
Motivation:
IMDB is one of the top visited sites through which viewers
often decide whether to watch a movie or not. Hence this has
a direct effect on whether a movie will profit or loss.
Primary Objective: To develop a model that can predict whether a movie will
break even in the U.S. market or not.
Secondary Objective: To relay to promoters who use social media for movie
promotion on which factors affect the outcome of the movie
Confusion Matrix for Model Comparison
Gradient Boosting
A Gradient Boosting model builds up a strong learning tree from a base set of weak
learning trees, using Gradient Descending algorithm. It is computational intensive and
has excellent performance for moderate number of variables after fine-tuning.
Logistic Regression
Logistics regression describes the relationship between categorical target variable and
independent variables by estimating the probability from a cumulative logistic
distribution.
Neural Network
Neural network is a parametric model that accommodates a wider variety of nonlinear
relationships. Neural network also keeps checking the curse of dimensionality problem
which bedevils attempts to model non-linear functions with large number of variables.
Data set
5043
movie
titles
28
variables
The data set was scrapped from
IMDB using Python’s scrappy
library. This resulted in 5043
observations of 28 variables.
Random Forest
Random forest is ensemble of decision trees. It averages the predictive probability of
a large number of over trained decision trees, thus is more robust against overfitting
and more generalized than a single decision tree.
Most
influential
factors
2nd Most
influential
factors
3rd Most
influential
factors
Least
influential
factor
2 3
4
Target percentages show how accurate the model’s predictions are
towards future data set. Outcome percentages, on the other hand,
indicate the accuracy of model prediction for the sample data set. For
Gradient Boosting and Neural Network, the Outcome 1/1 percentages
are above 75%, which means the models have successfully predicted
75% of the breakeven movies. The Target 1/1 percentages are above
70%, which means the models predictions are reliable. Hence, Gradient
Boosting and Neural Network are the models chosen to predict the
breakeven status of the future movies in the U.S. market.
Misclassification rate takes the false positives
and the false negatives into consideration. Of
all the models, Gradient Boosting has the
lowest misclassification rate. This is not
surprising given the delicate algorithm that
seeks to minimise the intermediate pseudo-
residuals rather than simply relying on one
splitting criterion like in Decision Tree and
Random Forest. Neural Network 2 works the
second best, proving that its complicated
algorithm which imitates human mind indeed
has some advantage in building predictive
models.
The analysis and data set are highly reliant on online data given that it is extracted
from a movie rating website. This is however is not the only defining factor.
• Hence, further analysis on predicting movie successes should also take into
consideration traditional promotional channels such as theatre data.
• Additionally this data is collected over a period of time and when it comes to
movies, popularity of the movie grows over a period of time. Hence for a more
accurate analysis, time-stamps of the metrics must be collected and taken into
consideration.
• The most important insight from the above predictive analysis is that
online popularity of a movie is the best indicator of its success
• IMDB is a sought after site for movie opinions and hence movie votes,
critic reviews and general public reviews are the greatest influencers
• For Facebook likes Actor 3 Facebook likes are a better indicator than
actor 2 and actor 1 Facebook likes.
𝑔𝑟𝑜𝑠𝑠−𝑏𝑢𝑑𝑔𝑒𝑡
𝑏𝑢𝑑𝑔𝑒𝑡
%
future work