An exploration of machine learning models to predict movie profitability - originally created as a term project for TECH-GB 2336 (Technical Data Science for Business) @ NYU Stern, and based on the Kaggle competition [TMDB Box Office Prediction](https://www.kaggle.com/c/tmdb-box-office-prediction).
2. Why are we here?
The film industry has a serious business problem
■ Successful investing in movie production is hard: upwards of 50% of all films lose money for their
backers (i.e. financially backing films is no better than random chance)
■ However there is lots of public data available: www.themoviedb.org has millions of datapoints
on over 600k movies (and 1.9m people in the film industry)
■ Improving decision-making outcomes in this industry is a business problem that is ripe for DATA
ANALYTICS
■ We propose a supervised learning approach to generate binary classifications of profitability,
using only information known at the pitch stage (i.e. pre-production)
■ Predictive accuracy of > 80% is achievable using our proprietary model*
*Based on model backtesting on a randomly-sampled 20% holdoutset from 2965 to 2017
3. Data
We obtained raw data
on ~5000 films
released from 1960 to
2017 from public
sources
Detailed descriptive data
on Cast, Crew, Genre,
financials, Production
Company, Language,
Filming Location and
more were extracted
Substantial data pre-processing and cleaning was performed
4. EDA + Feature Engineering
We performed extensive exploratory data analysis in order to form hypotheses and identify potentially important features
And constructed
many novel features
from the raw data –
for example average
revenue and
profitability for
previous movies for
each key cast and crew
member
The final result was a
clean, scaled,
binarized dataset of
4803 rows, with 346
predictor columns
(mostly one-hots)
5. Modelling
We explored the effectiveness of predictive modelling along 2 dimensions:
1) Both binary (Positive Class = top quartile RoI) and multi-class classification (“Hit” = top quartile RoI, “Loss” = RoI < 0, “Neutral”
= everything in the middle) problems
2) Different types of data mining algorithm (Decision Trees, Logistic Regression, SupportVector Machines, K-Nearest
Neighbors, Random Forest, Bagged Decision Trees) were evaluated
Grid searching was employed for hyper-parameter tuning, and models were evaluated via K-fold cross-validation on training data
primarily based on classification accuracy and ROC AUC before final testing on a hold-out data set.
The most effective models were found to be tree-based: Random Forest and Bagged Decision Trees were the highest-performing
models
6. Evaluation – Multiclass Classification
Performance on the multi-class classification problem was challenged:
In both of these (best-performing!) models the
classifier struggled to identify the majority of
the target class (“Hit”) , or to differentiate
between “Hits”, “Losses” and the majority class
(approx. 60% of the data set) of “Neutral”
We posit that this is because the boundaries
between classes are hard and somewhat
artificial – i.e. RoI = 2.905 is a “Hit” but RoI =
2.895 is “Neutral” – and as such do not represent
natural clusters in the decision space
This level of performance is unlikely to be useful
to the business end user.
Bagged DecisionTree Random Forest
7. Evaluation – Binary Classification
Model performance on the binary classification problem – with the boundary of the positive class set at top-quartile return on
investment (RoI) – was highly satisfactory however. Random Forest was the top performing model evaluated
The tuned RF model scored well on the most important criteria for the business problem:
■ Precision = (TP /TP+FP) = when we predict a movie will be profitable it is ~74% of the time
■ ROC AUC = degree of separability between predictions of the two classes = the average positive
prediction has only ~19% of negative examples scored higher than it
■ Lift = ratio of results obtained with and without the model
For our specific business problem we are less concerned with recall (what proportion of theTP we
identify) as “passing” on a movie that turns out to be profitable has only opportunity not real cost
These results should be considered in the context of the business problem e.g. >50% of movies made lose
money, and that the base rate of “profitability” (as defined here) in our population is only 25%
8. Business Use + Conclusions
Using population averages for film production
cost and revenue we can evaluate the model in
an expected value framework to assess
potential real-world business impact:
■ Funding the top 15% of movies we see, as
ranked by our tuned RF binary classifier,
would result in optimal expected return on
investment (total expected profit / total
expected investment)
■ Funding the top 35% of movies would
result in slightly higher total profit, but at a
cost of significantly higher capital required
for investment
Analysis of feature
importance provides a
useful guide to investors
as to what attributes are
correlated to high
profitability:
“Good”
Features
“Bad”
Features
Visualisation of
sample decision
trees can help
stakeholders
interpret model
predictions
DeploymentConsiderations:
■ Accuracy of source data (esp. on
budgets, and for older films)
■ Model trained on pre-Covid data
■ Expected value assumes fixed
cost and revenue per film
■ Profit curve ignores sequencing
-> too many FP’s on big budget
movies could lead to
bankruptcy!
Opportunities for Enhancement:
■ Obtain more data
■ Refine multi-class model
■ Additional dimensionality
reduction