Building a Movie Success Predictor

Building a Movie Success Predictor
Youness Lahdili
TED University
Project Paper for
CMPE542 - Machine Learning
Prof. Venera Adanova
Abstract— The movie making is a multibillion-dollar
industry. In 2018, the global movie business has
generated nearly $41.5 billion in box office and more
than that in merchandise revenues. But it is not a
guaranteed business: every year we witness big buster
and budget movies that become either a “hit” or a
“flop”. The success of a movie is mainly judged by
looking at ratio of its gross revenue over its budget, but
some may also call a movie successful if it bagged critics
praise and awards, both of which do not necessarily
convert to financial revenue. In our project we look
from an investor point of view, who largely favour
financial return over any other attribute. But to predict
the success of a movie, an investor can’t only rely on
superficial attributes, a typical reason why Machine
Learning (ML) prediction will prove to be very useful.
We are going to implement this prediction using two
ML methods that we have studied during the subject
CMPE542, namely Random Forest and Neural
Network. These are very adapted for discriminating
classes, and can thus help us very effectively in pointing
to successful or failed movies after being trained on a
set of 5043 movies which data have been scraped from
IMDB. At the end of the project, we should be able to
know which method has the highest accuracy, what
movies sell the best at the box office and most
importantly for movies producers, what movie features
are the most decisive in making a movie profitable.
Keywords— Movie Industry, Data Scraping, Machine
Learning, Random Forest, Neural Network
I. INTRODUCTION
A. Overview
More than entertainment, the cinema industry is
becoming vital to economies of some countries and has
became an indispensable weapon in psychological war and
the soft power exerted by some countries. So it is
imperative to be able to maximize the financial gains from
movies, and keep movies as crowd-alluring as possible.
B. Data Extraction and Parsing
To run our ML analysis, we need raw data on all movies
that have been judged either successful of failed. For this
end, we turn to IMDB, an online repository of all movies
that have been released to date and even those in pre-
production phase. We can tap into this database to extract
key information about the movie budget, gross revenue,
ratings, names of people who are taking part in the movie,
year of release, and so on. We will have to use some tools
such as BeautifulSoup which allow to read data of interest
from HTML webpages and create tabular data out of it, in
this project it is a .CSV datafile conveniently named
“movie_metadata” and ready to be treated by SciKit or
other ML utilitaries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
dataRaw = pd.read_csv(movie_metadata.csv",sep=',')

The dataset is composed of a wide array of attributes on
all films, with separable values and annotations. It has 28
variables belonging to 5043 movies. The dataset we
obtained show the movie titles, investment that was placed
to produce the movie, the revenue that was earned, and a
lot more regarding the visual characteristics, the leading
actors/actress in the movie. It should be notes that those
movies originate from 66 countries, but with a clear
prevalence of USA movies.
An important goal of this project is to forecast the
critics’ score of a movie using the raw data have at our
disposal. It is very essential to understand which factors
have the highest weight in determining the rating of a
movie. So we will present the results in a bar chart so to
have a better grasp of this analysis.
dataRaw.info()
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color 5024 non-null object
director_name 4939 non-null object
num_critic_for_reviews 4993 non-null float64
duration 5028 non-null float64
director_facebook_likes 4939 non-null float64
actor_3_facebook_likes 5020 non-null float64
actor_2_name 5030 non-null object
gross 4159 non-null float64
genres 5043 non-null object
movie_title 5043 non-null object
num_voted_users 5043 non-null int64
cast_total_facebook_likes 5043 non-null int64
facenumber_in_poster 5030 non-null float64
plot_keywords 4890 non-null object
movie_imdb_link 5043 non-null object
num_user_for_reviews 5022 non-null float64
language 5031 non-null object
country 5038 non-null object
content_rating 4740 non-null object
budget 4551 non-null float64
title_year 4935 non-null float64
imdb_score 5043 non-null float64
aspect_ratio 4714 non-null float64
movie_facebook_likes 5043 non-null int64
As we can observe from the listing of data displayed
above, not all features are describe all the movies. Most
column are short 5043, which means there is missing data
that will compromise our ML analysis. There is a
possibility that some values are actually redundant. Data
analysist often encounter such situation of data mismatch,
which compels them to make up for this missing data or
duplicates, by either standardizing, interpolating, pruning
or panning their data.
We run these commands to have a peek at the number
of redundant data and the missing value in our dataset.
dataRaw.isnull().sum()
color 19
director_name 104
num_critic_for_reviews 50
duration 15
director_facebook_likes 104
actor_3_facebook_likes 23
actor_2_name 13
gross 884

genres 0
actor_1_name 7
movie_title 0
num_voted_users 0
cast_total_facebook_likes 0
actor_3_name 23
facenumber_in_poster 13
plot_keywords 153
movie_imdb_link 0
num_user_for_reviews 21
language 12
country 5
content_rating 303
budget 492
title_year 108
imdb_score 0
aspect_ratio 329
movie_facebook_likes 0
dataRaw.duplicated().sum()
45
We realize that 45 data are duplicates and that there is a
handful of missing data. It is evident that we can just erase
the duplicate data, but it is not as obvious how to treat the
missing data, since there exist a significant number of
them and we cannot just afford to just do away with them,
lest we run the risk of having un underfitted predictor, and
loose its accuracy that early. Our solution is as follow: we
can easily see that the feature “gross” exhibit the largest
number of missing data at a number of 884. This is
significantly larger than the second most feature with
missing data “budget” at 492, which is not a negligible
number either. Since the void values in “gross” and
“budget” are considerably big, we will just resort to
ablating those rows from our dataset altogether so to avoid
any irregularity down the line, and causing erroneous
implications.
dataRaw = dataRaw.drop_duplicates()
dataRaw = dataRaw.dropna(subset=['gross', 'budget'])
dataRaw.shape
(3857, 28)
After the unwanted data is been discarded, we still end
up with 3857 rows of data belonging to 28 features which
is amply enough to go ahead with our analysis.
We carried on with cleaning the data even further, since
other features are still not yet adapted to be inputs for our
ML algorithm. Some features like “aspect ratio” will
undergo averaging in order to reduce the intricacy of our
datasets, and bring a highly sparse data to become
consolidated into two or three ranges of data under one
mean value.
If we were to display the “language” column, we will
realize that 3644/5043 of the movies have English as their
language, which suggests that this feature will have little
to no effect in our prediction, and so we can go ahead and
eliminate it from our set of features, like we did with
“gross” and “budget”. A similar observation can be as
to the origin of the movies, which is dominated by USA
made films with a staggering number of 3025 out of 5043
largely in front of UK and France respectively with 316
and 103 respectively, and other nations contribution is
almost unaccountable at this scale. This is a opportunity to
reduce the complexity of our data, and create just four
“countries” group: 'USA', 'UK', 'France' and 'Others'.
To make our data fully interpretable by our ML
algorithms, we need to associate a numerical arbitrary
value to some of our data that are in string form. This
applies to “country”, “language” and
“content_rating”.
When we are done with all this preliminary steps of data
cleaning, rearranging and parsing, we get a dataset
perfectly suitable for ML processing, and we can then
proceed.

II. DATA VISUALISATION
One last step before we execute our ML processing over
the final dataset, it is judicious to understand first how the
data is correlated, that is to say how some features can
outweigh others in making a movie more or less
successful. Despite the fact that Random Trees and Neural
Networks can seamlessly create linkages between those
features, it will not be able to identify the semantic
connotation of each feature, but rather sees all features as
equal placeholder with no special meaning. Therefor, we
will make an even better assessment if we help ourselves
by recognizing the different key connections that exist.
This can be achieved by visualization of the data. We shall
begin by first laying down how many films have been
produced since the beginning of cinematography.
plt.figure(figsize=(30, 10))
sns.distplot(dataRaw.title_year, kde=False);
There is a clear sharp increase of movie productions
starting from the 80s. This is a direct result of the
cinematographic technical advance that coincided with
this decade, and most specially the market being flooded
with VHS cassettes which popularized home movies.
Another foretelling connection is the movie score with
regard to the genre. Here we can see series of plots that
best illustrate that, and clearly show the normal
distribution of this connection.
temp_act = dataRaw.loc[dataRaw.Action == 1][['imdb_score']]
temp_adv = dataRaw.loc[dataRaw.Adventure == 1][['imdb_score']]
temp_fan = dataRaw.loc[dataRaw.Fantasy == 1][['imdb_score']]
temp_sci = dataRaw.loc[dataRaw['Sci-Fi'] == 1][['imdb_score']]
temp_thr = dataRaw.loc[dataRaw.Thriller == 1][['imdb_score']]
temp_rom = dataRaw.loc[dataRaw.Romance == 1][['imdb_score']]
temp_com = dataRaw.loc[dataRaw.Comedy == 1][['imdb_score']]
temp_ani = dataRaw.loc[dataRaw.Animation == 1][['imdb_score']]
temp_fam = dataRaw.loc[dataRaw.Family == 1][['imdb_score']]
temp_hor = dataRaw.loc[dataRaw.Horror == 1][['imdb_score']]
temp_dra = dataRaw.loc[dataRaw.Drama == 1][['imdb_score']]
temp_crime = dataRaw.loc[dataRaw.Crime == 1][['imdb_score']
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(3, 4, figsize=(20, 20), sharex=True)
sns.despine(left=True)
sns.distplot(temp_act.imdb_score, kde=False, color="blue", ax=axes[0, 0]).set_title('Action
Movies')
sns.distplot(temp_adv.imdb_score, kde=False, color="red", ax=axes[0, 1]).set_title('Adventur
e Movies')
sns.distplot(temp_fan.imdb_score, kde=False, color="green", ax=axes[0, 2]).set_title('Fantas
y Movies')
sns.distplot(temp_sci.imdb_score, kde=False, color="orange", ax=axes[0, 3]).set_title('Sci-F
i Movies')

sns.distplot(temp_thr.imdb_score, kde=False, color="blue", ax=axes[1, 0]).set_title('Thrille
r Movies')
sns.distplot(temp_rom.imdb_score, kde=False, color="red", ax=axes[1, 1]).set_title('Romance
Movies')
sns.distplot(temp_com.imdb_score, kde=False, color="green", ax=axes[1, 2]).set_title('Comedy
Movies')
sns.distplot(temp_fam.imdb_score, kde=False, color="orange", ax=axes[1, 3]).set_title('Famil
y Movies')
sns.distplot(temp_hor.imdb_score, kde=False, color="blue", ax=axes[2, 0]).set_title('Horror
Movies')
sns.distplot(temp_dra.imdb_score, kde=False, color="red", ax=axes[2, 1]).set_title('Drama Mo
vies')
sns.distplot(temp_ani.imdb_score, kde=False, color="green", ax=axes[2, 2]).set_title('Animat
ion Movies')
sns.distplot(temp_crime.imdb_score, kde=False, color="orange", ax=axes[2, 3]).set_title('Cri
me Movies')

III. IMPLEMENTATION OF THE ML TECHNIQUES
The prediction can now properly start, and we can
implement our two ML algorithm as per the requirement
of our paper introduction. We will compare the
performance of both algorithms and give our evaluation
and interpretation about their usage. In this project, we are
going to build the Random Forest and Neural Networks by
means of the specialized Python frameworks of Keras and
Tensorflow. We could also resort to SciKit but we wanted
to explore other tools that have are meant for advanced
ML implementation.
At this point, we have to split the train and test data. We
are using a hold-out rule that keep 25% of the data for the
testing, while being trained on 75% of it.
Y=dataRaw.imdb_score
X=dataRaw.drop(['imdb_score'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)
A. Random Forest
Our Random Forest algorithm takes a sample of data
that is randomly selected, and generate a multitude of
decision trees based on it. It computes the prediction for
each single tree and pick the one which has the best cut.
Random Forest make for an ideal classifier. The movie
success predictor seems to be a textbook case of
classification, but we must note that among the 28
features, some are not easily separable and thus a clear cut
can’t easily be achieved by Random Forest models.
from sklearn.ensemble import RandomForestC
lassifier
model = RandomForestClassifier(n_estimator
s=10)
model = model.fit(x_train, y_train)
feature_imp = pd.Series(model.feature_impo
rtances_,index=x_train.columns.values).sor
t_values(ascending=False)
A fundamental task when it comes to performing
supervised learning on a dataset is establishing which
features offers the most predictive power. Through
extrapolating the relationship between only a few crucial
features and the target label (success, failure) we break
down our understanding of the phenomenon into elements
that we are familiar with. For the dataset we are
studying, our aim is to narrow down it down to a couple
of features that impact the successful rate of the film.
# Creating a bar plot
plt.figure(figsize=(10, 10))
sns.barplot(x=feature_imp, y=feature_imp.i
ndex)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features"
)
plt.legend()
plt.show()

# predictions
pred_train = model.predict(x_train)
rms_train = accuracy_score(y_train, pred_train)
pred_test = model.predict(x_test)
rms_test = accuracy_score(y_test, pred_test)
print('Train Accuracy: {0} Test Accuracy: {1}'.format(rms_train*100, rms_test*100))
Train Accuracy: 98.60813704496788 Test Accuracy: 76.12419700214133
Random Forest implementation leads to a training
accuracy of over 98%, which might be an indicator that an
overfitting has occurred. This can be further corroborated
by the fact that the accuracy on testing data is only
76.12%, which suggests that some testing data was out of
range due to this unrealistic training accuracy.
B. Neural Networks
The favourite ML algorithm for sophisticated analysis
is Neural Networks. They are reliable and can be
employed for both performing a regression on linear data,
or a classification of clustered data. There bio-inspired
nature makes them easy to interpret in certain applications,
but also computationally expensive.
We will gauge both Random Forest and Neural
Networks in the light of this movies success predictor
project.
One last step prior to running the Neural Network
algorithm, is to render the data compatible with the
algorithm. First, we have to standardize all 'X' data so that
they fit in a[0,1] window. This is an important part in the
process of normalisation, which is a perquisite of ML
techniques such as Neural Network and PCM. The, the 'Y'
data which was represented in one column, will undergo a
transformation into five columns associated with each
classes.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
Y = to_categorical(Y, num_classes=5)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)
from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Activation
from keras.optimizers import Adam, RMSprop, SGD
from keras.utils.np_utils import to_categorical
from keras.utils import np_utils
import tensorflow as tf
nn_model = Sequential()
nn_model.add(Dense(10, input_dim=39 , activation = 'relu'))
nn_model.add(Dense(5, activation ='softmax'))
adam = optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.05)
nn_model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
nn_model.fit(x_train, y_train, nb_epoch=100, batch_size=10)
trainScore, trainAcc = nn_model.evaluate(x_train, y_train)
testScore, testAcc = nn_model.evaluate(x_test, y_test)
print('Train Accuracy: {0} Test Accuracy: {1}'.format(trainAcc*100, testAcc*100))
Train Accuracy: 85.08208422555317 Test Accuracy: 75.2676659528908

The Neural Network model has yielded a realistic 85%
accuracy on its training routine, and a 75.26% accuracy for
the testing data. Compared to Random Forest algorithm
which boast a 76.12%, one would say that this is slightly
lower, but Neural Network are more robust and thus can
be trusted for datasets of large quantities, which is not the
case for Random Forest. The neural network we used here
make use of a Keras optimized called Adam which
previous researchers has shown to be more lenient on the
computational power, and thus on the power consumption.
This is something to remember if we ever want to
implement this predictor on more disparate features (list
of all actors, timing of movie release with respect to other
global events, etc..).
This table can summarize our final result on the
accuracy of both of our ML approaches:
Slice of Dataset
used for
Random
Forest
Neural
Networks
Testing 98.60 85.08
Training 76.12 75.26
IV.CONCLUSION
Random Forest exhibit the highest accuracy on testing
data which is a far cry from the 76% accuracy on training
data. Neural Networks have a relatively high accuracy for
testing data 85% and so it makes sense when training
accuracy is of 75%. This is a typical issue with random
forest because it can overfit on data easily.
From all the above, here we list some of the key
takeaways that one can make:
- Estimating the success of a film is not as
straightforward as it seems. It does not correlate intimately
with any of the obvious features that a movie rely on
(genre, country of origin, shooting quality).
- That being said, there are certain factors that have
more impact than others (choice of actor/actress is more
decisive than the directors, drama genre movies are likely
to be more successful, and movies released during summer
have better chances than movies released in regular
months).
The model built here can be said to be a minimalistic
model, and there are a number of additions if one want to
make it better. I would suggest:
- Training on a larger dataset. Our study was conducted
only on IMDB datasets, but one can also look at Rotten
Tomatos and Box Office Mojo datasets. In which case
some of the features we removed here during data cleaning
(like “gross” and “budget”) would probably still make
it to the final training set.
- Add as a feature the keywords that make up a movie
synopsis. This text describing the plot is one the first
elements an audience consult for choosing which movie to
go watch, along with the movie poster. But this latter is
complicated to evaluate as a numerical score, unless we
use image segmentation and learn the different
components on the poster image, in order to compare
posters across movies.
- Other critical features can be inserted as well, such as
number of theatres screening a particular movie, the
number of previously successful movie from a particular
director or actor/actress.
The ML implementation that we designed in this
project can be extended to the Turkish movie industry and
help local production studios understand the parameters
that can help boost the commercial success of their films.
V. REFERENCES
CMPE542 Course Notes
www.imdb.com
www.kaggle.com
www.stackoverflow.com

Building a Movie Success Predictor

Recommended

Recommended

More Related Content

Similar to Building a Movie Success Predictor

Similar to Building a Movie Success Predictor (20)

More from Youness Lahdili

More from Youness Lahdili (11)

Recently uploaded

Recently uploaded (20)

Building a Movie Success Predictor