SlideShare a Scribd company logo
Building a Movie Success Predictor
Youness Lahdili
TED University
Project Paper for
CMPE542 - Machine Learning
Prof. Venera Adanova
Abstract— The movie making is a multibillion-dollar
industry. In 2018, the global movie business has
generated nearly $41.5 billion in box office and more
than that in merchandise revenues. But it is not a
guaranteed business: every year we witness big buster
and budget movies that become either a “hit” or a
“flop”. The success of a movie is mainly judged by
looking at ratio of its gross revenue over its budget, but
some may also call a movie successful if it bagged critics
praise and awards, both of which do not necessarily
convert to financial revenue. In our project we look
from an investor point of view, who largely favour
financial return over any other attribute. But to predict
the success of a movie, an investor can’t only rely on
superficial attributes, a typical reason why Machine
Learning (ML) prediction will prove to be very useful.
We are going to implement this prediction using two
ML methods that we have studied during the subject
CMPE542, namely Random Forest and Neural
Network. These are very adapted for discriminating
classes, and can thus help us very effectively in pointing
to successful or failed movies after being trained on a
set of 5043 movies which data have been scraped from
IMDB. At the end of the project, we should be able to
know which method has the highest accuracy, what
movies sell the best at the box office and most
importantly for movies producers, what movie features
are the most decisive in making a movie profitable.
Keywords— Movie Industry, Data Scraping, Machine
Learning, Random Forest, Neural Network
I. INTRODUCTION
A. Overview
More than entertainment, the cinema industry is
becoming vital to economies of some countries and has
became an indispensable weapon in psychological war and
the soft power exerted by some countries. So it is
imperative to be able to maximize the financial gains from
movies, and keep movies as crowd-alluring as possible.
B. Data Extraction and Parsing
To run our ML analysis, we need raw data on all movies
that have been judged either successful of failed. For this
end, we turn to IMDB, an online repository of all movies
that have been released to date and even those in pre-
production phase. We can tap into this database to extract
key information about the movie budget, gross revenue,
ratings, names of people who are taking part in the movie,
year of release, and so on. We will have to use some tools
such as BeautifulSoup which allow to read data of interest
from HTML webpages and create tabular data out of it, in
this project it is a .CSV datafile conveniently named
“movie_metadata” and ready to be treated by SciKit or
other ML utilitaries.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer
dataRaw = pd.read_csv(movie_metadata.csv",sep=',')
The dataset is composed of a wide array of attributes on
all films, with separable values and annotations. It has 28
variables belonging to 5043 movies. The dataset we
obtained show the movie titles, investment that was placed
to produce the movie, the revenue that was earned, and a
lot more regarding the visual characteristics, the leading
actors/actress in the movie. It should be notes that those
movies originate from 66 countries, but with a clear
prevalence of USA movies.
An important goal of this project is to forecast the
critics’ score of a movie using the raw data have at our
disposal. It is very essential to understand which factors
have the highest weight in determining the rating of a
movie. So we will present the results in a bar chart so to
have a better grasp of this analysis.
dataRaw.info()
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color 5024 non-null object
director_name 4939 non-null object
num_critic_for_reviews 4993 non-null float64
duration 5028 non-null float64
director_facebook_likes 4939 non-null float64
actor_3_facebook_likes 5020 non-null float64
actor_2_name 5030 non-null object
actor_1_facebook_likes 5036 non-null float64
gross 4159 non-null float64
genres 5043 non-null object
actor_1_name 5036 non-null object
movie_title 5043 non-null object
num_voted_users 5043 non-null int64
cast_total_facebook_likes 5043 non-null int64
actor_3_name 5020 non-null object
facenumber_in_poster 5030 non-null float64
plot_keywords 4890 non-null object
movie_imdb_link 5043 non-null object
num_user_for_reviews 5022 non-null float64
language 5031 non-null object
country 5038 non-null object
content_rating 4740 non-null object
budget 4551 non-null float64
title_year 4935 non-null float64
actor_2_facebook_likes 5030 non-null float64
imdb_score 5043 non-null float64
aspect_ratio 4714 non-null float64
movie_facebook_likes 5043 non-null int64
As we can observe from the listing of data displayed
above, not all features are describe all the movies. Most
column are short 5043, which means there is missing data
that will compromise our ML analysis. There is a
possibility that some values are actually redundant. Data
analysist often encounter such situation of data mismatch,
which compels them to make up for this missing data or
duplicates, by either standardizing, interpolating, pruning
or panning their data.
We run these commands to have a peek at the number
of redundant data and the missing value in our dataset.
dataRaw.isnull().sum()
color 19
director_name 104
num_critic_for_reviews 50
duration 15
director_facebook_likes 104
actor_3_facebook_likes 23
actor_2_name 13
actor_1_facebook_likes 7
gross 884
genres 0
actor_1_name 7
movie_title 0
num_voted_users 0
cast_total_facebook_likes 0
actor_3_name 23
facenumber_in_poster 13
plot_keywords 153
movie_imdb_link 0
num_user_for_reviews 21
language 12
country 5
content_rating 303
budget 492
title_year 108
actor_2_facebook_likes 13
imdb_score 0
aspect_ratio 329
movie_facebook_likes 0
dataRaw.duplicated().sum()
45
We realize that 45 data are duplicates and that there is a
handful of missing data. It is evident that we can just erase
the duplicate data, but it is not as obvious how to treat the
missing data, since there exist a significant number of
them and we cannot just afford to just do away with them,
lest we run the risk of having un underfitted predictor, and
loose its accuracy that early. Our solution is as follow: we
can easily see that the feature “gross” exhibit the largest
number of missing data at a number of 884. This is
significantly larger than the second most feature with
missing data “budget” at 492, which is not a negligible
number either. Since the void values in “gross” and
“budget” are considerably big, we will just resort to
ablating those rows from our dataset altogether so to avoid
any irregularity down the line, and causing erroneous
implications.
dataRaw = dataRaw.drop_duplicates()
dataRaw = dataRaw.dropna(subset=['gross', 'budget'])
dataRaw.shape
(3857, 28)
After the unwanted data is been discarded, we still end
up with 3857 rows of data belonging to 28 features which
is amply enough to go ahead with our analysis.
We carried on with cleaning the data even further, since
other features are still not yet adapted to be inputs for our
ML algorithm. Some features like “aspect ratio” will
undergo averaging in order to reduce the intricacy of our
datasets, and bring a highly sparse data to become
consolidated into two or three ranges of data under one
mean value.
If we were to display the “language” column, we will
realize that 3644/5043 of the movies have English as their
language, which suggests that this feature will have little
to no effect in our prediction, and so we can go ahead and
eliminate it from our set of features, like we did with
“gross” and “budget”. A similar observation can be as
to the origin of the movies, which is dominated by USA
made films with a staggering number of 3025 out of 5043
largely in front of UK and France respectively with 316
and 103 respectively, and other nations contribution is
almost unaccountable at this scale. This is a opportunity to
reduce the complexity of our data, and create just four
“countries” group: 'USA', 'UK', 'France' and 'Others'.
To make our data fully interpretable by our ML
algorithms, we need to associate a numerical arbitrary
value to some of our data that are in string form. This
applies to “country”, “language” and
“content_rating”.
When we are done with all this preliminary steps of data
cleaning, rearranging and parsing, we get a dataset
perfectly suitable for ML processing, and we can then
proceed.
II. DATA VISUALISATION
One last step before we execute our ML processing over
the final dataset, it is judicious to understand first how the
data is correlated, that is to say how some features can
outweigh others in making a movie more or less
successful. Despite the fact that Random Trees and Neural
Networks can seamlessly create linkages between those
features, it will not be able to identify the semantic
connotation of each feature, but rather sees all features as
equal placeholder with no special meaning. Therefor, we
will make an even better assessment if we help ourselves
by recognizing the different key connections that exist.
This can be achieved by visualization of the data. We shall
begin by first laying down how many films have been
produced since the beginning of cinematography.
plt.figure(figsize=(30, 10))
sns.distplot(dataRaw.title_year, kde=False);
There is a clear sharp increase of movie productions
starting from the 80s. This is a direct result of the
cinematographic technical advance that coincided with
this decade, and most specially the market being flooded
with VHS cassettes which popularized home movies.
Another foretelling connection is the movie score with
regard to the genre. Here we can see series of plots that
best illustrate that, and clearly show the normal
distribution of this connection.
temp_act = dataRaw.loc[dataRaw.Action == 1][['imdb_score']]
temp_adv = dataRaw.loc[dataRaw.Adventure == 1][['imdb_score']]
temp_fan = dataRaw.loc[dataRaw.Fantasy == 1][['imdb_score']]
temp_sci = dataRaw.loc[dataRaw['Sci-Fi'] == 1][['imdb_score']]
temp_thr = dataRaw.loc[dataRaw.Thriller == 1][['imdb_score']]
temp_rom = dataRaw.loc[dataRaw.Romance == 1][['imdb_score']]
temp_com = dataRaw.loc[dataRaw.Comedy == 1][['imdb_score']]
temp_ani = dataRaw.loc[dataRaw.Animation == 1][['imdb_score']]
temp_fam = dataRaw.loc[dataRaw.Family == 1][['imdb_score']]
temp_hor = dataRaw.loc[dataRaw.Horror == 1][['imdb_score']]
temp_dra = dataRaw.loc[dataRaw.Drama == 1][['imdb_score']]
temp_crime = dataRaw.loc[dataRaw.Crime == 1][['imdb_score']
sns.set(style="white", palette="muted", color_codes=True)
f, axes = plt.subplots(3, 4, figsize=(20, 20), sharex=True)
sns.despine(left=True)
sns.distplot(temp_act.imdb_score, kde=False, color="blue", ax=axes[0, 0]).set_title('Action
Movies')
sns.distplot(temp_adv.imdb_score, kde=False, color="red", ax=axes[0, 1]).set_title('Adventur
e Movies')
sns.distplot(temp_fan.imdb_score, kde=False, color="green", ax=axes[0, 2]).set_title('Fantas
y Movies')
sns.distplot(temp_sci.imdb_score, kde=False, color="orange", ax=axes[0, 3]).set_title('Sci-F
i Movies')
sns.distplot(temp_thr.imdb_score, kde=False, color="blue", ax=axes[1, 0]).set_title('Thrille
r Movies')
sns.distplot(temp_rom.imdb_score, kde=False, color="red", ax=axes[1, 1]).set_title('Romance
Movies')
sns.distplot(temp_com.imdb_score, kde=False, color="green", ax=axes[1, 2]).set_title('Comedy
Movies')
sns.distplot(temp_fam.imdb_score, kde=False, color="orange", ax=axes[1, 3]).set_title('Famil
y Movies')
sns.distplot(temp_hor.imdb_score, kde=False, color="blue", ax=axes[2, 0]).set_title('Horror
Movies')
sns.distplot(temp_dra.imdb_score, kde=False, color="red", ax=axes[2, 1]).set_title('Drama Mo
vies')
sns.distplot(temp_ani.imdb_score, kde=False, color="green", ax=axes[2, 2]).set_title('Animat
ion Movies')
sns.distplot(temp_crime.imdb_score, kde=False, color="orange", ax=axes[2, 3]).set_title('Cri
me Movies')
III. IMPLEMENTATION OF THE ML TECHNIQUES
The prediction can now properly start, and we can
implement our two ML algorithm as per the requirement
of our paper introduction. We will compare the
performance of both algorithms and give our evaluation
and interpretation about their usage. In this project, we are
going to build the Random Forest and Neural Networks by
means of the specialized Python frameworks of Keras and
Tensorflow. We could also resort to SciKit but we wanted
to explore other tools that have are meant for advanced
ML implementation.
At this point, we have to split the train and test data. We
are using a hold-out rule that keep 25% of the data for the
testing, while being trained on 75% of it.
Y=dataRaw.imdb_score
X=dataRaw.drop(['imdb_score'], axis=1)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)
A. Random Forest
Our Random Forest algorithm takes a sample of data
that is randomly selected, and generate a multitude of
decision trees based on it. It computes the prediction for
each single tree and pick the one which has the best cut.
Random Forest make for an ideal classifier. The movie
success predictor seems to be a textbook case of
classification, but we must note that among the 28
features, some are not easily separable and thus a clear cut
can’t easily be achieved by Random Forest models.
from sklearn.ensemble import RandomForestC
lassifier
model = RandomForestClassifier(n_estimator
s=10)
model = model.fit(x_train, y_train)
feature_imp = pd.Series(model.feature_impo
rtances_,index=x_train.columns.values).sor
t_values(ascending=False)
A fundamental task when it comes to performing
supervised learning on a dataset is establishing which
features offers the most predictive power. Through
extrapolating the relationship between only a few crucial
features and the target label (success, failure) we break
down our understanding of the phenomenon into elements
that we are familiar with. For the dataset we are
studying, our aim is to narrow down it down to a couple
of features that impact the successful rate of the film.
# Creating a bar plot
plt.figure(figsize=(10, 10))
sns.barplot(x=feature_imp, y=feature_imp.i
ndex)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features"
)
plt.legend()
plt.show()
# predictions
pred_train = model.predict(x_train)
rms_train = accuracy_score(y_train, pred_train)
pred_test = model.predict(x_test)
rms_test = accuracy_score(y_test, pred_test)
print('Train Accuracy: {0} Test Accuracy: {1}'.format(rms_train*100, rms_test*100))
Train Accuracy: 98.60813704496788 Test Accuracy: 76.12419700214133
Random Forest implementation leads to a training
accuracy of over 98%, which might be an indicator that an
overfitting has occurred. This can be further corroborated
by the fact that the accuracy on testing data is only
76.12%, which suggests that some testing data was out of
range due to this unrealistic training accuracy.
B. Neural Networks
The favourite ML algorithm for sophisticated analysis
is Neural Networks. They are reliable and can be
employed for both performing a regression on linear data,
or a classification of clustered data. There bio-inspired
nature makes them easy to interpret in certain applications,
but also computationally expensive.
We will gauge both Random Forest and Neural
Networks in the light of this movies success predictor
project.
One last step prior to running the Neural Network
algorithm, is to render the data compatible with the
algorithm. First, we have to standardize all 'X' data so that
they fit in a[0,1] window. This is an important part in the
process of normalisation, which is a perquisite of ML
techniques such as Neural Network and PCM. The, the 'Y'
data which was represented in one column, will undergo a
transformation into five columns associated with each
classes.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)
Y = to_categorical(Y, num_classes=5)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25)
from keras import optimizers
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, Activation
from keras.optimizers import Adam, RMSprop, SGD
from keras.utils.np_utils import to_categorical
from keras.utils import np_utils
import tensorflow as tf
nn_model = Sequential()
nn_model.add(Dense(10, input_dim=39 , activation = 'relu'))
nn_model.add(Dense(10, input_dim=10 , activation = 'relu'))
nn_model.add(Dense(10, input_dim=10 , activation = 'relu'))
nn_model.add(Dense(5, activation ='softmax'))
adam = optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.05)
nn_model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy'])
nn_model.fit(x_train, y_train, nb_epoch=100, batch_size=10)
trainScore, trainAcc = nn_model.evaluate(x_train, y_train)
testScore, testAcc = nn_model.evaluate(x_test, y_test)
print('Train Accuracy: {0} Test Accuracy: {1}'.format(trainAcc*100, testAcc*100))
Train Accuracy: 85.08208422555317 Test Accuracy: 75.2676659528908
The Neural Network model has yielded a realistic 85%
accuracy on its training routine, and a 75.26% accuracy for
the testing data. Compared to Random Forest algorithm
which boast a 76.12%, one would say that this is slightly
lower, but Neural Network are more robust and thus can
be trusted for datasets of large quantities, which is not the
case for Random Forest. The neural network we used here
make use of a Keras optimized called Adam which
previous researchers has shown to be more lenient on the
computational power, and thus on the power consumption.
This is something to remember if we ever want to
implement this predictor on more disparate features (list
of all actors, timing of movie release with respect to other
global events, etc..).
This table can summarize our final result on the
accuracy of both of our ML approaches:
Slice of Dataset
used for
Random
Forest
Neural
Networks
Testing 98.60 85.08
Training 76.12 75.26
IV.CONCLUSION
Random Forest exhibit the highest accuracy on testing
data which is a far cry from the 76% accuracy on training
data. Neural Networks have a relatively high accuracy for
testing data 85% and so it makes sense when training
accuracy is of 75%. This is a typical issue with random
forest because it can overfit on data easily.
From all the above, here we list some of the key
takeaways that one can make:
- Estimating the success of a film is not as
straightforward as it seems. It does not correlate intimately
with any of the obvious features that a movie rely on
(genre, country of origin, shooting quality).
- That being said, there are certain factors that have
more impact than others (choice of actor/actress is more
decisive than the directors, drama genre movies are likely
to be more successful, and movies released during summer
have better chances than movies released in regular
months).
The model built here can be said to be a minimalistic
model, and there are a number of additions if one want to
make it better. I would suggest:
- Training on a larger dataset. Our study was conducted
only on IMDB datasets, but one can also look at Rotten
Tomatos and Box Office Mojo datasets. In which case
some of the features we removed here during data cleaning
(like “gross” and “budget”) would probably still make
it to the final training set.
- Add as a feature the keywords that make up a movie
synopsis. This text describing the plot is one the first
elements an audience consult for choosing which movie to
go watch, along with the movie poster. But this latter is
complicated to evaluate as a numerical score, unless we
use image segmentation and learn the different
components on the poster image, in order to compare
posters across movies.
- Other critical features can be inserted as well, such as
number of theatres screening a particular movie, the
number of previously successful movie from a particular
director or actor/actress.
The ML implementation that we designed in this
project can be extended to the Turkish movie industry and
help local production studios understand the parameters
that can help boost the commercial success of their films.
V. REFERENCES
CMPE542 Course Notes
www.imdb.com
www.kaggle.com
www.stackoverflow.com

More Related Content

Similar to Building a Movie Success Predictor

Using_The_Predictive_Analytics_For_Effective_Cross_Selling
Using_The_Predictive_Analytics_For_Effective_Cross_SellingUsing_The_Predictive_Analytics_For_Effective_Cross_Selling
Using_The_Predictive_Analytics_For_Effective_Cross_SellingSunil Kakade
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property Valuations
Cognizant
 
Explaining the Explainability: ‘Why’ and ‘How’ of Explainability in Research
Explaining the Explainability: ‘Why’ and ‘How’ of Explainability  in ResearchExplaining the Explainability: ‘Why’ and ‘How’ of Explainability  in Research
Explaining the Explainability: ‘Why’ and ‘How’ of Explainability in Research
Melih Bahar
 
R markup code to create Regression Model
R markup code to create Regression ModelR markup code to create Regression Model
R markup code to create Regression Model
Mohit Rajput
 
RETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITYRETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITY
IRJET Journal
 
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
honey725342
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
Dinesh V
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
Big Data Week
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
Sara Hooker
 
IRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social MediaIRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social Media
IRJET Journal
 
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
ijaia
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
Amy Hodler
 
IST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxIST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docx
priestmanmable
 
MLSEV. Anatomy of an ML Application
MLSEV. Anatomy of an ML ApplicationMLSEV. Anatomy of an ML Application
MLSEV. Anatomy of an ML Application
BigML, Inc
 
How to use LLMs in synthesizing training data?
How to use LLMs in synthesizing training data?How to use LLMs in synthesizing training data?
How to use LLMs in synthesizing training data?
Benjaminlapid1
 
Top 5 Cloud Analytics Platforms Of 2024 (1).pdf
Top 5 Cloud Analytics Platforms Of 2024 (1).pdfTop 5 Cloud Analytics Platforms Of 2024 (1).pdf
Top 5 Cloud Analytics Platforms Of 2024 (1).pdf
SophiaJohnson39
 
Prospection_Business_Intelligence[1]
Prospection_Business_Intelligence[1]Prospection_Business_Intelligence[1]
Prospection_Business_Intelligence[1]Tuong Do, MBA
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
Jim Czuprynski
 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
cravennichole326
 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
keturahhazelhurst
 

Similar to Building a Movie Success Predictor (20)

Using_The_Predictive_Analytics_For_Effective_Cross_Selling
Using_The_Predictive_Analytics_For_Effective_Cross_SellingUsing_The_Predictive_Analytics_For_Effective_Cross_Selling
Using_The_Predictive_Analytics_For_Effective_Cross_Selling
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property Valuations
 
Explaining the Explainability: ‘Why’ and ‘How’ of Explainability in Research
Explaining the Explainability: ‘Why’ and ‘How’ of Explainability  in ResearchExplaining the Explainability: ‘Why’ and ‘How’ of Explainability  in Research
Explaining the Explainability: ‘Why’ and ‘How’ of Explainability in Research
 
R markup code to create Regression Model
R markup code to create Regression ModelR markup code to create Regression Model
R markup code to create Regression Model
 
RETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITYRETRIEVING FUNDAMENTAL VALUES OF EQUITY
RETRIEVING FUNDAMENTAL VALUES OF EQUITY
 
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
1 Exploratory Data Analysis (EDA) by Melvin Ott, PhD.docx
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
BDW16 London - Scott Krueger, skyscanner - Does More Data Mean Better Decisio...
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
IRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social MediaIRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social Media
 
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
MOVIE SUCCESS PREDICTION AND PERFORMANCE COMPARISON USING VARIOUS STATISTICAL...
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
 
IST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxIST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docx
 
MLSEV. Anatomy of an ML Application
MLSEV. Anatomy of an ML ApplicationMLSEV. Anatomy of an ML Application
MLSEV. Anatomy of an ML Application
 
How to use LLMs in synthesizing training data?
How to use LLMs in synthesizing training data?How to use LLMs in synthesizing training data?
How to use LLMs in synthesizing training data?
 
Top 5 Cloud Analytics Platforms Of 2024 (1).pdf
Top 5 Cloud Analytics Platforms Of 2024 (1).pdfTop 5 Cloud Analytics Platforms Of 2024 (1).pdf
Top 5 Cloud Analytics Platforms Of 2024 (1).pdf
 
Prospection_Business_Intelligence[1]
Prospection_Business_Intelligence[1]Prospection_Business_Intelligence[1]
Prospection_Business_Intelligence[1]
 
From DBA to DE: Becoming a Data Engineer
From DBA to DE:  Becoming a Data Engineer From DBA to DE:  Becoming a Data Engineer
From DBA to DE: Becoming a Data Engineer
 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
 
Chapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docxChapter 11Data Visualization and Geographic Information System.docx
Chapter 11Data Visualization and Geographic Information System.docx
 

More from Youness Lahdili

7 [single-page slide] - My attempt at understanding Augmented Reality
7 [single-page slide] - My attempt at understanding Augmented Reality7 [single-page slide] - My attempt at understanding Augmented Reality
7 [single-page slide] - My attempt at understanding Augmented Reality
Youness Lahdili
 
6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision
6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision
6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision
Youness Lahdili
 
6 [progress report] for this leisurely side-project I was doing in 2016
6 [progress report] for this leisurely side-project I was doing in 20166 [progress report] for this leisurely side-project I was doing in 2016
6 [progress report] for this leisurely side-project I was doing in 2016
Youness Lahdili
 
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
Youness Lahdili
 
5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...
5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...
5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...
Youness Lahdili
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
Youness Lahdili
 
3 - A critical review on the usual DCT Implementations (presented in a Malays...
3 - A critical review on the usual DCT Implementations (presented in a Malays...3 - A critical review on the usual DCT Implementations (presented in a Malays...
3 - A critical review on the usual DCT Implementations (presented in a Malays...
Youness Lahdili
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
Youness Lahdili
 
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
Youness Lahdili
 
1 [single-page slide] - My concept of project presented for NI GSDA Award
1 [single-page slide] - My concept of project presented for NI GSDA Award1 [single-page slide] - My concept of project presented for NI GSDA Award
1 [single-page slide] - My concept of project presented for NI GSDA Award
Youness Lahdili
 
1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...
1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...
1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...
Youness Lahdili
 

More from Youness Lahdili (11)

7 [single-page slide] - My attempt at understanding Augmented Reality
7 [single-page slide] - My attempt at understanding Augmented Reality7 [single-page slide] - My attempt at understanding Augmented Reality
7 [single-page slide] - My attempt at understanding Augmented Reality
 
6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision
6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision
6 [single-page slide] - Conception of an Autonomous UAV using Stereo Vision
 
6 [progress report] for this leisurely side-project I was doing in 2016
6 [progress report] for this leisurely side-project I was doing in 20166 [progress report] for this leisurely side-project I was doing in 2016
6 [progress report] for this leisurely side-project I was doing in 2016
 
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
6 - Conception of an Autonomous UAV using Stereo Vision (presented in an Indo...
 
5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...
5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...
5 - Anthology on the Ethical Issues in Engineering Practice (presented in a M...
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
 
3 - A critical review on the usual DCT Implementations (presented in a Malays...
3 - A critical review on the usual DCT Implementations (presented in a Malays...3 - A critical review on the usual DCT Implementations (presented in a Malays...
3 - A critical review on the usual DCT Implementations (presented in a Malays...
 
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
4 - Simulation and analysis of different DCT techniques on MATLAB (presented ...
 
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
2 - Generation of PSK signal using non linear devices via MATLAB (presented i...
 
1 [single-page slide] - My concept of project presented for NI GSDA Award
1 [single-page slide] - My concept of project presented for NI GSDA Award1 [single-page slide] - My concept of project presented for NI GSDA Award
1 [single-page slide] - My concept of project presented for NI GSDA Award
 
1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...
1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...
1 - My concept of project presented for NI GSDA Award (selected as one of 8 f...
 

Recently uploaded

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
ViralQR
 

Recently uploaded (20)

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.Welocme to ViralQR, your best QR code generator.
Welocme to ViralQR, your best QR code generator.
 

Building a Movie Success Predictor

  • 1. Building a Movie Success Predictor Youness Lahdili TED University Project Paper for CMPE542 - Machine Learning Prof. Venera Adanova Abstract— The movie making is a multibillion-dollar industry. In 2018, the global movie business has generated nearly $41.5 billion in box office and more than that in merchandise revenues. But it is not a guaranteed business: every year we witness big buster and budget movies that become either a “hit” or a “flop”. The success of a movie is mainly judged by looking at ratio of its gross revenue over its budget, but some may also call a movie successful if it bagged critics praise and awards, both of which do not necessarily convert to financial revenue. In our project we look from an investor point of view, who largely favour financial return over any other attribute. But to predict the success of a movie, an investor can’t only rely on superficial attributes, a typical reason why Machine Learning (ML) prediction will prove to be very useful. We are going to implement this prediction using two ML methods that we have studied during the subject CMPE542, namely Random Forest and Neural Network. These are very adapted for discriminating classes, and can thus help us very effectively in pointing to successful or failed movies after being trained on a set of 5043 movies which data have been scraped from IMDB. At the end of the project, we should be able to know which method has the highest accuracy, what movies sell the best at the box office and most importantly for movies producers, what movie features are the most decisive in making a movie profitable. Keywords— Movie Industry, Data Scraping, Machine Learning, Random Forest, Neural Network I. INTRODUCTION A. Overview More than entertainment, the cinema industry is becoming vital to economies of some countries and has became an indispensable weapon in psychological war and the soft power exerted by some countries. So it is imperative to be able to maximize the financial gains from movies, and keep movies as crowd-alluring as possible. B. Data Extraction and Parsing To run our ML analysis, we need raw data on all movies that have been judged either successful of failed. For this end, we turn to IMDB, an online repository of all movies that have been released to date and even those in pre- production phase. We can tap into this database to extract key information about the movie budget, gross revenue, ratings, names of people who are taking part in the movie, year of release, and so on. We will have to use some tools such as BeautifulSoup which allow to read data of interest from HTML webpages and create tabular data out of it, in this project it is a .CSV datafile conveniently named “movie_metadata” and ready to be treated by SciKit or other ML utilitaries. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from sklearn.model_selection import train_test_split from nltk.corpus import stopwords from nltk.util import ngrams from sklearn.feature_extraction.text import TfidfVectorizer dataRaw = pd.read_csv(movie_metadata.csv",sep=',')
  • 2. The dataset is composed of a wide array of attributes on all films, with separable values and annotations. It has 28 variables belonging to 5043 movies. The dataset we obtained show the movie titles, investment that was placed to produce the movie, the revenue that was earned, and a lot more regarding the visual characteristics, the leading actors/actress in the movie. It should be notes that those movies originate from 66 countries, but with a clear prevalence of USA movies. An important goal of this project is to forecast the critics’ score of a movie using the raw data have at our disposal. It is very essential to understand which factors have the highest weight in determining the rating of a movie. So we will present the results in a bar chart so to have a better grasp of this analysis. dataRaw.info() RangeIndex: 5043 entries, 0 to 5042 Data columns (total 28 columns): color 5024 non-null object director_name 4939 non-null object num_critic_for_reviews 4993 non-null float64 duration 5028 non-null float64 director_facebook_likes 4939 non-null float64 actor_3_facebook_likes 5020 non-null float64 actor_2_name 5030 non-null object actor_1_facebook_likes 5036 non-null float64 gross 4159 non-null float64 genres 5043 non-null object actor_1_name 5036 non-null object movie_title 5043 non-null object num_voted_users 5043 non-null int64 cast_total_facebook_likes 5043 non-null int64 actor_3_name 5020 non-null object facenumber_in_poster 5030 non-null float64 plot_keywords 4890 non-null object movie_imdb_link 5043 non-null object num_user_for_reviews 5022 non-null float64 language 5031 non-null object country 5038 non-null object content_rating 4740 non-null object budget 4551 non-null float64 title_year 4935 non-null float64 actor_2_facebook_likes 5030 non-null float64 imdb_score 5043 non-null float64 aspect_ratio 4714 non-null float64 movie_facebook_likes 5043 non-null int64 As we can observe from the listing of data displayed above, not all features are describe all the movies. Most column are short 5043, which means there is missing data that will compromise our ML analysis. There is a possibility that some values are actually redundant. Data analysist often encounter such situation of data mismatch, which compels them to make up for this missing data or duplicates, by either standardizing, interpolating, pruning or panning their data. We run these commands to have a peek at the number of redundant data and the missing value in our dataset. dataRaw.isnull().sum() color 19 director_name 104 num_critic_for_reviews 50 duration 15 director_facebook_likes 104 actor_3_facebook_likes 23 actor_2_name 13 actor_1_facebook_likes 7 gross 884
  • 3. genres 0 actor_1_name 7 movie_title 0 num_voted_users 0 cast_total_facebook_likes 0 actor_3_name 23 facenumber_in_poster 13 plot_keywords 153 movie_imdb_link 0 num_user_for_reviews 21 language 12 country 5 content_rating 303 budget 492 title_year 108 actor_2_facebook_likes 13 imdb_score 0 aspect_ratio 329 movie_facebook_likes 0 dataRaw.duplicated().sum() 45 We realize that 45 data are duplicates and that there is a handful of missing data. It is evident that we can just erase the duplicate data, but it is not as obvious how to treat the missing data, since there exist a significant number of them and we cannot just afford to just do away with them, lest we run the risk of having un underfitted predictor, and loose its accuracy that early. Our solution is as follow: we can easily see that the feature “gross” exhibit the largest number of missing data at a number of 884. This is significantly larger than the second most feature with missing data “budget” at 492, which is not a negligible number either. Since the void values in “gross” and “budget” are considerably big, we will just resort to ablating those rows from our dataset altogether so to avoid any irregularity down the line, and causing erroneous implications. dataRaw = dataRaw.drop_duplicates() dataRaw = dataRaw.dropna(subset=['gross', 'budget']) dataRaw.shape (3857, 28) After the unwanted data is been discarded, we still end up with 3857 rows of data belonging to 28 features which is amply enough to go ahead with our analysis. We carried on with cleaning the data even further, since other features are still not yet adapted to be inputs for our ML algorithm. Some features like “aspect ratio” will undergo averaging in order to reduce the intricacy of our datasets, and bring a highly sparse data to become consolidated into two or three ranges of data under one mean value. If we were to display the “language” column, we will realize that 3644/5043 of the movies have English as their language, which suggests that this feature will have little to no effect in our prediction, and so we can go ahead and eliminate it from our set of features, like we did with “gross” and “budget”. A similar observation can be as to the origin of the movies, which is dominated by USA made films with a staggering number of 3025 out of 5043 largely in front of UK and France respectively with 316 and 103 respectively, and other nations contribution is almost unaccountable at this scale. This is a opportunity to reduce the complexity of our data, and create just four “countries” group: 'USA', 'UK', 'France' and 'Others'. To make our data fully interpretable by our ML algorithms, we need to associate a numerical arbitrary value to some of our data that are in string form. This applies to “country”, “language” and “content_rating”. When we are done with all this preliminary steps of data cleaning, rearranging and parsing, we get a dataset perfectly suitable for ML processing, and we can then proceed.
  • 4. II. DATA VISUALISATION One last step before we execute our ML processing over the final dataset, it is judicious to understand first how the data is correlated, that is to say how some features can outweigh others in making a movie more or less successful. Despite the fact that Random Trees and Neural Networks can seamlessly create linkages between those features, it will not be able to identify the semantic connotation of each feature, but rather sees all features as equal placeholder with no special meaning. Therefor, we will make an even better assessment if we help ourselves by recognizing the different key connections that exist. This can be achieved by visualization of the data. We shall begin by first laying down how many films have been produced since the beginning of cinematography. plt.figure(figsize=(30, 10)) sns.distplot(dataRaw.title_year, kde=False); There is a clear sharp increase of movie productions starting from the 80s. This is a direct result of the cinematographic technical advance that coincided with this decade, and most specially the market being flooded with VHS cassettes which popularized home movies. Another foretelling connection is the movie score with regard to the genre. Here we can see series of plots that best illustrate that, and clearly show the normal distribution of this connection. temp_act = dataRaw.loc[dataRaw.Action == 1][['imdb_score']] temp_adv = dataRaw.loc[dataRaw.Adventure == 1][['imdb_score']] temp_fan = dataRaw.loc[dataRaw.Fantasy == 1][['imdb_score']] temp_sci = dataRaw.loc[dataRaw['Sci-Fi'] == 1][['imdb_score']] temp_thr = dataRaw.loc[dataRaw.Thriller == 1][['imdb_score']] temp_rom = dataRaw.loc[dataRaw.Romance == 1][['imdb_score']] temp_com = dataRaw.loc[dataRaw.Comedy == 1][['imdb_score']] temp_ani = dataRaw.loc[dataRaw.Animation == 1][['imdb_score']] temp_fam = dataRaw.loc[dataRaw.Family == 1][['imdb_score']] temp_hor = dataRaw.loc[dataRaw.Horror == 1][['imdb_score']] temp_dra = dataRaw.loc[dataRaw.Drama == 1][['imdb_score']] temp_crime = dataRaw.loc[dataRaw.Crime == 1][['imdb_score'] sns.set(style="white", palette="muted", color_codes=True) f, axes = plt.subplots(3, 4, figsize=(20, 20), sharex=True) sns.despine(left=True) sns.distplot(temp_act.imdb_score, kde=False, color="blue", ax=axes[0, 0]).set_title('Action Movies') sns.distplot(temp_adv.imdb_score, kde=False, color="red", ax=axes[0, 1]).set_title('Adventur e Movies') sns.distplot(temp_fan.imdb_score, kde=False, color="green", ax=axes[0, 2]).set_title('Fantas y Movies') sns.distplot(temp_sci.imdb_score, kde=False, color="orange", ax=axes[0, 3]).set_title('Sci-F i Movies')
  • 5. sns.distplot(temp_thr.imdb_score, kde=False, color="blue", ax=axes[1, 0]).set_title('Thrille r Movies') sns.distplot(temp_rom.imdb_score, kde=False, color="red", ax=axes[1, 1]).set_title('Romance Movies') sns.distplot(temp_com.imdb_score, kde=False, color="green", ax=axes[1, 2]).set_title('Comedy Movies') sns.distplot(temp_fam.imdb_score, kde=False, color="orange", ax=axes[1, 3]).set_title('Famil y Movies') sns.distplot(temp_hor.imdb_score, kde=False, color="blue", ax=axes[2, 0]).set_title('Horror Movies') sns.distplot(temp_dra.imdb_score, kde=False, color="red", ax=axes[2, 1]).set_title('Drama Mo vies') sns.distplot(temp_ani.imdb_score, kde=False, color="green", ax=axes[2, 2]).set_title('Animat ion Movies') sns.distplot(temp_crime.imdb_score, kde=False, color="orange", ax=axes[2, 3]).set_title('Cri me Movies')
  • 6. III. IMPLEMENTATION OF THE ML TECHNIQUES The prediction can now properly start, and we can implement our two ML algorithm as per the requirement of our paper introduction. We will compare the performance of both algorithms and give our evaluation and interpretation about their usage. In this project, we are going to build the Random Forest and Neural Networks by means of the specialized Python frameworks of Keras and Tensorflow. We could also resort to SciKit but we wanted to explore other tools that have are meant for advanced ML implementation. At this point, we have to split the train and test data. We are using a hold-out rule that keep 25% of the data for the testing, while being trained on 75% of it. Y=dataRaw.imdb_score X=dataRaw.drop(['imdb_score'], axis=1) x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25) A. Random Forest Our Random Forest algorithm takes a sample of data that is randomly selected, and generate a multitude of decision trees based on it. It computes the prediction for each single tree and pick the one which has the best cut. Random Forest make for an ideal classifier. The movie success predictor seems to be a textbook case of classification, but we must note that among the 28 features, some are not easily separable and thus a clear cut can’t easily be achieved by Random Forest models. from sklearn.ensemble import RandomForestC lassifier model = RandomForestClassifier(n_estimator s=10) model = model.fit(x_train, y_train) feature_imp = pd.Series(model.feature_impo rtances_,index=x_train.columns.values).sor t_values(ascending=False) A fundamental task when it comes to performing supervised learning on a dataset is establishing which features offers the most predictive power. Through extrapolating the relationship between only a few crucial features and the target label (success, failure) we break down our understanding of the phenomenon into elements that we are familiar with. For the dataset we are studying, our aim is to narrow down it down to a couple of features that impact the successful rate of the film. # Creating a bar plot plt.figure(figsize=(10, 10)) sns.barplot(x=feature_imp, y=feature_imp.i ndex) # Add labels to your graph plt.xlabel('Feature Importance Score') plt.ylabel('Features') plt.title("Visualizing Important Features" ) plt.legend() plt.show()
  • 7. # predictions pred_train = model.predict(x_train) rms_train = accuracy_score(y_train, pred_train) pred_test = model.predict(x_test) rms_test = accuracy_score(y_test, pred_test) print('Train Accuracy: {0} Test Accuracy: {1}'.format(rms_train*100, rms_test*100)) Train Accuracy: 98.60813704496788 Test Accuracy: 76.12419700214133 Random Forest implementation leads to a training accuracy of over 98%, which might be an indicator that an overfitting has occurred. This can be further corroborated by the fact that the accuracy on testing data is only 76.12%, which suggests that some testing data was out of range due to this unrealistic training accuracy. B. Neural Networks The favourite ML algorithm for sophisticated analysis is Neural Networks. They are reliable and can be employed for both performing a regression on linear data, or a classification of clustered data. There bio-inspired nature makes them easy to interpret in certain applications, but also computationally expensive. We will gauge both Random Forest and Neural Networks in the light of this movies success predictor project. One last step prior to running the Neural Network algorithm, is to render the data compatible with the algorithm. First, we have to standardize all 'X' data so that they fit in a[0,1] window. This is an important part in the process of normalisation, which is a perquisite of ML techniques such as Neural Network and PCM. The, the 'Y' data which was represented in one column, will undergo a transformation into five columns associated with each classes. from sklearn.preprocessing import StandardScaler sc = StandardScaler() X = sc.fit_transform(X) Y = to_categorical(Y, num_classes=5) x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25) from keras import optimizers from keras.models import Sequential from keras.layers import Dense, Embedding, LSTM, Activation from keras.optimizers import Adam, RMSprop, SGD from keras.utils.np_utils import to_categorical from keras.utils import np_utils import tensorflow as tf nn_model = Sequential() nn_model.add(Dense(10, input_dim=39 , activation = 'relu')) nn_model.add(Dense(10, input_dim=10 , activation = 'relu')) nn_model.add(Dense(10, input_dim=10 , activation = 'relu')) nn_model.add(Dense(5, activation ='softmax')) adam = optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.05) nn_model.compile(loss='categorical_crossentropy', optimizer='Adam', metrics=['accuracy']) nn_model.fit(x_train, y_train, nb_epoch=100, batch_size=10) trainScore, trainAcc = nn_model.evaluate(x_train, y_train) testScore, testAcc = nn_model.evaluate(x_test, y_test) print('Train Accuracy: {0} Test Accuracy: {1}'.format(trainAcc*100, testAcc*100)) Train Accuracy: 85.08208422555317 Test Accuracy: 75.2676659528908
  • 8. The Neural Network model has yielded a realistic 85% accuracy on its training routine, and a 75.26% accuracy for the testing data. Compared to Random Forest algorithm which boast a 76.12%, one would say that this is slightly lower, but Neural Network are more robust and thus can be trusted for datasets of large quantities, which is not the case for Random Forest. The neural network we used here make use of a Keras optimized called Adam which previous researchers has shown to be more lenient on the computational power, and thus on the power consumption. This is something to remember if we ever want to implement this predictor on more disparate features (list of all actors, timing of movie release with respect to other global events, etc..). This table can summarize our final result on the accuracy of both of our ML approaches: Slice of Dataset used for Random Forest Neural Networks Testing 98.60 85.08 Training 76.12 75.26 IV.CONCLUSION Random Forest exhibit the highest accuracy on testing data which is a far cry from the 76% accuracy on training data. Neural Networks have a relatively high accuracy for testing data 85% and so it makes sense when training accuracy is of 75%. This is a typical issue with random forest because it can overfit on data easily. From all the above, here we list some of the key takeaways that one can make: - Estimating the success of a film is not as straightforward as it seems. It does not correlate intimately with any of the obvious features that a movie rely on (genre, country of origin, shooting quality). - That being said, there are certain factors that have more impact than others (choice of actor/actress is more decisive than the directors, drama genre movies are likely to be more successful, and movies released during summer have better chances than movies released in regular months). The model built here can be said to be a minimalistic model, and there are a number of additions if one want to make it better. I would suggest: - Training on a larger dataset. Our study was conducted only on IMDB datasets, but one can also look at Rotten Tomatos and Box Office Mojo datasets. In which case some of the features we removed here during data cleaning (like “gross” and “budget”) would probably still make it to the final training set. - Add as a feature the keywords that make up a movie synopsis. This text describing the plot is one the first elements an audience consult for choosing which movie to go watch, along with the movie poster. But this latter is complicated to evaluate as a numerical score, unless we use image segmentation and learn the different components on the poster image, in order to compare posters across movies. - Other critical features can be inserted as well, such as number of theatres screening a particular movie, the number of previously successful movie from a particular director or actor/actress. The ML implementation that we designed in this project can be extended to the Turkish movie industry and help local production studios understand the parameters that can help boost the commercial success of their films. V. REFERENCES CMPE542 Course Notes www.imdb.com www.kaggle.com www.stackoverflow.com