In this projet, we analyze a dataset about 10,000 movies which was orginally generated from the TMDb movie database APi and published by kaggle https://www.kaggle.com/tmdb/tmdb-movie-metadata. We've analyzed the dataset, in order the answer different research questions:
- Most popular movies by genre,
- relations between movie popularity and rating with the production budget and revenue
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Â
TMDb movie dataset by kaggle
1. Udacity Data Analyst Nanodegree
P2: Investigate [TMDb Movie] dataset
Author: Mouhamadou GUEYE
Date: May 26, 2019
Table of contents
Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions
Introduction
In this project we will analyze the dataset associated with the informations about 10000 movies collected from the movie
database TMDb. In particular we'll be interested in finding trends ralating most popular movies by genre, the movie rating and
popularity based on the budget and revenue.
Background:
The [Movie Database TMDB](https://www.themoviedb.org/) is a community built movie and TV database.
Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong
international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put
simply, we live and breathe community and that's precisely what makes us different. ### The TMDb
Advantage: 1. Every year since 2008, the number of contributions to our database has increased. With
over 125,000 developers and companies using our platform, TMDb has become a premiere source for
metadata. 2. Along with extensive metadata for movies, TV shows and people, we also offer one of the
best selections of high resolution posters and fan art. On average, over 1,000 images are added every
single day. 3. We're international. While we officially support 39 languages we also have extensive
regional data. Every single day TMDb is used in over 180 countries. 4. Our community is second to none.
Between our staff and community moderators, we're always here to help. We're passionate about making
sure your experience on TMDb is nothing short of amazing. 5. Trusted platform. Every single day our
service is used by millions of people while we process over 2 billion requests. We've proven for years that
this is a service that can be trusted and relied on. This organization profile is not owned or maintained by
TMDb: datasets hosted under this organization profile use the TMDb API but are not endorsed or
certified by TMDb[1].
Reseach Questions for investigations
1. What is the most popular movies by genre?
2. What is the most popular movies by genre from year to year?
3. Do movies with highest revenue have more popularity?
4. Do movies with highest budget have more popularity?
5. Do movies with highest revenue recieve a better rating?
6. Do movies with highest budget recieve a better rating?
Dataset
This data set contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and
revenue.The dataset uses in this project is a cleaned version of the original dataset on Kaggle. where its full description can be
found there.
In [1]: # packages import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
Data Wrangling
General Properties
In this step we will inspect the dataset, in order to undestand it's properties and structures:
The datatypes of each column
The number samples of the dataset
Number of columns in the dataset
Duplicate rows if any in the dataset
Features with missing values
Number of non-null unique value for features in each dataset
What are those unique values are, count of each
In [2]: movies = pd.read_csv('tmdb-movies.csv')
In [3]: # Printing the five first row of the dataframe
movies.head()
Columns Data Types
In [4]: movies.dtypes
Number of samples/columns
In [5]: # number of rows for the movie dataset
movies.shape[0]
In [6]: # Number of columns for the movie dataset
movies.shape[1]
Duplicates Rows
In [7]: # Duplicate rows in the movies dataset
sum(movies.duplicated())
Deletion of duplicates
In [8]: # Duplicate rows n the credit dataset
movies[movies.duplicated()]
movies.drop_duplicates(inplace=True)
Missing Values
We notice that there missing values in the following columns:
homepage
overview
release_date
tagline
runtime
cast
production_companies
director
genres
etc.
In [9]: # informations about the dataset
movies.info()
In [10]: # Inpecting rows with missing values
movies[movies.isnull().any(axis='columns')].head()
Number of Distinct Observations
In [11]: movies.nunique()
Descriptive Statistics Summary
In [12]: movies.describe()
Data Cleaning
In this step we will clean the dataset by removing columns that are irrelevant for our analysis, convert the release date
columns from a string to a datetime object, fill columns for budget and revenue which contains a huge amount of zero values
by their means, handle the columns with multiple values separated by a pipe (|), by splitting them in differents rows.
In [13]: # movies dataset columns
print(movies.columns)
Droping Extraneous Columns
These columns will dropped since they are not relevant on our data analysis.
In [14]: # columns to drop from the movies dataset, thes columns are irrelevant for our data analysis
columns = ['homepage', 'tagline', 'overview', 'keywords']
In [15]: movies.drop(labels=columns, axis=1, inplace=True)
In [16]: movies.info()
Convert release_date in datetime Object
The release_date in a string format, we will use panda's to_datetime method to convert the column from string to datatime
dtype.
In [17]: movies['release_date'] = pd.to_datetime(movies['release_date'],format='%m/%d/%y')
In [18]: # check the date format after cleaning
movies['release_date'].dtype
Dealing with Multiple Values Columns
In [19]: movies= (movies.drop('genres', axis=1)
.join(movies['genres'].str.split('|', expand=True)
.stack().reset_index(level=1,drop=True)
.rename('genres'))
.loc[:, movies.columns])
movies.head()
In [20]: # splitting into row the production_companies columns
movies= (movies.drop('production_companies', axis=1)
.join(movies['production_companies'].str.split('|', expand=True)
.stack().reset_index(level=1,drop=True)
.rename('production_companies'))
.loc[:, movies.columns])
movies.head()
Fill zero value in the revenue and budget columns
Here we inspect the column revenue, revenue_adj, budget and budget_adj counting the number of rows having 0 values
before filling those values with the mean.
In [21]: # inspecting the movies and budget columns
movies[movies['revenue'] == 0].count()['revenue']
In [22]: # inspecting the movies and budget columns
movies[movies['revenue_adj'] == 0].count()['revenue_adj']
In [23]: # inspecting the movies and budget columns
movies[movies['budget'] == 0].count()['budget']
In [24]: # inspecting the movies and budget columns
movies[movies['budget_adj'] == 0].count()['budget_adj']
In [25]: # fill the columns revenue and budget with their mean value
cols = ['budget', 'budget_adj', 'revenue', 'revenue_adj']
for item in cols:
print(item, movies[item].mean())
movies[item] = movies[item].replace({0:movies[item].mean()})
In [26]: # Check Whether the colums have been successfully filled
movies[movies['revenue'].notnull()].count()
In [27]: # should return False
(movies['revenue'] == 0).all()
In [28]: # should return False
(movies['revenue_adj'] == 0).all()
In [29]: # should return False
(movies['budget_adj'] == 0).all()
In [30]: # should return False
(movies['budget_adj'] == 0).all()
Check Number of samples/columns
In [31]: movies.shape
Visual Trends
In [32]: movies.hist(figsize=(15,10));
Exploratory data analysis
1. Which genres is more popular ?
In [33]: # unique genres movies existing in the dataframe
genres = movies['genres'].unique()
print(genres)
In [34]: # grouping movies by genre projecting on the popularity column and calculation of the mean
movies_by_genres = movies.groupby('genres')['popularity'].mean()
# plottting the bar chart of movies by genre
movies_by_genres.plot(kind='bar', alpha=.7, figsize=(15,6))
plt.xlabel("Genres", fontsize=18);
plt.ylabel("Popularity", fontsize=18);
plt.xticks(fontsize=10)
plt.title('Average movie popularity by genre', fontsize=18);
plt.grid(True)
2. Which genres is most popular from year to year?
In [35]: # plot data
fig, ax = plt.subplots(figsize=(15,7))
# grouping movies by genre
grouped= movies.groupby(['release_year', 'genres']).count()['popularity']
.unstack().plot(ax=ax, figsize=(15,6))
plt.xlabel("release year", fontsize=18);
plt.ylabel("count", fontsize=18);
plt.xticks(fontsize=10)
plt.title('movie popularity year by year', fontsize=18);
3. What Moving Genres recieves the highest average rating?
In [36]: # grouping the movies by genres and projecting on the rating column
rating = movies.groupby('genres')['vote_average'].mean()
rating
In [37]: # bar chart of the movies mean rating by genre
rating.plot(kind='bar', alpha=0.7)
plt.xlabel('Movie Genre', fontsize=12)
plt.ylabel('Vote Average', fontsize=12)
plt.title('Average Movie Quality by Genre', fontsize=12)
plt.grid(True)
4. Do movies with high revenue recieve the highest rating?
In [38]: plt.scatter(movies['revenue_adj'], movies['vote_average'], linewidth=5)
plt.title('Vote Ratings by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=15)
plt.ylabel('Average Vote Rating', fontsize=15);
plt.show()
In [39]: # mean rating for each revenue level
median_rev = movies['revenue_adj'].median()
low = movies.query('revenue_adj < {}'.format(median_rev))
high = movies.query('revenue_adj >= {}'.format(median_rev))
# filtering to vote_average columns and calculation of the mean
mean_low = low['vote_average'].mean()
mean_high = high['vote_average'].mean()
In [40]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Vote Ratings by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=12)
plt.ylabel('Average Vote Rating', fontsize=15);
In [41]: # counting the movie revenue unique values
movies.revenue.value_counts().head()
In [42]: # 10 first values
movies.groupby('revenue_adj')['vote_average'].value_counts().head(10)
In [43]: # 10 last values
movies.groupby('revenue_adj')['vote_average'].value_counts().tail(10)
In [44]: # comparison of the median popularity of movies with low and high revenue
movies.query('revenue_adj < {}'.format(median_rev))['vote_average'].median(), movies.query('revenue_
adj > {}'.format(median_rev))['vote_average'].median()
Partial conclusion
It is difficult to say the movies with high revenue have a better rating since according to the histogram, the height of the
histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is
calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high
revenue is 6.3.
5. Do movies with high budget get the highest rating?
In [45]: # scatter plot of the budget versus vote rating
plt.scatter(movies['budget_adj'], movies['vote_average'], linewidth=5)
plt.title('Vote Ratings by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=15)
plt.ylabel('Vote Rating', fontsize=15);
plt.show()
In [46]: # mean rating for each revenue level
median_bud = movies['budget_adj'].median()
low = movies.query('budget_adj < {}'.format(median_bud))
high = movies.query('budget_adj >= {}'.format(median_bud))
# filtering to vote_average columns and calculation of the mean
mean_low = low['vote_average'].mean()
mean_high = high['vote_average'].mean()
print([mean_low, mean_high])
In [47]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Vote Ratings by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=12)
plt.ylabel('Average Vote Rating', fontsize=15);
In [48]: # counting the movie revenue unique values
movies.budget.value_counts().head()
In [49]: # 10 first values
movies.groupby('budget_adj')['vote_average'].value_counts().head(10)
In [50]: # 10 last values
movies.groupby('budget_adj')['vote_average'].value_counts().tail(10)
In [51]: # comparison of the median popularity of movies with low and high revenue
(movies.query('budget_adj < {}'.format(median_rev))['vote_average'].median(),
movies.query('budget_adj > {}'.format(median_rev))['vote_average'].median())
Partial conclusion
It is difficult to say the movies with high budget have a better rating since according to the histogram, the height of the
histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is
calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high
revenue is 6.2.
6. Do movies with highest revenue have more popularity?
In [52]: plt.scatter(movies['revenue_adj'], movies['popularity'], linewidth=5)
plt.title('Popularity by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=15)
plt.ylabel('Average Popularity', fontsize=15);
plt.show()
In [53]: # mean rating for each revenue level
median_rev = movies['revenue_adj'].median()
low = movies.query('revenue_adj < {}'.format(median_rev))
high = movies.query('revenue_adj >= {}'.format(median_rev))
# filtering to popularity columns and calculation of the mean
mean_low = low['popularity'].mean()
mean_high = high['popularity'].mean()
In [54]: # list of the mean and high revenue for historgram chart
heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Popularity by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=12)
plt.ylabel('Average Popularity', fontsize=15);
In [55]: # counting the movie revenue unique values
movies.revenue_adj.value_counts().head()
In [56]: # 10 first values
movies.groupby('revenue_adj')['popularity'].value_counts().head(10)
In [57]: # 10 last values
movies.groupby('revenue_adj')['popularity'].value_counts().tail(10)
In [58]: # comparison of the median popularity of movies with low and high revenue
(movies.query('revenue_adj < {}'.format(median_rev))['popularity'].median(),
movies.query('revenue_adj > {}'.format(median_rev))['popularity'].median())
Partial conclusion
We can see that, the film with high revenue seem to be more popular than the ones with low revenue, with an average
popularity respectively of 0.7420684714824547 and 0.9989869505300212. Morever by comparing the median popularity of
movies with low and high revenue, we can clearly see that the movie with high revenue are more popular.
7. Do movies with highest budget have more popularity?
In [59]: # scatter plot of the movies budget versus popularity
plt.scatter(movies['budget_adj'], movies['popularity'], linewidth=5)
plt.title('Popularity by Revenue Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=15)
plt.ylabel('Average Popularity', fontsize=15);
plt.show()
In [60]: # mean rating for each revenue level
median_rev = movies['budget_adj'].median()
low = movies.query('budget_adj < {}'.format(median_rev))
high = movies.query('budget_adj >= {}'.format(median_rev))
# filtering to popularity columns and calculation of the mean
mean_low = low['popularity'].mean()
mean_high = high['popularity'].mean()
In [61]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Popularity by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=12)
plt.ylabel('Average Popularity', fontsize=15);
In [62]: # counting the movie budget unique values
movies.budget_adj.value_counts().head()
In [63]: # 10 first values
movies.groupby('budget_adj')['popularity'].value_counts().head(10)
In [64]: # 10 last values
movies.groupby('budget_adj')['popularity'].value_counts().tail(10)
In [65]: # comparison of the median popularity of movies with low and high revenue
(movies.query('budget_adj < {}'.format(median_rev))['popularity'].median(),
movies.query('budget_adj > {}'.format(median_rev))['popularity'].median())
Partial conclusion
We can see that, the film with high budget seem to be more popular than the ones with low budget, with an average popularity
respectively of 0.7564478409230605 and 0.979017978784679. Morever by comparing the median popularity of movies with
low and high budget, we can clearly see that the movie with high budget seem more popular.
Conclusions
In this project, we started our analysis by examining the most popular movie by genre. We notice the adventure movies are the
most popular movies genre. We've, then examined, the movie popularity year by year. For, this since there is no correlation
between release_year and movie popularity, the count of the realese movie each year is used for the analysis. Based on the
relation between the genres and vote avarage, we found that, the Documentary recieves the highest rating. Moreover, we
have analyzed the dataset trying to answer different questions related to movies popularity and rating versus revenue and
budget. While the movies with high revenue and budget seem to be more popular, we could not find a correlation between
movie budget and revenue with rating.
Limitations
For a better analysis, a more details seems to be useful regarding the variables popularity and vote_average and how they are
calculated? The factors/criteria used for their calculations. During the analysis process the columns in which we are interested
in this analayis (budget, revenue, budget_adj and revenue_adj) contain many missing values which have been filled using the
mean. This seems not the best way to fix those columns since the mean is not always the best measure of center. Another
limitations in this analysis, the process of categorizing the movie with low and high revenue and budget using the median.
Since some movie have a huge amount of budget and revenue and the fact that we fill many missing values with the mean,
the median should not be the best for categoring the movie.
References:
[1]: https://www.themoviedb.org/about
In [ ]:
Out[3]:
id imdb_id popularity budget revenue original_title cast home
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
http://www.jurassicworld
1 76341 tt1392190 28.419936 150000000 378436354
Mad Max:
Fury Road
Tom
Hardy|Charlize
Theron|Hugh
Keays-
Byrne|Nic...
http://www.madmaxmovie
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent
Shailene
Woodley|Theo
James|Kate
Winslet|Ansel...
http://www.thedivergentseries.movie/#insu
3 140607 tt2488496 11.173104 200000000 2068178225
Star Wars:
The Force
Awakens
Harrison
Ford|Mark
Hamill|Carrie
Fisher|Adam D...
http://www.starwars.com/films/star-w
epi
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7
Vin Diesel|Paul
Walker|Jason
Statham|Michelle
...
http://www.furious7
5 rows Ă— 21 columns
Out[4]: id int64
imdb_id object
popularity float64
budget int64
revenue int64
original_title object
cast object
homepage object
director object
tagline object
keywords object
overview object
runtime int64
genres object
production_companies object
release_date object
vote_count int64
vote_average float64
release_year int64
budget_adj float64
revenue_adj float64
dtype: object
Out[5]: 10866
Out[6]: 21
Out[7]: 1
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 21 columns):
id 10865 non-null int64
imdb_id 10855 non-null object
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
cast 10789 non-null object
homepage 2936 non-null object
director 10821 non-null object
tagline 8041 non-null object
keywords 9372 non-null object
overview 10861 non-null object
runtime 10865 non-null int64
genres 10842 non-null object
production_companies 9835 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.8+ MB
Out[10]:
id imdb_id popularity budget revenue original_title cast homepage director tagline
18 150689 tt1661199 5.556818 95000000 542351353 Cinderella
Lily James|Cate
Blanchett|Richard
Madden|Helen...
NaN
Kenneth
Branagh
Midnight
is just the
beginning.
21 307081 tt1798684 5.337064 30000000 91709827 Southpaw
Jake
Gyllenhaal|Rachel
McAdams|Forest
Whitaker...
NaN
Antoine
Fuqua
Believe in
Hope.
26 214756 tt2637276 4.564549 68000000 215863606 Ted 2
Mark Wahlberg|Seth
MacFarlane|Amanda
Seyfried|...
NaN
Seth
MacFarlane
Ted is
Coming,
Again.
32 254470 tt2848292 3.877764 29000000 287506194
Pitch Perfect
2
Anna
Kendrick|Rebel
Wilson|Hailee
Steinfeld|Br...
NaN
Elizabeth
Banks
We're
back
pitches
33 296098 tt3682448 3.648210 40000000 162610473
Bridge of
Spies
Tom Hanks|Mark
Rylance|Amy
Ryan|Alan
Alda|Seba...
NaN
Steven
Spielberg
In the
shadow of
war, one
man
showed
the
world...
5 rows Ă— 21 columns
Out[11]: id 10865
imdb_id 10855
popularity 10814
budget 557
revenue 4702
original_title 10571
cast 10719
homepage 2896
director 5067
tagline 7997
keywords 8804
overview 10847
runtime 247
genres 2039
production_companies 7445
release_date 5909
vote_count 1289
vote_average 72
release_year 56
budget_adj 2614
revenue_adj 4840
dtype: int64
Out[12]:
id popularity budget revenue runtime vote_count vote_average release_year
count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000
mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859
std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000
25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000
50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000
75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
'runtime', 'genres', 'production_companies', 'release_date',
'vote_count', 'vote_average', 'release_year', 'budget_adj',
'revenue_adj'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 17 columns):
id 10865 non-null int64
imdb_id 10855 non-null object
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
cast 10789 non-null object
director 10821 non-null object
runtime 10865 non-null int64
genres 10842 non-null object
production_companies 9835 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(7)
memory usage: 1.5+ MB
Out[18]: dtype('<M8[ns]')
Out[19]:
id imdb_id popularity budget revenue original_title cast director runtime genres product
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Adventure
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124
Science
Fiction
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Thriller
Universa
Entertain
1 76341 tt1392190 28.419936 150000000 378436354
Mad Max:
Fury Road
Tom
Hardy|Charlize
Theron|Hugh
Keays-
Byrne|Nic...
George
Miller
120 Action
Vi
Pictures
Out[20]:
id imdb_id popularity budget revenue original_title cast director runtime genres production_c
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Univers
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Amblin Ent
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Legenda
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Fuji Televisio
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action
Out[21]: 76851
Out[22]: 76851
Out[23]: 69433
Out[24]: 69433
budget 26775356.47371121
budget_adj 30889712.59859798
revenue 70334023.35863845
revenue_adj 84260108.11800905
Out[26]: id 181294
imdb_id 181249
popularity 181294
budget 181294
revenue 181294
original_title 181294
cast 181023
director 181104
runtime 181294
genres 181270
production_companies 179082
release_date 181294
vote_count 181294
vote_average 181294
release_year 181294
budget_adj 181294
revenue_adj 181294
dtype: int64
Out[27]: False
Out[28]: False
Out[29]: False
Out[30]: False
Out[31]: (181294, 17)
['Action' 'Adventure' 'Science Fiction' 'Thriller' 'Fantasy' 'Crime'
'Western' 'Drama' 'Family' 'Animation' 'Comedy' 'Mystery' 'Romance' 'War'
'History' 'Music' 'Horror' 'Documentary' 'TV Movie' nan 'Foreign']
Out[36]: genres
Action 5.859801
Adventure 5.962865
Animation 6.333965
Comedy 5.917464
Crime 6.112665
Documentary 6.957312
Drama 6.156389
Family 5.973175
Fantasy 5.895793
Foreign 5.892970
History 6.417070
Horror 5.444786
Music 6.302175
Mystery 5.986585
Romance 6.059295
Science Fiction 5.738771
TV Movie 5.651250
Thriller 5.848404
War 6.336557
Western 6.101556
Name: vote_average, dtype: float64
[5.971800334804548, 5.967507674675677]
Out[41]: 7.033402e+07 76851
2.000000e+06 152
2.000000e+07 126
1.200000e+07 126
5.318650e+08 125
Name: revenue, dtype: int64
Out[42]: revenue_adj vote_average
2.370705 6.4 12
2.861934 6.8 12
3.038360 7.7 5
5.926763 6.8 4
6.951084 4.9 8
8.585801 4.5 4
9.056820 6.7 20
9.115080 5.1 2
10.000000 4.2 9
10.296367 6.5 18
Name: vote_average, dtype: int64
Out[43]: revenue_adj vote_average
1.443191e+09 7.3 9
1.574815e+09 6.6 16
1.583050e+09 5.6 25
1.791694e+09 7.2 32
1.902723e+09 7.5 48
1.907006e+09 7.3 18
2.167325e+09 7.2 18
2.506406e+09 7.3 27
2.789712e+09 7.9 18
2.827124e+09 7.1 64
Name: vote_average, dtype: int64
Out[44]: (6.0, 6.3)
[5.981346153846566, 5.963905517657313]
[5.981346153846566, 5.963905517657313]
Out[48]: 2.677536e+07 69433
2.000000e+07 4331
2.500000e+07 4255
3.000000e+07 3902
4.000000e+07 3584
Name: budget, dtype: int64
Out[49]: budget_adj vote_average
0.921091 4.1 3
0.969398 5.3 20
1.012787 6.5 48
1.309053 4.8 8
2.908194 6.5 12
3.000000 7.3 12
4.519285 5.6 9
4.605455 6.0 1
5.006696 5.8 27
8.102293 6.9 45
Name: vote_average, dtype: int64
Out[50]: budget_adj vote_average
2.504192e+08 5.8 16
2.541001e+08 7.3 18
2.575999e+08 7.4 27
2.600000e+08 7.3 8
2.713305e+08 5.8 27
2.716921e+08 7.3 27
2.920507e+08 5.3 64
3.155006e+08 6.8 27
3.683713e+08 6.3 27
4.250000e+08 6.4 25
Name: vote_average, dtype: int64
Out[51]: (6.0, 6.2)
[0.7420684714824547, 0.9989869505300212]
Out[55]: 8.426011e+07 76851
2.358000e+07 125
4.978434e+08 125
2.231273e+07 125
1.934053e+07 125
Name: revenue_adj, dtype: int64
Out[56]: revenue_adj popularity
2.370705 0.462609 12
2.861934 0.552091 12
3.038360 0.352054 5
5.926763 0.208637 4
6.951084 0.578849 8
8.585801 0.183034 4
9.056820 0.450208 20
9.115080 0.113082 2
10.000000 0.559371 9
10.296367 0.222776 18
Name: popularity, dtype: int64
Out[57]: revenue_adj popularity
1.443191e+09 7.637767 9
1.574815e+09 2.631987 16
1.583050e+09 1.136610 25
1.791694e+09 2.900556 32
1.902723e+09 11.173104 48
1.907006e+09 2.563191 18
2.167325e+09 2.010733 18
2.506406e+09 4.355219 27
2.789712e+09 12.037933 18
2.827124e+09 9.432768 64
Name: popularity, dtype: int64
Out[58]: (0.58808, 1.4886709999999999)
[0.7564478409230605, 0.9790179787846799]
Out[62]: 3.088971e+07 69433
2.032801e+07 532
2.103337e+07 421
4.065602e+07 385
2.908194e+07 381
Name: budget_adj, dtype: int64
Out[63]: budget_adj popularity
0.921091 0.177102 3
0.969398 0.520430 20
1.012787 0.472691 48
1.309053 0.090186 8
2.908194 0.228643 12
3.000000 0.028456 12
4.519285 0.464188 9
4.605455 0.002922 1
5.006696 0.317091 27
8.102293 0.626646 45
Name: popularity, dtype: int64
Out[64]: budget_adj popularity
2.504192e+08 1.232098 16
2.541001e+08 5.076472 18
2.575999e+08 5.944927 27
2.600000e+08 2.865684 8
2.713305e+08 2.520912 27
2.716921e+08 4.355219 27
2.920507e+08 1.957331 64
3.155006e+08 4.965391 27
3.683713e+08 4.955130 27
4.250000e+08 0.250540 25
Name: popularity, dtype: int64
Out[65]: (0.534192, 1.138395)