SlideShare a Scribd company logo
1 of 1
Download to read offline
Udacity Data Analyst Nanodegree
P2: Investigate [TMDb Movie] dataset
Author: Mouhamadou GUEYE
Date: May 26, 2019
Table of contents
Introduction
Data Wrangling
Exploratory Data Analysis
Conclusions
Introduction
In this project we will analyze the dataset associated with the informations about 10000 movies collected from the movie
database TMDb. In particular we'll be interested in finding trends ralating most popular movies by genre, the movie rating and
popularity based on the budget and revenue.
Background:
The [Movie Database TMDB](https://www.themoviedb.org/) is a community built movie and TV database.
Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong
international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put
simply, we live and breathe community and that's precisely what makes us different. ### The TMDb
Advantage: 1. Every year since 2008, the number of contributions to our database has increased. With
over 125,000 developers and companies using our platform, TMDb has become a premiere source for
metadata. 2. Along with extensive metadata for movies, TV shows and people, we also offer one of the
best selections of high resolution posters and fan art. On average, over 1,000 images are added every
single day. 3. We're international. While we officially support 39 languages we also have extensive
regional data. Every single day TMDb is used in over 180 countries. 4. Our community is second to none.
Between our staff and community moderators, we're always here to help. We're passionate about making
sure your experience on TMDb is nothing short of amazing. 5. Trusted platform. Every single day our
service is used by millions of people while we process over 2 billion requests. We've proven for years that
this is a service that can be trusted and relied on. This organization profile is not owned or maintained by
TMDb: datasets hosted under this organization profile use the TMDb API but are not endorsed or
certified by TMDb[1].
Reseach Questions for investigations
1. What is the most popular movies by genre?
2. What is the most popular movies by genre from year to year?
3. Do movies with highest revenue have more popularity?
4. Do movies with highest budget have more popularity?
5. Do movies with highest revenue recieve a better rating?
6. Do movies with highest budget recieve a better rating?
Dataset
This data set contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and
revenue.The dataset uses in this project is a cleaned version of the original dataset on Kaggle. where its full description can be
found there.
In [1]: # packages import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline
Data Wrangling
General Properties
In this step we will inspect the dataset, in order to undestand it's properties and structures:
The datatypes of each column
The number samples of the dataset
Number of columns in the dataset
Duplicate rows if any in the dataset
Features with missing values
Number of non-null unique value for features in each dataset
What are those unique values are, count of each
In [2]: movies = pd.read_csv('tmdb-movies.csv')
In [3]: # Printing the five first row of the dataframe
movies.head()
Columns Data Types
In [4]: movies.dtypes
Number of samples/columns
In [5]: # number of rows for the movie dataset
movies.shape[0]
In [6]: # Number of columns for the movie dataset
movies.shape[1]
Duplicates Rows
In [7]: # Duplicate rows in the movies dataset
sum(movies.duplicated())
Deletion of duplicates
In [8]: # Duplicate rows n the credit dataset
movies[movies.duplicated()]
movies.drop_duplicates(inplace=True)
Missing Values
We notice that there missing values in the following columns:
homepage
overview
release_date
tagline
runtime
cast
production_companies
director
genres
etc.
In [9]: # informations about the dataset
movies.info()
In [10]: # Inpecting rows with missing values
movies[movies.isnull().any(axis='columns')].head()
Number of Distinct Observations
In [11]: movies.nunique()
Descriptive Statistics Summary
In [12]: movies.describe()
Data Cleaning
In this step we will clean the dataset by removing columns that are irrelevant for our analysis, convert the release date
columns from a string to a datetime object, fill columns for budget and revenue which contains a huge amount of zero values
by their means, handle the columns with multiple values separated by a pipe (|), by splitting them in differents rows.
In [13]: # movies dataset columns
print(movies.columns)
Droping Extraneous Columns
These columns will dropped since they are not relevant on our data analysis.
In [14]: # columns to drop from the movies dataset, thes columns are irrelevant for our data analysis
columns = ['homepage', 'tagline', 'overview', 'keywords']
In [15]: movies.drop(labels=columns, axis=1, inplace=True)
In [16]: movies.info()
Convert release_date in datetime Object
The release_date in a string format, we will use panda's to_datetime method to convert the column from string to datatime
dtype.
In [17]: movies['release_date'] = pd.to_datetime(movies['release_date'],format='%m/%d/%y')
In [18]: # check the date format after cleaning
movies['release_date'].dtype
Dealing with Multiple Values Columns
In [19]: movies= (movies.drop('genres', axis=1)
.join(movies['genres'].str.split('|', expand=True)
.stack().reset_index(level=1,drop=True)
.rename('genres'))
.loc[:, movies.columns])
movies.head()
In [20]: # splitting into row the production_companies columns
movies= (movies.drop('production_companies', axis=1)
.join(movies['production_companies'].str.split('|', expand=True)
.stack().reset_index(level=1,drop=True)
.rename('production_companies'))
.loc[:, movies.columns])
movies.head()
Fill zero value in the revenue and budget columns
Here we inspect the column revenue, revenue_adj, budget and budget_adj counting the number of rows having 0 values
before filling those values with the mean.
In [21]: # inspecting the movies and budget columns
movies[movies['revenue'] == 0].count()['revenue']
In [22]: # inspecting the movies and budget columns
movies[movies['revenue_adj'] == 0].count()['revenue_adj']
In [23]: # inspecting the movies and budget columns
movies[movies['budget'] == 0].count()['budget']
In [24]: # inspecting the movies and budget columns
movies[movies['budget_adj'] == 0].count()['budget_adj']
In [25]: # fill the columns revenue and budget with their mean value
cols = ['budget', 'budget_adj', 'revenue', 'revenue_adj']
for item in cols:
print(item, movies[item].mean())
movies[item] = movies[item].replace({0:movies[item].mean()})
In [26]: # Check Whether the colums have been successfully filled
movies[movies['revenue'].notnull()].count()
In [27]: # should return False
(movies['revenue'] == 0).all()
In [28]: # should return False
(movies['revenue_adj'] == 0).all()
In [29]: # should return False
(movies['budget_adj'] == 0).all()
In [30]: # should return False
(movies['budget_adj'] == 0).all()
Check Number of samples/columns
In [31]: movies.shape
Visual Trends
In [32]: movies.hist(figsize=(15,10));
Exploratory data analysis
1. Which genres is more popular ?
In [33]: # unique genres movies existing in the dataframe
genres = movies['genres'].unique()
print(genres)
In [34]: # grouping movies by genre projecting on the popularity column and calculation of the mean
movies_by_genres = movies.groupby('genres')['popularity'].mean()
# plottting the bar chart of movies by genre
movies_by_genres.plot(kind='bar', alpha=.7, figsize=(15,6))
plt.xlabel("Genres", fontsize=18);
plt.ylabel("Popularity", fontsize=18);
plt.xticks(fontsize=10)
plt.title('Average movie popularity by genre', fontsize=18);
plt.grid(True)
2. Which genres is most popular from year to year?
In [35]: # plot data
fig, ax = plt.subplots(figsize=(15,7))
# grouping movies by genre
grouped= movies.groupby(['release_year', 'genres']).count()['popularity']
.unstack().plot(ax=ax, figsize=(15,6))
plt.xlabel("release year", fontsize=18);
plt.ylabel("count", fontsize=18);
plt.xticks(fontsize=10)
plt.title('movie popularity year by year', fontsize=18);
3. What Moving Genres recieves the highest average rating?
In [36]: # grouping the movies by genres and projecting on the rating column
rating = movies.groupby('genres')['vote_average'].mean()
rating
In [37]: # bar chart of the movies mean rating by genre
rating.plot(kind='bar', alpha=0.7)
plt.xlabel('Movie Genre', fontsize=12)
plt.ylabel('Vote Average', fontsize=12)
plt.title('Average Movie Quality by Genre', fontsize=12)
plt.grid(True)
4. Do movies with high revenue recieve the highest rating?
In [38]: plt.scatter(movies['revenue_adj'], movies['vote_average'], linewidth=5)
plt.title('Vote Ratings by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=15)
plt.ylabel('Average Vote Rating', fontsize=15);
plt.show()
In [39]: # mean rating for each revenue level
median_rev = movies['revenue_adj'].median()
low = movies.query('revenue_adj < {}'.format(median_rev))
high = movies.query('revenue_adj >= {}'.format(median_rev))
# filtering to vote_average columns and calculation of the mean
mean_low = low['vote_average'].mean()
mean_high = high['vote_average'].mean()
In [40]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Vote Ratings by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=12)
plt.ylabel('Average Vote Rating', fontsize=15);
In [41]: # counting the movie revenue unique values
movies.revenue.value_counts().head()
In [42]: # 10 first values
movies.groupby('revenue_adj')['vote_average'].value_counts().head(10)
In [43]: # 10 last values
movies.groupby('revenue_adj')['vote_average'].value_counts().tail(10)
In [44]: # comparison of the median popularity of movies with low and high revenue
movies.query('revenue_adj < {}'.format(median_rev))['vote_average'].median(), movies.query('revenue_
adj > {}'.format(median_rev))['vote_average'].median()
Partial conclusion
It is difficult to say the movies with high revenue have a better rating since according to the histogram, the height of the
histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is
calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high
revenue is 6.3.
5. Do movies with high budget get the highest rating?
In [45]: # scatter plot of the budget versus vote rating
plt.scatter(movies['budget_adj'], movies['vote_average'], linewidth=5)
plt.title('Vote Ratings by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=15)
plt.ylabel('Vote Rating', fontsize=15);
plt.show()
In [46]: # mean rating for each revenue level
median_bud = movies['budget_adj'].median()
low = movies.query('budget_adj < {}'.format(median_bud))
high = movies.query('budget_adj >= {}'.format(median_bud))
# filtering to vote_average columns and calculation of the mean
mean_low = low['vote_average'].mean()
mean_high = high['vote_average'].mean()
print([mean_low, mean_high])
In [47]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Vote Ratings by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=12)
plt.ylabel('Average Vote Rating', fontsize=15);
In [48]: # counting the movie revenue unique values
movies.budget.value_counts().head()
In [49]: # 10 first values
movies.groupby('budget_adj')['vote_average'].value_counts().head(10)
In [50]: # 10 last values
movies.groupby('budget_adj')['vote_average'].value_counts().tail(10)
In [51]: # comparison of the median popularity of movies with low and high revenue
(movies.query('budget_adj < {}'.format(median_rev))['vote_average'].median(),
movies.query('budget_adj > {}'.format(median_rev))['vote_average'].median())
Partial conclusion
It is difficult to say the movies with high budget have a better rating since according to the histogram, the height of the
histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is
calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high
revenue is 6.2.
6. Do movies with highest revenue have more popularity?
In [52]: plt.scatter(movies['revenue_adj'], movies['popularity'], linewidth=5)
plt.title('Popularity by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=15)
plt.ylabel('Average Popularity', fontsize=15);
plt.show()
In [53]: # mean rating for each revenue level
median_rev = movies['revenue_adj'].median()
low = movies.query('revenue_adj < {}'.format(median_rev))
high = movies.query('revenue_adj >= {}'.format(median_rev))
# filtering to popularity columns and calculation of the mean
mean_low = low['popularity'].mean()
mean_high = high['popularity'].mean()
In [54]: # list of the mean and high revenue for historgram chart
heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Popularity by Revenue Level', fontsize=15)
plt.xlabel('Revenue Level', fontsize=12)
plt.ylabel('Average Popularity', fontsize=15);
In [55]: # counting the movie revenue unique values
movies.revenue_adj.value_counts().head()
In [56]: # 10 first values
movies.groupby('revenue_adj')['popularity'].value_counts().head(10)
In [57]: # 10 last values
movies.groupby('revenue_adj')['popularity'].value_counts().tail(10)
In [58]: # comparison of the median popularity of movies with low and high revenue
(movies.query('revenue_adj < {}'.format(median_rev))['popularity'].median(),
movies.query('revenue_adj > {}'.format(median_rev))['popularity'].median())
Partial conclusion
We can see that, the film with high revenue seem to be more popular than the ones with low revenue, with an average
popularity respectively of 0.7420684714824547 and 0.9989869505300212. Morever by comparing the median popularity of
movies with low and high revenue, we can clearly see that the movie with high revenue are more popular.
7. Do movies with highest budget have more popularity?
In [59]: # scatter plot of the movies budget versus popularity
plt.scatter(movies['budget_adj'], movies['popularity'], linewidth=5)
plt.title('Popularity by Revenue Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=15)
plt.ylabel('Average Popularity', fontsize=15);
plt.show()
In [60]: # mean rating for each revenue level
median_rev = movies['budget_adj'].median()
low = movies.query('budget_adj < {}'.format(median_rev))
high = movies.query('budget_adj >= {}'.format(median_rev))
# filtering to popularity columns and calculation of the mean
mean_low = low['popularity'].mean()
mean_high = high['popularity'].mean()
In [61]: heights = [mean_low, mean_high]
print(heights)
labels = ['low', 'high']
locations = [1,2]
plt.bar(locations, heights, tick_label=labels)
plt.title('Average Popularity by Budget Level', fontsize=15)
plt.xlabel('Budget Level', fontsize=12)
plt.ylabel('Average Popularity', fontsize=15);
In [62]: # counting the movie budget unique values
movies.budget_adj.value_counts().head()
In [63]: # 10 first values
movies.groupby('budget_adj')['popularity'].value_counts().head(10)
In [64]: # 10 last values
movies.groupby('budget_adj')['popularity'].value_counts().tail(10)
In [65]: # comparison of the median popularity of movies with low and high revenue
(movies.query('budget_adj < {}'.format(median_rev))['popularity'].median(),
movies.query('budget_adj > {}'.format(median_rev))['popularity'].median())
Partial conclusion
We can see that, the film with high budget seem to be more popular than the ones with low budget, with an average popularity
respectively of 0.7564478409230605 and 0.979017978784679. Morever by comparing the median popularity of movies with
low and high budget, we can clearly see that the movie with high budget seem more popular.
Conclusions
In this project, we started our analysis by examining the most popular movie by genre. We notice the adventure movies are the
most popular movies genre. We've, then examined, the movie popularity year by year. For, this since there is no correlation
between release_year and movie popularity, the count of the realese movie each year is used for the analysis. Based on the
relation between the genres and vote avarage, we found that, the Documentary recieves the highest rating. Moreover, we
have analyzed the dataset trying to answer different questions related to movies popularity and rating versus revenue and
budget. While the movies with high revenue and budget seem to be more popular, we could not find a correlation between
movie budget and revenue with rating.
Limitations
For a better analysis, a more details seems to be useful regarding the variables popularity and vote_average and how they are
calculated? The factors/criteria used for their calculations. During the analysis process the columns in which we are interested
in this analayis (budget, revenue, budget_adj and revenue_adj) contain many missing values which have been filled using the
mean. This seems not the best way to fix those columns since the mean is not always the best measure of center. Another
limitations in this analysis, the process of categorizing the movie with low and high revenue and budget using the median.
Since some movie have a huge amount of budget and revenue and the fact that we fill many missing values with the mean,
the median should not be the best for categoring the movie.
References:
[1]: https://www.themoviedb.org/about
In [ ]:
Out[3]:
id imdb_id popularity budget revenue original_title cast home
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
http://www.jurassicworld
1 76341 tt1392190 28.419936 150000000 378436354
Mad Max:
Fury Road
Tom
Hardy|Charlize
Theron|Hugh
Keays-
Byrne|Nic...
http://www.madmaxmovie
2 262500 tt2908446 13.112507 110000000 295238201 Insurgent
Shailene
Woodley|Theo
James|Kate
Winslet|Ansel...
http://www.thedivergentseries.movie/#insu
3 140607 tt2488496 11.173104 200000000 2068178225
Star Wars:
The Force
Awakens
Harrison
Ford|Mark
Hamill|Carrie
Fisher|Adam D...
http://www.starwars.com/films/star-w
epi
4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7
Vin Diesel|Paul
Walker|Jason
Statham|Michelle
...
http://www.furious7
5 rows Ă— 21 columns
Out[4]: id int64
imdb_id object
popularity float64
budget int64
revenue int64
original_title object
cast object
homepage object
director object
tagline object
keywords object
overview object
runtime int64
genres object
production_companies object
release_date object
vote_count int64
vote_average float64
release_year int64
budget_adj float64
revenue_adj float64
dtype: object
Out[5]: 10866
Out[6]: 21
Out[7]: 1
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 21 columns):
id 10865 non-null int64
imdb_id 10855 non-null object
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
cast 10789 non-null object
homepage 2936 non-null object
director 10821 non-null object
tagline 8041 non-null object
keywords 9372 non-null object
overview 10861 non-null object
runtime 10865 non-null int64
genres 10842 non-null object
production_companies 9835 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.8+ MB
Out[10]:
id imdb_id popularity budget revenue original_title cast homepage director tagline
18 150689 tt1661199 5.556818 95000000 542351353 Cinderella
Lily James|Cate
Blanchett|Richard
Madden|Helen...
NaN
Kenneth
Branagh
Midnight
is just the
beginning.
21 307081 tt1798684 5.337064 30000000 91709827 Southpaw
Jake
Gyllenhaal|Rachel
McAdams|Forest
Whitaker...
NaN
Antoine
Fuqua
Believe in
Hope.
26 214756 tt2637276 4.564549 68000000 215863606 Ted 2
Mark Wahlberg|Seth
MacFarlane|Amanda
Seyfried|...
NaN
Seth
MacFarlane
Ted is
Coming,
Again.
32 254470 tt2848292 3.877764 29000000 287506194
Pitch Perfect
2
Anna
Kendrick|Rebel
Wilson|Hailee
Steinfeld|Br...
NaN
Elizabeth
Banks
We're
back
pitches
33 296098 tt3682448 3.648210 40000000 162610473
Bridge of
Spies
Tom Hanks|Mark
Rylance|Amy
Ryan|Alan
Alda|Seba...
NaN
Steven
Spielberg
In the
shadow of
war, one
man
showed
the
world...
5 rows Ă— 21 columns
Out[11]: id 10865
imdb_id 10855
popularity 10814
budget 557
revenue 4702
original_title 10571
cast 10719
homepage 2896
director 5067
tagline 7997
keywords 8804
overview 10847
runtime 247
genres 2039
production_companies 7445
release_date 5909
vote_count 1289
vote_average 72
release_year 56
budget_adj 2614
revenue_adj 4840
dtype: int64
Out[12]:
id popularity budget revenue runtime vote_count vote_average release_year
count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000
mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859
std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260
min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000
25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000
50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000
75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000
max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
'runtime', 'genres', 'production_companies', 'release_date',
'vote_count', 'vote_average', 'release_year', 'budget_adj',
'revenue_adj'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10865 entries, 0 to 10865
Data columns (total 17 columns):
id 10865 non-null int64
imdb_id 10855 non-null object
popularity 10865 non-null float64
budget 10865 non-null int64
revenue 10865 non-null int64
original_title 10865 non-null object
cast 10789 non-null object
director 10821 non-null object
runtime 10865 non-null int64
genres 10842 non-null object
production_companies 9835 non-null object
release_date 10865 non-null object
vote_count 10865 non-null int64
vote_average 10865 non-null float64
release_year 10865 non-null int64
budget_adj 10865 non-null float64
revenue_adj 10865 non-null float64
dtypes: float64(4), int64(6), object(7)
memory usage: 1.5+ MB
Out[18]: dtype('<M8[ns]')
Out[19]:
id imdb_id popularity budget revenue original_title cast director runtime genres product
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Adventure
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124
Science
Fiction
Universa
Entertain
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Thriller
Universa
Entertain
1 76341 tt1392190 28.419936 150000000 378436354
Mad Max:
Fury Road
Tom
Hardy|Charlize
Theron|Hugh
Keays-
Byrne|Nic...
George
Miller
120 Action
Vi
Pictures
Out[20]:
id imdb_id popularity budget revenue original_title cast director runtime genres production_c
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Univers
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Amblin Ent
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Legenda
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action Fuji Televisio
0 135397 tt0369610 32.985763 150000000 1513528810
Jurassic
World
Chris
Pratt|Bryce
Dallas
Howard|Irrfan
Khan|Vi...
Colin
Trevorrow
124 Action
Out[21]: 76851
Out[22]: 76851
Out[23]: 69433
Out[24]: 69433
budget 26775356.47371121
budget_adj 30889712.59859798
revenue 70334023.35863845
revenue_adj 84260108.11800905
Out[26]: id 181294
imdb_id 181249
popularity 181294
budget 181294
revenue 181294
original_title 181294
cast 181023
director 181104
runtime 181294
genres 181270
production_companies 179082
release_date 181294
vote_count 181294
vote_average 181294
release_year 181294
budget_adj 181294
revenue_adj 181294
dtype: int64
Out[27]: False
Out[28]: False
Out[29]: False
Out[30]: False
Out[31]: (181294, 17)
['Action' 'Adventure' 'Science Fiction' 'Thriller' 'Fantasy' 'Crime'
'Western' 'Drama' 'Family' 'Animation' 'Comedy' 'Mystery' 'Romance' 'War'
'History' 'Music' 'Horror' 'Documentary' 'TV Movie' nan 'Foreign']
Out[36]: genres
Action 5.859801
Adventure 5.962865
Animation 6.333965
Comedy 5.917464
Crime 6.112665
Documentary 6.957312
Drama 6.156389
Family 5.973175
Fantasy 5.895793
Foreign 5.892970
History 6.417070
Horror 5.444786
Music 6.302175
Mystery 5.986585
Romance 6.059295
Science Fiction 5.738771
TV Movie 5.651250
Thriller 5.848404
War 6.336557
Western 6.101556
Name: vote_average, dtype: float64
[5.971800334804548, 5.967507674675677]
Out[41]: 7.033402e+07 76851
2.000000e+06 152
2.000000e+07 126
1.200000e+07 126
5.318650e+08 125
Name: revenue, dtype: int64
Out[42]: revenue_adj vote_average
2.370705 6.4 12
2.861934 6.8 12
3.038360 7.7 5
5.926763 6.8 4
6.951084 4.9 8
8.585801 4.5 4
9.056820 6.7 20
9.115080 5.1 2
10.000000 4.2 9
10.296367 6.5 18
Name: vote_average, dtype: int64
Out[43]: revenue_adj vote_average
1.443191e+09 7.3 9
1.574815e+09 6.6 16
1.583050e+09 5.6 25
1.791694e+09 7.2 32
1.902723e+09 7.5 48
1.907006e+09 7.3 18
2.167325e+09 7.2 18
2.506406e+09 7.3 27
2.789712e+09 7.9 18
2.827124e+09 7.1 64
Name: vote_average, dtype: int64
Out[44]: (6.0, 6.3)
[5.981346153846566, 5.963905517657313]
[5.981346153846566, 5.963905517657313]
Out[48]: 2.677536e+07 69433
2.000000e+07 4331
2.500000e+07 4255
3.000000e+07 3902
4.000000e+07 3584
Name: budget, dtype: int64
Out[49]: budget_adj vote_average
0.921091 4.1 3
0.969398 5.3 20
1.012787 6.5 48
1.309053 4.8 8
2.908194 6.5 12
3.000000 7.3 12
4.519285 5.6 9
4.605455 6.0 1
5.006696 5.8 27
8.102293 6.9 45
Name: vote_average, dtype: int64
Out[50]: budget_adj vote_average
2.504192e+08 5.8 16
2.541001e+08 7.3 18
2.575999e+08 7.4 27
2.600000e+08 7.3 8
2.713305e+08 5.8 27
2.716921e+08 7.3 27
2.920507e+08 5.3 64
3.155006e+08 6.8 27
3.683713e+08 6.3 27
4.250000e+08 6.4 25
Name: vote_average, dtype: int64
Out[51]: (6.0, 6.2)
[0.7420684714824547, 0.9989869505300212]
Out[55]: 8.426011e+07 76851
2.358000e+07 125
4.978434e+08 125
2.231273e+07 125
1.934053e+07 125
Name: revenue_adj, dtype: int64
Out[56]: revenue_adj popularity
2.370705 0.462609 12
2.861934 0.552091 12
3.038360 0.352054 5
5.926763 0.208637 4
6.951084 0.578849 8
8.585801 0.183034 4
9.056820 0.450208 20
9.115080 0.113082 2
10.000000 0.559371 9
10.296367 0.222776 18
Name: popularity, dtype: int64
Out[57]: revenue_adj popularity
1.443191e+09 7.637767 9
1.574815e+09 2.631987 16
1.583050e+09 1.136610 25
1.791694e+09 2.900556 32
1.902723e+09 11.173104 48
1.907006e+09 2.563191 18
2.167325e+09 2.010733 18
2.506406e+09 4.355219 27
2.789712e+09 12.037933 18
2.827124e+09 9.432768 64
Name: popularity, dtype: int64
Out[58]: (0.58808, 1.4886709999999999)
[0.7564478409230605, 0.9790179787846799]
Out[62]: 3.088971e+07 69433
2.032801e+07 532
2.103337e+07 421
4.065602e+07 385
2.908194e+07 381
Name: budget_adj, dtype: int64
Out[63]: budget_adj popularity
0.921091 0.177102 3
0.969398 0.520430 20
1.012787 0.472691 48
1.309053 0.090186 8
2.908194 0.228643 12
3.000000 0.028456 12
4.519285 0.464188 9
4.605455 0.002922 1
5.006696 0.317091 27
8.102293 0.626646 45
Name: popularity, dtype: int64
Out[64]: budget_adj popularity
2.504192e+08 1.232098 16
2.541001e+08 5.076472 18
2.575999e+08 5.944927 27
2.600000e+08 2.865684 8
2.713305e+08 2.520912 27
2.716921e+08 4.355219 27
2.920507e+08 1.957331 64
3.155006e+08 4.965391 27
3.683713e+08 4.955130 27
4.250000e+08 0.250540 25
Name: popularity, dtype: int64
Out[65]: (0.534192, 1.138395)

More Related Content

What's hot

Hands-on Lab: Data Lake Analytics
Hands-on Lab: Data Lake AnalyticsHands-on Lab: Data Lake Analytics
Hands-on Lab: Data Lake AnalyticsAmazon Web Services
 
Movies Recommendation System
Movies Recommendation SystemMovies Recommendation System
Movies Recommendation SystemShubham Patil
 
Case Moodle AWS
Case Moodle AWSCase Moodle AWS
Case Moodle AWSionatec
 
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation projectAbhishek Jaisingh
 
Movie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens DatasetMovie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens DatasetJagruti Joshi
 
Prediction of Car Price using Linear Regression
Prediction of Car Price using Linear RegressionPrediction of Car Price using Linear Regression
Prediction of Car Price using Linear Regressionijtsrd
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business IntelligenceJonathan Coleman
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachEslam Nader
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesSlideTeam
 
Power BI Interview Questions and Answers | Power BI Certification | Power BI ...
Power BI Interview Questions and Answers | Power BI Certification | Power BI ...Power BI Interview Questions and Answers | Power BI Certification | Power BI ...
Power BI Interview Questions and Answers | Power BI Certification | Power BI ...Edureka!
 
Textual & Sentiment Analysis of Movie Reviews
Textual & Sentiment Analysis of Movie ReviewsTextual & Sentiment Analysis of Movie Reviews
Textual & Sentiment Analysis of Movie ReviewsYousef Fadila
 
Fundamental Cloud Architectures
Fundamental Cloud ArchitecturesFundamental Cloud Architectures
Fundamental Cloud ArchitecturesMohammed Sajjad Ali
 
Qlik View Corporate Overview Ppt Presentation
Qlik View Corporate Overview Ppt PresentationQlik View Corporate Overview Ppt Presentation
Qlik View Corporate Overview Ppt Presentationpdalalau
 
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFAmazon Web Services
 
Case study: Implementation of OLAP operations
Case study: Implementation of OLAP operationsCase study: Implementation of OLAP operations
Case study: Implementation of OLAP operationschirag patil
 
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptx
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptxAWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptx
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptxNabilMECHERI
 

What's hot (20)

Hands-on Lab: Data Lake Analytics
Hands-on Lab: Data Lake AnalyticsHands-on Lab: Data Lake Analytics
Hands-on Lab: Data Lake Analytics
 
Movies Recommendation System
Movies Recommendation SystemMovies Recommendation System
Movies Recommendation System
 
Case Moodle AWS
Case Moodle AWSCase Moodle AWS
Case Moodle AWS
 
Movie recommendation project
Movie recommendation projectMovie recommendation project
Movie recommendation project
 
Movie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens DatasetMovie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens Dataset
 
Prediction of Car Price using Linear Regression
Prediction of Car Price using Linear RegressionPrediction of Car Price using Linear Regression
Prediction of Car Price using Linear Regression
 
MIS: Business Intelligence
MIS: Business IntelligenceMIS: Business Intelligence
MIS: Business Intelligence
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approach
 
SaaS Presentation
SaaS PresentationSaaS Presentation
SaaS Presentation
 
Big Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation SlidesBig Data Analytics Architecture PowerPoint Presentation Slides
Big Data Analytics Architecture PowerPoint Presentation Slides
 
Power BI Interview Questions and Answers | Power BI Certification | Power BI ...
Power BI Interview Questions and Answers | Power BI Certification | Power BI ...Power BI Interview Questions and Answers | Power BI Certification | Power BI ...
Power BI Interview Questions and Answers | Power BI Certification | Power BI ...
 
Textual & Sentiment Analysis of Movie Reviews
Textual & Sentiment Analysis of Movie ReviewsTextual & Sentiment Analysis of Movie Reviews
Textual & Sentiment Analysis of Movie Reviews
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Fundamental Cloud Architectures
Fundamental Cloud ArchitecturesFundamental Cloud Architectures
Fundamental Cloud Architectures
 
Qlik View Corporate Overview Ppt Presentation
Qlik View Corporate Overview Ppt PresentationQlik View Corporate Overview Ppt Presentation
Qlik View Corporate Overview Ppt Presentation
 
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SFHow Amazon.com Uses AWS Analytics: Data Analytics Week SF
How Amazon.com Uses AWS Analytics: Data Analytics Week SF
 
CRM and ERP
CRM and ERPCRM and ERP
CRM and ERP
 
Case study: Implementation of OLAP operations
Case study: Implementation of OLAP operationsCase study: Implementation of OLAP operations
Case study: Implementation of OLAP operations
 
Salesforce PPT.pptx
Salesforce PPT.pptxSalesforce PPT.pptx
Salesforce PPT.pptx
 
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptx
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptxAWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptx
AWS-Architecture-Icons-Deck_For-Dark-BG_04282023.pptx
 

Similar to TMDb movie dataset by kaggle

movie_notebook.pdf
movie_notebook.pdfmovie_notebook.pdf
movie_notebook.pdfpinstechwork
 
movieRecommendation_FinalReport
movieRecommendation_FinalReportmovieRecommendation_FinalReport
movieRecommendation_FinalReportSohini Sarkar
 
R markup code to create Regression Model
R markup code to create Regression ModelR markup code to create Regression Model
R markup code to create Regression ModelMohit Rajput
 
· You are asked to develop a database system to keep track o.docx
· You are asked to develop a database system to keep track o.docx· You are asked to develop a database system to keep track o.docx
· You are asked to develop a database system to keep track o.docxalinainglis
 
MOVIE RECOMMENDATION SYSTEM.pptx
MOVIE RECOMMENDATION SYSTEM.pptxMOVIE RECOMMENDATION SYSTEM.pptx
MOVIE RECOMMENDATION SYSTEM.pptxAyushkumar417871
 
Se276 enterprise computingassignment
Se276 enterprise computingassignmentSe276 enterprise computingassignment
Se276 enterprise computingassignmentharinathinfotech
 
IRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social MediaIRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social MediaIRJET Journal
 
projectreport
projectreportprojectreport
projectreportWeston Wei
 
MasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalyticsMasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalyticsLonghow Lam
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessingAbdurRazzaqe1
 
movie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD techmovie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD techUddeshBhagat
 
Building a Movie Success Predictor
Building a Movie Success PredictorBuilding a Movie Success Predictor
Building a Movie Success PredictorYouness Lahdili
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisTorsten Steinbach
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with RCasper Crause
 
Term 2 CS Practical File 2021-22.pdf
Term 2 CS Practical File 2021-22.pdfTerm 2 CS Practical File 2021-22.pdf
Term 2 CS Practical File 2021-22.pdfKiranKumari204016
 
Web Services Aggregator
Web Services AggregatorWeb Services Aggregator
Web Services AggregatorDhaval Patel
 
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...AlexACMSC
 

Similar to TMDb movie dataset by kaggle (20)

movie_notebook.pdf
movie_notebook.pdfmovie_notebook.pdf
movie_notebook.pdf
 
movieRecommendation_FinalReport
movieRecommendation_FinalReportmovieRecommendation_FinalReport
movieRecommendation_FinalReport
 
R markup code to create Regression Model
R markup code to create Regression ModelR markup code to create Regression Model
R markup code to create Regression Model
 
· You are asked to develop a database system to keep track o.docx
· You are asked to develop a database system to keep track o.docx· You are asked to develop a database system to keep track o.docx
· You are asked to develop a database system to keep track o.docx
 
MOVIE RECOMMENDATION SYSTEM.pptx
MOVIE RECOMMENDATION SYSTEM.pptxMOVIE RECOMMENDATION SYSTEM.pptx
MOVIE RECOMMENDATION SYSTEM.pptx
 
Se276 enterprise computingassignment
Se276 enterprise computingassignmentSe276 enterprise computingassignment
Se276 enterprise computingassignment
 
C++ assignment
C++ assignmentC++ assignment
C++ assignment
 
IRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social MediaIRJET- Movie Success Prediction using Data Mining and Social Media
IRJET- Movie Success Prediction using Data Mining and Social Media
 
projectreport
projectreportprojectreport
projectreport
 
MasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalyticsMasterSearch_Meetup_AdvancedAnalytics
MasterSearch_Meetup_AdvancedAnalytics
 
Foresee your movie revenue
Foresee your movie revenueForesee your movie revenue
Foresee your movie revenue
 
Lesson 2 data preprocessing
Lesson 2   data preprocessingLesson 2   data preprocessing
Lesson 2 data preprocessing
 
movie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD techmovie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD tech
 
Building a Movie Success Predictor
Building a Movie Success PredictorBuilding a Movie Success Predictor
Building a Movie Success Predictor
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
Company segmentation - an approach with R
Company segmentation - an approach with RCompany segmentation - an approach with R
Company segmentation - an approach with R
 
Final project kijtorntham n
Final project kijtorntham nFinal project kijtorntham n
Final project kijtorntham n
 
Term 2 CS Practical File 2021-22.pdf
Term 2 CS Practical File 2021-22.pdfTerm 2 CS Practical File 2021-22.pdf
Term 2 CS Practical File 2021-22.pdf
 
Web Services Aggregator
Web Services AggregatorWeb Services Aggregator
Web Services Aggregator
 
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
Alexandria ACM Student Chapter | Specification & Verification of Data-Centric...
 

More from Mouhamadou Gueye, PhD

Certified kubernetes application developer (ckad)
Certified kubernetes application developer (ckad)Certified kubernetes application developer (ckad)
Certified kubernetes application developer (ckad)Mouhamadou Gueye, PhD
 
Google cloud professional_data_engineer
Google cloud professional_data_engineerGoogle cloud professional_data_engineer
Google cloud professional_data_engineerMouhamadou Gueye, PhD
 
Certificate of business administration
Certificate of business administrationCertificate of business administration
Certificate of business administrationMouhamadou Gueye, PhD
 

More from Mouhamadou Gueye, PhD (9)

Managing aws with ansible
Managing aws with ansibleManaging aws with ansible
Managing aws with ansible
 
Certified kubernetes application developer (ckad)
Certified kubernetes application developer (ckad)Certified kubernetes application developer (ckad)
Certified kubernetes application developer (ckad)
 
Ansible playbooks deep dive
Ansible playbooks deep diveAnsible playbooks deep dive
Ansible playbooks deep dive
 
Cloud DevOps Engineer Nanodegre
Cloud DevOps Engineer NanodegreCloud DevOps Engineer Nanodegre
Cloud DevOps Engineer Nanodegre
 
Google cloud professional_data_engineer
Google cloud professional_data_engineerGoogle cloud professional_data_engineer
Google cloud professional_data_engineer
 
Python Specialization Certicate
Python Specialization CerticatePython Specialization Certicate
Python Specialization Certicate
 
LabVIEW certificate
LabVIEW certificateLabVIEW certificate
LabVIEW certificate
 
Capsim Simulation Certificate
Capsim Simulation CertificateCapsim Simulation Certificate
Capsim Simulation Certificate
 
Certificate of business administration
Certificate of business administrationCertificate of business administration
Certificate of business administration
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 

Recently uploaded (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
Call Girls Indiranagar Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call đź‘— 7737669865 đź‘— Top Class Call Girl Service Ba...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

TMDb movie dataset by kaggle

  • 1. Udacity Data Analyst Nanodegree P2: Investigate [TMDb Movie] dataset Author: Mouhamadou GUEYE Date: May 26, 2019 Table of contents Introduction Data Wrangling Exploratory Data Analysis Conclusions Introduction In this project we will analyze the dataset associated with the informations about 10000 movies collected from the movie database TMDb. In particular we'll be interested in finding trends ralating most popular movies by genre, the movie rating and popularity based on the budget and revenue. Background: The [Movie Database TMDB](https://www.themoviedb.org/) is a community built movie and TV database. Every piece of data has been added by our amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something we're incredibly proud of. Put simply, we live and breathe community and that's precisely what makes us different. ### The TMDb Advantage: 1. Every year since 2008, the number of contributions to our database has increased. With over 125,000 developers and companies using our platform, TMDb has become a premiere source for metadata. 2. Along with extensive metadata for movies, TV shows and people, we also offer one of the best selections of high resolution posters and fan art. On average, over 1,000 images are added every single day. 3. We're international. While we officially support 39 languages we also have extensive regional data. Every single day TMDb is used in over 180 countries. 4. Our community is second to none. Between our staff and community moderators, we're always here to help. We're passionate about making sure your experience on TMDb is nothing short of amazing. 5. Trusted platform. Every single day our service is used by millions of people while we process over 2 billion requests. We've proven for years that this is a service that can be trusted and relied on. This organization profile is not owned or maintained by TMDb: datasets hosted under this organization profile use the TMDb API but are not endorsed or certified by TMDb[1]. Reseach Questions for investigations 1. What is the most popular movies by genre? 2. What is the most popular movies by genre from year to year? 3. Do movies with highest revenue have more popularity? 4. Do movies with highest budget have more popularity? 5. Do movies with highest revenue recieve a better rating? 6. Do movies with highest budget recieve a better rating? Dataset This data set contains information about 10,000 movies collected from The Movie Database (TMDb),including user ratings and revenue.The dataset uses in this project is a cleaned version of the original dataset on Kaggle. where its full description can be found there. In [1]: # packages import import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style('darkgrid') %matplotlib inline Data Wrangling General Properties In this step we will inspect the dataset, in order to undestand it's properties and structures: The datatypes of each column The number samples of the dataset Number of columns in the dataset Duplicate rows if any in the dataset Features with missing values Number of non-null unique value for features in each dataset What are those unique values are, count of each In [2]: movies = pd.read_csv('tmdb-movies.csv') In [3]: # Printing the five first row of the dataframe movies.head() Columns Data Types In [4]: movies.dtypes Number of samples/columns In [5]: # number of rows for the movie dataset movies.shape[0] In [6]: # Number of columns for the movie dataset movies.shape[1] Duplicates Rows In [7]: # Duplicate rows in the movies dataset sum(movies.duplicated()) Deletion of duplicates In [8]: # Duplicate rows n the credit dataset movies[movies.duplicated()] movies.drop_duplicates(inplace=True) Missing Values We notice that there missing values in the following columns: homepage overview release_date tagline runtime cast production_companies director genres etc. In [9]: # informations about the dataset movies.info() In [10]: # Inpecting rows with missing values movies[movies.isnull().any(axis='columns')].head() Number of Distinct Observations In [11]: movies.nunique() Descriptive Statistics Summary In [12]: movies.describe() Data Cleaning In this step we will clean the dataset by removing columns that are irrelevant for our analysis, convert the release date columns from a string to a datetime object, fill columns for budget and revenue which contains a huge amount of zero values by their means, handle the columns with multiple values separated by a pipe (|), by splitting them in differents rows. In [13]: # movies dataset columns print(movies.columns) Droping Extraneous Columns These columns will dropped since they are not relevant on our data analysis. In [14]: # columns to drop from the movies dataset, thes columns are irrelevant for our data analysis columns = ['homepage', 'tagline', 'overview', 'keywords'] In [15]: movies.drop(labels=columns, axis=1, inplace=True) In [16]: movies.info() Convert release_date in datetime Object The release_date in a string format, we will use panda's to_datetime method to convert the column from string to datatime dtype. In [17]: movies['release_date'] = pd.to_datetime(movies['release_date'],format='%m/%d/%y') In [18]: # check the date format after cleaning movies['release_date'].dtype Dealing with Multiple Values Columns In [19]: movies= (movies.drop('genres', axis=1) .join(movies['genres'].str.split('|', expand=True) .stack().reset_index(level=1,drop=True) .rename('genres')) .loc[:, movies.columns]) movies.head() In [20]: # splitting into row the production_companies columns movies= (movies.drop('production_companies', axis=1) .join(movies['production_companies'].str.split('|', expand=True) .stack().reset_index(level=1,drop=True) .rename('production_companies')) .loc[:, movies.columns]) movies.head() Fill zero value in the revenue and budget columns Here we inspect the column revenue, revenue_adj, budget and budget_adj counting the number of rows having 0 values before filling those values with the mean. In [21]: # inspecting the movies and budget columns movies[movies['revenue'] == 0].count()['revenue'] In [22]: # inspecting the movies and budget columns movies[movies['revenue_adj'] == 0].count()['revenue_adj'] In [23]: # inspecting the movies and budget columns movies[movies['budget'] == 0].count()['budget'] In [24]: # inspecting the movies and budget columns movies[movies['budget_adj'] == 0].count()['budget_adj'] In [25]: # fill the columns revenue and budget with their mean value cols = ['budget', 'budget_adj', 'revenue', 'revenue_adj'] for item in cols: print(item, movies[item].mean()) movies[item] = movies[item].replace({0:movies[item].mean()}) In [26]: # Check Whether the colums have been successfully filled movies[movies['revenue'].notnull()].count() In [27]: # should return False (movies['revenue'] == 0).all() In [28]: # should return False (movies['revenue_adj'] == 0).all() In [29]: # should return False (movies['budget_adj'] == 0).all() In [30]: # should return False (movies['budget_adj'] == 0).all() Check Number of samples/columns In [31]: movies.shape Visual Trends In [32]: movies.hist(figsize=(15,10)); Exploratory data analysis 1. Which genres is more popular ? In [33]: # unique genres movies existing in the dataframe genres = movies['genres'].unique() print(genres) In [34]: # grouping movies by genre projecting on the popularity column and calculation of the mean movies_by_genres = movies.groupby('genres')['popularity'].mean() # plottting the bar chart of movies by genre movies_by_genres.plot(kind='bar', alpha=.7, figsize=(15,6)) plt.xlabel("Genres", fontsize=18); plt.ylabel("Popularity", fontsize=18); plt.xticks(fontsize=10) plt.title('Average movie popularity by genre', fontsize=18); plt.grid(True) 2. Which genres is most popular from year to year? In [35]: # plot data fig, ax = plt.subplots(figsize=(15,7)) # grouping movies by genre grouped= movies.groupby(['release_year', 'genres']).count()['popularity'] .unstack().plot(ax=ax, figsize=(15,6)) plt.xlabel("release year", fontsize=18); plt.ylabel("count", fontsize=18); plt.xticks(fontsize=10) plt.title('movie popularity year by year', fontsize=18); 3. What Moving Genres recieves the highest average rating? In [36]: # grouping the movies by genres and projecting on the rating column rating = movies.groupby('genres')['vote_average'].mean() rating In [37]: # bar chart of the movies mean rating by genre rating.plot(kind='bar', alpha=0.7) plt.xlabel('Movie Genre', fontsize=12) plt.ylabel('Vote Average', fontsize=12) plt.title('Average Movie Quality by Genre', fontsize=12) plt.grid(True) 4. Do movies with high revenue recieve the highest rating? In [38]: plt.scatter(movies['revenue_adj'], movies['vote_average'], linewidth=5) plt.title('Vote Ratings by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=15) plt.ylabel('Average Vote Rating', fontsize=15); plt.show() In [39]: # mean rating for each revenue level median_rev = movies['revenue_adj'].median() low = movies.query('revenue_adj < {}'.format(median_rev)) high = movies.query('revenue_adj >= {}'.format(median_rev)) # filtering to vote_average columns and calculation of the mean mean_low = low['vote_average'].mean() mean_high = high['vote_average'].mean() In [40]: heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Vote Ratings by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=12) plt.ylabel('Average Vote Rating', fontsize=15); In [41]: # counting the movie revenue unique values movies.revenue.value_counts().head() In [42]: # 10 first values movies.groupby('revenue_adj')['vote_average'].value_counts().head(10) In [43]: # 10 last values movies.groupby('revenue_adj')['vote_average'].value_counts().tail(10) In [44]: # comparison of the median popularity of movies with low and high revenue movies.query('revenue_adj < {}'.format(median_rev))['vote_average'].median(), movies.query('revenue_ adj > {}'.format(median_rev))['vote_average'].median() Partial conclusion It is difficult to say the movies with high revenue have a better rating since according to the histogram, the height of the histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high revenue is 6.3. 5. Do movies with high budget get the highest rating? In [45]: # scatter plot of the budget versus vote rating plt.scatter(movies['budget_adj'], movies['vote_average'], linewidth=5) plt.title('Vote Ratings by Budget Level', fontsize=15) plt.xlabel('Budget Level', fontsize=15) plt.ylabel('Vote Rating', fontsize=15); plt.show() In [46]: # mean rating for each revenue level median_bud = movies['budget_adj'].median() low = movies.query('budget_adj < {}'.format(median_bud)) high = movies.query('budget_adj >= {}'.format(median_bud)) # filtering to vote_average columns and calculation of the mean mean_low = low['vote_average'].mean() mean_high = high['vote_average'].mean() print([mean_low, mean_high]) In [47]: heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Vote Ratings by Budget Level', fontsize=15) plt.xlabel('Budget Level', fontsize=12) plt.ylabel('Average Vote Rating', fontsize=15); In [48]: # counting the movie revenue unique values movies.budget.value_counts().head() In [49]: # 10 first values movies.groupby('budget_adj')['vote_average'].value_counts().head(10) In [50]: # 10 last values movies.groupby('budget_adj')['vote_average'].value_counts().tail(10) In [51]: # comparison of the median popularity of movies with low and high revenue (movies.query('budget_adj < {}'.format(median_rev))['vote_average'].median(), movies.query('budget_adj > {}'.format(median_rev))['vote_average'].median()) Partial conclusion It is difficult to say the movies with high budget have a better rating since according to the histogram, the height of the histogram are approximativaty the same. For deeper comparison the median vote average with low and high revenue is calculated. we notice that median movie vote average for movie with low revenue is 6.0 while the one of movie with high revenue is 6.2. 6. Do movies with highest revenue have more popularity? In [52]: plt.scatter(movies['revenue_adj'], movies['popularity'], linewidth=5) plt.title('Popularity by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=15) plt.ylabel('Average Popularity', fontsize=15); plt.show() In [53]: # mean rating for each revenue level median_rev = movies['revenue_adj'].median() low = movies.query('revenue_adj < {}'.format(median_rev)) high = movies.query('revenue_adj >= {}'.format(median_rev)) # filtering to popularity columns and calculation of the mean mean_low = low['popularity'].mean() mean_high = high['popularity'].mean() In [54]: # list of the mean and high revenue for historgram chart heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Popularity by Revenue Level', fontsize=15) plt.xlabel('Revenue Level', fontsize=12) plt.ylabel('Average Popularity', fontsize=15); In [55]: # counting the movie revenue unique values movies.revenue_adj.value_counts().head() In [56]: # 10 first values movies.groupby('revenue_adj')['popularity'].value_counts().head(10) In [57]: # 10 last values movies.groupby('revenue_adj')['popularity'].value_counts().tail(10) In [58]: # comparison of the median popularity of movies with low and high revenue (movies.query('revenue_adj < {}'.format(median_rev))['popularity'].median(), movies.query('revenue_adj > {}'.format(median_rev))['popularity'].median()) Partial conclusion We can see that, the film with high revenue seem to be more popular than the ones with low revenue, with an average popularity respectively of 0.7420684714824547 and 0.9989869505300212. Morever by comparing the median popularity of movies with low and high revenue, we can clearly see that the movie with high revenue are more popular. 7. Do movies with highest budget have more popularity? In [59]: # scatter plot of the movies budget versus popularity plt.scatter(movies['budget_adj'], movies['popularity'], linewidth=5) plt.title('Popularity by Revenue Level', fontsize=15) plt.xlabel('Budget Level', fontsize=15) plt.ylabel('Average Popularity', fontsize=15); plt.show() In [60]: # mean rating for each revenue level median_rev = movies['budget_adj'].median() low = movies.query('budget_adj < {}'.format(median_rev)) high = movies.query('budget_adj >= {}'.format(median_rev)) # filtering to popularity columns and calculation of the mean mean_low = low['popularity'].mean() mean_high = high['popularity'].mean() In [61]: heights = [mean_low, mean_high] print(heights) labels = ['low', 'high'] locations = [1,2] plt.bar(locations, heights, tick_label=labels) plt.title('Average Popularity by Budget Level', fontsize=15) plt.xlabel('Budget Level', fontsize=12) plt.ylabel('Average Popularity', fontsize=15); In [62]: # counting the movie budget unique values movies.budget_adj.value_counts().head() In [63]: # 10 first values movies.groupby('budget_adj')['popularity'].value_counts().head(10) In [64]: # 10 last values movies.groupby('budget_adj')['popularity'].value_counts().tail(10) In [65]: # comparison of the median popularity of movies with low and high revenue (movies.query('budget_adj < {}'.format(median_rev))['popularity'].median(), movies.query('budget_adj > {}'.format(median_rev))['popularity'].median()) Partial conclusion We can see that, the film with high budget seem to be more popular than the ones with low budget, with an average popularity respectively of 0.7564478409230605 and 0.979017978784679. Morever by comparing the median popularity of movies with low and high budget, we can clearly see that the movie with high budget seem more popular. Conclusions In this project, we started our analysis by examining the most popular movie by genre. We notice the adventure movies are the most popular movies genre. We've, then examined, the movie popularity year by year. For, this since there is no correlation between release_year and movie popularity, the count of the realese movie each year is used for the analysis. Based on the relation between the genres and vote avarage, we found that, the Documentary recieves the highest rating. Moreover, we have analyzed the dataset trying to answer different questions related to movies popularity and rating versus revenue and budget. While the movies with high revenue and budget seem to be more popular, we could not find a correlation between movie budget and revenue with rating. Limitations For a better analysis, a more details seems to be useful regarding the variables popularity and vote_average and how they are calculated? The factors/criteria used for their calculations. During the analysis process the columns in which we are interested in this analayis (budget, revenue, budget_adj and revenue_adj) contain many missing values which have been filled using the mean. This seems not the best way to fix those columns since the mean is not always the best measure of center. Another limitations in this analysis, the process of categorizing the movie with low and high revenue and budget using the median. Since some movie have a huge amount of budget and revenue and the fact that we fill many missing values with the mean, the median should not be the best for categoring the movie. References: [1]: https://www.themoviedb.org/about In [ ]: Out[3]: id imdb_id popularity budget revenue original_title cast home 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld 1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays- Byrne|Nic... http://www.madmaxmovie 2 262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insu 3 140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-w epi 4 168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7 5 rows Ă— 21 columns Out[4]: id int64 imdb_id object popularity float64 budget int64 revenue int64 original_title object cast object homepage object director object tagline object keywords object overview object runtime int64 genres object production_companies object release_date object vote_count int64 vote_average float64 release_year int64 budget_adj float64 revenue_adj float64 dtype: object Out[5]: 10866 Out[6]: 21 Out[7]: 1 <class 'pandas.core.frame.DataFrame'> Int64Index: 10865 entries, 0 to 10865 Data columns (total 21 columns): id 10865 non-null int64 imdb_id 10855 non-null object popularity 10865 non-null float64 budget 10865 non-null int64 revenue 10865 non-null int64 original_title 10865 non-null object cast 10789 non-null object homepage 2936 non-null object director 10821 non-null object tagline 8041 non-null object keywords 9372 non-null object overview 10861 non-null object runtime 10865 non-null int64 genres 10842 non-null object production_companies 9835 non-null object release_date 10865 non-null object vote_count 10865 non-null int64 vote_average 10865 non-null float64 release_year 10865 non-null int64 budget_adj 10865 non-null float64 revenue_adj 10865 non-null float64 dtypes: float64(4), int64(6), object(11) memory usage: 1.8+ MB Out[10]: id imdb_id popularity budget revenue original_title cast homepage director tagline 18 150689 tt1661199 5.556818 95000000 542351353 Cinderella Lily James|Cate Blanchett|Richard Madden|Helen... NaN Kenneth Branagh Midnight is just the beginning. 21 307081 tt1798684 5.337064 30000000 91709827 Southpaw Jake Gyllenhaal|Rachel McAdams|Forest Whitaker... NaN Antoine Fuqua Believe in Hope. 26 214756 tt2637276 4.564549 68000000 215863606 Ted 2 Mark Wahlberg|Seth MacFarlane|Amanda Seyfried|... NaN Seth MacFarlane Ted is Coming, Again. 32 254470 tt2848292 3.877764 29000000 287506194 Pitch Perfect 2 Anna Kendrick|Rebel Wilson|Hailee Steinfeld|Br... NaN Elizabeth Banks We're back pitches 33 296098 tt3682448 3.648210 40000000 162610473 Bridge of Spies Tom Hanks|Mark Rylance|Amy Ryan|Alan Alda|Seba... NaN Steven Spielberg In the shadow of war, one man showed the world... 5 rows Ă— 21 columns Out[11]: id 10865 imdb_id 10855 popularity 10814 budget 557 revenue 4702 original_title 10571 cast 10719 homepage 2896 director 5067 tagline 7997 keywords 8804 overview 10847 runtime 247 genres 2039 production_companies 7445 release_date 5909 vote_count 1289 vote_average 72 release_year 56 budget_adj 2614 revenue_adj 4840 dtype: int64 Out[12]: id popularity budget revenue runtime vote_count vote_average release_year count 10865.000000 10865.000000 1.086500e+04 1.086500e+04 10865.000000 10865.000000 10865.000000 10865.000000 mean 66066.374413 0.646446 1.462429e+07 3.982690e+07 102.071790 217.399632 5.975012 2001.321859 std 92134.091971 1.000231 3.091428e+07 1.170083e+08 31.382701 575.644627 0.935138 12.813260 min 5.000000 0.000065 0.000000e+00 0.000000e+00 0.000000 10.000000 1.500000 1960.000000 25% 10596.000000 0.207575 0.000000e+00 0.000000e+00 90.000000 17.000000 5.400000 1995.000000 50% 20662.000000 0.383831 0.000000e+00 0.000000e+00 99.000000 38.000000 6.000000 2006.000000 75% 75612.000000 0.713857 1.500000e+07 2.400000e+07 111.000000 146.000000 6.600000 2011.000000 max 417859.000000 32.985763 4.250000e+08 2.781506e+09 900.000000 9767.000000 9.200000 2015.000000 Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'runtime', 'genres', 'production_companies', 'release_date', 'vote_count', 'vote_average', 'release_year', 'budget_adj', 'revenue_adj'], dtype='object') <class 'pandas.core.frame.DataFrame'> Int64Index: 10865 entries, 0 to 10865 Data columns (total 17 columns): id 10865 non-null int64 imdb_id 10855 non-null object popularity 10865 non-null float64 budget 10865 non-null int64 revenue 10865 non-null int64 original_title 10865 non-null object cast 10789 non-null object director 10821 non-null object runtime 10865 non-null int64 genres 10842 non-null object production_companies 9835 non-null object release_date 10865 non-null object vote_count 10865 non-null int64 vote_average 10865 non-null float64 release_year 10865 non-null int64 budget_adj 10865 non-null float64 revenue_adj 10865 non-null float64 dtypes: float64(4), int64(6), object(7) memory usage: 1.5+ MB Out[18]: dtype('<M8[ns]') Out[19]: id imdb_id popularity budget revenue original_title cast director runtime genres product 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Universa Entertain 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Adventure Universa Entertain 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Science Fiction Universa Entertain 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Thriller Universa Entertain 1 76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays- Byrne|Nic... George Miller 120 Action Vi Pictures Out[20]: id imdb_id popularity budget revenue original_title cast director runtime genres production_c 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Univers 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Amblin Ent 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Legenda 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Fuji Televisio 0 135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow 124 Action Out[21]: 76851 Out[22]: 76851 Out[23]: 69433 Out[24]: 69433 budget 26775356.47371121 budget_adj 30889712.59859798 revenue 70334023.35863845 revenue_adj 84260108.11800905 Out[26]: id 181294 imdb_id 181249 popularity 181294 budget 181294 revenue 181294 original_title 181294 cast 181023 director 181104 runtime 181294 genres 181270 production_companies 179082 release_date 181294 vote_count 181294 vote_average 181294 release_year 181294 budget_adj 181294 revenue_adj 181294 dtype: int64 Out[27]: False Out[28]: False Out[29]: False Out[30]: False Out[31]: (181294, 17) ['Action' 'Adventure' 'Science Fiction' 'Thriller' 'Fantasy' 'Crime' 'Western' 'Drama' 'Family' 'Animation' 'Comedy' 'Mystery' 'Romance' 'War' 'History' 'Music' 'Horror' 'Documentary' 'TV Movie' nan 'Foreign'] Out[36]: genres Action 5.859801 Adventure 5.962865 Animation 6.333965 Comedy 5.917464 Crime 6.112665 Documentary 6.957312 Drama 6.156389 Family 5.973175 Fantasy 5.895793 Foreign 5.892970 History 6.417070 Horror 5.444786 Music 6.302175 Mystery 5.986585 Romance 6.059295 Science Fiction 5.738771 TV Movie 5.651250 Thriller 5.848404 War 6.336557 Western 6.101556 Name: vote_average, dtype: float64 [5.971800334804548, 5.967507674675677] Out[41]: 7.033402e+07 76851 2.000000e+06 152 2.000000e+07 126 1.200000e+07 126 5.318650e+08 125 Name: revenue, dtype: int64 Out[42]: revenue_adj vote_average 2.370705 6.4 12 2.861934 6.8 12 3.038360 7.7 5 5.926763 6.8 4 6.951084 4.9 8 8.585801 4.5 4 9.056820 6.7 20 9.115080 5.1 2 10.000000 4.2 9 10.296367 6.5 18 Name: vote_average, dtype: int64 Out[43]: revenue_adj vote_average 1.443191e+09 7.3 9 1.574815e+09 6.6 16 1.583050e+09 5.6 25 1.791694e+09 7.2 32 1.902723e+09 7.5 48 1.907006e+09 7.3 18 2.167325e+09 7.2 18 2.506406e+09 7.3 27 2.789712e+09 7.9 18 2.827124e+09 7.1 64 Name: vote_average, dtype: int64 Out[44]: (6.0, 6.3) [5.981346153846566, 5.963905517657313] [5.981346153846566, 5.963905517657313] Out[48]: 2.677536e+07 69433 2.000000e+07 4331 2.500000e+07 4255 3.000000e+07 3902 4.000000e+07 3584 Name: budget, dtype: int64 Out[49]: budget_adj vote_average 0.921091 4.1 3 0.969398 5.3 20 1.012787 6.5 48 1.309053 4.8 8 2.908194 6.5 12 3.000000 7.3 12 4.519285 5.6 9 4.605455 6.0 1 5.006696 5.8 27 8.102293 6.9 45 Name: vote_average, dtype: int64 Out[50]: budget_adj vote_average 2.504192e+08 5.8 16 2.541001e+08 7.3 18 2.575999e+08 7.4 27 2.600000e+08 7.3 8 2.713305e+08 5.8 27 2.716921e+08 7.3 27 2.920507e+08 5.3 64 3.155006e+08 6.8 27 3.683713e+08 6.3 27 4.250000e+08 6.4 25 Name: vote_average, dtype: int64 Out[51]: (6.0, 6.2) [0.7420684714824547, 0.9989869505300212] Out[55]: 8.426011e+07 76851 2.358000e+07 125 4.978434e+08 125 2.231273e+07 125 1.934053e+07 125 Name: revenue_adj, dtype: int64 Out[56]: revenue_adj popularity 2.370705 0.462609 12 2.861934 0.552091 12 3.038360 0.352054 5 5.926763 0.208637 4 6.951084 0.578849 8 8.585801 0.183034 4 9.056820 0.450208 20 9.115080 0.113082 2 10.000000 0.559371 9 10.296367 0.222776 18 Name: popularity, dtype: int64 Out[57]: revenue_adj popularity 1.443191e+09 7.637767 9 1.574815e+09 2.631987 16 1.583050e+09 1.136610 25 1.791694e+09 2.900556 32 1.902723e+09 11.173104 48 1.907006e+09 2.563191 18 2.167325e+09 2.010733 18 2.506406e+09 4.355219 27 2.789712e+09 12.037933 18 2.827124e+09 9.432768 64 Name: popularity, dtype: int64 Out[58]: (0.58808, 1.4886709999999999) [0.7564478409230605, 0.9790179787846799] Out[62]: 3.088971e+07 69433 2.032801e+07 532 2.103337e+07 421 4.065602e+07 385 2.908194e+07 381 Name: budget_adj, dtype: int64 Out[63]: budget_adj popularity 0.921091 0.177102 3 0.969398 0.520430 20 1.012787 0.472691 48 1.309053 0.090186 8 2.908194 0.228643 12 3.000000 0.028456 12 4.519285 0.464188 9 4.605455 0.002922 1 5.006696 0.317091 27 8.102293 0.626646 45 Name: popularity, dtype: int64 Out[64]: budget_adj popularity 2.504192e+08 1.232098 16 2.541001e+08 5.076472 18 2.575999e+08 5.944927 27 2.600000e+08 2.865684 8 2.713305e+08 2.520912 27 2.716921e+08 4.355219 27 2.920507e+08 1.957331 64 3.155006e+08 4.965391 27 3.683713e+08 4.955130 27 4.250000e+08 0.250540 25 Name: popularity, dtype: int64 Out[65]: (0.534192, 1.138395)