SlideShare a Scribd company logo
1 of 31
Download to read offline
Predicting
Opening Weekend
Box Office Performance
From IMDB Search Frequency of Principal Cast Members
Christopher Giler January 26, 2018
Agenda
1. Project Objective
2. Data Overview
3. Key Features
4. Exploratory Data Analysis
5. Feature Engineering
6. Modeling Box Office Sales
7. Conclusion & Next Steps
2
1
Project Objective
Defining Project Goals
▪ How much does casting
affect the hype
surrounding a movie’s
release?
▪ Evaluate the impact of
cast popularity on
opening weekend box
office ticket sales.
The Problem Statement
4
2
Data Overview
Scraping and Cleaning from Data Sources
2014 ~ 2017
4 years of box office data
8
Total features for model
(raw)
994
Movies available w/
box office data
6
▪ IMDB
▫ Release Date
▫ Opening Weekend Gross
▫ Metacritic / IMDB Reviews
▫ Genre
▫ MPAA Rating
▫ Principal Cast
▪ IMDB Pro
▫ STARmeter Data (next slide)
▫ Number Theaters
IMDB Pro’s STARmeter
▪ Cast / Crew ranking
based on IMDB search
frequency
▪ Presented as time
series data
▪ Scraped via Python /
Selenium
7
3
Key Features
From IMDB Pro
▪ Feature defined as minimum
star ranking prior to 12
months before release date.
▪ A film’s “Star Power” is
based on average of top 5
star rankings
Using IMDB Pro’s StarMeter Data
9
Opening Weekend Gross per Theater
10
▪ Number of theaters is highly
correlated with overall
opening weekend gross
4
Data Analysis
Exploring the Data
Continuous Features
▪ All continuous features
should have normal
distribution.
▪ Issues:
▫ Opening Weekend
Gross
▫ Number of Theaters
▫ Star Power
12
Continuous Features
▪ All continuous features
should have normal
distribution.
▪ Logarithmic transform
applied to skewed data.
13
Continuous Features
▪ All continuous features
should have normal
distribution.
▪ Logarithmic transform
applied to skewed data.
▪ Removed all 2018 data
points
14
Categorical Features
▪ Movie Genres
(Action, Drama, Comedy, etc.)
▪ Release Season
(Spring, Summer, Autumn, Winter)
▪ MPAA Rating
(PG, PG-13, R)
▪ Film Release Type
(Wide vs. Limited)
15
5
Feature Engineering
Modifying Model Features
▪ Opening Weekend
Gross vs. Star
Ranking alone does
not show a clear
linear relationship.
Star Power
17
Star Power
18
▪ Opening Weekend
Gross vs. Star
Ranking alone does
not show a clear
linear relationship.
▪ “Limited Release”:
# Theaters < 600
6
Modeling Box Office Sales
Final Model Results
Limited Release - Feature Selection
20
▪ Model trained on all data.
▪ Features selected based on
individual p-values.
▪ Remaining features:
▫ Num_Theaters
▫ Runtime_Minutes
▫ Star_Power
Wide Release - Feature Selection
▪ Model trained on all data.
▪ Features selected based on
individual p-values.
▪ Remaining features:
▫ Num_Theaters
▫ Runtime_Minutes
▫ Star_Power
▫ Release_Year
▫ Rated_R
▫ Release_in_Spring
21
▪ 70-30 random
train-test split
▪ Test data shown
▪ Predicted results
within expected
range
Predicted vs. Actual Opening Weekend Gross
22
▪ Model trained 70% of data, and validated
on remaining 30%.
▪ Cross-Validation score aligns with
observed results.
Residuals for Train/Test Split
23
Regression Model 5-Fold Cross-Validation R^2 Score
All Data (Base) 0.16
Limited Release 0.49
Wide Release 0.64
30% Train-Test Split
▪ 2018 Opening Weekend Box Office Predictions
Applying Predictive Model
24
Movie Title Release Date Predicted Gross Actual Gross Release Type % Error
The Post 2018-01-12 $19,783,540 $19,887,979 wide -0.53 %
Paddington 2 2018-01-12 $18,109,110 $11,001,961 wide 64.60 %
7
Conclusion
And Next Steps
Conclusions
Results
▪ “Star Power” is a factor in
predicting opening
weekend box office
performance.
▪ Splitting data into two
models proved effective.
▪ Genre removed as feature
Next Steps
▪ New features to consider:
▫ Box office competition
▫ Is Sequel?
(Title/Brand recognition)
▫ Other metrics for cast /
crew popularity
(Google Trends, Twitter)
26
27
THANKS!
ANY QUESTIONS?
Scraping IMDB Pro’s
StarMeter
▪ Visualization uses
JavaScript SVG
Element
▪ Scraping using
Selenium
Place your screenshot here
28
All Data - Feature Elimination
29
Limited Release - Feature Elimination
30
Wide Release - Feature Elimination
31

More Related Content

Similar to Predicting Movie Box Office Hype

Sprint backlog specified by example
Sprint backlog specified by exampleSprint backlog specified by example
Sprint backlog specified by example
Agora Group
 

Similar to Predicting Movie Box Office Hype (20)

Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
DataScienceLab2017_Оптимизация гиперпараметров машинного обучения при помощи ...
 
Video Recommendation Engines as a Service
Video Recommendation Engines as a ServiceVideo Recommendation Engines as a Service
Video Recommendation Engines as a Service
 
Metis Project 2: Predicting Worldwide Gross - JungleBoogie
Metis Project 2: Predicting Worldwide Gross - JungleBoogieMetis Project 2: Predicting Worldwide Gross - JungleBoogie
Metis Project 2: Predicting Worldwide Gross - JungleBoogie
 
Front end performance on Shopify.com
Front end performance on Shopify.comFront end performance on Shopify.com
Front end performance on Shopify.com
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Worst Practices in Data Warehouse Design
Worst Practices in Data Warehouse DesignWorst Practices in Data Warehouse Design
Worst Practices in Data Warehouse Design
 
PASS Summit 2010 Keynote David DeWitt
PASS Summit 2010 Keynote David DeWittPASS Summit 2010 Keynote David DeWitt
PASS Summit 2010 Keynote David DeWitt
 
Building Products Quantitatively
Building Products QuantitativelyBuilding Products Quantitatively
Building Products Quantitatively
 
Planning XP
Planning XPPlanning XP
Planning XP
 
Planning XP - Extreme Programming
Planning XP - Extreme ProgrammingPlanning XP - Extreme Programming
Planning XP - Extreme Programming
 
Test Automation Challenges in the Gaming Industry
Test Automation Challenges in the Gaming IndustryTest Automation Challenges in the Gaming Industry
Test Automation Challenges in the Gaming Industry
 
Agile dwh
Agile dwhAgile dwh
Agile dwh
 
Predictive model
Predictive modelPredictive model
Predictive model
 
SivaRamaKrishna_CV_9.6 yrs Testing
SivaRamaKrishna_CV_9.6 yrs TestingSivaRamaKrishna_CV_9.6 yrs Testing
SivaRamaKrishna_CV_9.6 yrs Testing
 
Advanced topics in Agile: Implementing Scrum in a project-based company
Advanced topics in Agile: Implementing Scrum in a project-based companyAdvanced topics in Agile: Implementing Scrum in a project-based company
Advanced topics in Agile: Implementing Scrum in a project-based company
 
Phils Session cards @ Measurecamp
Phils Session cards @ MeasurecampPhils Session cards @ Measurecamp
Phils Session cards @ Measurecamp
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
Netflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time TravelNetflix Recommendations Feature Engineering with Time Travel
Netflix Recommendations Feature Engineering with Time Travel
 
Sprint backlog specified by example
Sprint backlog specified by exampleSprint backlog specified by example
Sprint backlog specified by example
 

Recently uploaded

原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
zifhagzkk
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 

Recently uploaded (20)

NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae CoolbethDigital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
Digital Marketing Demystified: Expert Tips from Samantha Rae Coolbeth
 

Predicting Movie Box Office Hype

  • 1. Predicting Opening Weekend Box Office Performance From IMDB Search Frequency of Principal Cast Members Christopher Giler January 26, 2018
  • 2. Agenda 1. Project Objective 2. Data Overview 3. Key Features 4. Exploratory Data Analysis 5. Feature Engineering 6. Modeling Box Office Sales 7. Conclusion & Next Steps 2
  • 4. ▪ How much does casting affect the hype surrounding a movie’s release? ▪ Evaluate the impact of cast popularity on opening weekend box office ticket sales. The Problem Statement 4
  • 5. 2 Data Overview Scraping and Cleaning from Data Sources
  • 6. 2014 ~ 2017 4 years of box office data 8 Total features for model (raw) 994 Movies available w/ box office data 6 ▪ IMDB ▫ Release Date ▫ Opening Weekend Gross ▫ Metacritic / IMDB Reviews ▫ Genre ▫ MPAA Rating ▫ Principal Cast ▪ IMDB Pro ▫ STARmeter Data (next slide) ▫ Number Theaters
  • 7. IMDB Pro’s STARmeter ▪ Cast / Crew ranking based on IMDB search frequency ▪ Presented as time series data ▪ Scraped via Python / Selenium 7
  • 9. ▪ Feature defined as minimum star ranking prior to 12 months before release date. ▪ A film’s “Star Power” is based on average of top 5 star rankings Using IMDB Pro’s StarMeter Data 9
  • 10. Opening Weekend Gross per Theater 10 ▪ Number of theaters is highly correlated with overall opening weekend gross
  • 12. Continuous Features ▪ All continuous features should have normal distribution. ▪ Issues: ▫ Opening Weekend Gross ▫ Number of Theaters ▫ Star Power 12
  • 13. Continuous Features ▪ All continuous features should have normal distribution. ▪ Logarithmic transform applied to skewed data. 13
  • 14. Continuous Features ▪ All continuous features should have normal distribution. ▪ Logarithmic transform applied to skewed data. ▪ Removed all 2018 data points 14
  • 15. Categorical Features ▪ Movie Genres (Action, Drama, Comedy, etc.) ▪ Release Season (Spring, Summer, Autumn, Winter) ▪ MPAA Rating (PG, PG-13, R) ▪ Film Release Type (Wide vs. Limited) 15
  • 17. ▪ Opening Weekend Gross vs. Star Ranking alone does not show a clear linear relationship. Star Power 17
  • 18. Star Power 18 ▪ Opening Weekend Gross vs. Star Ranking alone does not show a clear linear relationship. ▪ “Limited Release”: # Theaters < 600
  • 19. 6 Modeling Box Office Sales Final Model Results
  • 20. Limited Release - Feature Selection 20 ▪ Model trained on all data. ▪ Features selected based on individual p-values. ▪ Remaining features: ▫ Num_Theaters ▫ Runtime_Minutes ▫ Star_Power
  • 21. Wide Release - Feature Selection ▪ Model trained on all data. ▪ Features selected based on individual p-values. ▪ Remaining features: ▫ Num_Theaters ▫ Runtime_Minutes ▫ Star_Power ▫ Release_Year ▫ Rated_R ▫ Release_in_Spring 21
  • 22. ▪ 70-30 random train-test split ▪ Test data shown ▪ Predicted results within expected range Predicted vs. Actual Opening Weekend Gross 22
  • 23. ▪ Model trained 70% of data, and validated on remaining 30%. ▪ Cross-Validation score aligns with observed results. Residuals for Train/Test Split 23 Regression Model 5-Fold Cross-Validation R^2 Score All Data (Base) 0.16 Limited Release 0.49 Wide Release 0.64 30% Train-Test Split
  • 24. ▪ 2018 Opening Weekend Box Office Predictions Applying Predictive Model 24 Movie Title Release Date Predicted Gross Actual Gross Release Type % Error The Post 2018-01-12 $19,783,540 $19,887,979 wide -0.53 % Paddington 2 2018-01-12 $18,109,110 $11,001,961 wide 64.60 %
  • 26. Conclusions Results ▪ “Star Power” is a factor in predicting opening weekend box office performance. ▪ Splitting data into two models proved effective. ▪ Genre removed as feature Next Steps ▪ New features to consider: ▫ Box office competition ▫ Is Sequel? (Title/Brand recognition) ▫ Other metrics for cast / crew popularity (Google Trends, Twitter) 26
  • 28. Scraping IMDB Pro’s StarMeter ▪ Visualization uses JavaScript SVG Element ▪ Scraping using Selenium Place your screenshot here 28
  • 29. All Data - Feature Elimination 29
  • 30. Limited Release - Feature Elimination 30
  • 31. Wide Release - Feature Elimination 31