SlideShare a Scribd company logo
1 of 21
Predicting Future
NFL Hall of Famers
By: Sam Binenfeld
Introduction
• What is the goal of this project?
• To predict which NFL players will be inducted into the Hall of Fame in the near future.
• How do we achieve our goal?
• Acquire data from Kaggle.com and www.pro-football-reference.com
• Explore the data to find out what qualifies players to be in the HOF
• Create a model that can predict the likelihood that a player will be in the HOF
• Run our model on current NFL Players, and players that retired after 2005
Project Restrictions
• We only include players of the following positions:
• Quarterback
• Running Back
• Wide Receiver
• We restrict our training model to include only players whose career
started after 1960, and ended before 2006.
Data Acquisition
• The core source of our data comes from the Kaggle dataset NFL Statistics
• We collect the three tables we want: Passing, Rushing, and Receiving
• We merge these three tables together, so that we have one table with all of our
information
• We bring in more data from another source, www.pro-football-
reference.com
• We get 6 supplemental tables, and merge them with our initial dataset
• Hall of Famers
• Super Bowls
• Game-Winning Drives
• Playoff Game-Winning Drives
• MVPs
• Super Bowl MVPs
Data Cleaning
• Remove Text
• Some of the number fields have text mixed in with them
• Fix Position Column: Remove Nulls and Reformat
• To fill these nulls, we create a function that is able to determine the player’s
position.
• The function looks at passing yards, rushing yards, and receiving yards, and
determines which is the highest.
• It then assigns the player a position of quarterback (0), running back (1), or
wide receiver (2)
Data Pre-Processing
• Create New Columns
• Create columns for a player’s first and last year in the league
• Adjust the following statistics for inflation:
• Create Combined Feature Variables
• SB - Total Superbowls (Super Bowl Wins + Super Bowl Losses)
• RRYd – Receiving Yards + Rushing Yards
• RRTD - Receiving Touchdowns + Rushing Touchdowns
• Passing Yards
• TD Passes
• Rushing Yards
• Rushing TDs
• Receiving Yards
• Receiving TDs
Data Pre-Processing
• Transform the Data
• Pivot the data to convert from yearly stats to career stats
• Recalculate Ratio Columns
• Passing Yards Per Game – Passing Yards / Games Played
• Completion Percentage – Passes Completed / Passes Attempted
• Yards Per Carry – Rushing Yards / Rushing Attempts
• Lasso (L1 Regularization)
• We use the Lasso method (at ⍺ = .01) to perform feature selection
• We reduce our number of features from 52 to 14
Data Exploration
• Introduction to the Cleaned Dataset
• Our cleaned dataset consists of 2,153 rows and 14 columns
• Of the 2,153 players, only 47 of them are in the Hall of Fame (2.1%)
• Position Distribution
Data Exploration: Quarterbacks
• For quarterbacks, the variable that has the
highest correlation with Hall of Fame is Playoff
Game-Winning Drives.
• We can look at a histogram of quarterbacks with
at least one Playoff Game-Winning Drive
• All HOF quarterbacks have at least one PGWD.
• All quarterbacks with more than 2 PGWDs are in
the HOF.
Data Exploration: Running Backs
• 13 running backs have won either a League MVP or a Super Bowl
MVP (3 have won both).
• 12 out of these 13 MVP are in the HOF
Red
HOF Running Backs
with at least one
MVP (Super Bowl or
League MVP)
Blue
HOF Running Backs
without an MVP
award
Gray
Non-HOF Running
Backs
Data Exploration: Wide Receivers
• In this plot, HOF Wide
Receivers are in red
• Squares represent
someone who has been to
at least one Super Bowl
• Shapes with a gold star in
the middle show a player
who has been to 4 or
more Super Bowls
Modeling
• The algorithms we will test are:
• Random Forest
• Logistic Regression
• Support Vector Classifier
• Gradient Boosting
• The resampling methods we will test are:
• Random Over Sampling
• SMOTE
• Random Under Sampler
• We will use F1 score as our Performance Metric
Modeling: Variables
• Our response variable is HOF (Hall of Fame)
• Our predictor variables are:
• SB MVP
• RRTD
• Receiving Yards Per Game
• MVP, Rushing Yards adj
• PGWD
• TD Passes adj
• RRYd
• Passing Yards Per Game
• SB
• SB Win
• Position
Modeling: Initial Testing
Classifier Sampler Accuracy F1 Precision Recall
GradientBoosting
Classifier
nosampler 0.993321 0.824689 0.87056 0.803636
SMOTE 0.990538 0.812408 0.797711 0.842179
RandomOverSampler 0.991095 0.805656 0.799634 0.82599
RandomForest
Classifier
nosampler 0.992022 0.785662 0.905429 0.699957
LogisticRegression nosampler 0.989239 0.770769 0.844451 0.71984
RandomForest
Classifier
RandomOverSampler 0.98961 0.756497 0.780022 0.746907
SMOTE 0.988312 0.738019 0.791825 0.709014
GradientBoosting
Classifier
RandomUnderSampler 0.960297 0.554085 0.401329 0.954762
RandomForest
Classifier
RandomUnderSampler 0.956401 0.549563 0.392505 0.993333
LogisticRegression
SMOTE 0.956401 0.486309 0.334287 0.984615
RandomOverSampler 0.948237 0.420189 0.269592 1
RandomUnderSampler 0.926345 0.358347 0.224078 0.977381
SVC
SMOTE 0.44026 0.072452 0.037694 1
RandomUnderSampler 0.187384 0.050136 0.025757 1
RandomOverSampler 0.97885 0 0 0
nosampler 0.980519 0 0 0
Modeling: Parameter Tuning
• Gradient Boosting algorithm was our best performer
• The left table shows the values tested for each parameter
• The right table shows the optimal value for each parameter
Hyper-Parameter Values Tested
'max_depth' [5, 6, 7, 8],
'max_features' [0.1, 0.2, 0.3, 0.4],
'min_samples_leaf' [5, 10, 20, 50],
'min_samples_split' [0.05, 0.075, 0.1],
'subsample' [0.5, 0.75, 0.9, 1]
Hyper-Parameter Optimal Value
'max_depth' 7
'max_features' 0.2
'min_samples_leaf' 5
'min_samples_split' 0.1
'subsample' 0.9
Modeling: Optimal Model
• We re-run the Gradient-Boosting algorithm with optimal
parameters
• We find the best re-sampling method is the Random Over Sampler
Classifier Sampler Accuracy F1 Precision Recall
GradientBoosting
Classifier
Random
Over-Sampler
0.994286 0.875867 0.841344 0.922296
nosampler 0.992764 0.833550 0.875636 0.807443
SMOTE 0.992393 0.832843 0.765702 0.925740
Random
Under-Sampler
0.963636 0.560210 0.401982 0.983445
Results: Future Hall of Famers
• We take our trained model,
and make predictions on
current NFL players, and
players who retired in 2006
or later
• Our model performs well,
as it correctly predicted all
four players who have been
inducted to the Hall
(highlighted in green)
Conclusion
• We are happy to have created a model that is able to classify a
test set with 99.4% accuracy, 84% precision, and 92% recall.
• Future study could leverage other data sources to predict Hall of
Famers in the many other NFL positions
• We could improve our model by acquiring other relevant data and
statistics (ex: # of Pro-Bowl Teams)
Appendix: Variables Used in Modeling
Columns Data Type Description
RRTD float64 Rushing TDs adj + Receiving TDs adj
SB MVP float64 Super Bowl MVP awards
Receiving Yards Per Game float64 Receiving Yards adj / Games Played
Rushing Yards adj float64 Rushing Yards adjusted for inflation
MVP float64 Most Valuable Player Awards won
PGWD float64 Playoff Game-Winning Drives
TD Passes adj float64 TD Passes adjusted for inflation
RRYd float64 Receiving Yards adj + Rushing Yards adj
Passing Yards Per Game float64 Passing Yards adj / Games Played
SB float64 Super Bowl Appearances
Name object Player’s Name
Player Id object Individual Identifier
HOF int64
Hall of Fame Status
1 – In Hall of Fame
0 – Not in Hall of Fame
Position int64
0- Quarterback
1- Running Back
2- Wide Receiver
Appendix: All Variables
Columns Data Type Columns Data Type Columns Data Type Columns Data Type
Player Id object Passes Attempted float64
Receptions Longer
than 20 Yards
float64 Sacks float64
Name object Passes Completed float64
Receptions Longer
than 40 Yards
float64 TD Passes adj float64
Position int64
Passes Longer than
20 Yards
float64 Rushing Attempts float64 Yards Per Carry float64
Completion
Percentage
float64
Passes Longer than
40 Yards
float64
Rushing Attempts Per
Game
float64 Yards Per Reception float64
First Down
Receptions
float64
Passing Yards Per
Attempt
float64 Rushing First Downs float64 Year float64
First Year int64
Passing Yards Per
Game
float64
Rushing More Than 20
Yards
float64 MVP float64
Games Played float64 Passing Yards adj float64
Rushing More Than 40
Yards
float64 SB MVP float64
HOF int64
Percentage of
Rushing First Downs
float64 Rushing TDs adj float64 GWD float64
Int Rate float64
Percentage of TDs per
Attempts
float64
Rushing Yards Per
Game
float64 PGWD float64
Ints float64 Receiving TDs adj float64 Rushing Yards adj float64 SB float64
Last Year int64
Receiving Yards Per
Game
float64 SB Loss float64 RRYd float64
Pass Attempts Per
Game
float64 Receiving Yards adj float64 SB Win float64 RRTD float64
Passer Rating float64 Receptions float64 Sacked Yards Lost float64
Appendix: Inflation Calculation
• adj: means that metric is adjusted for inflation
• inflation calculation:
𝑎
𝑏
= 𝑐
• 𝑎 = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑜𝑝 15 𝑝𝑙𝑎𝑦𝑒𝑟𝑠 𝑖𝑛 𝑦𝑒𝑎𝑟 2016
• 𝑏 = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑜𝑝 15 𝑝𝑙𝑎𝑦𝑒𝑟𝑠 𝑖𝑛 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑦𝑒𝑎𝑟
• 𝑐 = 𝐼𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 (𝑡ℎ𝑖𝑠 𝑖𝑠 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒)

More Related Content

Similar to Predicting NFL Hall of Famers

Why Does a Team Outperform its Run Differential?
Why Does a Team Outperform its Run Differential? Why Does a Team Outperform its Run Differential?
Why Does a Team Outperform its Run Differential? Gregory Ackerman
 
NBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxNBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxrishikeshravi30
 
NFL: RB Selection for Cash Games
NFL: RB Selection for Cash GamesNFL: RB Selection for Cash Games
NFL: RB Selection for Cash GamesRotoQL
 
Cricket team selection analysis
Cricket team selection analysisCricket team selection analysis
Cricket team selection analysischetan mitta
 
Fantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUGFantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUGegoodwintx
 
IPL Player ranking based on logistic regression
IPL Player ranking based on logistic regressionIPL Player ranking based on logistic regression
IPL Player ranking based on logistic regressionSurvesh Chauhan
 
Polo and Sports Science: An unlikely pair or a match made in heaven?
Polo and Sports Science: An unlikely pair or a match made in heaven?Polo and Sports Science: An unlikely pair or a match made in heaven?
Polo and Sports Science: An unlikely pair or a match made in heaven?Russell Best
 
Columbia Presentation
Columbia PresentationColumbia Presentation
Columbia PresentationTanner Crouch
 
Data mining for baseball new ppt
Data mining for baseball new pptData mining for baseball new ppt
Data mining for baseball new pptSalford Systems
 

Similar to Predicting NFL Hall of Famers (14)

Run diff pp update1
Run diff pp update1Run diff pp update1
Run diff pp update1
 
Why Does a Team Outperform its Run Differential?
Why Does a Team Outperform its Run Differential? Why Does a Team Outperform its Run Differential?
Why Does a Team Outperform its Run Differential?
 
NBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxNBA playoff prediction Model.pptx
NBA playoff prediction Model.pptx
 
NFL: RB Selection for Cash Games
NFL: RB Selection for Cash GamesNFL: RB Selection for Cash Games
NFL: RB Selection for Cash Games
 
Cricket team selection analysis
Cricket team selection analysisCricket team selection analysis
Cricket team selection analysis
 
IPL WIN PREDICTION.pptx
IPL WIN PREDICTION.pptxIPL WIN PREDICTION.pptx
IPL WIN PREDICTION.pptx
 
IPL WIN .pptx
IPL WIN .pptxIPL WIN .pptx
IPL WIN .pptx
 
Fantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUGFantasy Football Draft Optimization in R - HRUG
Fantasy Football Draft Optimization in R - HRUG
 
IPL Player ranking based on logistic regression
IPL Player ranking based on logistic regressionIPL Player ranking based on logistic regression
IPL Player ranking based on logistic regression
 
Polo and Sports Science: An unlikely pair or a match made in heaven?
Polo and Sports Science: An unlikely pair or a match made in heaven?Polo and Sports Science: An unlikely pair or a match made in heaven?
Polo and Sports Science: An unlikely pair or a match made in heaven?
 
Columbia Presentation
Columbia PresentationColumbia Presentation
Columbia Presentation
 
Games
GamesGames
Games
 
Data mining for baseball new ppt
Data mining for baseball new pptData mining for baseball new ppt
Data mining for baseball new ppt
 
Lesson 3
Lesson 3Lesson 3
Lesson 3
 

Recently uploaded

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证zifhagzkk
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样jk0tkvfv
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontangsiskavia95
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"John Sobanski
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证acoha1
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfRobertoOcampo24
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjadimosmejiaslendon
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证pwgnohujw
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxStephen266013
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...siskavia95
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444saurabvyas476
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 

Recently uploaded (20)

如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
如何办理(Dalhousie毕业证书)达尔豪斯大学毕业证成绩单留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 

Predicting NFL Hall of Famers

  • 1. Predicting Future NFL Hall of Famers By: Sam Binenfeld
  • 2. Introduction • What is the goal of this project? • To predict which NFL players will be inducted into the Hall of Fame in the near future. • How do we achieve our goal? • Acquire data from Kaggle.com and www.pro-football-reference.com • Explore the data to find out what qualifies players to be in the HOF • Create a model that can predict the likelihood that a player will be in the HOF • Run our model on current NFL Players, and players that retired after 2005
  • 3. Project Restrictions • We only include players of the following positions: • Quarterback • Running Back • Wide Receiver • We restrict our training model to include only players whose career started after 1960, and ended before 2006.
  • 4. Data Acquisition • The core source of our data comes from the Kaggle dataset NFL Statistics • We collect the three tables we want: Passing, Rushing, and Receiving • We merge these three tables together, so that we have one table with all of our information • We bring in more data from another source, www.pro-football- reference.com • We get 6 supplemental tables, and merge them with our initial dataset • Hall of Famers • Super Bowls • Game-Winning Drives • Playoff Game-Winning Drives • MVPs • Super Bowl MVPs
  • 5. Data Cleaning • Remove Text • Some of the number fields have text mixed in with them • Fix Position Column: Remove Nulls and Reformat • To fill these nulls, we create a function that is able to determine the player’s position. • The function looks at passing yards, rushing yards, and receiving yards, and determines which is the highest. • It then assigns the player a position of quarterback (0), running back (1), or wide receiver (2)
  • 6. Data Pre-Processing • Create New Columns • Create columns for a player’s first and last year in the league • Adjust the following statistics for inflation: • Create Combined Feature Variables • SB - Total Superbowls (Super Bowl Wins + Super Bowl Losses) • RRYd – Receiving Yards + Rushing Yards • RRTD - Receiving Touchdowns + Rushing Touchdowns • Passing Yards • TD Passes • Rushing Yards • Rushing TDs • Receiving Yards • Receiving TDs
  • 7. Data Pre-Processing • Transform the Data • Pivot the data to convert from yearly stats to career stats • Recalculate Ratio Columns • Passing Yards Per Game – Passing Yards / Games Played • Completion Percentage – Passes Completed / Passes Attempted • Yards Per Carry – Rushing Yards / Rushing Attempts • Lasso (L1 Regularization) • We use the Lasso method (at ⍺ = .01) to perform feature selection • We reduce our number of features from 52 to 14
  • 8. Data Exploration • Introduction to the Cleaned Dataset • Our cleaned dataset consists of 2,153 rows and 14 columns • Of the 2,153 players, only 47 of them are in the Hall of Fame (2.1%) • Position Distribution
  • 9. Data Exploration: Quarterbacks • For quarterbacks, the variable that has the highest correlation with Hall of Fame is Playoff Game-Winning Drives. • We can look at a histogram of quarterbacks with at least one Playoff Game-Winning Drive • All HOF quarterbacks have at least one PGWD. • All quarterbacks with more than 2 PGWDs are in the HOF.
  • 10. Data Exploration: Running Backs • 13 running backs have won either a League MVP or a Super Bowl MVP (3 have won both). • 12 out of these 13 MVP are in the HOF Red HOF Running Backs with at least one MVP (Super Bowl or League MVP) Blue HOF Running Backs without an MVP award Gray Non-HOF Running Backs
  • 11. Data Exploration: Wide Receivers • In this plot, HOF Wide Receivers are in red • Squares represent someone who has been to at least one Super Bowl • Shapes with a gold star in the middle show a player who has been to 4 or more Super Bowls
  • 12. Modeling • The algorithms we will test are: • Random Forest • Logistic Regression • Support Vector Classifier • Gradient Boosting • The resampling methods we will test are: • Random Over Sampling • SMOTE • Random Under Sampler • We will use F1 score as our Performance Metric
  • 13. Modeling: Variables • Our response variable is HOF (Hall of Fame) • Our predictor variables are: • SB MVP • RRTD • Receiving Yards Per Game • MVP, Rushing Yards adj • PGWD • TD Passes adj • RRYd • Passing Yards Per Game • SB • SB Win • Position
  • 14. Modeling: Initial Testing Classifier Sampler Accuracy F1 Precision Recall GradientBoosting Classifier nosampler 0.993321 0.824689 0.87056 0.803636 SMOTE 0.990538 0.812408 0.797711 0.842179 RandomOverSampler 0.991095 0.805656 0.799634 0.82599 RandomForest Classifier nosampler 0.992022 0.785662 0.905429 0.699957 LogisticRegression nosampler 0.989239 0.770769 0.844451 0.71984 RandomForest Classifier RandomOverSampler 0.98961 0.756497 0.780022 0.746907 SMOTE 0.988312 0.738019 0.791825 0.709014 GradientBoosting Classifier RandomUnderSampler 0.960297 0.554085 0.401329 0.954762 RandomForest Classifier RandomUnderSampler 0.956401 0.549563 0.392505 0.993333 LogisticRegression SMOTE 0.956401 0.486309 0.334287 0.984615 RandomOverSampler 0.948237 0.420189 0.269592 1 RandomUnderSampler 0.926345 0.358347 0.224078 0.977381 SVC SMOTE 0.44026 0.072452 0.037694 1 RandomUnderSampler 0.187384 0.050136 0.025757 1 RandomOverSampler 0.97885 0 0 0 nosampler 0.980519 0 0 0
  • 15. Modeling: Parameter Tuning • Gradient Boosting algorithm was our best performer • The left table shows the values tested for each parameter • The right table shows the optimal value for each parameter Hyper-Parameter Values Tested 'max_depth' [5, 6, 7, 8], 'max_features' [0.1, 0.2, 0.3, 0.4], 'min_samples_leaf' [5, 10, 20, 50], 'min_samples_split' [0.05, 0.075, 0.1], 'subsample' [0.5, 0.75, 0.9, 1] Hyper-Parameter Optimal Value 'max_depth' 7 'max_features' 0.2 'min_samples_leaf' 5 'min_samples_split' 0.1 'subsample' 0.9
  • 16. Modeling: Optimal Model • We re-run the Gradient-Boosting algorithm with optimal parameters • We find the best re-sampling method is the Random Over Sampler Classifier Sampler Accuracy F1 Precision Recall GradientBoosting Classifier Random Over-Sampler 0.994286 0.875867 0.841344 0.922296 nosampler 0.992764 0.833550 0.875636 0.807443 SMOTE 0.992393 0.832843 0.765702 0.925740 Random Under-Sampler 0.963636 0.560210 0.401982 0.983445
  • 17. Results: Future Hall of Famers • We take our trained model, and make predictions on current NFL players, and players who retired in 2006 or later • Our model performs well, as it correctly predicted all four players who have been inducted to the Hall (highlighted in green)
  • 18. Conclusion • We are happy to have created a model that is able to classify a test set with 99.4% accuracy, 84% precision, and 92% recall. • Future study could leverage other data sources to predict Hall of Famers in the many other NFL positions • We could improve our model by acquiring other relevant data and statistics (ex: # of Pro-Bowl Teams)
  • 19. Appendix: Variables Used in Modeling Columns Data Type Description RRTD float64 Rushing TDs adj + Receiving TDs adj SB MVP float64 Super Bowl MVP awards Receiving Yards Per Game float64 Receiving Yards adj / Games Played Rushing Yards adj float64 Rushing Yards adjusted for inflation MVP float64 Most Valuable Player Awards won PGWD float64 Playoff Game-Winning Drives TD Passes adj float64 TD Passes adjusted for inflation RRYd float64 Receiving Yards adj + Rushing Yards adj Passing Yards Per Game float64 Passing Yards adj / Games Played SB float64 Super Bowl Appearances Name object Player’s Name Player Id object Individual Identifier HOF int64 Hall of Fame Status 1 – In Hall of Fame 0 – Not in Hall of Fame Position int64 0- Quarterback 1- Running Back 2- Wide Receiver
  • 20. Appendix: All Variables Columns Data Type Columns Data Type Columns Data Type Columns Data Type Player Id object Passes Attempted float64 Receptions Longer than 20 Yards float64 Sacks float64 Name object Passes Completed float64 Receptions Longer than 40 Yards float64 TD Passes adj float64 Position int64 Passes Longer than 20 Yards float64 Rushing Attempts float64 Yards Per Carry float64 Completion Percentage float64 Passes Longer than 40 Yards float64 Rushing Attempts Per Game float64 Yards Per Reception float64 First Down Receptions float64 Passing Yards Per Attempt float64 Rushing First Downs float64 Year float64 First Year int64 Passing Yards Per Game float64 Rushing More Than 20 Yards float64 MVP float64 Games Played float64 Passing Yards adj float64 Rushing More Than 40 Yards float64 SB MVP float64 HOF int64 Percentage of Rushing First Downs float64 Rushing TDs adj float64 GWD float64 Int Rate float64 Percentage of TDs per Attempts float64 Rushing Yards Per Game float64 PGWD float64 Ints float64 Receiving TDs adj float64 Rushing Yards adj float64 SB float64 Last Year int64 Receiving Yards Per Game float64 SB Loss float64 RRYd float64 Pass Attempts Per Game float64 Receiving Yards adj float64 SB Win float64 RRTD float64 Passer Rating float64 Receptions float64 Sacked Yards Lost float64
  • 21. Appendix: Inflation Calculation • adj: means that metric is adjusted for inflation • inflation calculation: 𝑎 𝑏 = 𝑐 • 𝑎 = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑜𝑝 15 𝑝𝑙𝑎𝑦𝑒𝑟𝑠 𝑖𝑛 𝑦𝑒𝑎𝑟 2016 • 𝑏 = 𝑀𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑡𝑜𝑝 15 𝑝𝑙𝑎𝑦𝑒𝑟𝑠 𝑖𝑛 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑦𝑒𝑎𝑟 • 𝑐 = 𝐼𝑛𝑓𝑙𝑎𝑡𝑖𝑜𝑛 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟 (𝑡ℎ𝑖𝑠 𝑖𝑠 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝑣𝑎𝑙𝑢𝑒)