SlideShare a Scribd company logo
1
Predicting the Outcome of a Basketball Game for
Knowledge and Profit
Jason Kholodnov
2
Table of Contents
1. Project Purpose 3
2. Acquiring Data from Internet Sources 4
3. Determining Strength of Teams 5-6
4. 2014 NBA Playoffs Prediction 7
5. 2015 NBA Playoffs Predictions 8
6. Analyzing the Impact of Money on Performance 9
7. Analyzing Elo's Failures 10
7. Determining Players' Performances 11-12
8. Using Player Statistics to Categorize Teams 13
9. What's Next 14
10. Technologies Used 15
3
Project Purpose
The purpose of this project is to attempt to create a platform which analyzes statistics between
professional Basketball players and teams, and by doing so predict the outcome of a future game. To do
this, a platform with three major components was developed in order to Scrape, Develop, and Analyze
data from 10 seasons worth of NBA games.
4
Acquiring Data from Internet Sources
To begin this project, a database of game statistics first needed to be generated. ESPN.com was
selected due to its uniform formatting style between various teams and games. A Web Scraping1
component was developed in Python3 which utilizes BeautifulSoup4 to generate a parse tree for the
HTML content. In order to scrape all seasons worth of data, a recursive scraping algorithm was
developed, which functions like so:
 Scrape all of ESPN.go.com/nba/standings to acquire URL's of each team.
 Scrape all of the previously acquired team URL's in parallel to acquire the URLs of each game.
 Scrape all of the previously acquired game URL's in parallel to acquire the data from each
game.
 Store the data acquired by scraper threads in a relational SQLite3 database.
Several problems were encountered along the way of developing this program, the first and
foremost being lack of uniformity between some teams homepage formatting. The most problematic of
teams happens to be the Charlotte Bobcats, or as they are known now the Charlotte Hornets. Due to a
team name change this last season, all of the hyperlinks on ESPN were incorrect and did not display
within the parse tree. In order to solve for this a page parsing algorithm was developed to determine if
either of the teams in the game being parsed were “Charlotte”, if this proved to be true then several
values were hardcoded in order to allow the rest of the scraping component to function correctly.
The second most complicated issue which arose was SQLite3 multi-threaded database writes.
Due to the program launching upwards of 14,000 game scraping threads at a time (in a pipelined
architecture), the database writes experienced a large amount of data contention and thus causing the
database to lock. The solution to this problem just required two steps, which works in the following
way:
 If the thread attempted a database write but received a DATABASE LOCKED error, the thread
recursively calls the same function with the same parameters. If this sequence repeats more than
two times, the thread exits and stores the game ID in an error message.
 If any threads exited with an error, then the program will output informing how many games
were not able to be scraped. The program will have to run again in order to attempt to scrape
these games.
This method of error handling allows for “Eventual Validity” of the database. What this means is that
although one pass of the program will not achieve an entirely correct database with all games scraped,
the size of games which still need to be scraped decreases every successful run of the program. In order
to scrape one season worth of games, a total of 2-3 runs are required.
1- A technique of extracting information from websites
5
Determining Strength of Teams
The second component in this project is a team strength generator. In order to determine how
important the outcome of a previous game was, it is necessary to have some metric by which we can
rank teams. In order to do this, an Elo ranking system2
was adopted, Elo is a metric developed by
Arpad Elo, a Hungarian physicist. This ranking system functions very well when there are highly
ranked individuals competing against low ranked individuals, and the rating updates should scale
according to the difference in rankings.
The formula reads as follows:
Ea=
1
1+10(RB− RA
400 )
Eb=
1
1+10(RA − RB
400 )
Where Ex is the expected chance that X will be the victor, and RA and RB denote the current ratings of
Team A and Team B respectively.
To calculate the rating change that occurs in the event of a victory for team X, we use the following:
R0=1500
Rx=(Rx)(n−1)+32(W −Ex )
Where W is a binary value, 1 = win, 0 = loss.
For example:
 1500 rating vs 1500 rating: The victor of this will gain 25 points, the loser will lose 25 points
 1600 rating vs 1400 rating: If the 1400 wins, ~40 points will change hands, while if the 1600
wins only ~15 points will change hands.
In order to implement this component, each team was given an initial rating of 1500, and an algorithm
was developed which functioned in the following steps:
 Select all days on which any games were played.
 Select all games which were played on each day.
 For each day on which games were played, spawn an appropriate number of threads so that
each game's information has its own worker thread.
 Update the database to reflect the teams' ELO Rating.
2- http://en.wikipedia.org/wiki/Elo_rating_system
6
This implementation method solves one key problem, when all ten seasons were scraped,
upwards of 13000 games needed to have Elo ratings generated. If this implementation was performed
sequentially, a runtime of 200 minutes would be expected. By performing this in parallel by
decomposing the problem set into days in which games were played, we are able to reduce the problem
set into smaller parallelizable chunks and reduce the runtime to 45 minutes on a quad-core machine.
Analyzing the ELO output, we see an interesting trend for each season. Throughout the season,
each team's maximum ELO rating was recorded, and at the end of each season the top sixteen teams
which made it into the playoffs were put in order. Here is an example of the outputs from the 2013-
2014 season:
Team Abbreviation Max ELO Achieved
sa 1827.277
okc 1776.423
mia 1755.340
lac 1751.937
ind 1746.647
hou 1739.535
por 1735.329
gsw 1696.646
mem 1686.782
bkn 1686.189
chi 1668.387
dal 1649.538
tor 1647.060
cha 1623.499
was 1588.494
atl 1568.493
7
2014 NBA Playoff Predictions
By comparing the maximum Elo rating each team achieved throughout the regular season, we are able
to make predictions of the outcomes of the playoff brackets.
RO16: RO8: RO4: Finals
SA > Dal correct SA > POR correct MIA > IND correct SA > MIA
HOU > POR incorrect OKC > LAC correct SA > OKC correct
LAC > GSW correct IND > WAS correct
OKC > MEM correct MIA > BKN correct
IND > ATL correct
CHI > WAS correct
BKN > TOR correct
MIA > CHA correct
Using these predictions we are able to achieve an accuracy of 15/16 series predicted correctly.
By using the maximum Elo rating each team achieved throughout the season, we are effectively
determining the team's peak performance. Because the playoffs are so high stakes, it is expected that
each team will perform at or near their peak performance. Although any team may win an individual
game, with a seven-game series the chances that the stronger team will win four games is much higher.
We see that the only case in which maximum Elo did not accurately predict the victor of a series was
between Houston and Portland, although Houston had the greater Elo rating than Portland, the
difference was negligible, leading to an incredibly close six game series in which three games went to
overtime and the average spread was 4.7 points.
8
2015 NBA Playoffs Predictions
Comparing this season's maximum ELO's with the previous season, we see a large difference in team
strengths.
2014 Playoffs 2015 Playoffs
sa 1827.27 lac 1741.23
okc 1776.423 mem 1729.13
mia 1755.340 atl 1718.02
lac 1751.937 hou 1716.41
ind 1746.647 gs 1702.32
We see that the peak strength of teams in the 2013-2014 season was significantly higher than the peak
strength of teams in the 2014-2015 season. This would imply that throughout the season, no team was
dominant for large streaks of time, and the team strengths were more balanced.
By comparing the ELO differences in the 2014 playoffs to those we expect in the 2015 playoffs, we can
expect to see a very interesting series between Memphis and Los Angeles in the semifinals, as well as a
much closer finals this year.
9
Analyzing the Impact of Money on Performance
(Data for 2015 season.)
An interesting comparison between teams arises when the total salaries of teams is compared to
the maximum Elo rating that team was able to achieve during a season. We see a tiered distribution of
teams, where those which spent less than ~65M were not able to achieve the peaks that additional
money can bring. We see two outliers in the data set, Atlanta and Brooklyn. Atlanta performed
exceptionally well for a team with its salary, while Brooklyn performed very poorly for a team with the
highest salary in the league. These results help explain the predicted brackets, and give us another piece
of information about the teams.
The teams which are circled in Red denote the teams that made it into the playoffs. We see that
14/18 teams which spent more than 70 million or more made it into the playoffs, while only 3 that
spent less than 70 million were able to make the playoffs.
10
Analyzing Elo's Failures
In addition to Elo's effectiveness in a seven-game series, it is effective in predicting single
games. By comparing the Elo rating of each team in all games with the result of the game, we are able
to predict the victor of the game with an accuracy of 65.6%. This is significantly lower than the 15/16
accuracy we are able to achieve through a seven-game series, but still a very good result.
In order to improve our overall accuracy, we must look at the games in which a prediction using
purely Elo did not prove to be accurate.
We see that the Elo differential is skewed heavily toward the lower end of the plot. This means
that as the Elo differential between teams increases, the likelihood that the lower rated team will prove
to be victorious declines rapidly. This was to be expected through the definition of Elo rating, but using
this graph will allow us to develop a percent chance the team will win based on Elo differential.
In order to analyze why the lower rated team was able to be victorious, we must move toward a
lower level view than simply team ratings. Since a whole is just the sum of its parts, we will look at the
performances of individual players in these games to determine what factors led to the underdog
coming out on top.
11
Determining Players' Performances
Now that we have developed a ranking system for each team, we must determine how well each
player played in each game, as well as determine how they performed against the strength of the other
team. We must do so in order to develop a player-based simulation in which the team's performance is
equal to the sum of each of its players' performances. In order to do this effectively, we must gauge a
players' performance based on their previous performances as well as a standardized performance
metric. The metrics which we will be using to measure the players' performances will be:
 Performance Index Rating (PIR)3
 Normalized Performance Rating(NPR)4
Both of these metrics will use all measurable statistics from all games for which data has been tracked.
 Minutes Played, Field Goals (M/A), Three Pointers(M/A), Free Throws(M/A), Offensive
Rebounds, Defensive Rebounds, Assists, Steals, Turnovers, Points, Plus Minus
In order to calculate the PIR values for each player for each game, we will use this fairly simple
formula.
PIR=(Points+2∗Rebounds+ Assists+Steals+2∗Blocks+Fouls Drawn)
−(Missed FG' s+ Missed FT ' s+Turnovers+Fouls+ShotsBlocked )
This provides us with a generic way to determine a players' performance in a game, including
both his offensive and defensive contributions. The use of this metric to determine a team's
performance is also quite reliable. By generating a Team Performance Rating (TPR) using the
following formula:
TPR= ∑
i=1
i=numplayers
Playeri. PIR
When comparing the TPR's of teams within each game, we are able to achieve an incredible
90.6% accuracy rating of predicting the winner of a game. Although this value is incredibly accurate,
the values from which it is calculated are intrinsically related to the outcome of a game. If we are able
to predict individual players' PIR with no knowledge of a future game, we will have an incredibly
accurate method of determining individual players' performances as well as the outcome of any
individual game. We will return to this TPR statistic when we begin to generate simulations for each
player.
3- Performance Index Rating (PIR) is a basketball statistical formula that is used in a variety of European Basketball
leagues. It is similar but not identical to the Efficiency rating used by the NBA.
4- Not related to National Public Radio.
12
To compute the NPR values for each player within each game, we must develop an algorithm
which functions in the following way:
 Create a player object for each player in the league.
 Go through each game in order.
◦ For each game, attribute each players' performances to the appropriate player.
◦ Using the players μ and σ values up to the game being measured, determine how many
standard deviations from mean the players performance for each statistic was.
▪ The NPR value for this game will be:
 ∑
i=0
i=num− variables
(VariablePerformancei −μi
σi
)
By doing so, we can go through each player's game performances sequentially and determine
the mean and standard deviation of each of these metrics at the time that any game is played. We can
then compare the player's performance in that particular game to determine where along their standard
curve they fall by using a fairly simple algorithm:
This NPR value will tell us how well this player performed in regards to his average. A positive
value will imply that he performed better than an average game for him, while a negative value implies
the opposite. We will use this value to predict a player's momentum. For example if a player has the
following NPR values for his previous 5 games:
• 0.5
• 1.6
• 6.4
• 16.4
• 15.8
We would imply that the player has been “Heating up”, and would predict that his hot streak
will continue through to his next game. It is important to note that some players may be able to
generate inflated NPR values due to simply having very few minutes played.
• Example: A player with an average of 1 minute played, and 0 in all statistics who played for 5
minutes one game and managed to score a few points would be able to achieve astronomical
NPR values which may not even be achievable by Lebron James.
We will account for this effect by multiplying the NPR value by the percentage of minutes
played within a game. This will scale down a player's performance to account for their actual game
contribution, while still displaying their personal momentum.
Individually, the NPR statistic does not tell us much about the outcome of a game, since it
simply projects a player's game performance against his previous performances, but it does allow us to
predict with some certainty how well the player has been performing recently, and thus generate a 5
and 10-game moving momentum.
13
Using Player Statistics to Categorize Teams
By using a team's player-level statistics, we can try to categorize teams into whether the team
won or not. Using a SVM5
model like so:
svm(Win~fga+tpa+fta+oreb+dreb+assist+steal+block+turnover+fouls+npr)
Note that all information about points gained is removed in order to prevent direct influence
into the SVM Model. Using this model, we can make a prediction of the teams within our data set, and
output this prediction as a table where the SVM proved correct, and where it failed.
> independent = fga + tpa + fta + oreb + dreb + assist + steal + block + turnover + fouls + npr
> model = svm(Win~independent, data = overall)
> pred = predict(model, independent, type = "class")
> table(pred = round(pred), true=dependent)
Actual
pred 0 1
0 776 173
1 122 725
These bolded figures in our table are the important output of the SVM. This tells us that our
SVM model places 776 /949 losing teams correctly, and 725/847 winning team correctly. This means
that if we can develop a model which predicts a team's performances in the tracked variables,
FGA, TPA, FTA, OREB, DREB, ASSIST, STEAL, BLOCK, TURNOVER, FOULS, NPR
Then we will be able to place that point into N-dimensional space on either side of a N-dimensional
plane. Based on which side of the N-dimensional plane the point lands, then we will predict that team
will win or lose the game that was simulated. We can now use this model to predict a team's chances of
winning a game, based on simulated values for each player.
5 Support Vector Models try to classify vectors in N dimensions based on a dependent variable.
14
What's Next
-Develop a simulation which tries to give a predicted
value for each team's performances. Using the SVM, predict
whether the team's performance classifies as a Win or Loss.
-Dig deeper into Salary ~ Performance player-level
statistics
-Compare predictions against Las Vegas bookie odds,
determine approximate returns if I bet $100 on all predicted
teams.
15
Technologies Used
SQLite3 : Lightweight database which has SQL syntax.
Git : Version Control.
Docker : Virtual Environment which contains all library
dependencies, packages, data to run the program.
Python : Web Scraping.
C++ : Team Elo, NPR, PIR generator, Game simulations.
R : Statistics on gathered data.

More Related Content

Viewers also liked

Understanding Reasonable Accommodation
Understanding Reasonable AccommodationUnderstanding Reasonable Accommodation
Understanding Reasonable Accommodation
Curley & Rothman, LLC
 
START-alert_Mauritania-acute-malnutrition_July-2015
START-alert_Mauritania-acute-malnutrition_July-2015START-alert_Mauritania-acute-malnutrition_July-2015
START-alert_Mauritania-acute-malnutrition_July-2015
Lylaa Shaikh
 
Система дистанционного обслуживания
Система дистанционного обслуживания Система дистанционного обслуживания
Система дистанционного обслуживания
ICL Solutions
 
30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts
30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts
30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts
Amit Rao
 
How effective is the combination of your main product and ancillary tasks?
How effective is the combination of your main product and ancillary tasks?How effective is the combination of your main product and ancillary tasks?
How effective is the combination of your main product and ancillary tasks?
Jake Wilde
 
A Survey on Elliptic Curve Cryptography
A Survey on Elliptic Curve CryptographyA Survey on Elliptic Curve Cryptography
A Survey on Elliptic Curve Cryptography
editor1knowledgecuddle
 
List of Performances
List of PerformancesList of Performances
List of Performances
Alexander Rensch
 
Richard Filion english
Richard Filion englishRichard Filion english
Richard Filion english
Richard Filion
 
Studie zum Thema "Dschungelcamp"
Studie zum Thema "Dschungelcamp"Studie zum Thema "Dschungelcamp"
Studie zum Thema "Dschungelcamp"
LTUR_Presse
 

Viewers also liked (9)

Understanding Reasonable Accommodation
Understanding Reasonable AccommodationUnderstanding Reasonable Accommodation
Understanding Reasonable Accommodation
 
START-alert_Mauritania-acute-malnutrition_July-2015
START-alert_Mauritania-acute-malnutrition_July-2015START-alert_Mauritania-acute-malnutrition_July-2015
START-alert_Mauritania-acute-malnutrition_July-2015
 
Система дистанционного обслуживания
Система дистанционного обслуживания Система дистанционного обслуживания
Система дистанционного обслуживания
 
30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts
30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts
30 Mind-blowing Stats on Internet of Things: Market Estimates and Forecasts
 
How effective is the combination of your main product and ancillary tasks?
How effective is the combination of your main product and ancillary tasks?How effective is the combination of your main product and ancillary tasks?
How effective is the combination of your main product and ancillary tasks?
 
A Survey on Elliptic Curve Cryptography
A Survey on Elliptic Curve CryptographyA Survey on Elliptic Curve Cryptography
A Survey on Elliptic Curve Cryptography
 
List of Performances
List of PerformancesList of Performances
List of Performances
 
Richard Filion english
Richard Filion englishRichard Filion english
Richard Filion english
 
Studie zum Thema "Dschungelcamp"
Studie zum Thema "Dschungelcamp"Studie zum Thema "Dschungelcamp"
Studie zum Thema "Dschungelcamp"
 

Similar to honors_paper

CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_Project
Dimitry Slavin
 
IRJET-V8I11270.pdf
IRJET-V8I11270.pdfIRJET-V8I11270.pdf
IRJET-V8I11270.pdf
ShubhamSharma2566
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
Xudong Brandon Liang
 
Cs229 final report
Cs229 final reportCs229 final report
Cs229 final report
Alexandre Bucquet
 
OTDK angol
OTDK angolOTDK angol
OTDK angol
M J
 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysis
Ritu Sarkar
 
TechnicalReport_NFLProject_Austin&Ovais
TechnicalReport_NFLProject_Austin&OvaisTechnicalReport_NFLProject_Austin&Ovais
TechnicalReport_NFLProject_Austin&Ovais
Ovais Siddiqui
 
NIT1201 Introduction to Database System Assignment by USA Experts
NIT1201 Introduction to Database System Assignment by USA ExpertsNIT1201 Introduction to Database System Assignment by USA Experts
NIT1201 Introduction to Database System Assignment by USA Experts
Johnsmith5188
 
A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...
A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...
A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...
Shannon Green
 
NBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive BalanceNBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive Balance
Caleb Engelbourg
 
NBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive BalanceNBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive Balance
David Schneider
 
m503 Project1 FINAL DRAFT
m503 Project1 FINAL DRAFTm503 Project1 FINAL DRAFT
m503 Project1 FINAL DRAFT
Brian Becker
 
LAX IMPACT! White Paper
LAX IMPACT! White PaperLAX IMPACT! White Paper
LAX IMPACT! White Paper
R. Alan Eisenman
 
A Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball StatisticsA Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball Statistics
Joseph DeLay
 
InstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docxInstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docx
normanibarber20063
 
10.1.1.735.795.pdf
10.1.1.735.795.pdf10.1.1.735.795.pdf
10.1.1.735.795.pdf
researchict
 
Perfunctory NBA Analysis
Perfunctory NBA AnalysisPerfunctory NBA Analysis
Perfunctory NBA Analysis
Radu Stancut
 
NCAA March Madness Recruiting For Success
NCAA March Madness Recruiting For SuccessNCAA March Madness Recruiting For Success
NCAA March Madness Recruiting For Success
Jonathan Stryer
 
NBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxNBA playoff prediction Model.pptx
NBA playoff prediction Model.pptx
rishikeshravi30
 
The Data Behind Football
The Data Behind FootballThe Data Behind Football
The Data Behind Football
Apostolos Mourouzis
 

Similar to honors_paper (20)

CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_Project
 
IRJET-V8I11270.pdf
IRJET-V8I11270.pdfIRJET-V8I11270.pdf
IRJET-V8I11270.pdf
 
MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Cs229 final report
Cs229 final reportCs229 final report
Cs229 final report
 
OTDK angol
OTDK angolOTDK angol
OTDK angol
 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysis
 
TechnicalReport_NFLProject_Austin&Ovais
TechnicalReport_NFLProject_Austin&OvaisTechnicalReport_NFLProject_Austin&Ovais
TechnicalReport_NFLProject_Austin&Ovais
 
NIT1201 Introduction to Database System Assignment by USA Experts
NIT1201 Introduction to Database System Assignment by USA ExpertsNIT1201 Introduction to Database System Assignment by USA Experts
NIT1201 Introduction to Database System Assignment by USA Experts
 
A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...
A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...
A Hybrid Constraint Programming And Enumeration Approach For Solving NHL Play...
 
NBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive BalanceNBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive Balance
 
NBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive BalanceNBA Shorter Game and Competitive Balance
NBA Shorter Game and Competitive Balance
 
m503 Project1 FINAL DRAFT
m503 Project1 FINAL DRAFTm503 Project1 FINAL DRAFT
m503 Project1 FINAL DRAFT
 
LAX IMPACT! White Paper
LAX IMPACT! White PaperLAX IMPACT! White Paper
LAX IMPACT! White Paper
 
A Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball StatisticsA Time Series Analysis for Predicting Basketball Statistics
A Time Series Analysis for Predicting Basketball Statistics
 
InstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docxInstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docx
 
10.1.1.735.795.pdf
10.1.1.735.795.pdf10.1.1.735.795.pdf
10.1.1.735.795.pdf
 
Perfunctory NBA Analysis
Perfunctory NBA AnalysisPerfunctory NBA Analysis
Perfunctory NBA Analysis
 
NCAA March Madness Recruiting For Success
NCAA March Madness Recruiting For SuccessNCAA March Madness Recruiting For Success
NCAA March Madness Recruiting For Success
 
NBA playoff prediction Model.pptx
NBA playoff prediction Model.pptxNBA playoff prediction Model.pptx
NBA playoff prediction Model.pptx
 
The Data Behind Football
The Data Behind FootballThe Data Behind Football
The Data Behind Football
 

honors_paper

  • 1. 1 Predicting the Outcome of a Basketball Game for Knowledge and Profit Jason Kholodnov
  • 2. 2 Table of Contents 1. Project Purpose 3 2. Acquiring Data from Internet Sources 4 3. Determining Strength of Teams 5-6 4. 2014 NBA Playoffs Prediction 7 5. 2015 NBA Playoffs Predictions 8 6. Analyzing the Impact of Money on Performance 9 7. Analyzing Elo's Failures 10 7. Determining Players' Performances 11-12 8. Using Player Statistics to Categorize Teams 13 9. What's Next 14 10. Technologies Used 15
  • 3. 3 Project Purpose The purpose of this project is to attempt to create a platform which analyzes statistics between professional Basketball players and teams, and by doing so predict the outcome of a future game. To do this, a platform with three major components was developed in order to Scrape, Develop, and Analyze data from 10 seasons worth of NBA games.
  • 4. 4 Acquiring Data from Internet Sources To begin this project, a database of game statistics first needed to be generated. ESPN.com was selected due to its uniform formatting style between various teams and games. A Web Scraping1 component was developed in Python3 which utilizes BeautifulSoup4 to generate a parse tree for the HTML content. In order to scrape all seasons worth of data, a recursive scraping algorithm was developed, which functions like so:  Scrape all of ESPN.go.com/nba/standings to acquire URL's of each team.  Scrape all of the previously acquired team URL's in parallel to acquire the URLs of each game.  Scrape all of the previously acquired game URL's in parallel to acquire the data from each game.  Store the data acquired by scraper threads in a relational SQLite3 database. Several problems were encountered along the way of developing this program, the first and foremost being lack of uniformity between some teams homepage formatting. The most problematic of teams happens to be the Charlotte Bobcats, or as they are known now the Charlotte Hornets. Due to a team name change this last season, all of the hyperlinks on ESPN were incorrect and did not display within the parse tree. In order to solve for this a page parsing algorithm was developed to determine if either of the teams in the game being parsed were “Charlotte”, if this proved to be true then several values were hardcoded in order to allow the rest of the scraping component to function correctly. The second most complicated issue which arose was SQLite3 multi-threaded database writes. Due to the program launching upwards of 14,000 game scraping threads at a time (in a pipelined architecture), the database writes experienced a large amount of data contention and thus causing the database to lock. The solution to this problem just required two steps, which works in the following way:  If the thread attempted a database write but received a DATABASE LOCKED error, the thread recursively calls the same function with the same parameters. If this sequence repeats more than two times, the thread exits and stores the game ID in an error message.  If any threads exited with an error, then the program will output informing how many games were not able to be scraped. The program will have to run again in order to attempt to scrape these games. This method of error handling allows for “Eventual Validity” of the database. What this means is that although one pass of the program will not achieve an entirely correct database with all games scraped, the size of games which still need to be scraped decreases every successful run of the program. In order to scrape one season worth of games, a total of 2-3 runs are required. 1- A technique of extracting information from websites
  • 5. 5 Determining Strength of Teams The second component in this project is a team strength generator. In order to determine how important the outcome of a previous game was, it is necessary to have some metric by which we can rank teams. In order to do this, an Elo ranking system2 was adopted, Elo is a metric developed by Arpad Elo, a Hungarian physicist. This ranking system functions very well when there are highly ranked individuals competing against low ranked individuals, and the rating updates should scale according to the difference in rankings. The formula reads as follows: Ea= 1 1+10(RB− RA 400 ) Eb= 1 1+10(RA − RB 400 ) Where Ex is the expected chance that X will be the victor, and RA and RB denote the current ratings of Team A and Team B respectively. To calculate the rating change that occurs in the event of a victory for team X, we use the following: R0=1500 Rx=(Rx)(n−1)+32(W −Ex ) Where W is a binary value, 1 = win, 0 = loss. For example:  1500 rating vs 1500 rating: The victor of this will gain 25 points, the loser will lose 25 points  1600 rating vs 1400 rating: If the 1400 wins, ~40 points will change hands, while if the 1600 wins only ~15 points will change hands. In order to implement this component, each team was given an initial rating of 1500, and an algorithm was developed which functioned in the following steps:  Select all days on which any games were played.  Select all games which were played on each day.  For each day on which games were played, spawn an appropriate number of threads so that each game's information has its own worker thread.  Update the database to reflect the teams' ELO Rating. 2- http://en.wikipedia.org/wiki/Elo_rating_system
  • 6. 6 This implementation method solves one key problem, when all ten seasons were scraped, upwards of 13000 games needed to have Elo ratings generated. If this implementation was performed sequentially, a runtime of 200 minutes would be expected. By performing this in parallel by decomposing the problem set into days in which games were played, we are able to reduce the problem set into smaller parallelizable chunks and reduce the runtime to 45 minutes on a quad-core machine. Analyzing the ELO output, we see an interesting trend for each season. Throughout the season, each team's maximum ELO rating was recorded, and at the end of each season the top sixteen teams which made it into the playoffs were put in order. Here is an example of the outputs from the 2013- 2014 season: Team Abbreviation Max ELO Achieved sa 1827.277 okc 1776.423 mia 1755.340 lac 1751.937 ind 1746.647 hou 1739.535 por 1735.329 gsw 1696.646 mem 1686.782 bkn 1686.189 chi 1668.387 dal 1649.538 tor 1647.060 cha 1623.499 was 1588.494 atl 1568.493
  • 7. 7 2014 NBA Playoff Predictions By comparing the maximum Elo rating each team achieved throughout the regular season, we are able to make predictions of the outcomes of the playoff brackets. RO16: RO8: RO4: Finals SA > Dal correct SA > POR correct MIA > IND correct SA > MIA HOU > POR incorrect OKC > LAC correct SA > OKC correct LAC > GSW correct IND > WAS correct OKC > MEM correct MIA > BKN correct IND > ATL correct CHI > WAS correct BKN > TOR correct MIA > CHA correct Using these predictions we are able to achieve an accuracy of 15/16 series predicted correctly. By using the maximum Elo rating each team achieved throughout the season, we are effectively determining the team's peak performance. Because the playoffs are so high stakes, it is expected that each team will perform at or near their peak performance. Although any team may win an individual game, with a seven-game series the chances that the stronger team will win four games is much higher. We see that the only case in which maximum Elo did not accurately predict the victor of a series was between Houston and Portland, although Houston had the greater Elo rating than Portland, the difference was negligible, leading to an incredibly close six game series in which three games went to overtime and the average spread was 4.7 points.
  • 8. 8 2015 NBA Playoffs Predictions Comparing this season's maximum ELO's with the previous season, we see a large difference in team strengths. 2014 Playoffs 2015 Playoffs sa 1827.27 lac 1741.23 okc 1776.423 mem 1729.13 mia 1755.340 atl 1718.02 lac 1751.937 hou 1716.41 ind 1746.647 gs 1702.32 We see that the peak strength of teams in the 2013-2014 season was significantly higher than the peak strength of teams in the 2014-2015 season. This would imply that throughout the season, no team was dominant for large streaks of time, and the team strengths were more balanced. By comparing the ELO differences in the 2014 playoffs to those we expect in the 2015 playoffs, we can expect to see a very interesting series between Memphis and Los Angeles in the semifinals, as well as a much closer finals this year.
  • 9. 9 Analyzing the Impact of Money on Performance (Data for 2015 season.) An interesting comparison between teams arises when the total salaries of teams is compared to the maximum Elo rating that team was able to achieve during a season. We see a tiered distribution of teams, where those which spent less than ~65M were not able to achieve the peaks that additional money can bring. We see two outliers in the data set, Atlanta and Brooklyn. Atlanta performed exceptionally well for a team with its salary, while Brooklyn performed very poorly for a team with the highest salary in the league. These results help explain the predicted brackets, and give us another piece of information about the teams. The teams which are circled in Red denote the teams that made it into the playoffs. We see that 14/18 teams which spent more than 70 million or more made it into the playoffs, while only 3 that spent less than 70 million were able to make the playoffs.
  • 10. 10 Analyzing Elo's Failures In addition to Elo's effectiveness in a seven-game series, it is effective in predicting single games. By comparing the Elo rating of each team in all games with the result of the game, we are able to predict the victor of the game with an accuracy of 65.6%. This is significantly lower than the 15/16 accuracy we are able to achieve through a seven-game series, but still a very good result. In order to improve our overall accuracy, we must look at the games in which a prediction using purely Elo did not prove to be accurate. We see that the Elo differential is skewed heavily toward the lower end of the plot. This means that as the Elo differential between teams increases, the likelihood that the lower rated team will prove to be victorious declines rapidly. This was to be expected through the definition of Elo rating, but using this graph will allow us to develop a percent chance the team will win based on Elo differential. In order to analyze why the lower rated team was able to be victorious, we must move toward a lower level view than simply team ratings. Since a whole is just the sum of its parts, we will look at the performances of individual players in these games to determine what factors led to the underdog coming out on top.
  • 11. 11 Determining Players' Performances Now that we have developed a ranking system for each team, we must determine how well each player played in each game, as well as determine how they performed against the strength of the other team. We must do so in order to develop a player-based simulation in which the team's performance is equal to the sum of each of its players' performances. In order to do this effectively, we must gauge a players' performance based on their previous performances as well as a standardized performance metric. The metrics which we will be using to measure the players' performances will be:  Performance Index Rating (PIR)3  Normalized Performance Rating(NPR)4 Both of these metrics will use all measurable statistics from all games for which data has been tracked.  Minutes Played, Field Goals (M/A), Three Pointers(M/A), Free Throws(M/A), Offensive Rebounds, Defensive Rebounds, Assists, Steals, Turnovers, Points, Plus Minus In order to calculate the PIR values for each player for each game, we will use this fairly simple formula. PIR=(Points+2∗Rebounds+ Assists+Steals+2∗Blocks+Fouls Drawn) −(Missed FG' s+ Missed FT ' s+Turnovers+Fouls+ShotsBlocked ) This provides us with a generic way to determine a players' performance in a game, including both his offensive and defensive contributions. The use of this metric to determine a team's performance is also quite reliable. By generating a Team Performance Rating (TPR) using the following formula: TPR= ∑ i=1 i=numplayers Playeri. PIR When comparing the TPR's of teams within each game, we are able to achieve an incredible 90.6% accuracy rating of predicting the winner of a game. Although this value is incredibly accurate, the values from which it is calculated are intrinsically related to the outcome of a game. If we are able to predict individual players' PIR with no knowledge of a future game, we will have an incredibly accurate method of determining individual players' performances as well as the outcome of any individual game. We will return to this TPR statistic when we begin to generate simulations for each player. 3- Performance Index Rating (PIR) is a basketball statistical formula that is used in a variety of European Basketball leagues. It is similar but not identical to the Efficiency rating used by the NBA. 4- Not related to National Public Radio.
  • 12. 12 To compute the NPR values for each player within each game, we must develop an algorithm which functions in the following way:  Create a player object for each player in the league.  Go through each game in order. ◦ For each game, attribute each players' performances to the appropriate player. ◦ Using the players μ and σ values up to the game being measured, determine how many standard deviations from mean the players performance for each statistic was. ▪ The NPR value for this game will be:  ∑ i=0 i=num− variables (VariablePerformancei −μi σi ) By doing so, we can go through each player's game performances sequentially and determine the mean and standard deviation of each of these metrics at the time that any game is played. We can then compare the player's performance in that particular game to determine where along their standard curve they fall by using a fairly simple algorithm: This NPR value will tell us how well this player performed in regards to his average. A positive value will imply that he performed better than an average game for him, while a negative value implies the opposite. We will use this value to predict a player's momentum. For example if a player has the following NPR values for his previous 5 games: • 0.5 • 1.6 • 6.4 • 16.4 • 15.8 We would imply that the player has been “Heating up”, and would predict that his hot streak will continue through to his next game. It is important to note that some players may be able to generate inflated NPR values due to simply having very few minutes played. • Example: A player with an average of 1 minute played, and 0 in all statistics who played for 5 minutes one game and managed to score a few points would be able to achieve astronomical NPR values which may not even be achievable by Lebron James. We will account for this effect by multiplying the NPR value by the percentage of minutes played within a game. This will scale down a player's performance to account for their actual game contribution, while still displaying their personal momentum. Individually, the NPR statistic does not tell us much about the outcome of a game, since it simply projects a player's game performance against his previous performances, but it does allow us to predict with some certainty how well the player has been performing recently, and thus generate a 5 and 10-game moving momentum.
  • 13. 13 Using Player Statistics to Categorize Teams By using a team's player-level statistics, we can try to categorize teams into whether the team won or not. Using a SVM5 model like so: svm(Win~fga+tpa+fta+oreb+dreb+assist+steal+block+turnover+fouls+npr) Note that all information about points gained is removed in order to prevent direct influence into the SVM Model. Using this model, we can make a prediction of the teams within our data set, and output this prediction as a table where the SVM proved correct, and where it failed. > independent = fga + tpa + fta + oreb + dreb + assist + steal + block + turnover + fouls + npr > model = svm(Win~independent, data = overall) > pred = predict(model, independent, type = "class") > table(pred = round(pred), true=dependent) Actual pred 0 1 0 776 173 1 122 725 These bolded figures in our table are the important output of the SVM. This tells us that our SVM model places 776 /949 losing teams correctly, and 725/847 winning team correctly. This means that if we can develop a model which predicts a team's performances in the tracked variables, FGA, TPA, FTA, OREB, DREB, ASSIST, STEAL, BLOCK, TURNOVER, FOULS, NPR Then we will be able to place that point into N-dimensional space on either side of a N-dimensional plane. Based on which side of the N-dimensional plane the point lands, then we will predict that team will win or lose the game that was simulated. We can now use this model to predict a team's chances of winning a game, based on simulated values for each player. 5 Support Vector Models try to classify vectors in N dimensions based on a dependent variable.
  • 14. 14 What's Next -Develop a simulation which tries to give a predicted value for each team's performances. Using the SVM, predict whether the team's performance classifies as a Win or Loss. -Dig deeper into Salary ~ Performance player-level statistics -Compare predictions against Las Vegas bookie odds, determine approximate returns if I bet $100 on all predicted teams.
  • 15. 15 Technologies Used SQLite3 : Lightweight database which has SQL syntax. Git : Version Control. Docker : Virtual Environment which contains all library dependencies, packages, data to run the program. Python : Web Scraping. C++ : Team Elo, NPR, PIR generator, Game simulations. R : Statistics on gathered data.