honors_paper

1
Predicting the Outcome of a Basketball Game for
Knowledge and Profit
Jason Kholodnov

2
Table of Contents
1. Project Purpose 3
2. Acquiring Data from Internet Sources 4
3. Determining Strength of Teams 5-6
4. 2014 NBA Playoffs Prediction 7
5. 2015 NBA Playoffs Predictions 8
6. Analyzing the Impact of Money on Performance 9
7. Analyzing Elo's Failures 10
7. Determining Players' Performances 11-12
8. Using Player Statistics to Categorize Teams 13
9. What's Next 14
10. Technologies Used 15

3
Project Purpose
The purpose of this project is to attempt to create a platform which analyzes statistics between
professional Basketball players and teams, and by doing so predict the outcome of a future game. To do
this, a platform with three major components was developed in order to Scrape, Develop, and Analyze
data from 10 seasons worth of NBA games.

4
Acquiring Data from Internet Sources
To begin this project, a database of game statistics first needed to be generated. ESPN.com was
selected due to its uniform formatting style between various teams and games. A Web Scraping1
component was developed in Python3 which utilizes BeautifulSoup4 to generate a parse tree for the
HTML content. In order to scrape all seasons worth of data, a recursive scraping algorithm was
developed, which functions like so:
 Scrape all of ESPN.go.com/nba/standings to acquire URL's of each team.
 Scrape all of the previously acquired team URL's in parallel to acquire the URLs of each game.
 Scrape all of the previously acquired game URL's in parallel to acquire the data from each
game.
 Store the data acquired by scraper threads in a relational SQLite3 database.
Several problems were encountered along the way of developing this program, the first and
foremost being lack of uniformity between some teams homepage formatting. The most problematic of
teams happens to be the Charlotte Bobcats, or as they are known now the Charlotte Hornets. Due to a
team name change this last season, all of the hyperlinks on ESPN were incorrect and did not display
within the parse tree. In order to solve for this a page parsing algorithm was developed to determine if
either of the teams in the game being parsed were “Charlotte”, if this proved to be true then several
values were hardcoded in order to allow the rest of the scraping component to function correctly.
The second most complicated issue which arose was SQLite3 multi-threaded database writes.
Due to the program launching upwards of 14,000 game scraping threads at a time (in a pipelined
architecture), the database writes experienced a large amount of data contention and thus causing the
database to lock. The solution to this problem just required two steps, which works in the following
way:
 If the thread attempted a database write but received a DATABASE LOCKED error, the thread
recursively calls the same function with the same parameters. If this sequence repeats more than
two times, the thread exits and stores the game ID in an error message.
 If any threads exited with an error, then the program will output informing how many games
were not able to be scraped. The program will have to run again in order to attempt to scrape
these games.
This method of error handling allows for “Eventual Validity” of the database. What this means is that
although one pass of the program will not achieve an entirely correct database with all games scraped,
the size of games which still need to be scraped decreases every successful run of the program. In order
to scrape one season worth of games, a total of 2-3 runs are required.
1- A technique of extracting information from websites

5
Determining Strength of Teams
The second component in this project is a team strength generator. In order to determine how
important the outcome of a previous game was, it is necessary to have some metric by which we can
rank teams. In order to do this, an Elo ranking system2
was adopted, Elo is a metric developed by
Arpad Elo, a Hungarian physicist. This ranking system functions very well when there are highly
ranked individuals competing against low ranked individuals, and the rating updates should scale
according to the difference in rankings.
The formula reads as follows:
Ea=
1
1+10(RB− RA
400 )
Eb=
1
1+10(RA − RB
400 )
Where Ex is the expected chance that X will be the victor, and RA and RB denote the current ratings of
Team A and Team B respectively.
To calculate the rating change that occurs in the event of a victory for team X, we use the following:
R0=1500
Rx=(Rx)(n−1)+32(W −Ex )
Where W is a binary value, 1 = win, 0 = loss.
For example:
 1500 rating vs 1500 rating: The victor of this will gain 25 points, the loser will lose 25 points
 1600 rating vs 1400 rating: If the 1400 wins, ~40 points will change hands, while if the 1600
wins only ~15 points will change hands.
In order to implement this component, each team was given an initial rating of 1500, and an algorithm
was developed which functioned in the following steps:
 Select all days on which any games were played.
 Select all games which were played on each day.
 For each day on which games were played, spawn an appropriate number of threads so that
each game's information has its own worker thread.
 Update the database to reflect the teams' ELO Rating.
2- http://en.wikipedia.org/wiki/Elo_rating_system

6
This implementation method solves one key problem, when all ten seasons were scraped,
upwards of 13000 games needed to have Elo ratings generated. If this implementation was performed
sequentially, a runtime of 200 minutes would be expected. By performing this in parallel by
decomposing the problem set into days in which games were played, we are able to reduce the problem
set into smaller parallelizable chunks and reduce the runtime to 45 minutes on a quad-core machine.
Analyzing the ELO output, we see an interesting trend for each season. Throughout the season,
each team's maximum ELO rating was recorded, and at the end of each season the top sixteen teams
which made it into the playoffs were put in order. Here is an example of the outputs from the 2013-
2014 season:
Team Abbreviation Max ELO Achieved
sa 1827.277
okc 1776.423
mia 1755.340
lac 1751.937
ind 1746.647
hou 1739.535
por 1735.329
gsw 1696.646
mem 1686.782
bkn 1686.189
chi 1668.387
dal 1649.538
tor 1647.060
cha 1623.499
was 1588.494
atl 1568.493

7
2014 NBA Playoff Predictions
By comparing the maximum Elo rating each team achieved throughout the regular season, we are able
to make predictions of the outcomes of the playoff brackets.
RO16: RO8: RO4: Finals
SA > Dal correct SA > POR correct MIA > IND correct SA > MIA
HOU > POR incorrect OKC > LAC correct SA > OKC correct
LAC > GSW correct IND > WAS correct
OKC > MEM correct MIA > BKN correct
IND > ATL correct
CHI > WAS correct
BKN > TOR correct
MIA > CHA correct
Using these predictions we are able to achieve an accuracy of 15/16 series predicted correctly.
By using the maximum Elo rating each team achieved throughout the season, we are effectively
determining the team's peak performance. Because the playoffs are so high stakes, it is expected that
each team will perform at or near their peak performance. Although any team may win an individual
game, with a seven-game series the chances that the stronger team will win four games is much higher.
We see that the only case in which maximum Elo did not accurately predict the victor of a series was
between Houston and Portland, although Houston had the greater Elo rating than Portland, the
difference was negligible, leading to an incredibly close six game series in which three games went to
overtime and the average spread was 4.7 points.

8
2015 NBA Playoffs Predictions
Comparing this season's maximum ELO's with the previous season, we see a large difference in team
strengths.
2014 Playoffs 2015 Playoffs
sa 1827.27 lac 1741.23
okc 1776.423 mem 1729.13
mia 1755.340 atl 1718.02
lac 1751.937 hou 1716.41
ind 1746.647 gs 1702.32
We see that the peak strength of teams in the 2013-2014 season was significantly higher than the peak
strength of teams in the 2014-2015 season. This would imply that throughout the season, no team was
dominant for large streaks of time, and the team strengths were more balanced.
By comparing the ELO differences in the 2014 playoffs to those we expect in the 2015 playoffs, we can
expect to see a very interesting series between Memphis and Los Angeles in the semifinals, as well as a
much closer finals this year.

9
Analyzing the Impact of Money on Performance
(Data for 2015 season.)
An interesting comparison between teams arises when the total salaries of teams is compared to
the maximum Elo rating that team was able to achieve during a season. We see a tiered distribution of
teams, where those which spent less than ~65M were not able to achieve the peaks that additional
money can bring. We see two outliers in the data set, Atlanta and Brooklyn. Atlanta performed
exceptionally well for a team with its salary, while Brooklyn performed very poorly for a team with the
highest salary in the league. These results help explain the predicted brackets, and give us another piece
of information about the teams.
The teams which are circled in Red denote the teams that made it into the playoffs. We see that
14/18 teams which spent more than 70 million or more made it into the playoffs, while only 3 that
spent less than 70 million were able to make the playoffs.

10
Analyzing Elo's Failures
In addition to Elo's effectiveness in a seven-game series, it is effective in predicting single
games. By comparing the Elo rating of each team in all games with the result of the game, we are able
to predict the victor of the game with an accuracy of 65.6%. This is significantly lower than the 15/16
accuracy we are able to achieve through a seven-game series, but still a very good result.
In order to improve our overall accuracy, we must look at the games in which a prediction using
purely Elo did not prove to be accurate.
We see that the Elo differential is skewed heavily toward the lower end of the plot. This means
that as the Elo differential between teams increases, the likelihood that the lower rated team will prove
to be victorious declines rapidly. This was to be expected through the definition of Elo rating, but using
this graph will allow us to develop a percent chance the team will win based on Elo differential.
In order to analyze why the lower rated team was able to be victorious, we must move toward a
lower level view than simply team ratings. Since a whole is just the sum of its parts, we will look at the
performances of individual players in these games to determine what factors led to the underdog
coming out on top.

11
Determining Players' Performances
Now that we have developed a ranking system for each team, we must determine how well each
player played in each game, as well as determine how they performed against the strength of the other
team. We must do so in order to develop a player-based simulation in which the team's performance is
equal to the sum of each of its players' performances. In order to do this effectively, we must gauge a
players' performance based on their previous performances as well as a standardized performance
metric. The metrics which we will be using to measure the players' performances will be:
 Performance Index Rating (PIR)3
 Normalized Performance Rating(NPR)4
Both of these metrics will use all measurable statistics from all games for which data has been tracked.
 Minutes Played, Field Goals (M/A), Three Pointers(M/A), Free Throws(M/A), Offensive
Rebounds, Defensive Rebounds, Assists, Steals, Turnovers, Points, Plus Minus
In order to calculate the PIR values for each player for each game, we will use this fairly simple
formula.
PIR=(Points+2∗Rebounds+ Assists+Steals+2∗Blocks+Fouls Drawn)
−(Missed FG' s+ Missed FT ' s+Turnovers+Fouls+ShotsBlocked )
This provides us with a generic way to determine a players' performance in a game, including
both his offensive and defensive contributions. The use of this metric to determine a team's
performance is also quite reliable. By generating a Team Performance Rating (TPR) using the
following formula:
TPR= ∑
i=1
i=numplayers
Playeri. PIR
When comparing the TPR's of teams within each game, we are able to achieve an incredible
90.6% accuracy rating of predicting the winner of a game. Although this value is incredibly accurate,
the values from which it is calculated are intrinsically related to the outcome of a game. If we are able
to predict individual players' PIR with no knowledge of a future game, we will have an incredibly
accurate method of determining individual players' performances as well as the outcome of any
individual game. We will return to this TPR statistic when we begin to generate simulations for each
player.
3- Performance Index Rating (PIR) is a basketball statistical formula that is used in a variety of European Basketball
leagues. It is similar but not identical to the Efficiency rating used by the NBA.
4- Not related to National Public Radio.

12
To compute the NPR values for each player within each game, we must develop an algorithm
which functions in the following way:
 Create a player object for each player in the league.
 Go through each game in order.
◦ For each game, attribute each players' performances to the appropriate player.
◦ Using the players μ and σ values up to the game being measured, determine how many
standard deviations from mean the players performance for each statistic was.
▪ The NPR value for this game will be:
 ∑
i=0
i=num− variables
(VariablePerformancei −μi
σi
)
By doing so, we can go through each player's game performances sequentially and determine
the mean and standard deviation of each of these metrics at the time that any game is played. We can
then compare the player's performance in that particular game to determine where along their standard
curve they fall by using a fairly simple algorithm:
This NPR value will tell us how well this player performed in regards to his average. A positive
value will imply that he performed better than an average game for him, while a negative value implies
the opposite. We will use this value to predict a player's momentum. For example if a player has the
following NPR values for his previous 5 games:
• 0.5
• 1.6
• 6.4
• 16.4
• 15.8
We would imply that the player has been “Heating up”, and would predict that his hot streak
will continue through to his next game. It is important to note that some players may be able to
generate inflated NPR values due to simply having very few minutes played.
• Example: A player with an average of 1 minute played, and 0 in all statistics who played for 5
minutes one game and managed to score a few points would be able to achieve astronomical
NPR values which may not even be achievable by Lebron James.
We will account for this effect by multiplying the NPR value by the percentage of minutes
played within a game. This will scale down a player's performance to account for their actual game
contribution, while still displaying their personal momentum.
Individually, the NPR statistic does not tell us much about the outcome of a game, since it
simply projects a player's game performance against his previous performances, but it does allow us to
predict with some certainty how well the player has been performing recently, and thus generate a 5
and 10-game moving momentum.

13
Using Player Statistics to Categorize Teams
By using a team's player-level statistics, we can try to categorize teams into whether the team
won or not. Using a SVM5
model like so:
svm(Win~fga+tpa+fta+oreb+dreb+assist+steal+block+turnover+fouls+npr)
Note that all information about points gained is removed in order to prevent direct influence
into the SVM Model. Using this model, we can make a prediction of the teams within our data set, and
output this prediction as a table where the SVM proved correct, and where it failed.
> independent = fga + tpa + fta + oreb + dreb + assist + steal + block + turnover + fouls + npr
> model = svm(Win~independent, data = overall)
> pred = predict(model, independent, type = "class")
> table(pred = round(pred), true=dependent)
Actual
pred 0 1
0 776 173
1 122 725
These bolded figures in our table are the important output of the SVM. This tells us that our
SVM model places 776 /949 losing teams correctly, and 725/847 winning team correctly. This means
that if we can develop a model which predicts a team's performances in the tracked variables,
FGA, TPA, FTA, OREB, DREB, ASSIST, STEAL, BLOCK, TURNOVER, FOULS, NPR
Then we will be able to place that point into N-dimensional space on either side of a N-dimensional
plane. Based on which side of the N-dimensional plane the point lands, then we will predict that team
will win or lose the game that was simulated. We can now use this model to predict a team's chances of
winning a game, based on simulated values for each player.
5 Support Vector Models try to classify vectors in N dimensions based on a dependent variable.

14
What's Next
-Develop a simulation which tries to give a predicted
value for each team's performances. Using the SVM, predict
whether the team's performance classifies as a Win or Loss.
-Dig deeper into Salary ~ Performance player-level
statistics
-Compare predictions against Las Vegas bookie odds,
determine approximate returns if I bet $100 on all predicted
teams.

15
Technologies Used
SQLite3 : Lightweight database which has SQL syntax.
Git : Version Control.
Docker : Virtual Environment which contains all library
dependencies, packages, data to run the program.
Python : Web Scraping.
C++ : Team Elo, NPR, PIR generator, Game simulations.
R : Statistics on gathered data.

honors_paper

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to honors_paper

Similar to honors_paper (20)

honors_paper