SlideShare a Scribd company logo
1 of 6
Download to read offline
The Bank is Open: AI in Sports Gambling
Alexandre Bucquet
bucqueta@stanford.edu
Vishnu Sarukkai
sarukkai@stanford.edu
Abstract
In this paper, we explore the applications of Machine Learning to sports betting
by focusing on predicting the number of total points scored by both teams in an
NBA game. We use Neural Networks as well as recurrent models for this task,
and manage to achieve results that are similar to those of the sports books. On
average, our best models can beat the house 51.5% of the time.
1 Introduction
Last May, the United States Supreme Court legalized sports betting. While it is still up to every state
to pass legislation on the issues, many have already been working on bills. As tracked by ESPN,
eight states have already legalized sports gambling, with two more states having signed bills that will
legalize gambling in the near future. With its ruling, the Supreme Court paved way to the opening of
whole new legal (and thus taxable) market of size $150 billion according to the American Gaming
Association.
In this project, we will explore the applications of Machine Learning in the field of sports betting,
using a case study of the National Basketball Association. More specifically, we wish to design sets
of models that predict betting indicators for NBA matches. Out of the three main indicators (Over-
Under, Money Line and Point Spread), we focus on the Over-Under. In other words, we predict for
every NBA game how many combined points the two teams will score. Ideally, we would be able to
come up with an estimate for this indicator that is more accurate than that of some betting platforms,
and use this to our advantage to place bets.
2 Related Work
The paper “Predicting Margin of Victory in NFL Games” [1] uses a Gaussian process model in
aiming to model NFL point spreads and beat the Las Vegas line. They aim to beat the line through
the incorporation of novel features such as the temperature of the field at kickoff and indicator
variables representing team “strengths.” However, the model achieves a success rate in predicting
games 2% worse than the Vegas line. This may be because high-dimensional Gaussians are not the
most appropriate way to model NFL point totals and point spreads. Moreover, NFL data in very
scarce as each team only takes the field less than 20 times a year.
The paper “Football Match Prediction using Deep Learning” [2] uses recurrent models to predict
the outcome of soccer matches. The architecture developed includes a Long Short Term Memory
Network (LSTM) that processes game data at every time interval. This allows the authors to obtain
a live prediction of who will win the soccer match every minute as the game progresses. The final
model had one LSTM layer with 256 hidden dimensions, and achieved a test accuracy of 88%. This
paper illustrates the relevance of recurrent models to sports data and its competitive results suggest
that more is to be done for other sports using similar model architectures.
3 Dataset and Features
3.1 Data Collection
The data collected so far can be classified into two groups: betting odds data and game data, which
consists of both data describing team performance and player performance.
We found odds data on Sports Book Review Online, a website that compiles betting data for every
NBA games since the 2007-2008 season. The website offers a downloadable excel file for each
season. Using a short script, we were able to retrieve all the necessary target variables and the
corresponding betting odds offered. The betting indicators scraped were the following:
i Over-Under: the total number of points scored in a game.
ii Spread: the number of points by which the home team wins (or loses).
iii Money Line: a number encoding the amount of money won from placing a bet on the
winning team of a game.
In the rest of this project, we focus on the total number of points scored, and we use the Over-Under
data collected to solely evaluate our models, and do not include the above data in our features.
For game data, we retrieved data from Basketball Reference using Fran Goitia’s NBA crawler [3].
We made minor edits to the scraper in order to account for formatting inconsistencies in the data.
This crawler provided a strong base to start collecting data on every team for every game since the
2007-2008 NBA season. The data collected for every game included statistics such as Points Scored,
Rebounds, Assists and Steals for both teams.
3.2 Feature Building
Using the retrieved data from Basketball Reference, we build a featurized dataset. For every matchup
between two teams, we decide to look at both team’s past three games. We included simple features
such as Points Scored, Points Scored Against, or Total Rebounds, and also more complicated metrics
such as Offensive Rating and Plus/Minus. In order to account for opponent strength, we also added
each opponent’s season averages in all above metrics. Finally, to account for player fatigue, we
added the number of days since the last game as well as the distance traveled (see Figure 1).
Figure 1: Overview of the features used for a game between the Golden State Warriors (GSW) and
the Oklahoma City Thunder (OKC)
4 Methods
All the code for this project can be found at https://github.com/abucquet/bank is open. We used
Python libraries Scikit-learn [4] (Random Forest), Surprise [5] (Singlar Value Decomposition), and
PyTorch [6] (Neural Net and LSTM).
2
For the task of predicting the number of points scored in a game, we minimize Mean Squared Error
(MSE) when training all our models. We trained our models on four seasons (2012 − 2013 to
2015 − 2016), used the 2016 − 2017 season as our validation set, and tested our models on the
2017 − 2018 season. This gave us a train set of size 5,069, a validation set of size 1,260 and a test
set of size 1,264.
We trained a Random Forest as a baseline before moving on to more complex models.
4.1 Collaborative Filtering
This problem is similar to the user/item rating prediction, information about the outcome of previous
games between other teams can inform our predicted output for the current matchup. To leverage
this, we used Singular Value Decomposition [7] to build a model similar to collaborative filtering.
In this model, we build a sparse matrix A that contains the number of points scored by two teams if
the teams have faced off in the past few days. In this setting, the rows represent the home team and
the columns the away team. Then, we factorize A into UΣV T
by solving
minU,V,Σ
ij,Aij =0
(Aij − (UΣV T
)ij)2
We hope that the rows of matrices U and V capture information about teams at home and away,
respectively, and we then predict uiΣvT
j total points scored in a game between teams i (home) and
j (away), where bk is the k-th row of matrix B.
4.2 Neural Network
Like the collaborative filtering model, the Neural Network captures information from the outcomes
of previous games between other teams, as during training the network is provided the results of
previous games as input along with the identity of the two teams involved in the game. However, the
Neural Network has an advantage over collaborative filtering in that it is also able to take features of
both teams involved as inputs. Therefore, it can draw on not only the outcomes of previous training
examples but also the offensive rating of each of the teams involved over their past three games, etc.
Passing in the feature set created in 3.2, we train a Neural Network with fully-connected layers and
ReLU activations. The input data is flattened into a size of 1524 features. The ReLU activations are
used for ease of training and to reduce the likelihood of gradient vanishing (see Figure 2).
4.3 Long Short Term Memory Network (LSTM)
Given that games are played sequentially, we decided to use an LSTM to process the past three
games one by one. As described in ”Long Short-Term Memory” [8], an LSTM is a recurrent model
where we repeatedly computed a hidden state by processing a new timestep (in our case one game).
The equations to compute the future hidden state ht from the current state ht−1 and the input xt are
given below:
zt = σ(Wz[ht−1, xt]
rt = σ(Wz[ht−1, xt]
˜ht = tanh(W[rt ∗ ht−1, xt]
ht = (1 − zt) ∗ ht−1 + zt ∗ ˜ht
For our task, we implemented an LSTM model with a fully connected layer at the end to output the
number of points scored by the two teams for the desired game (see Figure 3).
5 Results
After tuning our models on the validation set to minimize validation MSE, we found the following
architectures to perform best:
3
Figure 2: Overview of Neural Network
Architecture Figure 3: Overview of LSTM Architecture
i Collaborative Filtering: We train a model looking at the games around the league for the
past five days, as that threshold minimized the validation MSE (see Figure 4).
ii Neural Network: We use four fully-connected layers, reducing the input of dimension 1524
to size 500, then 100, then 20, then finally a 1-dimensional output that predicts the Over-
Under of the desired game (see Figure 2). We used a weight decay parameter of 1 to
regularize the weights of the network, and a learning rate of 10−6
over 5000 epochs, using
Gradient Descent. This allowed the network to roughly train to convergence (see Figure 5).
iii LSTM: We use one layer with hidden dimension 20, and a fully connected layer at the end
(see Figure 3). We trained our model using Stochastic Gradient Descent with batches of
size 10, learning rate of 0.05 and dropout of 0.2 for 100 epochs (see Figure 6).
We evaluate the performance of our models primarily through the Mean Squared Error (MSE) be-
tween the Over-Under value predicted by the model and the true point total observed in the game.
The Over-Under values predicted by the sports books can serve as a benchmark as calculating the
MSE between the sports books’ predictions and the observed point totals gives us a sense of how
well our models are performing relative to the books. In addition, we can also calculate the percent-
age of the time that our model would have correctly predicted that the true point total was either over
or under the Over-Under number provided by the sports books – we found that the Neural Network
correctly chooses Over or Under around 51.5% of the time, while the Collaborative Filtering beats
the line 51% of the time on average.
Model Train MSE Test MSE
Random Forest (baseline) 955.10 910.33
Collaborative Filtering 2.95 459.85
Neural Network 349.17 369.84
LSTM Network 398.95 426.56
Sports Book Over-Under - 320.696
Table 1: Model Results
Figure 4: Collaborative
Filtering Hyperparameter
Search
Figure 5: Neural Network
Training Curve
Figure 6: LSTM Training
Curve
4
6 Discussion
Due to the rapidly changing nature of the NBA it is difficult to acquire sufficient training data that
reflects the way the game is currently played. This can be seen most prominently in the increasing
frequency with which NBA teams are attempting (and making) three-point shots. According to the
NBA, teams around the league combined for around 15,000 three-pointers made in the 2009 − 2010
season, and broke the 25,000 mark last year (2017-2018).
When we were training the models, we found while while we had data available starting from the
2007 − 2008 season, we achieved the highest levels of validation accuracy when the training dataset
started from the 2012 − 2013 season rather than from the 2007 − 2008 season. This suggests that
the earlier seasons are not very representative of the validation season (which was the 2016 − 2017
season) possibly because of the changing nature of the NBA.
It is worth noting that the Collaborative Filtering model, which doesn’t use any team features, sig-
nificantly outperformed the Random Forest. This shows the high variance in our data as well as the
strong seasonal trends that a model needs to encompass in order to be accurate on this task.
On Figure 6, it seems like the LSTM model doesn’t learn much after the first few epochs of training
and is overfitting to the training data. This is likely due to the fact that our data set is rather small
(around 5,000 example), and LSTMs needs lots of data to achieve successful training, both in terms
of learning and overfitting. Using the validation set, we weren’t able to find appropriate ways to
reduce overfitting without compromising the validation accuracy.
We achieved a test MSE of 369.84 for our best model, the Neural Network. This value is higher
than the MSE of the sports books’ predictions, which is 320.70, but is relatively close to the level
of accuracy of the books. When considering that NBA teams are scoring roughly a combined 200
points per game, MSE values of over 300 from even sports books may seem high. However, this
is simply a reflection of the high-variance nature of NBA games, where anything from injuries to
players to a strong shooting night or the decision of a coach to rest his star players after building
a lead can all lead to huge swings in the overall point totals of a game. The variance of the data
was also reflected in the need for a relatively large weight decay parameter while training the Neural
Network to prevent overfitting.
Encouragingly, when placing bets against the sports books on games in the test dataset using the
Neural Network, the predictions made were correct with respect to the actual outcomes 51.5% of
the time, as mentioned earlier. However, note that in the real world, the fees associated with betting
mean that successful long-term betting patterns need to be correct at least 52-53% of the time. In the
future, it would be interesting to explore if directly predicting whether the true result of the game is
over of under a sports book’s Over-Under prediction would lead to higher levels of accuracy in this
secondary metric.
7 Conclusion and Future Work
Overall, we were able to achieve encouraging results using our models, as our best predictions
rivaled the accuracy of sports books. Indeed, our Neural Network had a MSE of 369.84, compared
to the 320.696 MSE of sports book. On average, our model was able to pick the correct side of the
Over-Under 51.5% of the time.
In the future, there is definitely potential to improve the model through augmenting our data set.
First of all, the odds lines offered by sports books themselves offer a large amount of information
on the potential outcomes of games, including data that indicates the direction the odds lines are
moving in the hours before the game. It is entirely possible that the odds lines would indicate that
the sports books are very good at predicting the outcomes of games involving certain teams, but tend
to skew in some direction when trying to predict the outcomes of games involving other teams. In
addition, while our current models only use team-level data, incorporating player-level data could
add additional layers of nuance while accounting for the impact of injuries, or player fatigue.
Further exploration of model architectures could potentially improve our results as well. Due to
the similarity of our current problem (predicting betting indicators from the home team and away
team) to classical recommendation systems (predicting ratings for a given user and item), we could
definitely explore adapting algorithms used by Netflix, Amazon, and others for recommendation.
5
References
[1] Jim Warner. Predicting margin of victory in nfl games: Machine learning vs. the las vegas line.
Technical report, Technical Report, 2010.
[2] Daniel Pettersson and Robert Nyquist. Football match prediction using deep learning. 2017.
[3] Fran Goitia. Basketball reference scraper. https://github.com/FranGoitia/
basketball_reference, 2017.
[4] Scikit-learn. https://scikit-learn.org/stable/.
[5] Surprise. http://surpriselib.com/.
[6] Pytorch. https://pytorch.org/.
[7] Gene H Golub and Christian Reinsch. Singular value decomposition and least squares solutions.
Numerische mathematik, 14(5):403–420, 1970.
[8] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation,
9:1735–80, 12 1997.
Contributions
In this project, both Vishnu and Alexandre contributed equally to the writing of this report. Alexan-
dre completed around two thirds of the data retrieval/cleaning process, built the Collaborative
Filtering model and the LSTM network. Vishnu completed the remaining third of the data re-
trieval/cleaning process, built the baseline models as well as the Neural Network. The rest was
achieved jointly by both members.
6

More Related Content

Similar to Cs229 final report

Predicting Winner of DOTA2 Game
Predicting Winner of DOTA2 GamePredicting Winner of DOTA2 Game
Predicting Winner of DOTA2 GamePrashanth Raj
 
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...IRJET Journal
 
CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectDimitry Slavin
 
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...Deren Lei
 
An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...csandit
 
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...cscpconf
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020Ruben Kerkhofs
 
Cricket Score and Winning Prediction
Cricket Score and Winning PredictionCricket Score and Winning Prediction
Cricket Score and Winning PredictionIRJET Journal
 
Application of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsApplication of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsEditor IJCATR
 
Creative component alexzajichek
Creative component alexzajichekCreative component alexzajichek
Creative component alexzajichekAlexZajichek
 
10.1.1.735.795.pdf
10.1.1.735.795.pdf10.1.1.735.795.pdf
10.1.1.735.795.pdfresearchict
 
Project ppt.pptx
Project ppt.pptxProject ppt.pptx
Project ppt.pptxSnehaDR2
 
Tutorial machine learning with python - a tutorial
Tutorial   machine learning with python - a tutorialTutorial   machine learning with python - a tutorial
Tutorial machine learning with python - a tutorialMarcusBBraga
 
Supervised sequential pattern mining for identifying important patterns of pl...
Supervised sequential pattern mining for identifying important patterns of pl...Supervised sequential pattern mining for identifying important patterns of pl...
Supervised sequential pattern mining for identifying important patterns of pl...Rory Bunker
 

Similar to Cs229 final report (20)

MachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_SparkMachineLearning_MPI_vs_Spark
MachineLearning_MPI_vs_Spark
 
Predicting Winner of DOTA2 Game
Predicting Winner of DOTA2 GamePredicting Winner of DOTA2 Game
Predicting Winner of DOTA2 Game
 
IRJET-V8I11270.pdf
IRJET-V8I11270.pdfIRJET-V8I11270.pdf
IRJET-V8I11270.pdf
 
Cricket predictor
Cricket predictorCricket predictor
Cricket predictor
 
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
Comparative Analysis of Machine Learning Models for Cricket Score and Win Pre...
 
honors_paper
honors_paperhonors_paper
honors_paper
 
CLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_ProjectCLanctot_DSlavin_JMiron_Stats415_Project
CLanctot_DSlavin_JMiron_Stats415_Project
 
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
Learning to Reason in Round-based Games: Multi-task Sequence Generation for P...
 
An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...An analysis of minimax search and endgame databases in evolving awale game pl...
An analysis of minimax search and endgame databases in evolving awale game pl...
 
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
AN ANALYSIS OF MINIMAX SEARCH AND ENDGAME DATABASES IN EVOLVING AWALE GAME PL...
 
Ai final module (1)
Ai final module (1)Ai final module (1)
Ai final module (1)
 
Predictions European Championships 2020
Predictions European Championships 2020Predictions European Championships 2020
Predictions European Championships 2020
 
Cricket Score and Winning Prediction
Cricket Score and Winning PredictionCricket Score and Winning Prediction
Cricket Score and Winning Prediction
 
Application of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA GuardsApplication of K-Means Clustering Algorithm for Classification of NBA Guards
Application of K-Means Clustering Algorithm for Classification of NBA Guards
 
Creative component alexzajichek
Creative component alexzajichekCreative component alexzajichek
Creative component alexzajichek
 
10.1.1.735.795.pdf
10.1.1.735.795.pdf10.1.1.735.795.pdf
10.1.1.735.795.pdf
 
Project ppt.pptx
Project ppt.pptxProject ppt.pptx
Project ppt.pptx
 
Tutorial machine learning with python - a tutorial
Tutorial   machine learning with python - a tutorialTutorial   machine learning with python - a tutorial
Tutorial machine learning with python - a tutorial
 
Supervised sequential pattern mining for identifying important patterns of pl...
Supervised sequential pattern mining for identifying important patterns of pl...Supervised sequential pattern mining for identifying important patterns of pl...
Supervised sequential pattern mining for identifying important patterns of pl...
 
I2 madankarky1 jharibabu
I2 madankarky1 jharibabuI2 madankarky1 jharibabu
I2 madankarky1 jharibabu
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 

Cs229 final report

  • 1. The Bank is Open: AI in Sports Gambling Alexandre Bucquet bucqueta@stanford.edu Vishnu Sarukkai sarukkai@stanford.edu Abstract In this paper, we explore the applications of Machine Learning to sports betting by focusing on predicting the number of total points scored by both teams in an NBA game. We use Neural Networks as well as recurrent models for this task, and manage to achieve results that are similar to those of the sports books. On average, our best models can beat the house 51.5% of the time. 1 Introduction Last May, the United States Supreme Court legalized sports betting. While it is still up to every state to pass legislation on the issues, many have already been working on bills. As tracked by ESPN, eight states have already legalized sports gambling, with two more states having signed bills that will legalize gambling in the near future. With its ruling, the Supreme Court paved way to the opening of whole new legal (and thus taxable) market of size $150 billion according to the American Gaming Association. In this project, we will explore the applications of Machine Learning in the field of sports betting, using a case study of the National Basketball Association. More specifically, we wish to design sets of models that predict betting indicators for NBA matches. Out of the three main indicators (Over- Under, Money Line and Point Spread), we focus on the Over-Under. In other words, we predict for every NBA game how many combined points the two teams will score. Ideally, we would be able to come up with an estimate for this indicator that is more accurate than that of some betting platforms, and use this to our advantage to place bets. 2 Related Work The paper “Predicting Margin of Victory in NFL Games” [1] uses a Gaussian process model in aiming to model NFL point spreads and beat the Las Vegas line. They aim to beat the line through the incorporation of novel features such as the temperature of the field at kickoff and indicator variables representing team “strengths.” However, the model achieves a success rate in predicting games 2% worse than the Vegas line. This may be because high-dimensional Gaussians are not the most appropriate way to model NFL point totals and point spreads. Moreover, NFL data in very scarce as each team only takes the field less than 20 times a year. The paper “Football Match Prediction using Deep Learning” [2] uses recurrent models to predict the outcome of soccer matches. The architecture developed includes a Long Short Term Memory Network (LSTM) that processes game data at every time interval. This allows the authors to obtain a live prediction of who will win the soccer match every minute as the game progresses. The final model had one LSTM layer with 256 hidden dimensions, and achieved a test accuracy of 88%. This paper illustrates the relevance of recurrent models to sports data and its competitive results suggest that more is to be done for other sports using similar model architectures.
  • 2. 3 Dataset and Features 3.1 Data Collection The data collected so far can be classified into two groups: betting odds data and game data, which consists of both data describing team performance and player performance. We found odds data on Sports Book Review Online, a website that compiles betting data for every NBA games since the 2007-2008 season. The website offers a downloadable excel file for each season. Using a short script, we were able to retrieve all the necessary target variables and the corresponding betting odds offered. The betting indicators scraped were the following: i Over-Under: the total number of points scored in a game. ii Spread: the number of points by which the home team wins (or loses). iii Money Line: a number encoding the amount of money won from placing a bet on the winning team of a game. In the rest of this project, we focus on the total number of points scored, and we use the Over-Under data collected to solely evaluate our models, and do not include the above data in our features. For game data, we retrieved data from Basketball Reference using Fran Goitia’s NBA crawler [3]. We made minor edits to the scraper in order to account for formatting inconsistencies in the data. This crawler provided a strong base to start collecting data on every team for every game since the 2007-2008 NBA season. The data collected for every game included statistics such as Points Scored, Rebounds, Assists and Steals for both teams. 3.2 Feature Building Using the retrieved data from Basketball Reference, we build a featurized dataset. For every matchup between two teams, we decide to look at both team’s past three games. We included simple features such as Points Scored, Points Scored Against, or Total Rebounds, and also more complicated metrics such as Offensive Rating and Plus/Minus. In order to account for opponent strength, we also added each opponent’s season averages in all above metrics. Finally, to account for player fatigue, we added the number of days since the last game as well as the distance traveled (see Figure 1). Figure 1: Overview of the features used for a game between the Golden State Warriors (GSW) and the Oklahoma City Thunder (OKC) 4 Methods All the code for this project can be found at https://github.com/abucquet/bank is open. We used Python libraries Scikit-learn [4] (Random Forest), Surprise [5] (Singlar Value Decomposition), and PyTorch [6] (Neural Net and LSTM). 2
  • 3. For the task of predicting the number of points scored in a game, we minimize Mean Squared Error (MSE) when training all our models. We trained our models on four seasons (2012 − 2013 to 2015 − 2016), used the 2016 − 2017 season as our validation set, and tested our models on the 2017 − 2018 season. This gave us a train set of size 5,069, a validation set of size 1,260 and a test set of size 1,264. We trained a Random Forest as a baseline before moving on to more complex models. 4.1 Collaborative Filtering This problem is similar to the user/item rating prediction, information about the outcome of previous games between other teams can inform our predicted output for the current matchup. To leverage this, we used Singular Value Decomposition [7] to build a model similar to collaborative filtering. In this model, we build a sparse matrix A that contains the number of points scored by two teams if the teams have faced off in the past few days. In this setting, the rows represent the home team and the columns the away team. Then, we factorize A into UΣV T by solving minU,V,Σ ij,Aij =0 (Aij − (UΣV T )ij)2 We hope that the rows of matrices U and V capture information about teams at home and away, respectively, and we then predict uiΣvT j total points scored in a game between teams i (home) and j (away), where bk is the k-th row of matrix B. 4.2 Neural Network Like the collaborative filtering model, the Neural Network captures information from the outcomes of previous games between other teams, as during training the network is provided the results of previous games as input along with the identity of the two teams involved in the game. However, the Neural Network has an advantage over collaborative filtering in that it is also able to take features of both teams involved as inputs. Therefore, it can draw on not only the outcomes of previous training examples but also the offensive rating of each of the teams involved over their past three games, etc. Passing in the feature set created in 3.2, we train a Neural Network with fully-connected layers and ReLU activations. The input data is flattened into a size of 1524 features. The ReLU activations are used for ease of training and to reduce the likelihood of gradient vanishing (see Figure 2). 4.3 Long Short Term Memory Network (LSTM) Given that games are played sequentially, we decided to use an LSTM to process the past three games one by one. As described in ”Long Short-Term Memory” [8], an LSTM is a recurrent model where we repeatedly computed a hidden state by processing a new timestep (in our case one game). The equations to compute the future hidden state ht from the current state ht−1 and the input xt are given below: zt = σ(Wz[ht−1, xt] rt = σ(Wz[ht−1, xt] ˜ht = tanh(W[rt ∗ ht−1, xt] ht = (1 − zt) ∗ ht−1 + zt ∗ ˜ht For our task, we implemented an LSTM model with a fully connected layer at the end to output the number of points scored by the two teams for the desired game (see Figure 3). 5 Results After tuning our models on the validation set to minimize validation MSE, we found the following architectures to perform best: 3
  • 4. Figure 2: Overview of Neural Network Architecture Figure 3: Overview of LSTM Architecture i Collaborative Filtering: We train a model looking at the games around the league for the past five days, as that threshold minimized the validation MSE (see Figure 4). ii Neural Network: We use four fully-connected layers, reducing the input of dimension 1524 to size 500, then 100, then 20, then finally a 1-dimensional output that predicts the Over- Under of the desired game (see Figure 2). We used a weight decay parameter of 1 to regularize the weights of the network, and a learning rate of 10−6 over 5000 epochs, using Gradient Descent. This allowed the network to roughly train to convergence (see Figure 5). iii LSTM: We use one layer with hidden dimension 20, and a fully connected layer at the end (see Figure 3). We trained our model using Stochastic Gradient Descent with batches of size 10, learning rate of 0.05 and dropout of 0.2 for 100 epochs (see Figure 6). We evaluate the performance of our models primarily through the Mean Squared Error (MSE) be- tween the Over-Under value predicted by the model and the true point total observed in the game. The Over-Under values predicted by the sports books can serve as a benchmark as calculating the MSE between the sports books’ predictions and the observed point totals gives us a sense of how well our models are performing relative to the books. In addition, we can also calculate the percent- age of the time that our model would have correctly predicted that the true point total was either over or under the Over-Under number provided by the sports books – we found that the Neural Network correctly chooses Over or Under around 51.5% of the time, while the Collaborative Filtering beats the line 51% of the time on average. Model Train MSE Test MSE Random Forest (baseline) 955.10 910.33 Collaborative Filtering 2.95 459.85 Neural Network 349.17 369.84 LSTM Network 398.95 426.56 Sports Book Over-Under - 320.696 Table 1: Model Results Figure 4: Collaborative Filtering Hyperparameter Search Figure 5: Neural Network Training Curve Figure 6: LSTM Training Curve 4
  • 5. 6 Discussion Due to the rapidly changing nature of the NBA it is difficult to acquire sufficient training data that reflects the way the game is currently played. This can be seen most prominently in the increasing frequency with which NBA teams are attempting (and making) three-point shots. According to the NBA, teams around the league combined for around 15,000 three-pointers made in the 2009 − 2010 season, and broke the 25,000 mark last year (2017-2018). When we were training the models, we found while while we had data available starting from the 2007 − 2008 season, we achieved the highest levels of validation accuracy when the training dataset started from the 2012 − 2013 season rather than from the 2007 − 2008 season. This suggests that the earlier seasons are not very representative of the validation season (which was the 2016 − 2017 season) possibly because of the changing nature of the NBA. It is worth noting that the Collaborative Filtering model, which doesn’t use any team features, sig- nificantly outperformed the Random Forest. This shows the high variance in our data as well as the strong seasonal trends that a model needs to encompass in order to be accurate on this task. On Figure 6, it seems like the LSTM model doesn’t learn much after the first few epochs of training and is overfitting to the training data. This is likely due to the fact that our data set is rather small (around 5,000 example), and LSTMs needs lots of data to achieve successful training, both in terms of learning and overfitting. Using the validation set, we weren’t able to find appropriate ways to reduce overfitting without compromising the validation accuracy. We achieved a test MSE of 369.84 for our best model, the Neural Network. This value is higher than the MSE of the sports books’ predictions, which is 320.70, but is relatively close to the level of accuracy of the books. When considering that NBA teams are scoring roughly a combined 200 points per game, MSE values of over 300 from even sports books may seem high. However, this is simply a reflection of the high-variance nature of NBA games, where anything from injuries to players to a strong shooting night or the decision of a coach to rest his star players after building a lead can all lead to huge swings in the overall point totals of a game. The variance of the data was also reflected in the need for a relatively large weight decay parameter while training the Neural Network to prevent overfitting. Encouragingly, when placing bets against the sports books on games in the test dataset using the Neural Network, the predictions made were correct with respect to the actual outcomes 51.5% of the time, as mentioned earlier. However, note that in the real world, the fees associated with betting mean that successful long-term betting patterns need to be correct at least 52-53% of the time. In the future, it would be interesting to explore if directly predicting whether the true result of the game is over of under a sports book’s Over-Under prediction would lead to higher levels of accuracy in this secondary metric. 7 Conclusion and Future Work Overall, we were able to achieve encouraging results using our models, as our best predictions rivaled the accuracy of sports books. Indeed, our Neural Network had a MSE of 369.84, compared to the 320.696 MSE of sports book. On average, our model was able to pick the correct side of the Over-Under 51.5% of the time. In the future, there is definitely potential to improve the model through augmenting our data set. First of all, the odds lines offered by sports books themselves offer a large amount of information on the potential outcomes of games, including data that indicates the direction the odds lines are moving in the hours before the game. It is entirely possible that the odds lines would indicate that the sports books are very good at predicting the outcomes of games involving certain teams, but tend to skew in some direction when trying to predict the outcomes of games involving other teams. In addition, while our current models only use team-level data, incorporating player-level data could add additional layers of nuance while accounting for the impact of injuries, or player fatigue. Further exploration of model architectures could potentially improve our results as well. Due to the similarity of our current problem (predicting betting indicators from the home team and away team) to classical recommendation systems (predicting ratings for a given user and item), we could definitely explore adapting algorithms used by Netflix, Amazon, and others for recommendation. 5
  • 6. References [1] Jim Warner. Predicting margin of victory in nfl games: Machine learning vs. the las vegas line. Technical report, Technical Report, 2010. [2] Daniel Pettersson and Robert Nyquist. Football match prediction using deep learning. 2017. [3] Fran Goitia. Basketball reference scraper. https://github.com/FranGoitia/ basketball_reference, 2017. [4] Scikit-learn. https://scikit-learn.org/stable/. [5] Surprise. http://surpriselib.com/. [6] Pytorch. https://pytorch.org/. [7] Gene H Golub and Christian Reinsch. Singular value decomposition and least squares solutions. Numerische mathematik, 14(5):403–420, 1970. [8] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9:1735–80, 12 1997. Contributions In this project, both Vishnu and Alexandre contributed equally to the writing of this report. Alexan- dre completed around two thirds of the data retrieval/cleaning process, built the Collaborative Filtering model and the LSTM network. Vishnu completed the remaining third of the data re- trieval/cleaning process, built the baseline models as well as the Neural Network. The rest was achieved jointly by both members. 6