BACKGROUND AND MOTIVATION
• The motivation for this project was to apply
mathematical, statistical and actuarial
modelling techniques to professional sports.
• Our interest is to provide information on how
such modelling techniques can be used to gain
insights into sports dynamics.
The main aim of this work is to build models for
football data, in particular the scoring rate.
The specific objectives that we set out to achieve
Examining the suitability of the simple Poisson
model to our data.
Incorporating the time heterogeneity of the
simple Poisson parameter.
Exploring the effect of time on the scoring rate.
• Data was retrieved from rsssf.com.
• The data used in this project spans 7 football
seasons from the 1996-1997 to the 2002-2003
English premier league seasons.
• There were (380*7)=2660 matches over the 7
season with 7001 goals in total, of which 4055
were home goals and 2946 were away goals.
• Of the 7001 goals, 3119 were scored in the first
half with the rest (3882) being scored in the
• Consider the Poisson model as it is conventionally used for
count data and its parameter is defined for positive values.
The same model has been used in previous studies
extensively (eg Karlis and Ntzoufras, 2003).
• Test for correlation between home and away scores.
• Fit a Poisson Model for scoring rates for home case (λh)
away case (λa) and total case (λt) .
• Conduct goodness of fit test on the simple Poisson model.
• Introduce the Poisson Gamma Mixture as an improvement
on the simple Poisson.
• Conduct goodness of fit test on the mixed Poisson model.
• Introduce Poisson regression to model the relationship
between scoring rates and time in a football match.
Test for Correlation (R Test)
• Ho: ρ = 0 Versus H1: ρ ≠0
• r =
𝑆 𝐴𝐴×𝑆 𝐻𝐻
• Test stat = r ×
= -1.06593 ~ t2658
Degrees of freedom = n - 2 =2658
• P value = 0.028646
• The coefficient of correlation was found to be -0.02067. The test stat is z =
-1.0653 and the corresponding p value is 0.028646. So here we accept Ho.
This implies that the correlation coefficient is not significantly different
• Because the correlation coefficient was found not to be significantly
different from zero, we can proceed treating the home goals and away
goals as independent Poisson events and test whether a Poisson model is
Using the formulas below we obtained the following values
The Poisson Model
• For a Poisson distribution Y…
• Y ~ Po (λ)
• E (Y| λ) = λ
• Var (Y| λ) = λ
• However the Expectation of Y can be derived from the Tower Law
• E(Y) = E (E (Y| λ)) = E (λ) (1)
• The variance can be calculated in the following manner. This is known as
the law of total variance or variance decomposition formula
• Var(Y) = E (Var (Y|λ)) + Var (E (Y|λ)) (2)
• Thus the excess variance is synonymous with the variance of the
parameter λ, since this can be simplified further using E (Y| λ) = λ and
Var (Y| λ) = λ, for a Poisson model
• Var(Y) = E (λ) + Var (λ)
• Therefore, we observe heterogeneity in the scoring rate λ
• We observed heterogeneity in the scoring rate λ. This suggests that λ is a
random variable. We need a model that incorporates λ as a random
• We now consider a Poisson mixture.
• Since λ >0 , it would be appropriate to take a Gamma distribution as the
mixing distributionas it is also defined for positive values and is very
flexible to fit.
• It also has the advantage of havingmoment generating functions that are
easy to compute, which will be used later in the study.
• Some other distributions which cover the same state space (λ>0), do not
have easily computable moment generating functions.
• Gamma distributions are a two parameter distribution, usually termed α –
the shape parameter and β – the scale parameter. The pdf of a gamma
distribution is given by
• 𝑓𝜆 𝜆 =
𝛽 𝛼 𝜆 𝛼−1exp (−𝛽𝜆)
Poisson Gamma Model
• This may at first seem difficult as the model assumes that the scoring rates, home, away and total
are drawn randomly from a gamma distribution so λ is a random variable and not fixed. However we
can integrate out λ by integrating over the whole range available to λ in the gamma distribution and
hence derive a pdf in terms of α and β. This is illustrated below:
• 𝑓𝜆 𝜆 =
𝛽 𝛼 𝜆 𝛼−1exp(−𝛽𝜆)
• 𝑃 𝑋 = 𝑥|𝜆 =
• 𝑃 𝑋 = 𝑥 = 0
𝑃 𝑋 = 𝑥|𝜆 × 𝑓𝜆 𝜆 𝑑𝜆
exp(− 𝛽 + 1 𝜆) 𝑑𝜆
(β+1) 𝛼+𝑥 ×
Poisson gamma probabilities
• Hence we have a smart way of determining probabilities for
number of home goals, away goals and total goals. The minor
problem here would be evaluating non-integral gamma functions.
This can easily be overcome by using the recurrence formula
• P(X =0) =
• And P(X = x) = P(X=x-1)×
( β+1 ×(x)
• Using the excess variance for all three goal distributions it is
possible to fit appropriate gamma distributions by the method of
moments, since E (λ) = α/β and Var (λ) = α/β2 and using E(X) = E(λ)
and Excess variance = Var(X) – E(X) = Var(λ)
β = (α/β) / (α/β2) = E(X) / (Var(X)-E(X))
α = E(X) × β
• Ho Goals (i) follow Poisson distribution i = Home, away,
• H1 Goals (i) don’t follow Poisson distribution i = Home,
• H2 Goals (i) follow Poisson distribution with rates drawn
from a Gamma distribution i = Home, away, total
• H3 Goals (i) don’t follow Poisson distribution with rates
drawn from a Gamma distribution i = Home, away, total
• To test the hypotheses above, a Chi-square test needs to be
carried out. This test results are given below:
Simple Poisson goodness of fit test
Clearly the simple Poisson models with assumed constant rate
for all games are not good models with p values all substantially
below 1%. Hence Ho can be rejected in each case. However this
is hardly surprising as the rates are extremely unlikely to be the
same for each home and away side for each match.
Table 4: Goodness of fit test for simple poisson model
/E Classes Df lost for total
Total df p value
Home 21.9963 7 1 1 5 0.00052
Away 16.7955 6 1 1 4 0.00212
Total 21.5212 9 1 1 7 0.00589
Poisson gamma goodness of fit test
In the case of the home and away goals scored the p
values are very large indicating we can accept Ho i.e.
the Poisson Gamma model is a good fit. Here the
Poisson gamma model is little better than the simple
Poisson model. The same conclusion cannot be drawn
for the total goals as the p value is so small.
Table 5: Goodness of fit test for the Poisson Gamma Model
/E Classes Df lost for total
Total df p value
Home 4.02597 7 1 2 4 0.4025
Away 4.10313 6 1 2 3 0.25054
Total 18.9681 9 1 2 6 0.00829
0 10 20 30 40 50 60 70 80 90 100
Time in Minutes
Chart 1: Total Goals vs. Time
0 10 20 30 40 50 60 70 80 90 100
Chart 2: Number of Home Goals vs. Time
0 10 20 30 40 50 60 70 80 90 100
Chart 3: Number of Away Goals vs. Time
• As is clear from the scatter plots above, there is a significantly smaller number of
goals scored in the 1st and 46th minute, and a spike in the scoring rates of the 45th
and 90th minute.
• The latter could be due to the extra time that is associated with these two minutes.
We define the total scoring rate as the total goals scored in a given minute. In this
case the scoring rates between minute 2 and 44 and between 46th and 89th
minutes steadily increase with time.
• To construct a model it is reasonable to divide the 90 minute interval to reflect this
systematic pattern in the total scoring rates with respect to time.
• We need to obtain the 1st,45th,46th,and 90th minute rates explicitly; and then derive
an overall rate for the interval[2,44] and [47,89] inclusive.
• An argument can be made that a linear model can be constructed to capture the
trend in the data. However, a linear regression would imply equal weights assigned
to each minute (the weights are the minutes of exposure in each minute which
=2660), but since the rate increases with time in two largely significant intervals in
the data, this is not a reasonable assumption.
• The rate in these intervals would reasonably be approximated by a model with an
associated Poisson distribution. Since they span only a minute each, the other
explicit scoring rates can be derived by the trivial formula of the total goals scored
per minute divided by the minutes of exposure, which is 2660 in each case. In
addition, to account for the linearity of the rates in the ‘Poisson intervals’, the
Poisson regression seems ideal.
The Poisson regression is a general linear model under the Poisson family of distributions and is
expressed as follows:
lnI=+t, where t denotes the scoring rate for a given minute.
For the total goals case;
The explicitly derived rates denoted as μ0(t)s are obtained as follows:
total goals in the (i)th minute
For instance the rates for the total case are as follows;
μ0(t) = 0.0147 0 < t < 1
μ0(t) =0.0650 44 < t < 45
μ0(t) = 0.018797 45 < t< 46
μ0(t) = 0.1140 89 < t < 90
The Poisson regression model is
The following parameters and were estimated using r and the following results were
Table 6:Estimated Parameters α's and β's
Home case Away case Total case
First half α 3.653472 3.069511 4.09877
First half β 0.00188 0.009956 0.005133
First half α 3.312263 3.451026 4.291641
Second half β 0.007898 0.001473 0.005163
To establish whether the Poisson regression model adheres to the data for the first
and second half scores, a Chi-square goodness of fit test was carried out as shown
Ho (i) The data of goals(i) adheres to the Poisson regression model
H1 (i)The data does of goals (i) not adhere to the Poisson model
Where i= home, away , total
HOME AWAY TOTAL
First half second half first half second half first half second half
p value 0.012314 1.13E-07 0.00035 5.90E-06 0.137095 0.384263
• According to the results of the test above, the p-values for both the first and
second half for the total number of goals are significantly above zero. However,
the p-values for the home and away goals are essentially equal to zero.
• We hence accept the hypothesis that the in the total case, the data adheres to the
poisson regression model. Nonetheless we reject the hypothesis Ho (i) where
(i)=home goals or away goals.
• Given this fact, we may conclude that the Poisson regression is a good fit for the
total number of goals. However, the Poisson regression is not a good fit for the
home and away scores given the very low p values.
• Despite the fact the Poisson regression has low p-values for the home and away
scores, it addresses the flaws in the Poisson Gamma model, and the simple
Poisson model by extension.
• The Poisson regression model does not share the flaws that we observed in the
Poisson model. For this reason, we believe that it is a better fit for our data.
• Therefore, given that it avoids the limitations of both the Poisson and the Poisson
Gamma models, we find that the Poisson regression model is the most ideal for
capturing the scoring rate in football matches.
• A major limitation of our project is that the Poisson regression
model has been staggered to cover the different time
intervals where we observed patterns.
• A single model that captures the entire 90 minutes of a
football match would be more ideal.
• Similarly other factors that affect goal scoring were not
incorporated in the study due time and other constraints .
These included team strengths, injured players, home
advantage, refereeing decisions, coaching techniques,
retaliatory factor amongst others. Further research is needed
• Baker, R. D., and McHale, I. G. (2015). Time varying ratings in association Football: the all-
• time greatest team is… Journal of the Royal Statistical Society. Series A (Statistics in Society).Vol. 178 (2), 481-492.
• March 2016 from http://onlinelibrary.wiley.com/doi/10.1111/rssa.12060/full
• Karlis, D., and Ntzoufras, J. (2003). Bayesian and Non-Bayesian Analysis of Soccer Data
• Using Bivariate Poisson Regression Models. Retrieved 16th May 2016 from
• Maher, M. J. (1982). Modeling Association Football Scores. Statistica Neerlandica. Vol. 36
• (3), 109-118. Retrieved 18th May 2016 fromhttp://www.90minut.pl/misc/maher.pdf
• Pena, L. J. (2014). A Markovian model for association football possessions and
• Its Outcomes. Retrieved 1st March 2016 from http://arxiv.org/pdf/1403.7993v1.pdf
• Percy, D. F. (2015).Strategy selection and outcome prediction in sport using Dynamic
• learning for stochastic processes. Journal of the Operational Research Society. Vol. 66, 1840-1849. Retrieved 1st
March 2016 from
• Pollard, R. (2008). Home Advantage in Football: A Current Review of the Unsolved Puzzle.
• The Open Sports Sciences Journal. Vol. 1, 12-24. Retrieved 18th May 2016 from
• The Economist. (2014). The World’s Game, Not England’s. Retrieved 16th May 2016
• From http://www.economist.com/news/britain/21601540-premier-league-football-clubs-are-destroying-their-