Successfully reported this slideshow.
Upcoming SlideShare
×

# MODELLING FOOTBALL DATA

1,191 views

Published on

• Full Name
Comment goes here.

Are you sure you want to Yes No
• 80% Win Rate? It's Not a BUG? [Proof Inside] ■■■ http://ishbv.com/zcodesys/pdf

Are you sure you want to  Yes  No

### MODELLING FOOTBALL DATA

1. 1. MODELLING FOOTBALL DATA Shavajai Quentin Franz
2. 2. BACKGROUND AND MOTIVATION • The motivation for this project was to apply mathematical, statistical and actuarial modelling techniques to professional sports. • Our interest is to provide information on how such modelling techniques can be used to gain insights into sports dynamics.
3. 3. OBJECTIVES The main aim of this work is to build models for football data, in particular the scoring rate. The specific objectives that we set out to achieve include:  Examining the suitability of the simple Poisson model to our data.  Incorporating the time heterogeneity of the simple Poisson parameter.  Exploring the effect of time on the scoring rate.
4. 4. DATA • Data was retrieved from rsssf.com. • The data used in this project spans 7 football seasons from the 1996-1997 to the 2002-2003 English premier league seasons. • There were (380*7)=2660 matches over the 7 season with 7001 goals in total, of which 4055 were home goals and 2946 were away goals. • Of the 7001 goals, 3119 were scored in the first half with the rest (3882) being scored in the second half.
5. 5. Goals frequency
6. 6. METHODOLOGY • Consider the Poisson model as it is conventionally used for count data and its parameter is defined for positive values. The same model has been used in previous studies extensively (eg Karlis and Ntzoufras, 2003). • Test for correlation between home and away scores. • Fit a Poisson Model for scoring rates for home case (λh) away case (λa) and total case (λt) . • Conduct goodness of fit test on the simple Poisson model. • Introduce the Poisson Gamma Mixture as an improvement on the simple Poisson. • Conduct goodness of fit test on the mixed Poisson model. • Introduce Poisson regression to model the relationship between scoring rates and time in a football match.
7. 7. FINDINGS Test for Correlation (R Test) • Ho: ρ = 0 Versus H1: ρ ≠0 • r = 𝑆 𝐻𝐴 𝑆 𝐴𝐴×𝑆 𝐻𝐻 = -0.02067 • Test stat = r × (n−2) (1−𝑟2) = -1.06593 ~ t2658 Degrees of freedom = n - 2 =2658 • P value = 0.028646 • The coefficient of correlation was found to be -0.02067. The test stat is z = -1.0653 and the corresponding p value is 0.028646. So here we accept Ho. This implies that the correlation coefficient is not significantly different from zero. • Because the correlation coefficient was found not to be significantly different from zero, we can proceed treating the home goals and away goals as independent Poisson events and test whether a Poisson model is reasonable.
8. 8. Table 4 Using the formulas below we obtained the following values
9. 9. The Poisson Model • For a Poisson distribution Y… • Y=0,1,2,3.. • Y ~ Po (λ) • E (Y| λ) = λ • Var (Y| λ) = λ • However the Expectation of Y can be derived from the Tower Law • E(Y) = E (E (Y| λ)) = E (λ) (1) • The variance can be calculated in the following manner. This is known as the law of total variance or variance decomposition formula • Var(Y) = E (Var (Y|λ)) + Var (E (Y|λ)) (2) • Thus the excess variance is synonymous with the variance of the parameter λ, since this can be simplified further using E (Y| λ) = λ and Var (Y| λ) = λ, for a Poisson model • Var(Y) = E (λ) + Var (λ) • Therefore, we observe heterogeneity in the scoring rate λ
10. 10. What next? • We observed heterogeneity in the scoring rate λ. This suggests that λ is a random variable. We need a model that incorporates λ as a random variable. • We now consider a Poisson mixture. • Since λ >0 , it would be appropriate to take a Gamma distribution as the mixing distributionas it is also defined for positive values and is very flexible to fit. • It also has the advantage of havingmoment generating functions that are easy to compute, which will be used later in the study. • Some other distributions which cover the same state space (λ>0), do not have easily computable moment generating functions. • Gamma distributions are a two parameter distribution, usually termed α – the shape parameter and β – the scale parameter. The pdf of a gamma distribution is given by • 𝑓𝜆 𝜆 = 𝛽 𝛼 𝜆 𝛼−1exp (−𝛽𝜆) Γ𝛼
11. 11. Poisson Gamma Model • This may at first seem difficult as the model assumes that the scoring rates, home, away and total are drawn randomly from a gamma distribution so λ is a random variable and not fixed. However we can integrate out λ by integrating over the whole range available to λ in the gamma distribution and hence derive a pdf in terms of α and β. This is illustrated below: • 𝑓𝜆 𝜆 = 𝛽 𝛼 𝜆 𝛼−1exp(−𝛽𝜆) Γ(𝛼) • 𝑃 𝑋 = 𝑥|𝜆 = 𝜆 𝑥exp(−𝜆) 𝑥! • 𝑃 𝑋 = 𝑥 = 0 ∞ 𝑃 𝑋 = 𝑥|𝜆 × 𝑓𝜆 𝜆 𝑑𝜆 • = 0 ∞ 𝜆 𝑥 exp(−𝜆) 𝑥! × 𝛽 𝛼 𝜆 𝛼−1 exp(−𝛽𝜆) Γ(𝛼) 𝑑𝜆 • = 𝛽 𝛼 Γ(𝛼)×𝑥! × 0 ∞ 𝜆 𝛼+𝑥−1 exp(− 𝛽 + 1 𝜆) 𝑑𝜆 • = 𝛽 𝛼 (β+1) 𝛼+𝑥 × Γ(α+x) Γ(α} × 1 𝑥!
12. 12. Poisson gamma probabilities • Hence we have a smart way of determining probabilities for number of home goals, away goals and total goals. The minor problem here would be evaluating non-integral gamma functions. This can easily be overcome by using the recurrence formula • P(X =0) = 𝛽 𝛼 (β+1) 𝛼 • And P(X = x) = P(X=x-1)× (α+x−1) ( β+1 ×(x) • Using the excess variance for all three goal distributions it is possible to fit appropriate gamma distributions by the method of moments, since E (λ) = α/β and Var (λ) = α/β2 and using E(X) = E(λ) and Excess variance = Var(X) – E(X) = Var(λ) • Hence β = (α/β) / (α/β2) = E(X) / (Var(X)-E(X)) α = E(X) × β
13. 13. Hypothesis testing • Ho Goals (i) follow Poisson distribution i = Home, away, total • H1 Goals (i) don’t follow Poisson distribution i = Home, away, total • And • H2 Goals (i) follow Poisson distribution with rates drawn from a Gamma distribution i = Home, away, total • H3 Goals (i) don’t follow Poisson distribution with rates drawn from a Gamma distribution i = Home, away, total • To test the hypotheses above, a Chi-square test needs to be carried out. This test results are given below:
14. 14. Simple Poisson goodness of fit test Clearly the simple Poisson models with assumed constant rate for all games are not good models with p values all substantially below 1%. Hence Ho can be rejected in each case. However this is hardly surprising as the rates are extremely unlikely to be the same for each home and away side for each match. Table 4: Goodness of fit test for simple poisson model Σ(O-E)2 /E Classes Df lost for total Parameters estimated Total df p value Home 21.9963 7 1 1 5 0.00052 Away 16.7955 6 1 1 4 0.00212 Total 21.5212 9 1 1 7 0.00589
15. 15. Poisson gamma goodness of fit test In the case of the home and away goals scored the p values are very large indicating we can accept Ho i.e. the Poisson Gamma model is a good fit. Here the Poisson gamma model is little better than the simple Poisson model. The same conclusion cannot be drawn for the total goals as the p value is so small. Table 5: Goodness of fit test for the Poisson Gamma Model Σ(O-E)2 /E Classes Df lost for total Parameters estimated Total df p value Home 4.02597 7 1 2 4 0.4025 Away 4.10313 6 1 2 3 0.25054 Total 18.9681 9 1 2 6 0.00829
16. 16. Poisson Regression 0 50 100 150 200 250 300 350 0 10 20 30 40 50 60 70 80 90 100 g o a l s Time in Minutes Chart 1: Total Goals vs. Time Series1
17. 17. 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 80 90 100 Goals Time(minutes) Chart 2: Number of Home Goals vs. Time Series1
18. 18. 0 20 40 60 80 100 120 140 160 180 0 10 20 30 40 50 60 70 80 90 100 goals time(minutes) Chart 3: Number of Away Goals vs. Time Series1
19. 19. • As is clear from the scatter plots above, there is a significantly smaller number of goals scored in the 1st and 46th minute, and a spike in the scoring rates of the 45th and 90th minute. • The latter could be due to the extra time that is associated with these two minutes. We define the total scoring rate as the total goals scored in a given minute. In this case the scoring rates between minute 2 and 44 and between 46th and 89th minutes steadily increase with time. • To construct a model it is reasonable to divide the 90 minute interval to reflect this systematic pattern in the total scoring rates with respect to time. • We need to obtain the 1st,45th,46th,and 90th minute rates explicitly; and then derive an overall rate for the interval[2,44] and [47,89] inclusive. • An argument can be made that a linear model can be constructed to capture the trend in the data. However, a linear regression would imply equal weights assigned to each minute (the weights are the minutes of exposure in each minute which =2660), but since the rate increases with time in two largely significant intervals in the data, this is not a reasonable assumption. • The rate in these intervals would reasonably be approximated by a model with an associated Poisson distribution. Since they span only a minute each, the other explicit scoring rates can be derived by the trivial formula of the total goals scored per minute divided by the minutes of exposure, which is 2660 in each case. In addition, to account for the linearity of the rates in the ‘Poisson intervals’, the Poisson regression seems ideal.
20. 20. The Poisson regression is a general linear model under the Poisson family of distributions and is expressed as follows: lnI=+t, where t denotes the scoring rate for a given minute. For the total goals case; The explicitly derived rates denoted as μ0(t)s are obtained as follows: total goals in the (i)th minute 2660 For instance the rates for the total case are as follows; μ0(t) = 0.0147 0 < t < 1 μ0(t) =0.0650 44 < t < 45 μ0(t) = 0.018797 45 < t< 46 μ0(t) = 0.1140 89 < t < 90 The Poisson regression model is t= exp(+t) The following parameters  and  were estimated using r and the following results were obtained:
21. 21. Table 6:Estimated Parameters α's and β's Home case Away case Total case First half α 3.653472 3.069511 4.09877 First half β 0.00188 0.009956 0.005133 First half α 3.312263 3.451026 4.291641 Second half β 0.007898 0.001473 0.005163 To establish whether the Poisson regression model adheres to the data for the first and second half scores, a Chi-square goodness of fit test was carried out as shown below: Ho (i) The data of goals(i) adheres to the Poisson regression model H1 (i)The data does of goals (i) not adhere to the Poisson model Where i= home, away , total HOME AWAY TOTAL First half second half first half second half first half second half p value 0.012314 1.13E-07 0.00035 5.90E-06 0.137095 0.384263
22. 22. Discussion(Poisson Regression) • According to the results of the test above, the p-values for both the first and second half for the total number of goals are significantly above zero. However, the p-values for the home and away goals are essentially equal to zero. • We hence accept the hypothesis that the in the total case, the data adheres to the poisson regression model. Nonetheless we reject the hypothesis Ho (i) where (i)=home goals or away goals. • Given this fact, we may conclude that the Poisson regression is a good fit for the total number of goals. However, the Poisson regression is not a good fit for the home and away scores given the very low p values. • Despite the fact the Poisson regression has low p-values for the home and away scores, it addresses the flaws in the Poisson Gamma model, and the simple Poisson model by extension. • The Poisson regression model does not share the flaws that we observed in the Poisson model. For this reason, we believe that it is a better fit for our data. • Therefore, given that it avoids the limitations of both the Poisson and the Poisson Gamma models, we find that the Poisson regression model is the most ideal for capturing the scoring rate in football matches.
23. 23. Limitations • A major limitation of our project is that the Poisson regression model has been staggered to cover the different time intervals where we observed patterns. • A single model that captures the entire 90 minutes of a football match would be more ideal. • Similarly other factors that affect goal scoring were not incorporated in the study due time and other constraints . These included team strengths, injured players, home advantage, refereeing decisions, coaching techniques, retaliatory factor amongst others. Further research is needed here.
24. 24. REFERENCES • Baker, R. D., and McHale, I. G. (2015). Time varying ratings in association Football: the all- • time greatest team is… Journal of the Royal Statistical Society. Series A (Statistics in Society).Vol. 178 (2), 481-492. Retrieved 1st • March 2016 from http://onlinelibrary.wiley.com/doi/10.1111/rssa.12060/full • Karlis, D., and Ntzoufras, J. (2003). Bayesian and Non-Bayesian Analysis of Soccer Data • Using Bivariate Poisson Regression Models. Retrieved 16th May 2016 from • http://www.stat-athens.aueb.gr/~karlis/Bivariate%20Poisson%20Regression.pdf • Maher, M. J. (1982). Modeling Association Football Scores. Statistica Neerlandica. Vol. 36 • (3), 109-118. Retrieved 18th May 2016 fromhttp://www.90minut.pl/misc/maher.pdf • Pena, L. J. (2014). A Markovian model for association football possessions and • Its Outcomes. Retrieved 1st March 2016 from http://arxiv.org/pdf/1403.7993v1.pdf • Percy, D. F. (2015).Strategy selection and outcome prediction in sport using Dynamic • learning for stochastic processes. Journal of the Operational Research Society. Vol. 66, 1840-1849. Retrieved 1st March 2016 from • http://www.palgrave-journals.com/jors/journal/v66/n11/pdf/jors2014137a.pdf • Pollard, R. (2008). Home Advantage in Football: A Current Review of the Unsolved Puzzle. • The Open Sports Sciences Journal. Vol. 1, 12-24. Retrieved 18th May 2016 from • http://benthamopen.com/contents/pdf/TOSSJ/TOSSJ-1-12.pdf • The Economist. (2014). The World’s Game, Not England’s. Retrieved 16th May 2016 • From http://www.economist.com/news/britain/21601540-premier-league-football-clubs-are-destroying-their- roots-they-grow-worlds-game-not