1. The Data Behind Football
An analysis by Apostolos Mourouzis
11 December 2016
Introduction
Sport has always been considered a very emotional endeavour, where instinct and intuition take precedent over
fact. With the recent boom in data availability there are finally resources that enable an objective analysis of
the game. In this project we have chosen one particular dataset and use it to try derive relationships and
correlations that provide a different perspective to the world of football.
Our data has been obtained from Football-Data which gives the results of all football games in the top
divisions across Europe.
Our data set also includes betting odds for a range of betting houses. Let’s take a first glimpse at our data
set (betting odds have been filtered out for the sake of clarity):
## Observations: 380
## Variables: 23
## $ Div <fctr> E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, ...
## $ Date <fctr> 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08...
## $ HomeTeam <fctr> Bournemouth, Chelsea, Everton, Leicester, Man United...
## $ AwayTeam <fctr> Aston Villa, Swansea, Watford, Sunderland, Tottenham...
## $ FTHG <int> 0, 2, 2, 4, 1, 1, 0, 2, 0, 0, 0, 0, 1, 2, 2, 0, 1, 1,...
## $ FTAG <int> 1, 2, 2, 2, 0, 3, 2, 2, 1, 3, 1, 3, 3, 0, 2, 0, 2, 2,...
## $ FTR <fctr> A, D, D, H, H, A, A, D, A, A, A, A, A, H, D, D, A, A...
## $ HTHG <int> 0, 2, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1,...
## $ HTAG <int> 0, 1, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 2, 0, 0, 0, 2, 1,...
## $ HTR <fctr> D, H, A, H, H, A, A, D, D, A, A, A, A, H, H, D, A, D...
## $ Referee <fctr> M Clattenburg, M Oliver, M Jones, L Mason, J Moss, S...
## $ HS <int> 11, 11, 10, 19, 9, 17, 22, 9, 7, 9, 5, 17, 6, 19, 13,...
## $ AS <int> 7, 18, 11, 10, 9, 11, 8, 15, 8, 19, 9, 10, 19, 4, 16,...
## $ HST <int> 2, 3, 5, 8, 1, 6, 6, 4, 1, 2, 1, 4, 2, 6, 7, 5, 3, 4,...
## $ AST <int> 3, 10, 5, 5, 4, 7, 4, 5, 3, 7, 2, 4, 6, 2, 7, 0, 6, 7...
## $ HF <int> 13, 15, 7, 13, 12, 14, 12, 9, 9, 12, 14, 11, 7, 11, 1...
## $ AF <int> 13, 16, 13, 17, 12, 20, 9, 12, 16, 9, 10, 10, 7, 8, 1...
## $ HC <int> 6, 4, 8, 6, 1, 1, 5, 6, 3, 6, 3, 9, 6, 4, 4, 2, 8, 6,...
## $ AC <int> 3, 8, 2, 3, 2, 4, 4, 6, 5, 6, 5, 9, 6, 4, 3, 4, 4, 6,...
## $ HY <int> 3, 1, 1, 2, 2, 1, 1, 2, 2, 4, 2, 4, 1, 2, 2, 1, 1, 1,...
## $ AY <int> 4, 3, 2, 4, 3, 0, 3, 4, 4, 1, 2, 2, 2, 1, 2, 2, 3, 1,...
## $ HR <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ AR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
There are two things we can immediately see:
• The column names are not very descriptive
• We have a lot of integer variables which are usefull for regression analysis.
In regards to the first observation, a guide to the heading titles is available here or in the appendix. Interpreting
these we can see that the data available is relatively high level for a football game, encompasing goals, shots,
fouls, corners and cards.
The structure of this study will be partitioned into two independant investigations looking at answering
different questions regarding the football world.
1
2. Investigations
Which is the Dirtiest League?
A topic of much debate for football fans, we can finally do some statistical analysis to find which of the big
leagues is actually the dirtiest. In order to investigate this we have several variables that are applicable:
• Fouls Commited in a game
• Yellow Cards Given in a game
• Red Cards given in a game
As red cards are signs of a larger infringement than yellow cards, we will weight them accordingly using the
below formula:
CardScore = 5 ∗ Y ellow + 10 ∗ Red
So let’s investigate the fluctuations in fouls and cards across the different leagues in 2015:
Bundesliga
La Liga
Premier
Serie A
20 30 40
Average Fouls Commited
League
Average Card Score
17.5
20.0
22.5
25.0
27.5
League
Bundesliga
La Liga
Premier
Serie A
As expected there are variations in the leagues, Lets analyse some insights we can glean from the chart above:
• As many football fans would expect the Italian Serie A seems to be the dirtiest - at first glance at least!
• The Bundesliga, although high in fouls committed, seem to receive fewer cards per game. This could
mean that the fouls commited are mostly tactical.
• The Premier League seems to be the ‘cleanest’ league, with least fouls and cards per game.
• Although La Liga has less fouls on average than most of the other leagues, it has the highest card score.
When someone in La Liga decides to foul, he fouls hard!
2
3. In order to check whether there is a real fluctuation between the leagues, we will have to perform an ANOVA
analysis. Our Null Hypothesis is that there are no differences between the fouls/cards between the different
leagues.
The ANOVA test for Fouls committed:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 16335 5445 144.4 <2e-16 ***
## Residuals 1262 47603 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the ANOVA test for the Card Score:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 34785 11595 90.99 <2e-16 ***
## Residuals 1262 160817 127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values for both ANOVAs is less than the 0.05 threshold, which means we can reject our null hypothesis
and deduce that there is a statistical difference between the fouls in different leagues.
However, this conclusion is only applicable for the 2015/2016 season. Let’s investigate whether this is
consistent across the last 5 years.
A two-way ANOVA of fouls across the leagues in the last 5 years is performed below:
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 103962 34654 903.109 < 2e-16 ***
## Year 4 704 176 4.589 0.00106 **
## Div:Year 12 2829 236 6.144 7.37e-11 ***
## Residuals 7029 269716 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
So the resulting Two Factor Anova indicates that the different leagues do consistently have significant
differences between them, but the year itself also affects the fouls committed - although initially slightly
counterintuitive, we can see a significant difference in the means between 2011 (27.9 fouls a game) and 2015
(26.6 fouls a game). Furthermore, although the Leagues do have different values, there is an interaction effect
with the Year, which means that depending on the year there is a change in the number of fouls committed
within a league.
Now, although we know that there are differences in the ‘dirtiness’ of the leagues, we can’t say with confidence
what the order is. In order to do so we can do a Tukey analysis to see where there are significant differences.
3
4. Tukey Analysis on Card Score
−15 −10 −5 0 5 10
Serie A−Premier
Serie A−La Liga
Premier−La Liga
Serie A−Bundesliga
Premier−Bundesliga
La Liga−Bundesliga
95% family−wise confidence level
4
5. Tukey Analysis on Fouls
−10 −5 0 5 10
Serie A−Premier
Serie A−La Liga
Premier−La Liga
Serie A−Bundesliga
Premier−Bundesliga
La Liga−Bundesliga
95% family−wise confidence level
The derived rankings for Fouls and Card Score are:
Fouls Cards
1. Serie A 1. La Liga
2. Bundesliga 2. Serie A
3. La Liga 3. Bundesliga
4. Premier League 4. Premier League
We can now assert with confidence that:
Serie A is the dirtiest league, having the highest aggregate score when combining fouls and cards - proving the
age old rhetoric. The Premier League is the least aggressive league with the lowest rankings for both variables.
However, this may be a result of referee leniency as opposed to play style.
5
6. A Logistical Victory
Let’s assume that you can obtain all the variables of a specific game - shots, corners, cards, fouls and betting
odds for both teams. Having these, is it possible to accurately predict if a team has won?
In order to answer this question, we will perform a logistic regression using all these variables as input. Due
to the different nature of European leagues we will isolate this investigation to the Premier League to remove
variations based on play styles. Our two binary outputs for this regression will be 1, if a team won, and 0 if
it hasn’t.
Let’s see what the distribution of wins is for homes before we start the regression:
0
50
100
150
200
−0.5 0.0 0.5 1.0 1.5
HomeWin
count
We can see that the probability of a win is much higher, so we would expect the logical regression we perform
below to generate an equal distribution.
As we have several years worth of data there is no need to partition one data set into Training and Test. We
will use one year to generate the regression (in this particular case the 2015 season) and then test it on a
previous year’s data (the 2014 season).
First, let’s attempt our first regression utilising all the variables:
## [1] TRUE
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.31979143 1.47614950 0.8940771 3.712807e-01
## WHH -0.33415258 0.23461233 -1.4242754 1.543668e-01
## WHD 0.29607628 0.63993231 0.4626681 6.436023e-01
## WHA 0.15754079 0.18260871 0.8627233 3.882896e-01
## AR 1.55774944 0.46699806 3.3356657 8.509541e-04
6
7. ## HR -0.75961151 0.56302408 -1.3491634 1.772845e-01
## HC -0.24655953 0.05295213 -4.6562721 3.219864e-06
## AC 0.06158437 0.06253096 0.9848621 3.246918e-01
## HS -0.12670429 0.03985025 -3.1795102 1.475242e-03
## AS -0.05803611 0.04335286 -1.3386918 1.806710e-01
## HST 0.47069370 0.07674571 6.1331599 8.615050e-10
## AST -0.29635324 0.08560794 -3.4617495 5.366764e-04
## HY -0.04661090 0.12225064 -0.3812732 7.030005e-01
## AY -0.04089751 0.10651777 -0.3839501 7.010154e-01
It seems our first attempt has identified many redundant variables, which we will remove one by one (starting
with the highest P value) to identify how this model changes with each variable removed. Another interesting
observation is that there is no apparent correlation between the betting odds for the game and the result.
This is because odds are distributed unevenly, with the majority closer to 1, similar to a 1
x function. To
linearize it, we simply have to perform the same operation. The charts of the original and linearized are
shown below:
5
10
15
2 4 6 8
William Hill Home Odds
WilliamHillAwayOdds
Original
0.2
0.4
0.6
0.25 0.50 0.75
William Hill Home Odds
WilliamHillAwayOdds
Linear
Removing the highest p-value variables one by one until all remaining inputs are significant leaves the following
regression coefficients. The additional variables were removed in the follow order:
Home Yellow, Away Yellow, Draw Odds (Win Odds suddenly became significant at this stage), Lose Odds,
Away Corners, Away Shots, Home Reds
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8155647 0.53610105 -1.521289 1.281873e-01
## I(1/WHH) 3.7902130 0.89015019 4.257948 2.063122e-05
## AR 1.3210913 0.45618447 2.895958 3.780024e-03
## HC -0.2860810 0.04844828 -5.904874 3.529168e-09
7
8. ## HST 0.3359608 0.05887963 5.705893 1.157349e-08
## AST -0.3508141 0.06979585 -5.026289 5.000619e-07
Testing the variance of variables reveals that Home Shots have a high correlation with another variable.
## I(1/WHH) AR HC HST AST
## 1.231508 1.012846 1.417507 1.208117 1.084360
This is logical as Home Shots (HS) are intuitively related to Home Shots on Target (HST) In fact investigating
the relationship between the two variables shows a 67% correlation. We can therefore remove this from our
regression model, Our final formula is:
LogisticRegression = −0.81 + 3.79 × WinningOdds + 1.32 × AwayReds − 0.28 × HomeCorners + 0.34 ×
HomeShotsOnTarget − 0.35 × AwayShotsOnTarget
And therefore the match outcome can be predicted by:
MatchOutcome =
1
1 + e−(−0.81+3.79W HH+1.32AR−0.28HC+0.34HST −0.35AST )
In order to evaluate our model, the most important factor is to test the classification correctness. Running
the derived regression formula on our 2014 dataset results in the following ‘predictions’:
## Predicted Result
## Actual Result 0 1
## 0 168 40
## 1 67 105
It seems like our logistic regression has estimated 273 matches correctly out of a total of 380 games. This
gives an accuracy of 71.8%. Testing on several other years gives 75% in 2013, 66% in 2012 and 70% in 2011.
We can therefore validate the accuracy of our logarithmic model, with a mean prediction accuracy of 70.7%.
Let’s do one final check by calculating a pseudo R-Squared value for the model. We want to maximize the
value of the McFadden variable, which has a range of 0 to 1:
## llh llhNull G2 McFadden r2ML
## -186.6825419 -257.6351796 141.9052754 0.2753996 0.3116342
## r2CU
## 0.4198202
Although the McFadden isn’t too high, it’s also not close to zero so we can safely vouch that our model does
have some predictive power. Given the high accuracy over the several test sets, this model can be concluded
to be satisfactory.
8
9. Appendix
Column Name Translations
Column Name Meaning
FTHG Full Time Home Goals
FTAG Full Time Away Goals
FTR Full Time Result
HTHG Half Time Home Goals
HTAG Half Time Away Goals
HTR Half Time Result
HS Home Shots
AS Away Shots
HST Home Shots on Target
AST Away Shots on Target
HF Home Fouls
AF Away Fouls
HC Home Corners
AC Away Corners
HY Home Yellows
AY Away Yellows
HR Home Reds
AR Away Reds
9