SlideShare a Scribd company logo
1 of 9
Download to read offline
The Data Behind Football
An analysis by Apostolos Mourouzis
11 December 2016
Introduction
Sport has always been considered a very emotional endeavour, where instinct and intuition take precedent over
fact. With the recent boom in data availability there are finally resources that enable an objective analysis of
the game. In this project we have chosen one particular dataset and use it to try derive relationships and
correlations that provide a different perspective to the world of football.
Our data has been obtained from Football-Data which gives the results of all football games in the top
divisions across Europe.
Our data set also includes betting odds for a range of betting houses. Let’s take a first glimpse at our data
set (betting odds have been filtered out for the sake of clarity):
## Observations: 380
## Variables: 23
## $ Div <fctr> E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, ...
## $ Date <fctr> 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08...
## $ HomeTeam <fctr> Bournemouth, Chelsea, Everton, Leicester, Man United...
## $ AwayTeam <fctr> Aston Villa, Swansea, Watford, Sunderland, Tottenham...
## $ FTHG <int> 0, 2, 2, 4, 1, 1, 0, 2, 0, 0, 0, 0, 1, 2, 2, 0, 1, 1,...
## $ FTAG <int> 1, 2, 2, 2, 0, 3, 2, 2, 1, 3, 1, 3, 3, 0, 2, 0, 2, 2,...
## $ FTR <fctr> A, D, D, H, H, A, A, D, A, A, A, A, A, H, D, D, A, A...
## $ HTHG <int> 0, 2, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1,...
## $ HTAG <int> 0, 1, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 2, 0, 0, 0, 2, 1,...
## $ HTR <fctr> D, H, A, H, H, A, A, D, D, A, A, A, A, H, H, D, A, D...
## $ Referee <fctr> M Clattenburg, M Oliver, M Jones, L Mason, J Moss, S...
## $ HS <int> 11, 11, 10, 19, 9, 17, 22, 9, 7, 9, 5, 17, 6, 19, 13,...
## $ AS <int> 7, 18, 11, 10, 9, 11, 8, 15, 8, 19, 9, 10, 19, 4, 16,...
## $ HST <int> 2, 3, 5, 8, 1, 6, 6, 4, 1, 2, 1, 4, 2, 6, 7, 5, 3, 4,...
## $ AST <int> 3, 10, 5, 5, 4, 7, 4, 5, 3, 7, 2, 4, 6, 2, 7, 0, 6, 7...
## $ HF <int> 13, 15, 7, 13, 12, 14, 12, 9, 9, 12, 14, 11, 7, 11, 1...
## $ AF <int> 13, 16, 13, 17, 12, 20, 9, 12, 16, 9, 10, 10, 7, 8, 1...
## $ HC <int> 6, 4, 8, 6, 1, 1, 5, 6, 3, 6, 3, 9, 6, 4, 4, 2, 8, 6,...
## $ AC <int> 3, 8, 2, 3, 2, 4, 4, 6, 5, 6, 5, 9, 6, 4, 3, 4, 4, 6,...
## $ HY <int> 3, 1, 1, 2, 2, 1, 1, 2, 2, 4, 2, 4, 1, 2, 2, 1, 1, 1,...
## $ AY <int> 4, 3, 2, 4, 3, 0, 3, 4, 4, 1, 2, 2, 2, 1, 2, 2, 3, 1,...
## $ HR <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ AR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
There are two things we can immediately see:
• The column names are not very descriptive
• We have a lot of integer variables which are usefull for regression analysis.
In regards to the first observation, a guide to the heading titles is available here or in the appendix. Interpreting
these we can see that the data available is relatively high level for a football game, encompasing goals, shots,
fouls, corners and cards.
The structure of this study will be partitioned into two independant investigations looking at answering
different questions regarding the football world.
1
Investigations
Which is the Dirtiest League?
A topic of much debate for football fans, we can finally do some statistical analysis to find which of the big
leagues is actually the dirtiest. In order to investigate this we have several variables that are applicable:
• Fouls Commited in a game
• Yellow Cards Given in a game
• Red Cards given in a game
As red cards are signs of a larger infringement than yellow cards, we will weight them accordingly using the
below formula:
CardScore = 5 ∗ Y ellow + 10 ∗ Red
So let’s investigate the fluctuations in fouls and cards across the different leagues in 2015:
Bundesliga
La Liga
Premier
Serie A
20 30 40
Average Fouls Commited
League
Average Card Score
17.5
20.0
22.5
25.0
27.5
League
Bundesliga
La Liga
Premier
Serie A
As expected there are variations in the leagues, Lets analyse some insights we can glean from the chart above:
• As many football fans would expect the Italian Serie A seems to be the dirtiest - at first glance at least!
• The Bundesliga, although high in fouls committed, seem to receive fewer cards per game. This could
mean that the fouls commited are mostly tactical.
• The Premier League seems to be the ‘cleanest’ league, with least fouls and cards per game.
• Although La Liga has less fouls on average than most of the other leagues, it has the highest card score.
When someone in La Liga decides to foul, he fouls hard!
2
In order to check whether there is a real fluctuation between the leagues, we will have to perform an ANOVA
analysis. Our Null Hypothesis is that there are no differences between the fouls/cards between the different
leagues.
The ANOVA test for Fouls committed:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 16335 5445 144.4 <2e-16 ***
## Residuals 1262 47603 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the ANOVA test for the Card Score:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 34785 11595 90.99 <2e-16 ***
## Residuals 1262 160817 127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values for both ANOVAs is less than the 0.05 threshold, which means we can reject our null hypothesis
and deduce that there is a statistical difference between the fouls in different leagues.
However, this conclusion is only applicable for the 2015/2016 season. Let’s investigate whether this is
consistent across the last 5 years.
A two-way ANOVA of fouls across the leagues in the last 5 years is performed below:
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 103962 34654 903.109 < 2e-16 ***
## Year 4 704 176 4.589 0.00106 **
## Div:Year 12 2829 236 6.144 7.37e-11 ***
## Residuals 7029 269716 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
So the resulting Two Factor Anova indicates that the different leagues do consistently have significant
differences between them, but the year itself also affects the fouls committed - although initially slightly
counterintuitive, we can see a significant difference in the means between 2011 (27.9 fouls a game) and 2015
(26.6 fouls a game). Furthermore, although the Leagues do have different values, there is an interaction effect
with the Year, which means that depending on the year there is a change in the number of fouls committed
within a league.
Now, although we know that there are differences in the ‘dirtiness’ of the leagues, we can’t say with confidence
what the order is. In order to do so we can do a Tukey analysis to see where there are significant differences.
3
Tukey Analysis on Card Score
−15 −10 −5 0 5 10
Serie A−Premier
Serie A−La Liga
Premier−La Liga
Serie A−Bundesliga
Premier−Bundesliga
La Liga−Bundesliga
95% family−wise confidence level
4
Tukey Analysis on Fouls
−10 −5 0 5 10
Serie A−Premier
Serie A−La Liga
Premier−La Liga
Serie A−Bundesliga
Premier−Bundesliga
La Liga−Bundesliga
95% family−wise confidence level
The derived rankings for Fouls and Card Score are:
Fouls Cards
1. Serie A 1. La Liga
2. Bundesliga 2. Serie A
3. La Liga 3. Bundesliga
4. Premier League 4. Premier League
We can now assert with confidence that:
Serie A is the dirtiest league, having the highest aggregate score when combining fouls and cards - proving the
age old rhetoric. The Premier League is the least aggressive league with the lowest rankings for both variables.
However, this may be a result of referee leniency as opposed to play style.
5
A Logistical Victory
Let’s assume that you can obtain all the variables of a specific game - shots, corners, cards, fouls and betting
odds for both teams. Having these, is it possible to accurately predict if a team has won?
In order to answer this question, we will perform a logistic regression using all these variables as input. Due
to the different nature of European leagues we will isolate this investigation to the Premier League to remove
variations based on play styles. Our two binary outputs for this regression will be 1, if a team won, and 0 if
it hasn’t.
Let’s see what the distribution of wins is for homes before we start the regression:
0
50
100
150
200
−0.5 0.0 0.5 1.0 1.5
HomeWin
count
We can see that the probability of a win is much higher, so we would expect the logical regression we perform
below to generate an equal distribution.
As we have several years worth of data there is no need to partition one data set into Training and Test. We
will use one year to generate the regression (in this particular case the 2015 season) and then test it on a
previous year’s data (the 2014 season).
First, let’s attempt our first regression utilising all the variables:
## [1] TRUE
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.31979143 1.47614950 0.8940771 3.712807e-01
## WHH -0.33415258 0.23461233 -1.4242754 1.543668e-01
## WHD 0.29607628 0.63993231 0.4626681 6.436023e-01
## WHA 0.15754079 0.18260871 0.8627233 3.882896e-01
## AR 1.55774944 0.46699806 3.3356657 8.509541e-04
6
## HR -0.75961151 0.56302408 -1.3491634 1.772845e-01
## HC -0.24655953 0.05295213 -4.6562721 3.219864e-06
## AC 0.06158437 0.06253096 0.9848621 3.246918e-01
## HS -0.12670429 0.03985025 -3.1795102 1.475242e-03
## AS -0.05803611 0.04335286 -1.3386918 1.806710e-01
## HST 0.47069370 0.07674571 6.1331599 8.615050e-10
## AST -0.29635324 0.08560794 -3.4617495 5.366764e-04
## HY -0.04661090 0.12225064 -0.3812732 7.030005e-01
## AY -0.04089751 0.10651777 -0.3839501 7.010154e-01
It seems our first attempt has identified many redundant variables, which we will remove one by one (starting
with the highest P value) to identify how this model changes with each variable removed. Another interesting
observation is that there is no apparent correlation between the betting odds for the game and the result.
This is because odds are distributed unevenly, with the majority closer to 1, similar to a 1
x function. To
linearize it, we simply have to perform the same operation. The charts of the original and linearized are
shown below:
5
10
15
2 4 6 8
William Hill Home Odds
WilliamHillAwayOdds
Original
0.2
0.4
0.6
0.25 0.50 0.75
William Hill Home Odds
WilliamHillAwayOdds
Linear
Removing the highest p-value variables one by one until all remaining inputs are significant leaves the following
regression coefficients. The additional variables were removed in the follow order:
Home Yellow, Away Yellow, Draw Odds (Win Odds suddenly became significant at this stage), Lose Odds,
Away Corners, Away Shots, Home Reds
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8155647 0.53610105 -1.521289 1.281873e-01
## I(1/WHH) 3.7902130 0.89015019 4.257948 2.063122e-05
## AR 1.3210913 0.45618447 2.895958 3.780024e-03
## HC -0.2860810 0.04844828 -5.904874 3.529168e-09
7
## HST 0.3359608 0.05887963 5.705893 1.157349e-08
## AST -0.3508141 0.06979585 -5.026289 5.000619e-07
Testing the variance of variables reveals that Home Shots have a high correlation with another variable.
## I(1/WHH) AR HC HST AST
## 1.231508 1.012846 1.417507 1.208117 1.084360
This is logical as Home Shots (HS) are intuitively related to Home Shots on Target (HST) In fact investigating
the relationship between the two variables shows a 67% correlation. We can therefore remove this from our
regression model, Our final formula is:
LogisticRegression = −0.81 + 3.79 × WinningOdds + 1.32 × AwayReds − 0.28 × HomeCorners + 0.34 ×
HomeShotsOnTarget − 0.35 × AwayShotsOnTarget
And therefore the match outcome can be predicted by:
MatchOutcome =
1
1 + e−(−0.81+3.79W HH+1.32AR−0.28HC+0.34HST −0.35AST )
In order to evaluate our model, the most important factor is to test the classification correctness. Running
the derived regression formula on our 2014 dataset results in the following ‘predictions’:
## Predicted Result
## Actual Result 0 1
## 0 168 40
## 1 67 105
It seems like our logistic regression has estimated 273 matches correctly out of a total of 380 games. This
gives an accuracy of 71.8%. Testing on several other years gives 75% in 2013, 66% in 2012 and 70% in 2011.
We can therefore validate the accuracy of our logarithmic model, with a mean prediction accuracy of 70.7%.
Let’s do one final check by calculating a pseudo R-Squared value for the model. We want to maximize the
value of the McFadden variable, which has a range of 0 to 1:
## llh llhNull G2 McFadden r2ML
## -186.6825419 -257.6351796 141.9052754 0.2753996 0.3116342
## r2CU
## 0.4198202
Although the McFadden isn’t too high, it’s also not close to zero so we can safely vouch that our model does
have some predictive power. Given the high accuracy over the several test sets, this model can be concluded
to be satisfactory.
8
Appendix
Column Name Translations
Column Name Meaning
FTHG Full Time Home Goals
FTAG Full Time Away Goals
FTR Full Time Result
HTHG Half Time Home Goals
HTAG Half Time Away Goals
HTR Half Time Result
HS Home Shots
AS Away Shots
HST Home Shots on Target
AST Away Shots on Target
HF Home Fouls
AF Away Fouls
HC Home Corners
AC Away Corners
HY Home Yellows
AY Away Yellows
HR Home Reds
AR Away Reds
9

More Related Content

Similar to The Data Behind Football

Web Scraping IPL T20
Web Scraping IPL T20Web Scraping IPL T20
Web Scraping IPL T20Nijichinnu
 
Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)Nijichinnu
 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysisRitu Sarkar
 
A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2makdul
 
The Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-HittersThe Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-HittersKenneth Burgos
 
InstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docxInstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docxnormanibarber20063
 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceSina Anaraki
 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceSina Anaraki
 
2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm review2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm reviewarinedge
 
Digging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max SklarDigging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max SklarHakka Labs
 
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...Lucidworks
 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docxoswald1horne84988
 
Tangel Trends Report
Tangel Trends ReportTangel Trends Report
Tangel Trends ReportEdwardTangel
 

Similar to The Data Behind Football (20)

Web Scraping IPL T20
Web Scraping IPL T20Web Scraping IPL T20
Web Scraping IPL T20
 
Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)
 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysis
 
Statistical Model Report
Statistical Model ReportStatistical Model Report
Statistical Model Report
 
Statistical Model Report
Statistical Model ReportStatistical Model Report
Statistical Model Report
 
A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2
 
The Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-HittersThe Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-Hitters
 
honors_paper
honors_paperhonors_paper
honors_paper
 
InstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docxInstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docx
 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB Attendance
 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB Attendance
 
Handling data and probability
Handling data and probabilityHandling data and probability
Handling data and probability
 
2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm review2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm review
 
Digging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max SklarDigging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max Sklar
 
Notes Chapter 3.pptx
Notes Chapter 3.pptxNotes Chapter 3.pptx
Notes Chapter 3.pptx
 
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docx
 
Bab 4.ppt
Bab 4.pptBab 4.ppt
Bab 4.ppt
 
Tangel Trends Report
Tangel Trends ReportTangel Trends Report
Tangel Trends Report
 
Piano rubyslava final
Piano rubyslava finalPiano rubyslava final
Piano rubyslava final
 

The Data Behind Football

  • 1. The Data Behind Football An analysis by Apostolos Mourouzis 11 December 2016 Introduction Sport has always been considered a very emotional endeavour, where instinct and intuition take precedent over fact. With the recent boom in data availability there are finally resources that enable an objective analysis of the game. In this project we have chosen one particular dataset and use it to try derive relationships and correlations that provide a different perspective to the world of football. Our data has been obtained from Football-Data which gives the results of all football games in the top divisions across Europe. Our data set also includes betting odds for a range of betting houses. Let’s take a first glimpse at our data set (betting odds have been filtered out for the sake of clarity): ## Observations: 380 ## Variables: 23 ## $ Div <fctr> E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, ... ## $ Date <fctr> 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08... ## $ HomeTeam <fctr> Bournemouth, Chelsea, Everton, Leicester, Man United... ## $ AwayTeam <fctr> Aston Villa, Swansea, Watford, Sunderland, Tottenham... ## $ FTHG <int> 0, 2, 2, 4, 1, 1, 0, 2, 0, 0, 0, 0, 1, 2, 2, 0, 1, 1,... ## $ FTAG <int> 1, 2, 2, 2, 0, 3, 2, 2, 1, 3, 1, 3, 3, 0, 2, 0, 2, 2,... ## $ FTR <fctr> A, D, D, H, H, A, A, D, A, A, A, A, A, H, D, D, A, A... ## $ HTHG <int> 0, 2, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1,... ## $ HTAG <int> 0, 1, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 2, 0, 0, 0, 2, 1,... ## $ HTR <fctr> D, H, A, H, H, A, A, D, D, A, A, A, A, H, H, D, A, D... ## $ Referee <fctr> M Clattenburg, M Oliver, M Jones, L Mason, J Moss, S... ## $ HS <int> 11, 11, 10, 19, 9, 17, 22, 9, 7, 9, 5, 17, 6, 19, 13,... ## $ AS <int> 7, 18, 11, 10, 9, 11, 8, 15, 8, 19, 9, 10, 19, 4, 16,... ## $ HST <int> 2, 3, 5, 8, 1, 6, 6, 4, 1, 2, 1, 4, 2, 6, 7, 5, 3, 4,... ## $ AST <int> 3, 10, 5, 5, 4, 7, 4, 5, 3, 7, 2, 4, 6, 2, 7, 0, 6, 7... ## $ HF <int> 13, 15, 7, 13, 12, 14, 12, 9, 9, 12, 14, 11, 7, 11, 1... ## $ AF <int> 13, 16, 13, 17, 12, 20, 9, 12, 16, 9, 10, 10, 7, 8, 1... ## $ HC <int> 6, 4, 8, 6, 1, 1, 5, 6, 3, 6, 3, 9, 6, 4, 4, 2, 8, 6,... ## $ AC <int> 3, 8, 2, 3, 2, 4, 4, 6, 5, 6, 5, 9, 6, 4, 3, 4, 4, 6,... ## $ HY <int> 3, 1, 1, 2, 2, 1, 1, 2, 2, 4, 2, 4, 1, 2, 2, 1, 1, 1,... ## $ AY <int> 4, 3, 2, 4, 3, 0, 3, 4, 4, 1, 2, 2, 2, 1, 2, 2, 3, 1,... ## $ HR <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,... ## $ AR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,... There are two things we can immediately see: • The column names are not very descriptive • We have a lot of integer variables which are usefull for regression analysis. In regards to the first observation, a guide to the heading titles is available here or in the appendix. Interpreting these we can see that the data available is relatively high level for a football game, encompasing goals, shots, fouls, corners and cards. The structure of this study will be partitioned into two independant investigations looking at answering different questions regarding the football world. 1
  • 2. Investigations Which is the Dirtiest League? A topic of much debate for football fans, we can finally do some statistical analysis to find which of the big leagues is actually the dirtiest. In order to investigate this we have several variables that are applicable: • Fouls Commited in a game • Yellow Cards Given in a game • Red Cards given in a game As red cards are signs of a larger infringement than yellow cards, we will weight them accordingly using the below formula: CardScore = 5 ∗ Y ellow + 10 ∗ Red So let’s investigate the fluctuations in fouls and cards across the different leagues in 2015: Bundesliga La Liga Premier Serie A 20 30 40 Average Fouls Commited League Average Card Score 17.5 20.0 22.5 25.0 27.5 League Bundesliga La Liga Premier Serie A As expected there are variations in the leagues, Lets analyse some insights we can glean from the chart above: • As many football fans would expect the Italian Serie A seems to be the dirtiest - at first glance at least! • The Bundesliga, although high in fouls committed, seem to receive fewer cards per game. This could mean that the fouls commited are mostly tactical. • The Premier League seems to be the ‘cleanest’ league, with least fouls and cards per game. • Although La Liga has less fouls on average than most of the other leagues, it has the highest card score. When someone in La Liga decides to foul, he fouls hard! 2
  • 3. In order to check whether there is a real fluctuation between the leagues, we will have to perform an ANOVA analysis. Our Null Hypothesis is that there are no differences between the fouls/cards between the different leagues. The ANOVA test for Fouls committed: ## Df Sum Sq Mean Sq F value Pr(>F) ## Div 3 16335 5445 144.4 <2e-16 *** ## Residuals 1262 47603 38 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 And the ANOVA test for the Card Score: ## Df Sum Sq Mean Sq F value Pr(>F) ## Div 3 34785 11595 90.99 <2e-16 *** ## Residuals 1262 160817 127 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The p-values for both ANOVAs is less than the 0.05 threshold, which means we can reject our null hypothesis and deduce that there is a statistical difference between the fouls in different leagues. However, this conclusion is only applicable for the 2015/2016 season. Let’s investigate whether this is consistent across the last 5 years. A two-way ANOVA of fouls across the leagues in the last 5 years is performed below: ## ## Attaching package: 'reshape2' ## The following object is masked from 'package:tidyr': ## ## smiths ## Df Sum Sq Mean Sq F value Pr(>F) ## Div 3 103962 34654 903.109 < 2e-16 *** ## Year 4 704 176 4.589 0.00106 ** ## Div:Year 12 2829 236 6.144 7.37e-11 *** ## Residuals 7029 269716 38 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1 observation deleted due to missingness So the resulting Two Factor Anova indicates that the different leagues do consistently have significant differences between them, but the year itself also affects the fouls committed - although initially slightly counterintuitive, we can see a significant difference in the means between 2011 (27.9 fouls a game) and 2015 (26.6 fouls a game). Furthermore, although the Leagues do have different values, there is an interaction effect with the Year, which means that depending on the year there is a change in the number of fouls committed within a league. Now, although we know that there are differences in the ‘dirtiness’ of the leagues, we can’t say with confidence what the order is. In order to do so we can do a Tukey analysis to see where there are significant differences. 3
  • 4. Tukey Analysis on Card Score −15 −10 −5 0 5 10 Serie A−Premier Serie A−La Liga Premier−La Liga Serie A−Bundesliga Premier−Bundesliga La Liga−Bundesliga 95% family−wise confidence level 4
  • 5. Tukey Analysis on Fouls −10 −5 0 5 10 Serie A−Premier Serie A−La Liga Premier−La Liga Serie A−Bundesliga Premier−Bundesliga La Liga−Bundesliga 95% family−wise confidence level The derived rankings for Fouls and Card Score are: Fouls Cards 1. Serie A 1. La Liga 2. Bundesliga 2. Serie A 3. La Liga 3. Bundesliga 4. Premier League 4. Premier League We can now assert with confidence that: Serie A is the dirtiest league, having the highest aggregate score when combining fouls and cards - proving the age old rhetoric. The Premier League is the least aggressive league with the lowest rankings for both variables. However, this may be a result of referee leniency as opposed to play style. 5
  • 6. A Logistical Victory Let’s assume that you can obtain all the variables of a specific game - shots, corners, cards, fouls and betting odds for both teams. Having these, is it possible to accurately predict if a team has won? In order to answer this question, we will perform a logistic regression using all these variables as input. Due to the different nature of European leagues we will isolate this investigation to the Premier League to remove variations based on play styles. Our two binary outputs for this regression will be 1, if a team won, and 0 if it hasn’t. Let’s see what the distribution of wins is for homes before we start the regression: 0 50 100 150 200 −0.5 0.0 0.5 1.0 1.5 HomeWin count We can see that the probability of a win is much higher, so we would expect the logical regression we perform below to generate an equal distribution. As we have several years worth of data there is no need to partition one data set into Training and Test. We will use one year to generate the regression (in this particular case the 2015 season) and then test it on a previous year’s data (the 2014 season). First, let’s attempt our first regression utilising all the variables: ## [1] TRUE ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.31979143 1.47614950 0.8940771 3.712807e-01 ## WHH -0.33415258 0.23461233 -1.4242754 1.543668e-01 ## WHD 0.29607628 0.63993231 0.4626681 6.436023e-01 ## WHA 0.15754079 0.18260871 0.8627233 3.882896e-01 ## AR 1.55774944 0.46699806 3.3356657 8.509541e-04 6
  • 7. ## HR -0.75961151 0.56302408 -1.3491634 1.772845e-01 ## HC -0.24655953 0.05295213 -4.6562721 3.219864e-06 ## AC 0.06158437 0.06253096 0.9848621 3.246918e-01 ## HS -0.12670429 0.03985025 -3.1795102 1.475242e-03 ## AS -0.05803611 0.04335286 -1.3386918 1.806710e-01 ## HST 0.47069370 0.07674571 6.1331599 8.615050e-10 ## AST -0.29635324 0.08560794 -3.4617495 5.366764e-04 ## HY -0.04661090 0.12225064 -0.3812732 7.030005e-01 ## AY -0.04089751 0.10651777 -0.3839501 7.010154e-01 It seems our first attempt has identified many redundant variables, which we will remove one by one (starting with the highest P value) to identify how this model changes with each variable removed. Another interesting observation is that there is no apparent correlation between the betting odds for the game and the result. This is because odds are distributed unevenly, with the majority closer to 1, similar to a 1 x function. To linearize it, we simply have to perform the same operation. The charts of the original and linearized are shown below: 5 10 15 2 4 6 8 William Hill Home Odds WilliamHillAwayOdds Original 0.2 0.4 0.6 0.25 0.50 0.75 William Hill Home Odds WilliamHillAwayOdds Linear Removing the highest p-value variables one by one until all remaining inputs are significant leaves the following regression coefficients. The additional variables were removed in the follow order: Home Yellow, Away Yellow, Draw Odds (Win Odds suddenly became significant at this stage), Lose Odds, Away Corners, Away Shots, Home Reds ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.8155647 0.53610105 -1.521289 1.281873e-01 ## I(1/WHH) 3.7902130 0.89015019 4.257948 2.063122e-05 ## AR 1.3210913 0.45618447 2.895958 3.780024e-03 ## HC -0.2860810 0.04844828 -5.904874 3.529168e-09 7
  • 8. ## HST 0.3359608 0.05887963 5.705893 1.157349e-08 ## AST -0.3508141 0.06979585 -5.026289 5.000619e-07 Testing the variance of variables reveals that Home Shots have a high correlation with another variable. ## I(1/WHH) AR HC HST AST ## 1.231508 1.012846 1.417507 1.208117 1.084360 This is logical as Home Shots (HS) are intuitively related to Home Shots on Target (HST) In fact investigating the relationship between the two variables shows a 67% correlation. We can therefore remove this from our regression model, Our final formula is: LogisticRegression = −0.81 + 3.79 × WinningOdds + 1.32 × AwayReds − 0.28 × HomeCorners + 0.34 × HomeShotsOnTarget − 0.35 × AwayShotsOnTarget And therefore the match outcome can be predicted by: MatchOutcome = 1 1 + e−(−0.81+3.79W HH+1.32AR−0.28HC+0.34HST −0.35AST ) In order to evaluate our model, the most important factor is to test the classification correctness. Running the derived regression formula on our 2014 dataset results in the following ‘predictions’: ## Predicted Result ## Actual Result 0 1 ## 0 168 40 ## 1 67 105 It seems like our logistic regression has estimated 273 matches correctly out of a total of 380 games. This gives an accuracy of 71.8%. Testing on several other years gives 75% in 2013, 66% in 2012 and 70% in 2011. We can therefore validate the accuracy of our logarithmic model, with a mean prediction accuracy of 70.7%. Let’s do one final check by calculating a pseudo R-Squared value for the model. We want to maximize the value of the McFadden variable, which has a range of 0 to 1: ## llh llhNull G2 McFadden r2ML ## -186.6825419 -257.6351796 141.9052754 0.2753996 0.3116342 ## r2CU ## 0.4198202 Although the McFadden isn’t too high, it’s also not close to zero so we can safely vouch that our model does have some predictive power. Given the high accuracy over the several test sets, this model can be concluded to be satisfactory. 8
  • 9. Appendix Column Name Translations Column Name Meaning FTHG Full Time Home Goals FTAG Full Time Away Goals FTR Full Time Result HTHG Half Time Home Goals HTAG Half Time Away Goals HTR Half Time Result HS Home Shots AS Away Shots HST Home Shots on Target AST Away Shots on Target HF Home Fouls AF Away Fouls HC Home Corners AC Away Corners HY Home Yellows AY Away Yellows HR Home Reds AR Away Reds 9