SlideShare a Scribd company logo
The Data Behind Football
An analysis by Apostolos Mourouzis
11 December 2016
Introduction
Sport has always been considered a very emotional endeavour, where instinct and intuition take precedent over
fact. With the recent boom in data availability there are ļ¬nally resources that enable an objective analysis of
the game. In this project we have chosen one particular dataset and use it to try derive relationships and
correlations that provide a diļ¬€erent perspective to the world of football.
Our data has been obtained from Football-Data which gives the results of all football games in the top
divisions across Europe.
Our data set also includes betting odds for a range of betting houses. Letā€™s take a ļ¬rst glimpse at our data
set (betting odds have been ļ¬ltered out for the sake of clarity):
## Observations: 380
## Variables: 23
## $ Div <fctr> E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, ...
## $ Date <fctr> 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08...
## $ HomeTeam <fctr> Bournemouth, Chelsea, Everton, Leicester, Man United...
## $ AwayTeam <fctr> Aston Villa, Swansea, Watford, Sunderland, Tottenham...
## $ FTHG <int> 0, 2, 2, 4, 1, 1, 0, 2, 0, 0, 0, 0, 1, 2, 2, 0, 1, 1,...
## $ FTAG <int> 1, 2, 2, 2, 0, 3, 2, 2, 1, 3, 1, 3, 3, 0, 2, 0, 2, 2,...
## $ FTR <fctr> A, D, D, H, H, A, A, D, A, A, A, A, A, H, D, D, A, A...
## $ HTHG <int> 0, 2, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1,...
## $ HTAG <int> 0, 1, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 2, 0, 0, 0, 2, 1,...
## $ HTR <fctr> D, H, A, H, H, A, A, D, D, A, A, A, A, H, H, D, A, D...
## $ Referee <fctr> M Clattenburg, M Oliver, M Jones, L Mason, J Moss, S...
## $ HS <int> 11, 11, 10, 19, 9, 17, 22, 9, 7, 9, 5, 17, 6, 19, 13,...
## $ AS <int> 7, 18, 11, 10, 9, 11, 8, 15, 8, 19, 9, 10, 19, 4, 16,...
## $ HST <int> 2, 3, 5, 8, 1, 6, 6, 4, 1, 2, 1, 4, 2, 6, 7, 5, 3, 4,...
## $ AST <int> 3, 10, 5, 5, 4, 7, 4, 5, 3, 7, 2, 4, 6, 2, 7, 0, 6, 7...
## $ HF <int> 13, 15, 7, 13, 12, 14, 12, 9, 9, 12, 14, 11, 7, 11, 1...
## $ AF <int> 13, 16, 13, 17, 12, 20, 9, 12, 16, 9, 10, 10, 7, 8, 1...
## $ HC <int> 6, 4, 8, 6, 1, 1, 5, 6, 3, 6, 3, 9, 6, 4, 4, 2, 8, 6,...
## $ AC <int> 3, 8, 2, 3, 2, 4, 4, 6, 5, 6, 5, 9, 6, 4, 3, 4, 4, 6,...
## $ HY <int> 3, 1, 1, 2, 2, 1, 1, 2, 2, 4, 2, 4, 1, 2, 2, 1, 1, 1,...
## $ AY <int> 4, 3, 2, 4, 3, 0, 3, 4, 4, 1, 2, 2, 2, 1, 2, 2, 3, 1,...
## $ HR <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ AR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,...
There are two things we can immediately see:
ā€¢ The column names are not very descriptive
ā€¢ We have a lot of integer variables which are usefull for regression analysis.
In regards to the ļ¬rst observation, a guide to the heading titles is available here or in the appendix. Interpreting
these we can see that the data available is relatively high level for a football game, encompasing goals, shots,
fouls, corners and cards.
The structure of this study will be partitioned into two independant investigations looking at answering
diļ¬€erent questions regarding the football world.
1
Investigations
Which is the Dirtiest League?
A topic of much debate for football fans, we can ļ¬nally do some statistical analysis to ļ¬nd which of the big
leagues is actually the dirtiest. In order to investigate this we have several variables that are applicable:
ā€¢ Fouls Commited in a game
ā€¢ Yellow Cards Given in a game
ā€¢ Red Cards given in a game
As red cards are signs of a larger infringement than yellow cards, we will weight them accordingly using the
below formula:
CardScore = 5 āˆ— Y ellow + 10 āˆ— Red
So letā€™s investigate the ļ¬‚uctuations in fouls and cards across the diļ¬€erent leagues in 2015:
Bundesliga
La Liga
Premier
Serie A
20 30 40
Average Fouls Commited
League
Average Card Score
17.5
20.0
22.5
25.0
27.5
League
Bundesliga
La Liga
Premier
Serie A
As expected there are variations in the leagues, Lets analyse some insights we can glean from the chart above:
ā€¢ As many football fans would expect the Italian Serie A seems to be the dirtiest - at ļ¬rst glance at least!
ā€¢ The Bundesliga, although high in fouls committed, seem to receive fewer cards per game. This could
mean that the fouls commited are mostly tactical.
ā€¢ The Premier League seems to be the ā€˜cleanestā€™ league, with least fouls and cards per game.
ā€¢ Although La Liga has less fouls on average than most of the other leagues, it has the highest card score.
When someone in La Liga decides to foul, he fouls hard!
2
In order to check whether there is a real ļ¬‚uctuation between the leagues, we will have to perform an ANOVA
analysis. Our Null Hypothesis is that there are no diļ¬€erences between the fouls/cards between the diļ¬€erent
leagues.
The ANOVA test for Fouls committed:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 16335 5445 144.4 <2e-16 ***
## Residuals 1262 47603 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And the ANOVA test for the Card Score:
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 34785 11595 90.99 <2e-16 ***
## Residuals 1262 160817 127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-values for both ANOVAs is less than the 0.05 threshold, which means we can reject our null hypothesis
and deduce that there is a statistical diļ¬€erence between the fouls in diļ¬€erent leagues.
However, this conclusion is only applicable for the 2015/2016 season. Letā€™s investigate whether this is
consistent across the last 5 years.
A two-way ANOVA of fouls across the leagues in the last 5 years is performed below:
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Df Sum Sq Mean Sq F value Pr(>F)
## Div 3 103962 34654 903.109 < 2e-16 ***
## Year 4 704 176 4.589 0.00106 **
## Div:Year 12 2829 236 6.144 7.37e-11 ***
## Residuals 7029 269716 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
So the resulting Two Factor Anova indicates that the diļ¬€erent leagues do consistently have signiļ¬cant
diļ¬€erences between them, but the year itself also aļ¬€ects the fouls committed - although initially slightly
counterintuitive, we can see a signiļ¬cant diļ¬€erence in the means between 2011 (27.9 fouls a game) and 2015
(26.6 fouls a game). Furthermore, although the Leagues do have diļ¬€erent values, there is an interaction eļ¬€ect
with the Year, which means that depending on the year there is a change in the number of fouls committed
within a league.
Now, although we know that there are diļ¬€erences in the ā€˜dirtinessā€™ of the leagues, we canā€™t say with conļ¬dence
what the order is. In order to do so we can do a Tukey analysis to see where there are signiļ¬cant diļ¬€erences.
3
Tukey Analysis on Card Score
āˆ’15 āˆ’10 āˆ’5 0 5 10
Serie Aāˆ’Premier
Serie Aāˆ’La Liga
Premierāˆ’La Liga
Serie Aāˆ’Bundesliga
Premierāˆ’Bundesliga
La Ligaāˆ’Bundesliga
95% familyāˆ’wise confidence level
4
Tukey Analysis on Fouls
āˆ’10 āˆ’5 0 5 10
Serie Aāˆ’Premier
Serie Aāˆ’La Liga
Premierāˆ’La Liga
Serie Aāˆ’Bundesliga
Premierāˆ’Bundesliga
La Ligaāˆ’Bundesliga
95% familyāˆ’wise confidence level
The derived rankings for Fouls and Card Score are:
Fouls Cards
1. Serie A 1. La Liga
2. Bundesliga 2. Serie A
3. La Liga 3. Bundesliga
4. Premier League 4. Premier League
We can now assert with conļ¬dence that:
Serie A is the dirtiest league, having the highest aggregate score when combining fouls and cards - proving the
age old rhetoric. The Premier League is the least aggressive league with the lowest rankings for both variables.
However, this may be a result of referee leniency as opposed to play style.
5
A Logistical Victory
Letā€™s assume that you can obtain all the variables of a speciļ¬c game - shots, corners, cards, fouls and betting
odds for both teams. Having these, is it possible to accurately predict if a team has won?
In order to answer this question, we will perform a logistic regression using all these variables as input. Due
to the diļ¬€erent nature of European leagues we will isolate this investigation to the Premier League to remove
variations based on play styles. Our two binary outputs for this regression will be 1, if a team won, and 0 if
it hasnā€™t.
Letā€™s see what the distribution of wins is for homes before we start the regression:
0
50
100
150
200
āˆ’0.5 0.0 0.5 1.0 1.5
HomeWin
count
We can see that the probability of a win is much higher, so we would expect the logical regression we perform
below to generate an equal distribution.
As we have several years worth of data there is no need to partition one data set into Training and Test. We
will use one year to generate the regression (in this particular case the 2015 season) and then test it on a
previous yearā€™s data (the 2014 season).
First, letā€™s attempt our ļ¬rst regression utilising all the variables:
## [1] TRUE
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.31979143 1.47614950 0.8940771 3.712807e-01
## WHH -0.33415258 0.23461233 -1.4242754 1.543668e-01
## WHD 0.29607628 0.63993231 0.4626681 6.436023e-01
## WHA 0.15754079 0.18260871 0.8627233 3.882896e-01
## AR 1.55774944 0.46699806 3.3356657 8.509541e-04
6
## HR -0.75961151 0.56302408 -1.3491634 1.772845e-01
## HC -0.24655953 0.05295213 -4.6562721 3.219864e-06
## AC 0.06158437 0.06253096 0.9848621 3.246918e-01
## HS -0.12670429 0.03985025 -3.1795102 1.475242e-03
## AS -0.05803611 0.04335286 -1.3386918 1.806710e-01
## HST 0.47069370 0.07674571 6.1331599 8.615050e-10
## AST -0.29635324 0.08560794 -3.4617495 5.366764e-04
## HY -0.04661090 0.12225064 -0.3812732 7.030005e-01
## AY -0.04089751 0.10651777 -0.3839501 7.010154e-01
It seems our ļ¬rst attempt has identiļ¬ed many redundant variables, which we will remove one by one (starting
with the highest P value) to identify how this model changes with each variable removed. Another interesting
observation is that there is no apparent correlation between the betting odds for the game and the result.
This is because odds are distributed unevenly, with the majority closer to 1, similar to a 1
x function. To
linearize it, we simply have to perform the same operation. The charts of the original and linearized are
shown below:
5
10
15
2 4 6 8
William Hill Home Odds
WilliamHillAwayOdds
Original
0.2
0.4
0.6
0.25 0.50 0.75
William Hill Home Odds
WilliamHillAwayOdds
Linear
Removing the highest p-value variables one by one until all remaining inputs are signiļ¬cant leaves the following
regression coeļ¬ƒcients. The additional variables were removed in the follow order:
Home Yellow, Away Yellow, Draw Odds (Win Odds suddenly became signiļ¬cant at this stage), Lose Odds,
Away Corners, Away Shots, Home Reds
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8155647 0.53610105 -1.521289 1.281873e-01
## I(1/WHH) 3.7902130 0.89015019 4.257948 2.063122e-05
## AR 1.3210913 0.45618447 2.895958 3.780024e-03
## HC -0.2860810 0.04844828 -5.904874 3.529168e-09
7
## HST 0.3359608 0.05887963 5.705893 1.157349e-08
## AST -0.3508141 0.06979585 -5.026289 5.000619e-07
Testing the variance of variables reveals that Home Shots have a high correlation with another variable.
## I(1/WHH) AR HC HST AST
## 1.231508 1.012846 1.417507 1.208117 1.084360
This is logical as Home Shots (HS) are intuitively related to Home Shots on Target (HST) In fact investigating
the relationship between the two variables shows a 67% correlation. We can therefore remove this from our
regression model, Our ļ¬nal formula is:
LogisticRegression = āˆ’0.81 + 3.79 Ɨ WinningOdds + 1.32 Ɨ AwayReds āˆ’ 0.28 Ɨ HomeCorners + 0.34 Ɨ
HomeShotsOnTarget āˆ’ 0.35 Ɨ AwayShotsOnTarget
And therefore the match outcome can be predicted by:
MatchOutcome =
1
1 + eāˆ’(āˆ’0.81+3.79W HH+1.32ARāˆ’0.28HC+0.34HST āˆ’0.35AST )
In order to evaluate our model, the most important factor is to test the classiļ¬cation correctness. Running
the derived regression formula on our 2014 dataset results in the following ā€˜predictionsā€™:
## Predicted Result
## Actual Result 0 1
## 0 168 40
## 1 67 105
It seems like our logistic regression has estimated 273 matches correctly out of a total of 380 games. This
gives an accuracy of 71.8%. Testing on several other years gives 75% in 2013, 66% in 2012 and 70% in 2011.
We can therefore validate the accuracy of our logarithmic model, with a mean prediction accuracy of 70.7%.
Letā€™s do one ļ¬nal check by calculating a pseudo R-Squared value for the model. We want to maximize the
value of the McFadden variable, which has a range of 0 to 1:
## llh llhNull G2 McFadden r2ML
## -186.6825419 -257.6351796 141.9052754 0.2753996 0.3116342
## r2CU
## 0.4198202
Although the McFadden isnā€™t too high, itā€™s also not close to zero so we can safely vouch that our model does
have some predictive power. Given the high accuracy over the several test sets, this model can be concluded
to be satisfactory.
8
Appendix
Column Name Translations
Column Name Meaning
FTHG Full Time Home Goals
FTAG Full Time Away Goals
FTR Full Time Result
HTHG Half Time Home Goals
HTAG Half Time Away Goals
HTR Half Time Result
HS Home Shots
AS Away Shots
HST Home Shots on Target
AST Away Shots on Target
HF Home Fouls
AF Away Fouls
HC Home Corners
AC Away Corners
HY Home Yellows
AY Away Yellows
HR Home Reds
AR Away Reds
9

More Related Content

Similar to The Data Behind Football

Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)
Nijichinnu
Ā 
Web Scraping IPL T20
Web Scraping IPL T20Web Scraping IPL T20
Web Scraping IPL T20
Nijichinnu
Ā 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysis
Ritu Sarkar
Ā 
Statistical Model Report
Statistical Model ReportStatistical Model Report
Statistical Model ReportPatrick Jennings
Ā 
Statistical Model Report
Statistical Model ReportStatistical Model Report
Statistical Model ReportPatrick Jennings
Ā 
A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2
makdul
Ā 
The Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-HittersThe Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-Hitters
Kenneth Burgos
Ā 
InstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docxInstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docx
normanibarber20063
Ā 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB Attendance
Sina Anaraki
Ā 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB Attendance
Sina Anaraki
Ā 
Handling data and probability
Handling data and probabilityHandling data and probability
Handling data and probability
Alberto Pardo MilanƩs
Ā 
2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm review2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm reviewarinedge
Ā 
Digging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max SklarDigging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max Sklar
Hakka Labs
Ā 
Notes Chapter 3.pptx
Notes Chapter 3.pptxNotes Chapter 3.pptx
Notes Chapter 3.pptx
AbhayYadav887828
Ā 
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Lucidworks
Ā 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docx
oswald1horne84988
Ā 
Bab 4.ppt
Bab 4.pptBab 4.ppt
Bab 4.ppt
akhmadakbarsusamto1
Ā 

Similar to The Data Behind Football (20)

Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)Web Scraping and EDA Project on IPLT20(2015-2019)
Web Scraping and EDA Project on IPLT20(2015-2019)
Ā 
Web Scraping IPL T20
Web Scraping IPL T20Web Scraping IPL T20
Web Scraping IPL T20
Ā 
La liga 2013 2014 analysis
La liga 2013 2014 analysisLa liga 2013 2014 analysis
La liga 2013 2014 analysis
Ā 
Statistical Model Report
Statistical Model ReportStatistical Model Report
Statistical Model Report
Ā 
Statistical Model Report
Statistical Model ReportStatistical Model Report
Statistical Model Report
Ā 
A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2A study of the decathlon dataset a student20 february 2
A study of the decathlon dataset a student20 february 2
Ā 
The Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-HittersThe Year of the Pitcher: Analyzing No-Hitters
The Year of the Pitcher: Analyzing No-Hitters
Ā 
honors_paper
honors_paperhonors_paper
honors_paper
Ā 
InstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docxInstructionsCongratulations. You are a finalist in for a data a.docx
InstructionsCongratulations. You are a finalist in for a data a.docx
Ā 
Iowa_Report_2
Iowa_Report_2Iowa_Report_2
Iowa_Report_2
Ā 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB Attendance
Ā 
Regression Analysis: MLB Attendance
Regression Analysis: MLB AttendanceRegression Analysis: MLB Attendance
Regression Analysis: MLB Attendance
Ā 
Handling data and probability
Handling data and probabilityHandling data and probability
Handling data and probability
Ā 
2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm review2012 2013 3rd 9 weeks midterm review
2012 2013 3rd 9 weeks midterm review
Ā 
Digging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max SklarDigging into the Dirichlet Distribution by Max Sklar
Digging into the Dirichlet Distribution by Max Sklar
Ā 
Notes Chapter 3.pptx
Notes Chapter 3.pptxNotes Chapter 3.pptx
Notes Chapter 3.pptx
Ā 
FINAL_TAKE_HOME
FINAL_TAKE_HOMEFINAL_TAKE_HOME
FINAL_TAKE_HOME
Ā 
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Analytics at Scale with the Analytics Component 2.0 - Houston Putman, Bloombe...
Ā 
1 BBS300 Empirical Research Methods for Business .docx
1  BBS300 Empirical  Research  Methods  for  Business .docx1  BBS300 Empirical  Research  Methods  for  Business .docx
1 BBS300 Empirical Research Methods for Business .docx
Ā 
Bab 4.ppt
Bab 4.pptBab 4.ppt
Bab 4.ppt
Ā 

The Data Behind Football

  • 1. The Data Behind Football An analysis by Apostolos Mourouzis 11 December 2016 Introduction Sport has always been considered a very emotional endeavour, where instinct and intuition take precedent over fact. With the recent boom in data availability there are ļ¬nally resources that enable an objective analysis of the game. In this project we have chosen one particular dataset and use it to try derive relationships and correlations that provide a diļ¬€erent perspective to the world of football. Our data has been obtained from Football-Data which gives the results of all football games in the top divisions across Europe. Our data set also includes betting odds for a range of betting houses. Letā€™s take a ļ¬rst glimpse at our data set (betting odds have been ļ¬ltered out for the sake of clarity): ## Observations: 380 ## Variables: 23 ## $ Div <fctr> E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, E0, ... ## $ Date <fctr> 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08/08/15, 08... ## $ HomeTeam <fctr> Bournemouth, Chelsea, Everton, Leicester, Man United... ## $ AwayTeam <fctr> Aston Villa, Swansea, Watford, Sunderland, Tottenham... ## $ FTHG <int> 0, 2, 2, 4, 1, 1, 0, 2, 0, 0, 0, 0, 1, 2, 2, 0, 1, 1,... ## $ FTAG <int> 1, 2, 2, 2, 0, 3, 2, 2, 1, 3, 1, 3, 3, 0, 2, 0, 2, 2,... ## $ FTR <fctr> A, D, D, H, H, A, A, D, A, A, A, A, A, H, D, D, A, A... ## $ HTHG <int> 0, 2, 0, 3, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1,... ## $ HTAG <int> 0, 1, 1, 0, 0, 1, 1, 1, 0, 2, 1, 2, 2, 0, 0, 0, 2, 1,... ## $ HTR <fctr> D, H, A, H, H, A, A, D, D, A, A, A, A, H, H, D, A, D... ## $ Referee <fctr> M Clattenburg, M Oliver, M Jones, L Mason, J Moss, S... ## $ HS <int> 11, 11, 10, 19, 9, 17, 22, 9, 7, 9, 5, 17, 6, 19, 13,... ## $ AS <int> 7, 18, 11, 10, 9, 11, 8, 15, 8, 19, 9, 10, 19, 4, 16,... ## $ HST <int> 2, 3, 5, 8, 1, 6, 6, 4, 1, 2, 1, 4, 2, 6, 7, 5, 3, 4,... ## $ AST <int> 3, 10, 5, 5, 4, 7, 4, 5, 3, 7, 2, 4, 6, 2, 7, 0, 6, 7... ## $ HF <int> 13, 15, 7, 13, 12, 14, 12, 9, 9, 12, 14, 11, 7, 11, 1... ## $ AF <int> 13, 16, 13, 17, 12, 20, 9, 12, 16, 9, 10, 10, 7, 8, 1... ## $ HC <int> 6, 4, 8, 6, 1, 1, 5, 6, 3, 6, 3, 9, 6, 4, 4, 2, 8, 6,... ## $ AC <int> 3, 8, 2, 3, 2, 4, 4, 6, 5, 6, 5, 9, 6, 4, 3, 4, 4, 6,... ## $ HY <int> 3, 1, 1, 2, 2, 1, 1, 2, 2, 4, 2, 4, 1, 2, 2, 1, 1, 1,... ## $ AY <int> 4, 3, 2, 4, 3, 0, 3, 4, 4, 1, 2, 2, 2, 1, 2, 2, 3, 1,... ## $ HR <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,... ## $ AR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,... There are two things we can immediately see: ā€¢ The column names are not very descriptive ā€¢ We have a lot of integer variables which are usefull for regression analysis. In regards to the ļ¬rst observation, a guide to the heading titles is available here or in the appendix. Interpreting these we can see that the data available is relatively high level for a football game, encompasing goals, shots, fouls, corners and cards. The structure of this study will be partitioned into two independant investigations looking at answering diļ¬€erent questions regarding the football world. 1
  • 2. Investigations Which is the Dirtiest League? A topic of much debate for football fans, we can ļ¬nally do some statistical analysis to ļ¬nd which of the big leagues is actually the dirtiest. In order to investigate this we have several variables that are applicable: ā€¢ Fouls Commited in a game ā€¢ Yellow Cards Given in a game ā€¢ Red Cards given in a game As red cards are signs of a larger infringement than yellow cards, we will weight them accordingly using the below formula: CardScore = 5 āˆ— Y ellow + 10 āˆ— Red So letā€™s investigate the ļ¬‚uctuations in fouls and cards across the diļ¬€erent leagues in 2015: Bundesliga La Liga Premier Serie A 20 30 40 Average Fouls Commited League Average Card Score 17.5 20.0 22.5 25.0 27.5 League Bundesliga La Liga Premier Serie A As expected there are variations in the leagues, Lets analyse some insights we can glean from the chart above: ā€¢ As many football fans would expect the Italian Serie A seems to be the dirtiest - at ļ¬rst glance at least! ā€¢ The Bundesliga, although high in fouls committed, seem to receive fewer cards per game. This could mean that the fouls commited are mostly tactical. ā€¢ The Premier League seems to be the ā€˜cleanestā€™ league, with least fouls and cards per game. ā€¢ Although La Liga has less fouls on average than most of the other leagues, it has the highest card score. When someone in La Liga decides to foul, he fouls hard! 2
  • 3. In order to check whether there is a real ļ¬‚uctuation between the leagues, we will have to perform an ANOVA analysis. Our Null Hypothesis is that there are no diļ¬€erences between the fouls/cards between the diļ¬€erent leagues. The ANOVA test for Fouls committed: ## Df Sum Sq Mean Sq F value Pr(>F) ## Div 3 16335 5445 144.4 <2e-16 *** ## Residuals 1262 47603 38 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 And the ANOVA test for the Card Score: ## Df Sum Sq Mean Sq F value Pr(>F) ## Div 3 34785 11595 90.99 <2e-16 *** ## Residuals 1262 160817 127 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The p-values for both ANOVAs is less than the 0.05 threshold, which means we can reject our null hypothesis and deduce that there is a statistical diļ¬€erence between the fouls in diļ¬€erent leagues. However, this conclusion is only applicable for the 2015/2016 season. Letā€™s investigate whether this is consistent across the last 5 years. A two-way ANOVA of fouls across the leagues in the last 5 years is performed below: ## ## Attaching package: 'reshape2' ## The following object is masked from 'package:tidyr': ## ## smiths ## Df Sum Sq Mean Sq F value Pr(>F) ## Div 3 103962 34654 903.109 < 2e-16 *** ## Year 4 704 176 4.589 0.00106 ** ## Div:Year 12 2829 236 6.144 7.37e-11 *** ## Residuals 7029 269716 38 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 1 observation deleted due to missingness So the resulting Two Factor Anova indicates that the diļ¬€erent leagues do consistently have signiļ¬cant diļ¬€erences between them, but the year itself also aļ¬€ects the fouls committed - although initially slightly counterintuitive, we can see a signiļ¬cant diļ¬€erence in the means between 2011 (27.9 fouls a game) and 2015 (26.6 fouls a game). Furthermore, although the Leagues do have diļ¬€erent values, there is an interaction eļ¬€ect with the Year, which means that depending on the year there is a change in the number of fouls committed within a league. Now, although we know that there are diļ¬€erences in the ā€˜dirtinessā€™ of the leagues, we canā€™t say with conļ¬dence what the order is. In order to do so we can do a Tukey analysis to see where there are signiļ¬cant diļ¬€erences. 3
  • 4. Tukey Analysis on Card Score āˆ’15 āˆ’10 āˆ’5 0 5 10 Serie Aāˆ’Premier Serie Aāˆ’La Liga Premierāˆ’La Liga Serie Aāˆ’Bundesliga Premierāˆ’Bundesliga La Ligaāˆ’Bundesliga 95% familyāˆ’wise confidence level 4
  • 5. Tukey Analysis on Fouls āˆ’10 āˆ’5 0 5 10 Serie Aāˆ’Premier Serie Aāˆ’La Liga Premierāˆ’La Liga Serie Aāˆ’Bundesliga Premierāˆ’Bundesliga La Ligaāˆ’Bundesliga 95% familyāˆ’wise confidence level The derived rankings for Fouls and Card Score are: Fouls Cards 1. Serie A 1. La Liga 2. Bundesliga 2. Serie A 3. La Liga 3. Bundesliga 4. Premier League 4. Premier League We can now assert with conļ¬dence that: Serie A is the dirtiest league, having the highest aggregate score when combining fouls and cards - proving the age old rhetoric. The Premier League is the least aggressive league with the lowest rankings for both variables. However, this may be a result of referee leniency as opposed to play style. 5
  • 6. A Logistical Victory Letā€™s assume that you can obtain all the variables of a speciļ¬c game - shots, corners, cards, fouls and betting odds for both teams. Having these, is it possible to accurately predict if a team has won? In order to answer this question, we will perform a logistic regression using all these variables as input. Due to the diļ¬€erent nature of European leagues we will isolate this investigation to the Premier League to remove variations based on play styles. Our two binary outputs for this regression will be 1, if a team won, and 0 if it hasnā€™t. Letā€™s see what the distribution of wins is for homes before we start the regression: 0 50 100 150 200 āˆ’0.5 0.0 0.5 1.0 1.5 HomeWin count We can see that the probability of a win is much higher, so we would expect the logical regression we perform below to generate an equal distribution. As we have several years worth of data there is no need to partition one data set into Training and Test. We will use one year to generate the regression (in this particular case the 2015 season) and then test it on a previous yearā€™s data (the 2014 season). First, letā€™s attempt our ļ¬rst regression utilising all the variables: ## [1] TRUE ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.31979143 1.47614950 0.8940771 3.712807e-01 ## WHH -0.33415258 0.23461233 -1.4242754 1.543668e-01 ## WHD 0.29607628 0.63993231 0.4626681 6.436023e-01 ## WHA 0.15754079 0.18260871 0.8627233 3.882896e-01 ## AR 1.55774944 0.46699806 3.3356657 8.509541e-04 6
  • 7. ## HR -0.75961151 0.56302408 -1.3491634 1.772845e-01 ## HC -0.24655953 0.05295213 -4.6562721 3.219864e-06 ## AC 0.06158437 0.06253096 0.9848621 3.246918e-01 ## HS -0.12670429 0.03985025 -3.1795102 1.475242e-03 ## AS -0.05803611 0.04335286 -1.3386918 1.806710e-01 ## HST 0.47069370 0.07674571 6.1331599 8.615050e-10 ## AST -0.29635324 0.08560794 -3.4617495 5.366764e-04 ## HY -0.04661090 0.12225064 -0.3812732 7.030005e-01 ## AY -0.04089751 0.10651777 -0.3839501 7.010154e-01 It seems our ļ¬rst attempt has identiļ¬ed many redundant variables, which we will remove one by one (starting with the highest P value) to identify how this model changes with each variable removed. Another interesting observation is that there is no apparent correlation between the betting odds for the game and the result. This is because odds are distributed unevenly, with the majority closer to 1, similar to a 1 x function. To linearize it, we simply have to perform the same operation. The charts of the original and linearized are shown below: 5 10 15 2 4 6 8 William Hill Home Odds WilliamHillAwayOdds Original 0.2 0.4 0.6 0.25 0.50 0.75 William Hill Home Odds WilliamHillAwayOdds Linear Removing the highest p-value variables one by one until all remaining inputs are signiļ¬cant leaves the following regression coeļ¬ƒcients. The additional variables were removed in the follow order: Home Yellow, Away Yellow, Draw Odds (Win Odds suddenly became signiļ¬cant at this stage), Lose Odds, Away Corners, Away Shots, Home Reds ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.8155647 0.53610105 -1.521289 1.281873e-01 ## I(1/WHH) 3.7902130 0.89015019 4.257948 2.063122e-05 ## AR 1.3210913 0.45618447 2.895958 3.780024e-03 ## HC -0.2860810 0.04844828 -5.904874 3.529168e-09 7
  • 8. ## HST 0.3359608 0.05887963 5.705893 1.157349e-08 ## AST -0.3508141 0.06979585 -5.026289 5.000619e-07 Testing the variance of variables reveals that Home Shots have a high correlation with another variable. ## I(1/WHH) AR HC HST AST ## 1.231508 1.012846 1.417507 1.208117 1.084360 This is logical as Home Shots (HS) are intuitively related to Home Shots on Target (HST) In fact investigating the relationship between the two variables shows a 67% correlation. We can therefore remove this from our regression model, Our ļ¬nal formula is: LogisticRegression = āˆ’0.81 + 3.79 Ɨ WinningOdds + 1.32 Ɨ AwayReds āˆ’ 0.28 Ɨ HomeCorners + 0.34 Ɨ HomeShotsOnTarget āˆ’ 0.35 Ɨ AwayShotsOnTarget And therefore the match outcome can be predicted by: MatchOutcome = 1 1 + eāˆ’(āˆ’0.81+3.79W HH+1.32ARāˆ’0.28HC+0.34HST āˆ’0.35AST ) In order to evaluate our model, the most important factor is to test the classiļ¬cation correctness. Running the derived regression formula on our 2014 dataset results in the following ā€˜predictionsā€™: ## Predicted Result ## Actual Result 0 1 ## 0 168 40 ## 1 67 105 It seems like our logistic regression has estimated 273 matches correctly out of a total of 380 games. This gives an accuracy of 71.8%. Testing on several other years gives 75% in 2013, 66% in 2012 and 70% in 2011. We can therefore validate the accuracy of our logarithmic model, with a mean prediction accuracy of 70.7%. Letā€™s do one ļ¬nal check by calculating a pseudo R-Squared value for the model. We want to maximize the value of the McFadden variable, which has a range of 0 to 1: ## llh llhNull G2 McFadden r2ML ## -186.6825419 -257.6351796 141.9052754 0.2753996 0.3116342 ## r2CU ## 0.4198202 Although the McFadden isnā€™t too high, itā€™s also not close to zero so we can safely vouch that our model does have some predictive power. Given the high accuracy over the several test sets, this model can be concluded to be satisfactory. 8
  • 9. Appendix Column Name Translations Column Name Meaning FTHG Full Time Home Goals FTAG Full Time Away Goals FTR Full Time Result HTHG Half Time Home Goals HTAG Half Time Away Goals HTR Half Time Result HS Home Shots AS Away Shots HST Home Shots on Target AST Away Shots on Target HF Home Fouls AF Away Fouls HC Home Corners AC Away Corners HY Home Yellows AY Away Yellows HR Home Reds AR Away Reds 9