This document discusses predicting football match results using machine learning models. It outlines the workflow of data collection, feature engineering, model training and evaluation. Several classification models are tested on a dataset of European football matches and odds data. The best models achieve similar prediction accuracy to bookmakers, though draws remain difficult to predict. The author concludes that reaching bookmakers' performance is achievable with more refined data but predicting football matches accurately remains very challenging.
5. How data is
entering into
football ?
• Acquisitions…
• Sensors to track match
and players performance…
• And BIG BANKS trying to
make prediction on world
cup….
5/35
8. The leader of the markets for predictions:
the bookmakers
ESTIMATES OF MARKET SIZE
• WORLD: 700bn-1trillion US$ a year
• Of which, 70% related to football
• IN ITALY: increase of 300% over ten years, total bets collected of 10
billion € per year. For a net revenue of 1,3 billion €**.
• IN USA: sports betting to become fully legal in 2019.
• *Source: “Football betting - the global gambling industry worth billions”, BBC, October
2013, link
• ** Source: AGIMEG-SOLE24_ORE, link
8/35
9. What bookmakers offer….
Implied probability is
1/odd……in this example
1/1,39 = 71% chance for
Manchester United to win
The lowest odd is the bookmaker’s favourite result! (the bookmaker prediction) !!
9/35
12. WORKFLOW
• Data –
description
• Source
An
overview
of the
dataset
Feature
enginnering
Pre-
processing
data
Train/Test set
Selection of
the models
Training of
the models
Inspecting
results and
interpretation
Test of the
models
12/35
13. An overview
of the
dataset
“The European soccer database”
Downloadable from Kaggle or scraping on football-data.co.uk
• Seasons from 2008 to 2016
• 11 European top countries Leagues
• 25,979 matches
✓Results (Goal scored by home and away team)
✓Betting odds for the two teams from 10 top odds
providers
Home
team
Away
team
Home
Team
Goal
Away
Team
Goal
Stage Season League
Juventus Milan 3 2 2 2008/2009 Serie A
Lazio Sassuolo 1 0 3 2008/2009 Serie A
Real
Madrid
Valencia 1 1 5 2010/2011 Premier
League
…. … … … … … …
13/35
14. WORKFLOW
• Data –
description
• Source
An
overview
of the
dataset
Feature
enginnering
Pre-
processing
data
Train/Test set
Selection of
the models
Training of
the models
Inspecting
results and
interpretation
Test of the
models
14/35
15. Pre-
Processing
Data ‘ Result ‘
if home team
win 1
if draw X
if away team
win 2
• We will face a CLASSIFICATION task… we will classify each match to result “1” , “X” or “2”.
• We will threat those results as unordered factors…
15/35
16. Feature
Enginnering
• TEMPORAL:
• We cannot use the information regarding the
match to predict its result…because we need to
make the prediction before the match starts !
• WHICH FEATURES
• If we do not apply features engineering, the only
features we have available are the results of the
previous match, team_id’s and league id’s… but
this might not be enough
Why we
need to
create
features ?
• Points in the league (Win= 3 points, Draw= 1 points,
Loose=0 points)
• Goal of the teams
• Dummies (season, league, stage…)
• ELO points
• All this feature are calculated by team, league,
season….cumulatively till the match day !
Thus we
create….
16/35
17. Feature
Engineering
-
The final
dataset Home
team
Away
team
H_Points so far A_Points
so far
Goal in
season
_ Home
Goal in
season_
away
Elo_points
_home
Elo_point
s_away
…
Juve Milan 68 65 23 15 Serie A 2 …
Lazio Sassuol 58 18 3 3 Serie A 3 …
Real
Madrid
Valencia 15 23 5 2010/201
1
Premier
League
5 …
…. … … … … … … … …
Full set of 17 potential predictors…
Dummies
Team specific
ELO POINTS
"season","country_id","league_id","stage",
"home_team_api_id","away_team_api_id",
"pointsofar_home","pointsofar_away","home_H_points_sofar" ,"away_A_points_sofar",
"goalsofar_home_out" ,"goalsofar_home_in"
,"goalsofar_home_diff","goalsofar_away_out","goalsofar_away_in",
"home_H_goal_out_sofar" ,"away_A_goal_out_sofar"
,"home_H_goal_in_sofar","away_H_goal_in_sofar" ,"goalsofar_away_diff",
"home_H_points_elo_sofar" ,"away_A_points_elo_sofar")
17/35
18. Feature
Engineering
-
ELO RATINGS
ELO POINTS
• from Arpad Emrick Elo (1903 - 1992), professor of Physics, Hungary.
• Firstly applied to calculate the relative strenght of a Chess player.
• The intuition: In the normal setting we assign 3 points for a win, to each
match without taking into account the strenght of the rival.
The elo rating weights for the strenght of the rival, that is, if you are a low-
ranking team, and you win against an «high-ranking» team, you will be
assigned more than 3 points. The «multiplier» will be equal to the difference
in ranking (difference in points).
𝑬𝒍𝒐 𝒑𝒐𝒊𝒏𝒕𝒔 𝑨 𝒗𝒔 𝑩 =
𝑵𝒐𝒓𝒎𝒂𝒍 𝒑𝒐𝒊𝒏𝒕𝒔 𝑨 𝒗𝒔 𝑩
ൗ
𝑹𝒂𝒏𝒌𝒊𝒏𝒈𝒔 𝒂
𝑹𝒂𝒏𝒌𝒊𝒏𝒈𝒔 𝒃
18/35
19. WORKFLOW
• Data –
description
• Source
An
overview
of the
dataset
Feature
enginnering
Pre-
processing
data
Train/Test set
Selection of
the models
Training of
the models
Inspecting
results and
interpretation
Test of the
models
19/35
20. Training of
the models
Which model to use for classification ?
• Linear Discriminant Analysis LDA
• Random Forest + Boosting RF
• Support Vector Machine SVM
Despite we could theoretically use LOGIT for multinomial
classification, we prefer to go for other methods since Classes >2
(1,X,2). but will recall LOGIT later…
Before to look at results … LET’S GO BACK AGAIN TO
THE DATA
20/35
21. Training of
the models
What is the benchmark for our task ?
• If we always predict class 1, we got it right in 46% of the test. That is, our
benchmark for prediction accuracy is 0,46 !
• Boomakers do never predict “X” (draw) as favourite ! But predict 1 in most of
the cases (72,5 %). This will be important for our results analysis later!
21/35
22. Training of
the models
What is the benchmark for our task ?
If we take bookmakers favourite we have 0,53% prediction on
the whole set.
22/35
23. Training of
the models
LDA – what happens when we add
variables
We are capable of reaching the prediction accuracy of bookmakers with a
relatively simple procedure…. BUT this is not FAIR ….WHY ? WE ARE
TRAINING AND TESTING ON THE FULL SET 23/35
24. Selection of
training and
test set
How to choose the TRAINGING and TEST set ?
Problem: we have a time constraint to respect! We cannot look into the future!
i.e. we cannot use future observations to train a model that will predict past
observations
TIME (total period 8 years)
TRAINING
TEST
<->
2
weeks
24/35
25. Training of
the models
LDA – TEST/TRAIN*
Apparently we are not as good as the bookmakers.. So we try with another
model.
25/35
28. Training of
the models
Apply boosting to Random Forest
Show boosting when changing parameters does not have big effect…
28/35
29. Training of
the models
Recap of the results of all model
How can we explain these results?
* I do not include the ‘boosted’ RF as it is calculated only on one test size
29/35
30. WORKFLOW
• Data –
description
• Source
An
overview
of the
dataset
Feature
enginnering
Pre-
processing
data
Train/Test set
Selection of
the models
Training of
the models
Inspecting
results and
interpretation
Test of the
models
30/35
35. Test of the
models
Potential evolution of the project
• Look for more and more “refined” predictors (e.g.
player stats, team stats, newspapers…etc…)
• Other prediction problem (e.g. number of goals,
strategies, etc…)
34/35
36. Test of the
models
1. Seems the issue has to do with the difficulties to
find a way good predictors for Draw
Matches…..eventually predicting football matches
is very difficult….
2. ….but reaching bookmakers performance seems
“achievable”.
Conclusions
Running of the project
Running time script Data preparation Modelling and Result Presentation
45 minutes 3 days 2 days 1 days
6 days
35/35