The Sportsbet/CIKM competition (http://sportsbetcikm15.com) is a data mining and machine learning challenge: use data about Australian Football League (AFL) matches already played to predict future ones. These slides are related to the entry I submitted to the competition.
1. The Task My approach Conclusions
Competition for the International Conference of Information and Knowledge
Management (CIKM) hosted by Sportsbet
September 16th 2015
My Entry to the Sportsbet Competition
Simone Romano
simone.romano@unimelb.edu.au
@ialuronico
Simone Romano
My Entry to the Sportsbet Competition
2. The Task My approach Conclusions
The Task
Task description
The challenges
My approach
How to Build a Model for Predictions
Evaluation of Prediction Error
Conclusions
Summary
What I would have done if I had more time
Simone Romano
My Entry to the Sportsbet Competition
3. The Task My approach Conclusions
Task description
Task description
Sportbets competition: predict the outcomes of every match in the 2015
AFL season showing the probability that Team1 wins versus Team2.
E.g. Hawthorn (The Hawks) wins vs Adelaide (The Crows) on the 18th of
September with probability 0.75 (75%)1
Two phases:
The Leaderboard Phase prediction of the outcome of each regular-season
match in the 2015 AFL season.
(match results are already known)
The Finals Phase prediction of the outcome of each match in the 2015 AFL
Finals Series.
(match results are known after AFL Grand Final)
1
Implied by the odds for Hawthorn on Monday the 14th of September on
http://www.sportsbet.com.au/betting/australian-rules/afl
Simone Romano
My Entry to the Sportsbet Competition
4. The Task My approach Conclusions
Task description
Task description
Sportbets competition: predict the outcomes of every match in the 2015
AFL season showing the probability that Team1 wins versus Team2.
E.g. Hawthorn (The Hawks) wins vs Adelaide (The Crows) on the 18th of
September with probability 0.75 (75%)1
Two phases:
The Leaderboard Phase prediction of the outcome of each regular-season
match in the 2015 AFL season.
(match results are already known)
The Finals Phase prediction of the outcome of each match in the 2015 AFL
Finals Series.
(match results are known after AFL Grand Final)
I focused on the Lederboard Phase in order to evaluate the performance of my
predictions because we know the match results
1
Implied by the odds for Hawthorn on Monday the 14th of September on
http://www.sportsbet.com.au/betting/australian-rules/afl
Simone Romano
My Entry to the Sportsbet Competition
5. The Task My approach Conclusions
Task description
Data provided
The following datasets were provided:
Teams Name of teams which took part in AFL matches between 2000
and 2015.
Players Name of players that have played in at least one match
between 2000 and 2015.
Seasons Description, results, and statistics of regular-season (non-finals)
matches. E.g. it contains:
which team is home or away
venue: venue of the match.
margin: winning margin
Match stats Statistics recorded for a single player for every match (including
finals) between 2000 and 2015. E.g. it contains:
number of kicks performed
number of goals
Finals Contains information about the final matches between 2000
and 2014
Simone Romano
My Entry to the Sportsbet Competition
6. The Task My approach Conclusions
Task description
Data provided
The following datasets were provided:
Teams Name of teams which took part in AFL matches between 2000
and 2015.
Players Name of players that have played in at least one match
between 2000 and 2015.
Seasons Description, results, and statistics of regular-season (non-finals)
matches. E.g. it contains:
which team is home or away
venue: venue of the match.
margin: winning margin
Match stats Statistics recorded for a single player for every match (including
finals) between 2000 and 2015. E.g. it contains:
number of kicks performed
number of goals
Finals Contains information about the final matches between 2000
and 2014
Unplayed Remaining (unplayed) regular-season matches in the 2015
season. (Dataset release: end of July 2015)
Simone Romano
My Entry to the Sportsbet Competition
7. The Task My approach Conclusions
The challenges
The Challenges
Target: We want to predict the outcome of matches in the 2015 season using
the data available.
Challenges
Take into account the time constraints: when predicting the outcome of a
match we can only use information about past matches
Obtain low prediction error
Solution
Build an automated prediction model that incorporates information on
matches played between 2000 and 2014. Given 2 teams, Team1 and Team2,
the model predicts the probability for Team1 to win versus Team2.
We wish our model to have low prediction error
Simone Romano
My Entry to the Sportsbet Competition
8. The Task My approach Conclusions
The challenges
Evaluation of Prediction Error
Given that we actually know the results of matches in 2015 we can compute
the logloss error of our predictions. logloss error is used to score the entries to
the competition.
Useful facts about logloss error
logloss = 0 A team always wins when the model says 100% prob-
ability of winning and a team always loses if the
model says 0%. Model generates only 100% and
0% probabilities.
logloss = LARGE If it happens that even for just one match the pre-
diction of a team winning is 100% probability but
the team actually loses the game.
logloss = 0.693 If all predictions are set to 50%
Simone Romano
My Entry to the Sportsbet Competition
9. The Task My approach Conclusions
The challenges
We have to keep in mind that:
Large probability should be avoided (E.g. 100% or 0%) because just one
single error can increase a lot the logloss
Just being conservative we can obtain 0.693
This is not an easy task and some competitors performed really badly:
Simone Romano
My Entry to the Sportsbet Competition
10. The Task My approach Conclusions
The Task
Task description
The challenges
My approach
How to Build a Model for Predictions
Evaluation of Prediction Error
Conclusions
Summary
What I would have done if I had more time
Simone Romano
My Entry to the Sportsbet Competition
11. The Task My approach Conclusions
How to Build a Model for Predictions
Position on the Leaderboard
In two days a managed to finish half way in the Leaderboard with a
logloss = 0.640. Position 28 out of 52. The smallest error on the leaderboard
is 0.524
Simone Romano
My Entry to the Sportsbet Competition
12. The Task My approach Conclusions
How to Build a Model for Predictions
My Approach
We can build a simple model based on matches between 2000 and 2014 and
the knowledge of:
The teams that are playing
Which team is home and which one is away
Example: Hawthorn (The Hawks) vs Adelaide (The Crows)
Season Round Team Home Winner
2011 R01 Adelaide home Adelaide
2012 R03 Hawthorn home Hawthorn
2013 R06 Adelaide home Hawthorn
2014 R17 Adelaide home Hawthorn
2015 R12 Adelaide home ?
We could say that Hawthorn is going to win with probability 3
4
= 75%. Indeed,
Hawthorn won.
The model learn on the results of past matches to output this probability
according to this rationale
Simone Romano
My Entry to the Sportsbet Competition
13. The Task My approach Conclusions
How to Build a Model for Predictions
Adding Features
Feature: measurable information about matches which we can use to predict
the outcome for a match in 2015.
For example, can “winner margin” in past games help our predictions?
Season Round Team Home Winner Winner margin
2011 R01 Adelaide home Adelaide 20
2012 R03 Hawthorn home Hawthorn 56
2013 R06 Adelaide home Hawthorn 11
2014 R17 Adelaide home Hawthorn 12
2015 R12 Adelaide home ? ?
We can only use statistics about margin of previous events to predict the
probability of Hawthorn winning in 2015:
Mean margin of previous events (Hawthorn-Adelaide) ⇒ 14.75
Maximum margin of previous events (Hawthorn-Adelaide) ⇒ 56
Minimum margin of previous events (Hawthorn-Adelaide) ⇒ -20
But which one is a good predictor...
Simone Romano
My Entry to the Sportsbet Competition
14. The Task My approach Conclusions
How to Build a Model for Predictions
Is Mean Margin a good predictor of winning?
Distribution of games won according to the Mean Margin computed on
previous games(Red) for matches 2000-2014. Respectively games lost (Blue).
Mean Margin is good if these counts are well separated.
Mean Margin in Previous Games
-200 -100 0 100 200
Frequency
0
20
40
60
80
100
Lose
Win
Insights
If a team has Mean Margin more than 100 is likely to win
If a team has Mean Margin less than -90 it is likely to lose
Simone Romano
My Entry to the Sportsbet Competition
15. The Task My approach Conclusions
How to Build a Model for Predictions
Min Margin as predictor of winning
Min Margin in Previous Games
-200 -100 0 100 200
Frequency
0
20
40
60
80
100
Lose
Win
Insights
If a team has been defeated in the past by as many as 150 points it is
likely to lose
Simone Romano
My Entry to the Sportsbet Competition
16. The Task My approach Conclusions
How to Build a Model for Predictions
Max Margin as predictor of winning
Max Margin in Previous Games
-200 -100 0 100 200
Frequency
0
20
40
60
80
100
Lose
Win
Insights
If a team has won in the past by as many as 150 points it is likely to win
Simone Romano
My Entry to the Sportsbet Competition
17. The Task My approach Conclusions
How to Build a Model for Predictions
Other Features
Similarly to the margin of the final score between two teams, we can compute
the margin for other statistics:
Number of Kicks
Number of Inside 50
Number of Disposals
Number of Clearances
Rank of Attributes based on Prediction Errors (Best at the top)
Score2
Name
0.0449 Mean Margin Inside 50
0.0408 Mean Margin Score
0.0361 Max Margin Score
0.0325 Mean Margin Disposals
2
According to Information Gain
Simone Romano
My Entry to the Sportsbet Competition
18. The Task My approach Conclusions
Evaluation of Prediction Error
Evaluation of Prediction Error
I evaluated the model on the prediction of outcomes for 2015 matches:
logloss = 0.682 without statistics (just knowing the teams that are
playing)
logloss = 0.640 with statistics
This is obtained with a black-box model (Random Forest) which is accurate
but difficult to interpret.
Can we get a simpler model?
Interestingly, the simplest model obtained automatically from this data is:
(Mean-Margin ≥ -0.25 AND location = home) ⇒ win with probability 63.8%
else win with probability 36.8%
However, this shows high error: logloss = 0.689 (It does not take into account
the actual teams that are playing)
Simone Romano
My Entry to the Sportsbet Competition
19. The Task My approach Conclusions
Evaluation of Prediction Error
Remark about Data on Previous matches
We have to be careful about taking into account matches played too long ago.
Indeed, the best prediction (according to our features) is obtained only with
matches from 2014:
Least Recent Matches
2000 2002 2004 2006 2008 2010 2012 2014
logloss
6.2
6.4
6.6
6.8
Error in Prediction
This is probably because 2014 teams a very similar to 2015 teams.
It would be interesting to see which top players moved between teams in the
past years
Simone Romano
My Entry to the Sportsbet Competition
20. The Task My approach Conclusions
The Task
Task description
The challenges
My approach
How to Build a Model for Predictions
Evaluation of Prediction Error
Conclusions
Summary
What I would have done if I had more time
Simone Romano
My Entry to the Sportsbet Competition
21. The Task My approach Conclusions
Summary
Summary
It is possible to predict the outcome of future matches with enough accuracy
with 2 days of work:
Using features obtained from score margin, margin based on number of
inside 50, and number of disposals
Combining these features using a model (Random Forest logloss = 0.640)
and we can get insights from each feature individually
Knowing that data about recent matches is more helpful
Small error can be traded for model simplicity
Technicalities
I performed feature engineering in Python and predictions with WEKA.
Simone Romano
My Entry to the Sportsbet Competition
22. The Task My approach Conclusions
What I would have done if I had more time
What I would have done if I had more time
There are a number of things that can be done to improve my model and I did
not have the chance to try because of time:
Predict the outcome of a match on round X in 2015 based on matches
played in previous rounds in 2015
Use many other statistics: e.g. handballs, tackles
Use data about previously played finals
Introduce player level features: rank all the players based on goals and
count the number of top players a team is going to employ during the
match
Team strategy features (difficult to encode)
Use Sportsbet and other companies’ odds (not fair for my entry but it
would be fair in real practice)
Simone Romano
My Entry to the Sportsbet Competition
23. The Task My approach Conclusions
What I would have done if I had more time
Other interesting things other then predicting match outcomes...
It would be interesting to analyze data and see:
if there are players that are correlated with winning/losing games
characteristics of Brownlow Medal winners
probabilities of winning after losing the first/second/third quarters
identifying the ’turning points’ in important matches (which players are
involved in changing the outcome of a match?)
Simone Romano
My Entry to the Sportsbet Competition
24. The Task My approach Conclusions
What I would have done if I had more time
Thank you.
Questions?
Simone Romano
simone.romano@unimelb.edu.au
@ialuronico
Simone Romano
My Entry to the Sportsbet Competition