Slide deck of a presentation given at the 2015 OptaPro Analytics Forum on a statistical forecasting model that projects performance output of a football player as he transitions between multiple leagues in a career. The objective is to create a soccer equivalent of projection systems such as PECOTA in baseball and SCHOENE in basketball while incorporating machine learning techniques as much as possible. Work on this model began at the beginning of the year, so don't expect a lot of results to be presented. The goal of this talk is to present at a high level the objectives and methodology of the model, obtain feedback from the soccer analytics community, and gauge interest from the broader football industry.
Croatia and Italy Set for Challenging UEFA Euro 2024 Campaigns.docx
Framework for Forecasting Professional Soccer Player Career Paths
1. 2015 OptaPro Analytics Forum
Framework for a Player
Career Forecast Model
Between Multiple Leagues
Howard Hamilton
Founder, Soccermetrics Research
2. 2015 OptaPro Analytics Forum
Developed a career statistical forecasting modelling framework for football players,
automated by applying machine-learning techniques.
Inputs
1. Season statistical performance
2. Physical / playing
characteristics
Outputs
1. Identify peer group of players with comparable
performance
2. Forecast future statistical performance over a limited
horizon
3. Translate performance in one domestic league
competition to performance in another
Expected Interest
Clubs
Media
Betting
Fantasy
Early Stage: Framework > Results
Main Points
3. 2015 OptaPro Analytics Forum
Baseball
1. Similarity Scores (Bill James, 1980s)
2. Vladimir Forecasting System (Gary Huckabay, 1990s)
3. PECOTA (Nate Silver/Baseball Prospectus, 2003)
PECOTA-inspired forecasting models in other sports
1. SCHOENE (Kevin Pelton/Basketball Prospectus/ESPN, mid 2000s)
2. KUBIAK (Aaron Schatz/Football Outsiders, mid 2000s)
3. VUKOTA (Puck Prospectus, 2010)
Individual / team projection models in football
1. Aaron Nielsen (ENB Sports)
•
One-year projection of individual/team performance
2. Pérez Sánchez et al (2013)
•
Estimating goal-scoring performance in Spanish league
Forecasting Statistical Performance in Sport
Prior Art
4. 2015 OptaPro Analytics Forum
Data scarcity
•
Range of seasons
•
Statistical categories collected
•
League variations
Characteristics of domestic leagues
•
Differences in aging curves between leagues
•
Would a 'universal' aging curve work? Not sure...
•
Statistical translations between leagues
•
Some leagues are very connected, others less so
Challenges
5. 2015 OptaPro Analytics Forum
Data Source: ENB Soccer Database
•
60,000+ players,
•
75 domestic league competitions,
•
500+ clubs
Individual season statistics
•
1992-93 to 2011-12 (European)
•
1992 to 2012 (American/Scandinavian/Japanese)
Database Analysis
All players
•
Season
•
Team
•
Competition
•
Appearances
•
Subs
•
Minutes
•
Yellows / reds
Field players
•
Goals
•
Assists
•
Shots
•
Fouls
Goalkeepers
•
Goals allowed
•
Clean sheets
•
Shots faced
•
Wins
•
Draws
•
Losses
Modeling Components
6. 2015 OptaPro Analytics Forum
Normalize statistical categories
Convert statistical values of players in same competition and season
•
to “standard score”
•
Places statistical performances on one standard distribution
•
This is what allows us to compare players
Identify K comparable players (“nearest neighbors”)
•
Consider players of same age and position
•
Calculate similarity score between statistical records
•
Comparable players: Score about 0.90 - 0.95
•
Relax threshold for “unique” players
Forecast future performance with historical
performance of comparable players
Using regression techniques
• Adjust for aging and regression to mean
• Convert to statistics for league competition of interest
(x-)/
K-NN
Model Description
7. 2015 OptaPro Analytics Forum
Player League Season Similarity
Osvaldo Val Baiano Brazil Serie B 2007 0.961
Wayne Rooney English Premier League 2011-2012 0.957
Oscar Cardozo Portugal Primeira Liga 2009-2010 0.954
Maciej Zurawski Poland Ekstraklasa 2002-2003 0.939
Carlos Tevez English Premier League 2010-2011 0.926
Javi Moreno Spanish Primera 2000-2001 0.925
Katlego Mphela South Africa PSL 2010-2011 0.913
Matt Tubbs England Conference 2010-2011 0.913
Kris Boyd Scotland Premier League 2009-2010 0.905
Goncalves Jonas Brazil Serie A 2010 0.904
Rickie Lambert England League One 2008-2009 0.901
Mario Bermejo Spanish Segunda 2004-2005 0.897
Alan Shearer English Premier League 1996-1997 0.877
Kevin Phillips English Premier League 1999-2000 0.863
Photo by Simon Harriyott
Cristiano Ronaldo: Forward, aged 27 (Spanish Primera 2011/12)
Active Player.
Scored 46 goals in 2011/12
La Liga season.
Nearest Neighbor Results
Nearest Neighbor groups leading goalscorers at Ronaldo's age
0.96 similarity metric – few players had a season as dominant
8. 2015 OptaPro Analytics Forum
Marvin Bejarano: Defender, aged 21 (Bolivia Liga Profesional 2008)
Player League Season Similarity
Fernando Tobio Argentina Primera 2009-2010 0.996
Charlie Wassmer England League Two 2011-2012 0.990
Oswaldo Alanis Mexico Primera 2009-2010 0.985
Jan Vertonghen Netherlands Eredivisie 2007-2008 0.984
Paul Papp Romania Liga I 2009-2010 0.957
Santiago Vergini Paraguay Primera 2009 0.957
Mauricio Casierra Colombia Primera 2006 0.957
Rafael Delgado Argentina Nacional B 2010-2011 0.955
Konstantin Engel Germany 2 Bundesliga 2008-2009 0.954
Jae Sung Lee South Korea K-League 2009 0.953
Koybasi Ismail Turkey Super Lig 2009-2010 0.953
Luke O'Brien England League Two 2008-2009 0.951
Hector Quinones Colombia Primera 2012 0.950
Mate Ghvinianidze Germany 2 Bundesliga 2006-2007 0.950
Franz Schiemer Austria 1 Bundesliga 2006-2007 0.947
Active Player.
Has played for one club
over his career.
5 caps for Bolivia.
0.996 similarity metric – very comparable, but limited defensive data
Nearest Neighbor Results
9. 2015 OptaPro Analytics Forum
Iker Casillas: Goalkeeper, aged 26 (Spanish Primera, 2006-2007)
Active Player.
Has played for one club
over his career.
450+ appearances at
Real Madrid,
160 caps for Spain.
Interesting that Gianluigi Buffon is closest comparable at 26 y/o
Nearest Neighbor Results
Player League Season Similarity
Gianluigi Buffon Italy Serie A 2003-2004 0.994
Mark Crossley English Premier League 1994-1995 0.992
Dionissis Chiotis Greece Super League 2002-2003 0.990
Steve Mandanda France Ligue 1 2010-2011 0.989
Marco Wolfli Switzerland Super League 2007-2008 0.989
Shay Given English Premier League 2001-2002 0.986
Guillermo Ochoa Mexico Primera 2010-2011 0.986
Eduardo Martini Brazil Serie A 2004 0.985
Morgan de Sanctis Italy Serie A 2002-2003 0.984
Hiroki Iikura Japan J1-League 2011 0.982
Cesar Lainez Spanish Segunda 2002-2003 0.981
Marcelo Grohe Brazil Serie A 2012 0.981
Hitoshi Sogahata Japan J1-League 2005 0.980
Henri Sillanpaa Finland Veikkausliiga 2004 0.980
10. 2015 OptaPro Analytics Forum
Projecting career performance is difficult
•
Next steps:
●
Use nearest neighbors to forecast future performance
●
Quantify adjustments for age, league quality, position
●
Create multiple career forecast paths with probabilities
•
Limited horizons important (2-3 years)
•
Probabilistic projections sensible, not necessarily useful
•
Accuracy vs. clarity
•
Diverse range of statistical categories necessary –
•
Attacking and defending contributions and impact
•
Advanced metrics
Data normalization is a necessity!
Club projections are logical step
Need to enforce a “conservation of goals” in the universe of data in our
system, i.e:
Total goals scored == total goals conceded
Photo by Simon Harriyott
Conclusions
11. 2015 OptaPro Analytics Forum
Customization
•
Integrate with financial/medical databases, scouting data
•
Greatest utility at football operations/sporting director level
Biggest challenge: Data!
Not just data on all players in league, but players
•
in all other leagues of interest
•
Some statistical categories not available in some leagues
• As always, data collection and analysis problems are non-trivial
Photo by JD Hancock
Knowledge Transfer
12. 2015 OptaPro Analytics Forum
Thank You!
Special Thanks To:
OptaPro (Invitation to Forum)
Aaron Nielsen (ENB Database access)
Simon Harriyott (Presentation at Forum)
For more information contact
Soccermetrics Research
info@soccermetrics.net
www.soccermetrics.net
@soccermetrics