2. 2
Agenda
Sportradar’s key datasets
The Fraud Detection System journey
The Model and the Challenges
From Day After to Real Time
The Inference Streaming Pipeline
04
05
3. 3
Sportradar’skeydatasets
MTS BOOKMAKER
BET
- Bookmaker
- Channel (online, retail)
- Bet status (accepted, rejected, …)
- Stake
- Profit / Loss
- Sport event(s)
- Market(s) / outcome(s)
- Odds
- User info
COMPUTER
VISION
LIVESCORE
CRAWLERS
DATA
JOURNALISTS
LIVE DATA
- Event details (status, time, score)
- Timeline (ex. Shots, fouls, goals)
- Tracking data (players, ball, …)
ODDS CRAWLERS
ODDS CHANGE
- Bookmaker
- Sport event
- Event details (status, time, score)
- Market
- Odds
600+ bookmakers
crawled
~ 25 billions odds
movements per year
200+ bookmakers
with bet integration
with Sportradar
Managed Trading
Services
~ 5 billions bets per year
ANALYTICS PLATFORM
900,000+ sport events per year covered live
9,500+ data journalists globally
70+ sports
DATA LAKE
-Model training
- Batch inferencing
- Performance
monitoring
KAFKA
-Real time inferencing
INTEGRITY ANALYSTS
Over 9,000 matches
escalated since 2009
SPORT EVENT ANALYSIS
- Hotlisting data
- Escalation data
4. 4
TheFraudDetectionSystemjourney
Alerting system
- SME driven
- Looking at anomalous odds
movements or values
- 125,000+ alerts per year
- Less than 4% of them
suspicious after manual review
Alert scoring model
- Binary classification model (on alerts)
- Features: Alert data + match properties
- 70% of alerts automatically classified
as not suspicious
Match Fixing Detection model – day
after
- Binary classification model (on
matches)
- Batch inference on matches of
previous day
- Features: alert scores + betting data
+ match properties
Match Fixing Detection model –
real time
- Features: subset of previous
model (not possible to compute
some features in real time)
- Inference time: starting one day
prior to kick off, until match is over
- In 2023 up to Q4 814 escalated
matches were initially flagged by
the model (160 of them uniquely)
2018 2022
2020
5. 5
TheModelandtheChallenges
Binary classification model
Maximising the AUC on validation
set over multiple iterations
Model explainability
Extremely Unbalanced Dataset
9,325 escalations out of 4,522,387
Target Variable: Hotlisted
Weighted by: Escalated
Seasonality of sport events Random split on 3 years of
historical data
Keeping track with evolving
match-fixing trends
&
Unexpected event outcomes
Human in the loop
&
Continuous retraining
Crucial to provide evidence of
fixing
Feature contribution at inference
time
6. 6
FromDayAftertoRealTime
Current time: 45'
Potential payout: 9,000 €
Est. probability: 1%
Expected Profit: 9€
Current time: 60'
Potential payout : 9,000 €
Est. probability: 25%
Expected Profit: -2,175€
Current time: End of match
Potential payout : 9,000 €
Est. probability: 100%
Expected Profit: -9,000€
45' 60' End
• A single bet placed at 45’
• Odds: 90
• Stake: 100€
• Outcome: Green Team wins
• Changes in the probability for the
outcome to occur
7. 7
TheInferenceStreamingPipeline
• Processes up to 2000 messages per second
• Outputs around 3 predictions per second
• Score, time, alert-based predictions
• Feature construction
• Consumes 3 Kafka topics and outputs to 1
• Modelling resources fetched from S3 on
startup
• Aggregated features periodically retrieved
from Redshift
• Outputs to S3 via NiFi, with Athena on top
for analytics