Chance detection in football broadcasts

Chance detection in football broadcasts
Feature extraction and classification in football
streams using vision and deep learning
Auke vanderSchaar – Stratagem Technologies – London Machine Learning Meetup – 11 December 2017

Chance detection - Stratagem Technologies 2
Stratagem Technologies
● Stratagem Technologies is a machine learning financial technology
company focused on sports betting
– Sports prediction as an alternative financial asset class
● Predictive modeling requires historical data
● Trading based on in-play predictions requires online (real-time) data
● Fortunately, sport broadcast videos are ubiquitous and a rich source
of both historical and real-time information at a reasonable cost
● Challenge: how to exploit this information to improve trading?
– Chance detection (Focus is on detection, not on anticipating chances!)

Why chance detection?
● Chance: shot (attempt) on goal
● Chances occur more frequently than goals
– a more meaningful statistic and can signify momentum in football games.
● Human analysts annotate chances in football matches in more than
20 leagues.
– A football match has approx 20 chances per game
– Chances are divided in 6 categories. From “poor” to “superb”
– Goal conversion rate for each category is known (next slide)
– Annotations are used as input to prediction models for trading
– However, they are not precise enough for training a computer vision model
● Precision of tagging is approx. 1 second => very poor

Chance types
What is the conversion rate of each chance type to a goal?
● Known and exploited for trading
● E.g. for ‘superb’ (unmissable) chances such as clear open nets, rate ~
0.8 (8 out 10 superb chances result in a goal)

Detecting Chances – Challenges for Computer
Vision
● Small Field of View (FOV), short scenes, replays and close-ups prevent building up a
consistent view over time and space.
● Unbalanced
– The duration of a chance is approx 1-2 seconds and thus only 1% of a game is part of a chance.
– One season of one league of approx 300 games has (only) 6000 chances and results in 450 hours (40M
frames) of video.
● Difference between “a chance” and “not a chance” is subtle
– Touched by an attacking player or not makes the difference.
● Noisy labels
– Sometimes a replay is tagged
– Some chances are missed
– Chances close to each other are collapsed into one

Chances – “negative” samples

Chance - “positive” samples

Datasets
The dataset grows by approx 300 matches (450 hours) per week!
– 600 fixtures accurately annotated, used for training and validation/testing (and
increasing). Images from one fixture can only be in one (training/test) set.
– 40 fixtures holdout set.These are never used for training or validation by any of the
systems.
– 1100 fixtures from last season from four major leagues processed for downstream
evaluation

Unique challenges
● Not only unbalanced but difference between positive and negative sample
set is small
– A limited amount of positive samples
● The mapping of the ground truth (annotations) to video frames is not 1:1
– Complicates a data driven approach but rule based approaches are brittle
● 1000s hours of video limits the amount of processing possible and rules out
more sophisticated methods
– Any promising method must be evaluated on numerous videos.
● Only then the impact upon the precision will be become clear
→ exploit the video!
● Raise the signal from the noise

What is the aim of the Computer Vision
systems?
Each system takes a video as input and produces a list of chances, at which side of the field
(attacking team) and the game time at which they occurred as output
3 alternatives: why?
● Football vision
– Object detection and camera view to birds’ eye view conversion
– A high level feature extractor that can be used as input to an ML algorithm trained to detect chances
● ConvNet feature extractor
– Off the shelf deep state of the art single model neural network (CNN) with the classification layer removed.
– Train a ML algorithm using the features of the extractor as input
● ConvNet End to End
– Neural net trained end-to-end to do chance detection

System requirements
Processing (large number of) videos and training models is very
resource intensive
– Best fast method (best speed/quality trade off)
– A system should process approx 20 to 30 fixtures per day
Must generalize to all leagues
• Robust to different viewing conditions

Examples viewing conditions

Football vision
Performs object detection and converts the position of each object
(player) to a 2D field position

Vision – object detection
●
RCNN1
– Two stage detector
– Region proposals: more precise but slower
– Quoted: ~5 fps @ 600x400
● Football vision Python
– Detector for players based on Resnet50
– Detector for ball and goalpost corner based
onVGG16
– 1 fps
●
SSD2
– Single phase detector
– Quoted: ~ 45 fps @ 300x300
– 600x600 @ 15 fps
● Ball is small
– Faster but less precise
● Football vision C++
– VGG16
– 15 fps
1
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
2
SSD: Single Shot MultiBox Detector
3
Speed/accuracy trade-offs for modern convolutional object detectors

Vision - Homography
● A 2D plane (e.g. a football field) viewed from different viewpoints are related by a
homographic transform.
● The transform matrix H can be calculated by finding at least four corresponding
points in the two views.
1
Multiple View Geometry in Computer Vision, Zisserman et al
2
OpenCV tutorial
3
Chess board image credit: 2D projective transformations (homographies) Christiano Gava

Vision – Homography – pitch line detection
● With a broadcast there is only one camera view (monocular)
– Follows the action by panning, zooming and sometimes from different view points.
– Pitch line can serve as reference point
– Use pitch line detection to find key markers the camera view
– The position of the pitch line is known in the 2D field (birds eye) view
● There exist no ground truth data-set for pitch line detection
– Excludes a ML (data driven) approach.
● use rule based pitch line detection
– Which can be used to generate data for training a neural net (experimental)

Vision – pitch line detection -Hough transform

Vision - System

Vision – Homography – DL (experimental)
●
Deep Image Homography1
– A neural net can learn the relative homographic parameters given two images related by
a homography
● Can it also learn when only one image is given together with the
homographic parameters?
– The transform parameters are produced by “football vision”.
1
Deep Image Homography Estimation (archive)

Vision – Homography – DL - visualization
2
Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
● .
What does the DNN use to find the homographic transform parameters?
- visualize with GRAD-CAM

Vision – Homography – DL - visualization

Why CNN?
● Data driven
● Fast (faster than real-time, framerate > 25 fps)
● Development is also faster
● Training scales easily to large number of
fixtures
● From a system point of view simple
● Can combine multiple detectors

Deep Learning - CNN feature extractor
● Take a state-of-the-art single model CNN architecture, with weights
trained on ImageNet (ILSVRC), remove the last layer and use it as a
feature extractor.
– InceptionV3 (GoogLeNet) @ 299x299
– Features learned for classification on ImageNet useful for other related domains.
ILSVRC: Imagenet Large Scale Visual Recognition Challenge
2
Inception: Rethinking the Inception Architecture for Computer Vision
3
Feature extraction: CNN Features off-the-shelf: an Astounding Baseline for Recognition

Deep Learning - CNN feature extractor – cont
– Feature extraction takes approx 4 hours per football game
● Results in 500MB of compressed data per football match
– Once extracted training and classification is relative fast.
● Facilitates experimentation
● Single frame classifier
● Multi frame classifier: combine multiple frames (order not relevant) per prediction.
● Sequence classifier (order frames relevant)
– Loading all features of the training set (400 fixtures) is large.
● PCA reduces the accuracy
● Averaging features over frames seem to have little impact
– Aggregate features per second (25 frames)
● Large number of samples excludes many ML algorithms
– LDA performed best.

Deep Learning - CNN feature extractor – cont

CNN feature extractor – chance classifier

CNN end-to-end
● Use a state-of-the-art single model CNN (VGG, Resnet, InceptionVx)
remove the classification layer and add your own classifier layer on
top
Allows choice in
– parameters (resolution, image style (flow), …)
– number of targets
● left/right detector
– finetuning
– Domain specific visualization
but requires often training from scratch

CNN end-to-end
● Required number of pictures for training
– 400 (fixtures) * 20 (chances) * 4 second window * 2 images/second * 2 (lbl 0/1)= ~ 50 - 200K
● Hyper-parameter search space is large
– Early pruning
● Training and evaluation takes time and early feedback is required to make
progress and avoiding wasting a scarce resource
● Start with a small numbers of images
– All models are evaluated against a fixed set of images while training is ongoing.
– Only the top performers are allowed to the next stage (while optionally keeping the trained weights)
– Likewise the evaluation of the top performers while the training is ongoing is being evaluated against
the ground truth (human analyst annotations).

CNN end to end - cont

Metric – Precision Recall F1-score
● Performance of minority class
● Precision: fraction of true positives of all detected positives (predicted
chances)
● Recall: fraction of true detected positives out of all positives
● F1 score: harmonic mean of precision and recall
● Not averaged but calculated after taking all detections from all fixtures
into account
● Example: Fixture with 20 chances Chance detector detects 30
chances of which 10 are correct
● Precision = 10/30 = 0.3
● Recall = 10/20 = 0.5
● F1 = 0.375

Results

Results
● Noisy labels
-Train with the largest amount of images, for the longest
amount of time with a decaying learning rate schedule with
SGD + momentum
● Promising results are achieved with resnet50 @ 600x400 using
ball positions as extra regression target
● Ball positions are generated by football vision
● Only 200 fixtures with ball positions are available

Discussion
● There is now a chance detector with high recall which detects
chances or chance like situations reasonable precise in time.
– A chance is not a (long) sequence
– A chance is a very short event where the attacking player purposely pushes the ball
towards the goal
● With this detector we can thus create a dataset with only (known)
chances and chance like situations and further refine this

Conclusion
● Classical CV methods produce very general, high level, easy to
interpret features that can be used as input to many different types of
ML models
– Not a good chance detector!
– Very useful to generate labels which are used to improve the CNN
● CNN feature extractor is flexible, facilities experimentation and has
initially the upper hand.
● CNN end to end results in the best (sharpest), fastest and from a
system POV simplest classifier
● Chance detection results in a useful signal for trading

The end

Chance detection in football broadcasts

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Chance detection in football broadcasts

Similar to Chance detection in football broadcasts (20)

Recently uploaded

Recently uploaded (20)

Chance detection in football broadcasts

Editor's Notes