Visual Object Tracking: review

Visual Object Tracking @ Belka
Dmytro Mishkin
Center for Machine Perception
Czech Technical University in Prague
ducha.aiki@gmail.com
Kyiv, Ukraine
2017

My background
• PhD student of Czech Technical university in Prague.
Now fully working in Deep Learning,
recent paper “All you need is a good init” added to
Stanford CS231n course.
• exCTO of Clear Research (clear.sx) 2014-2017.
• Co-founder of Szkocka Research Group: Ukrainian open
research community for computer science
https://www.facebook.com/groups/839064726190162/ .
Currently supervising one project related to local
features learning
• Reviewer for the most impactful computer vision
journals: TPAMI and IJCV
2

Lecture is heavily based on
tutorial:
Visual Tracking
by Jiri Matas
„… Although tracking itself is by and
large solved problem...“,
-- Jianbo Shi & Carlo Tomasi
CVPR1994 --

Application domains of Visual Tracking
• monitoring, assistance, surveillance,
control, defense
• robotics, autonomous car driving,
rescue
• measurements: medicine, sport,
biology, meteorology
• human computer interaction
• augmented reality
• film production and postproduction:
motion capture, editing
• management of video content: indexing,
search
• action and activity recognition
• image stabilization
• mobile applications
• camera “tracking”
4/150Slide credit: Jiri Matas

Applications, applications, applications, …

Tracking Applications ….
– Team sports: game analysis, player statistics, video annotation, …

Sport examples
http://cvlab.epfl.ch/~lepetit/
http://www.dartfish.com/en/media-gallery/videos/index.htm
Slide Credit: Patrick Perez 7/150

Model-based Tracking: People and Faces
http://cvlab.epfl.ch/research/completed/realtime_tracking/ http://www.cs.brown.edu/~black/3Dtracking.html
Slide Credit: Patrick Perez 8/150

Tracking is commonly used in practice…
9/150
• Tracking is popular research topic for decades:
see CVPR, ICCV, ECCV, …
• But there is no online course devoted to tracking…
• nor big coverage in computer vision courses...
• nor it is well covered in textbooks

Is it clear, what tracking is?
video credit:
Helmut
Grabner

Tracking: Formulation - Literature
Surprisingly little is said about tracking in standard textbooks.
Limited to optic flow, plus some basic trackers, e.g. Lucas-Kanade.
Definition (0):
[Forsyth and Ponce, Computer Vision: A modern approach, 2003]
“Tracking is the problem of generating an inference about the
motion of an object given a sequence of images.
Good solutions of this problem have a variety of applications…”

Formulation (1): Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes:
• The concept of an “object” in F&P definition disappeared.
• If an algorithm correctly established such correspondences,
would that be a perfect tracker?
• tracking = motion estimation?

Tracking is Motion Estimation / Optic Flow ?

Tracking is Motion Estimation / Optic Flow ?
14/150

Tracking is Motion Estimation / Optic Flow?
Motion “pattern” Camera tracking
Dense motion field
http://www.cs.cmu.edu/~saada/Projects/CrowdSeg
mentation/
http://www.youtube.com/watch?v=ckVQrwYIjAs
Sparse motion field estimate

16/150
Motion field
• The motion field is the projection of the 3D scene
motion into the image
Slide credit: James Hays

17/150Slide credit: Jason Corso

Optic Flow
Standard formulation:
• At every pixel, 2D displacement is estimated between consecutive frames
Missing:
• occlusion – disocclusion handling: pixels visible in one image only
- in the standard formulation, “don’t know” is not an answer
• considering the 3D nature of the world
• large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow:
• is the ground truth ever known?
- learning and performance evaluation problematic (synthetic sequences ..)
• requires generic regularization (smoothing)
• failure (assumption validity) not easy to detect
In certain applications, tracking is motion estimation on a part of the image
with specific constraints: augmented reality, sports analysis 18/150

Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes:
• The concept of an “object” in F&P definition disappeared.
• If an algorithm correctly established such correspondences,
would that be a perfect tracker?
• tracking = motion estimation?
Consider the Bolt sequence:
19/150

Definition (2): Tracking
Given an initial estimate of its position,
locate X in a sequence of images,
Where X may mean:
• A (rectangular) region
• An “interest point” and its neighbourhood
• An “object”
This definition is adopted e.g. in a recent book by
Maggio and Cavallaro, Video Tracking, 2011
Smeulders T-PAMI13:
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame.
20/150

Formulation (3): Tracking as Segmentation
J. Fan et al. Closed-Loop Adaptation for Robust Tracking, ECCV 2010
21/150

Tracking as model-based segmentation
22/150

Tracking as segmentation
http://vision.ucsd.edu/~kbranson/research/cvpr2005.html
http://www2.imm.dtu.dk/~aam/tracking/
• heart
23/150

A “standard” CV tracking method output
24/150
Approximate motion estimation, approximate segmentation.
Neither good optic flow, neither precise segmentation required.

Given an initial estimate of the pose and state of X :
In all images in a sequence, (in a causal manner)
1. estimate the pose and state of X
2. (optionally) update the model of X
• Pose: any geometric parameter (position, scale, …)
• State: appearance, shape/segmentation, visibility, articulations
• Model update: essentially a semi-supervised learning problem
– a priori information (appearance, shape, dynamics, …)
– labeled data (“track this”) + unlabeled data = the sequences
• Causal: for estimation at T, use information from time t · T
25/150

Tracking for Black Mirror Blocking
26/150

Tracking-Learning-Detection (TLD)
28/150

A “miracle”: Tracking a Transparent Object
video credit:
Helmut
Grabner
H. Grabner, H. Bischof, On-line boosting and vision, CVPR, 2006.
29/150

Tracking the “Invisible”
H. Grabner, J. Matas, L. Gool, P. Cattin,Tracking the invisible: learning where the object might be, CVPR 2010.
30/150

Other Tracking Problems:
…… multiple object tracking ….
32/150

Other Tracking Problems:
Cell division.
http://www.youtube.com/watch?v=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster.
http://www.youtube.com/watch?v=YFKA647w4Jg
splitting and merging events ….
34/150

Just link to some computer vision lib?
from cv2 import tracker
or
#include <opencv2/core.hpp>
Or
import org.opencv.core.Core;
import org.opencv.core.Mat;

What is here in libraries? OpenCV
• KLT tracker (1981)
• CAMshift and Meanshift (1998)
• TLD (2011)
• MedianFLow (2010)
• Boosting (2006)
• MIL (2009)
and KCF (2012) in opencv_contrib
38/150

What is here in libraries? BoofCV
• SparseFlow (KLT tracker) (1991)
• MeanShift (1998)
• TLD (2011)
• KCF (2012)
39/150
https://github.com/lessthanoptimal/BoofCV
Computer vision librar

Bad news, OpenCV
40/150
http://www.votchallenge.net/

Reference implementation?
41/150
But authors often publish
their own implementation
on github…
git clone https://github.com/martin-danelljan/Continuous-ConvOp.git

Good news? Not so
44/150
Good news:
Easy to contribute to open source.
Just implement some modern
tracking algorithm.

45/150
So we need to understand
how tracking works

The KLT Tracker
47/150Slide credit: Kris Kitani

KLT Tracker
slide credit:Tomas Svoboda
48/150
Importance in Computer Vision
• Firstly published in 1981 as an image registration method [3].
• Improved many times, most importantly by Carlo Tomasi [5,4]
• Free implementaton(s) available1.
• After more than two decades, a project2 at CMU dedicated to
this
• single algorithm and results published in a premium journal
[1].
• Part of plethora computer vision algorithms.
1http://www.ces.clemson.edu/~stb/klt/
2http://www.ri.cmu.edu/projects/project_515.html

Image alignment
slide credit:Tomas Svoboda 49/150
Goal is to align a template image T(x) to an input image I(x). X - column
vector containing image coordinates [x, y]. The I(x) could be also a small
subwindow within an image.

Original Lucas-Kanade algorithm I

Original Lucas-Kanade algorithm II

Original Lucas-Kanade algorithm III

Original Lucas-Kanade algorithm IV

Original Lucas-Kanade summary
slide credit: Tomas Svoboda 54/150

KLT Tracker
For good conditioning, patch must be
textured/structured enough:
• Uniform patch: no information
• Contour element: aperture problem (one dimensional
information)
• Corners, blobs and texture: best estimate
[Lucas and Kanade 1981][Tomasi and Shi, CVPR’94]
slide credit:
Patrick Perez
55/150

Aperture Problem: Example
56/150Image from: Gary Bradski slides
If we look through small hole…
Video by Olha Mishkina

Multi-resolution Lucas-Kanade
– Arbitrary displacement
• Multi-resolution approach: Gauss-Newton like approximation down image
pyramid
57/150Slide credit: James Hays

Monitoring quality
– Translation is usually sufficient for small fragments, but:
• Perspective transforms and occlusions cause drift and loss
– Two complementary options
• Kill tracklets when minimum SSD too large
• Compare as well with initial patch under affine transform (warp) assumption
slide credit:
Patrick Perez
58/150

Characteristics of KLT
• cost function: sum of squared intensity differences
between template and window
• optimization technique: gradient descent
• model learning: no update / last frame / convex
combination
• attractive properties:
–fast
–easily extended to image-to-image transformations with
multiple parameters
59/150

What about deep
learning? or
CNN for tracking
60/150

AlexNet (original):Krizhevsky et.al., ImageNet Classification with Deep Convolutional Neural Networks, 2012.
CaffeNet: Jia et.al., Caffe: Convolutional Architecture for Fast Feature Embedding, 2014.
Image credit: Roberto Matheus Pinheiro Pereira, “Deep Learning Talk”.
Srinivas et.al “A Taxonomy of Deep Convolutional Neural Nets for Computer Vision”, 2016.
Recap: CaffeNet architecture
63

Recent trackers development
74/150
https://github.com/foolwood/benchmark_results
CNN KCF

Discriminative Tracking (T. by Detection)
t=0
samples
labels+1 +1 +1 −1 −1 −1
Classifier
t>0
…
Classify subwindows
to find target
75

Connection to Correlation
The Convolution Theorem
“Cross-correlation is equivalent to an
element-wise product in Fourier domain”
• where:
– ො𝐲 = ℱ(𝐲) is the Discrete Fourier Transform (DFT) of 𝐲.
(likewise for ො𝐱 and ෝ𝐰)
–× is element-wise product
–.∗ is complex-conjugate (i.e. negate imaginary part).
𝐲 = 𝐱 ⊛ 𝐰 ො𝐲 = ො𝐱∗ × ෝ𝐰⟺
• Note that cross-correlation, and the DFT, are cyclic
(the window wraps at the image edges).
76

Kernelized Correlation Filters
• Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations.
• The same idea can by applied e.g. to the Kernel Ridge Regression:
with K kernel matrix Kij = (xi, xj) and dual space representation
• For many kernels, circulant data  circulant K matrix
• Diagonalizing with the DFT for learning the classifier yields:
𝜶 = 𝐾 + 𝜆𝐼 −1 𝐲
𝐾 = 𝐶(𝐤 𝐱𝐱), where 𝐤 𝐱𝐱 is kernel auto-correlaton and
the first row of 𝐾 (small, and easy to compute)
ෝ𝜶 =
ො𝐲
መ𝐤 𝐱𝐱 + 𝜆
Fast solution in 𝒪 𝑛 log 𝑛 .
Typical kernel algorithms are
𝒪 𝑛2 or higher!
⇒
77

• The 𝐤 𝐱𝐱′ is kernel correlation of two vectors x and x’
• For Gaussian kernel it yields:
𝐤 𝐱𝐱′ = exp −
1
2 𝐱 2 + 𝐱′ 2 − 2ℱ−1 ො𝐱∗ ⊙ ො𝐱′
• Evaluation on subwindows of image z with classifier 𝜶 and model x:
1. 𝐾 𝐳
= 𝐶 𝐤 𝐱𝐳
2. 𝐟(𝐳) = ℱ−1 መ𝐤 𝐱𝐳 ⊙ ෝ𝛂
• Update classifier 𝜶 and model x by linear interpolation from the
location of maximum response f(z)
• Kernel allows integration of more complex and multi-channel
𝑘𝑖
𝐱𝐱′
= (𝐱′
, 𝑃 𝑖−1
𝐱)
multiple channels can be concatenated to
the vector x and then sum over in this term
78

KCF Tracker
• very few
hyperparameters
• code fits on one slide
of the presentation!
• Use HoG features
(32 channels)
• ~300 FPS
• Open-Source
(Matlab/Python/Java/C)
function alphaf = train(x, y, sigma, lambda)
k = kernel_correlation(x, x, sigma);
alphaf = fft2(y) ./ (fft2(k) + lambda);
end
function y = detect(alphaf, x, z, sigma)
k = kernel_correlation(z, x, sigma);
y = real(ifft2(alphaf .* fft2(k)));
end
function k = kernel_correlation(x1, x2, sigma)
c = ifft2(sum(conj(fft2(x1)) .* fft2(x2), 3));
d = x1(:)'*x1(:) + x2(:)'*x2(:) - 2 * c;
k = exp(-1 / sigma^2 * abs(d) / numel(d));
end
Training and detection (Matlab)
Sum over channel dimension
in kernel computation
79

From KCF to Discriminative CF trackers
Basic
• Henriques et al. – CSK
– raw grayscale pixel values as features
• Henriques et al. – KCF
– HoG multi-channel features
Further work
• Danelljan et al. – DSST:
– PCA-HoG + grayscale pixels features
– filters for translation and for scale (in the scale-space pyramid)
• Li et al. – SAMF:
– HoG, color-naming and grayscale pixels features
– quantize scale space and normalize each scale to one size by bilinear
inter. → only one filter on normalized size
80

Discriminative Correlation Filters Trackers
• Danelljan et al. –SRDCF:
– spatial regularization in the learning process
→ limits boundary effect
→ penalize filter coefficients depending on their spatial location
– allows to use much larger search region
– more discriminative to background (more training data)
CNN-based Correlation Trackers
• Danelljan et al. – Deep SRDCF, CCOT (best performance in VOT
2016)
• Ma et al.
– features : VGG-Net pretrained on ImageNet dataset extracted from
third, fourth and fifth convolution layer
– for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
• Nam et al. – MDNet, T-CNN:
81

Tracking: Which methods work?
83/150

Tracking: Which methods work?
84/150

What works? “The zero-order tracker” ☺
85/150

Matej Kristan, matej.kristan@fri.uni-lj.si, DPAEV Workshop, ECCV 2016
VOT community evolution
3,000
1,500
51 Coauthors, 14pgs
ICCV2013
57 Coauthors, 27pgs
ECCV2014
128 Coauthors, 24pgs
ICCV2015
141 Coauthors, 44pgs
ECCV2016
+
VOT-TIR
paper
(69 coauth)
+
VOT-TIR
paper
(70 coauth)
VOT2014
submission deadline
VOT2015
submission deadline
VOT2016
submission deadline
VOT2013
submission deadline
86/39

VOT challenge evolution
• Gradual increase of dataset size
• Gradual refinement of dataset construction
• Gradual refinement of performance measures
• Gradual increase of tested trackers
Perf. Measures Dataset size Target box Property Trackers tested
VOT2013 ranks, A, R 16, s. manual manual per frame 27
VOT2014 ranks, A, R, EFO 25, s. manual manual per frame 38
VOT2015 EAO, A, R, EFO 60, fully auto manual per frame 62 VOT, 24 VOT-TIR
VOT2016 EAO, A, R, EFO 60, fully auto auto per frame 70 VOT, 24 VOT-TIR
87/39

Class of trackers tested
• Single-object, single-camera
• Short-term:
–Trackers performing without re-detection
• Causality:
–Tracker is not allowed to use any future frames
• No prior knowledge about the target
–Only a single training example – BBox in the first frame
• Object state encoded by a bounding box
88/150

Construction (1/3): Sequence candidates
ALOV (315 seq.)
[Smeulders et al.,2013]
Filtered out:
• Grayscale sequences
• <400 pixels targets
• Poorly-defined targets
• Artificially created sequences
Example: Poorly defined target Example: Artificially created
356 sequences
PTR (~50 seq.)
[Vojir et al.,2013]
+
OTB (~50 seq.)
[Wu et al.,2013]
+
>30 new sequences
from VOT2015
committee
+
443
sequences
VOT Automatic Dataset
Construction Protocol:
cluster + sample
89/39

Construction (2/3): Clustering
• Approximately annotate targets
• 11 global attributes estimated
automatically for 356 sequences
(e.g., blur, camera motion, object motion)
• Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation
[Frey, Dueck 2007]
Cluster similar sequences
90/39

Construction (3/3): Sampling
• Requirement:
• Diverse visual attributes
• Challenging subset
• Global visual attributes: computed
• Tracking difficulty attribute: Applied FoT, ASMS, KCF trackers
• Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse.
91/39

VOT2015/16 dataset: 60 sequences
92/39

Object annotation
• Automatic bounding box placement
1. Segment the target (semi-automatic)
2. Automatically fit a bounding box by optimizing a cost function
93/39

Kristan et al., VOT2016 results
Sequence ranking
• Among the most challenging sequences
• Among the easiest sequences
Singer1 (𝐴 𝑓 = 0.02, 𝑀𝑓 = 4) Octopus (𝐴 𝑓 = 0.01, 𝑀𝑓 = 5) Sheep (𝐴 𝑓 = 0.02, 𝑀𝑓 = 15)
94/42
Matrix (𝐴 𝑓 = 0.33, 𝑀𝑓 = 57) Rabbit(𝐴 𝑓 = 0.31, 𝑀𝑓 = 43) Butterfly (𝐴 𝑓 = 0.22, 𝑀𝑓 = 45)

Main novelty – better ground truth.
• Each frame manually per-pixel segmented
• B-boxes automatically generated from the segmentation

VOT Results: Realtime
• Flow-based, Mean Shift-based, Correlation filters
• Engineering, use of basic features
2014
FoT (~190 fps)
PLT (~112 fps)
KCF (~36 fps)
2015
ASMS (~172 fps)
BDF (~300 fps)
FoT (~190 pfs)
ASMS
BDF
FoT
2013
PLT (~169 fps)
FoT (~156 fps)
CCMS(~57 fps)
PLT
FoT
CCMS
96/39

VOT 2016: Results
• C-COT slightly ahead of TCNN
• Most accurate: SSAT
• Most robust: C-COT and MLDF
Overlap curves
97/42
(1) C-COT
(2) TCNN
(3) SSAT
(4) MLDF
(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Accuracy

VOT 2016: Tracking speed
• Top-performers slowest
• Plausible cause: CNN
• Real-time bound: Staple+
• Decent accuracy,
• Decent robustness
Note: the speed in some
Matlab trackers has been
significantly underestimated
by the toolkit since it was
measuring also the Matlab
restart time. The EFOs of
Matlab trackers are in fact
higher than stated in this
figure.
98/42
C-COT TCNN
SSAT MLDF
Staple
+
Staple+

VOT public resources
• Resources publicly available: VOT page
• Raw results of all tested trackers
• Relevant methodology papers
• 2016: Submitted trackers code/binaries
• All fully annotated datasets (2013-2016)
• Documentation, tutorials, forum
99/39

Summary
• “Visual Tracking” may refer to quite different problems.
• The area is just starting to be affected by CNNs.
• Robustness at all levels is the road to reliable performance
• Key components of trackers:
– target learning (modelling, “template update”)
– integration of detection and temporal smoothness assumptions
– representation of the image and target
• Be careful when evaluating tracking results
100/150

Computer vision courses online
• https://cw.fel.cvut.cz/wiki/courses/ucuws17/start UCU Winter
school course
• http://cs.brown.edu/courses/cs143/
• https://www.udacity.com/course/introduction-to-computer-vision-
-ud810
• http://cs231n.stanford.edu/
• http://vision.stanford.edu/teaching/cs131_fall1415/index.html
101/150

Center for Machine Perception
Department of Cybernetics,
Faculty of Electrical Engineering
Czech Technical University, Prague
established 1707
Area of Expertise:
Computer Vision, Image Processing, Pattern Recognition
People:
12 Academics: 3 full, 2 associate, 7 assistant professors.
15 Researchers.
15 Phd students, 15 MSc and BSc students
Impact & Excellence:
significant % of funding (>75) from EU and private companies
collaboration with high-tech companies (Microsoft, Google, Toyota, VW, Daimler,
Samsung, Boeing, Hitachi, Honeywell), numerous science prizes, hundreds of
ISI citations to our publications per year, competitive results in contests.
Visual Recognition Group (head:Jiri Matas)

We are looking for 1-3
good students for PhD

THANK YOU.
Questions, please?
105/150

Visual Object Tracking: review

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Visual Object Tracking: review

Similar to Visual Object Tracking: review (20)

Recently uploaded

Recently uploaded (20)

Visual Object Tracking: review