Motion and Tracking
Eng-Jon Ong
University of Surrey
e.ong@surrey.ac.uk
Introduction
 There have been many objects that have been
tracked in the past.
 Whole objects: Cars, bicycles, human bodies.
Source:
Youtube: Intelligent
Traffic Surveillance
What objects have been
tracked? There have been many
objects that have been
tracked in the past.
 Medium level features:
Heads, Hands, small
objects, etc..
What objects have been
tracked?
 There have been many objects that have been
tracked in the past.
 Fine level features: Facial feature points, finger
positions, etc...
Overview
 The task of visual tracking involves locating the
position of a tracked target by a combination of
features and motion models.
 There is a strong relationship between the task of
object detection and tracking.
Visual
model +
Detector
Motion
Model
Overview
 One can think of tracking as a motion-model
constrained detection.
 Detection on the whole image tends to be expensive
Visual
model +
Detector
Motion
Model
Overview
 Introduction
 Object models
 Simple search strategies
 Using linear dynamics
 Optimisation search
strategies
 Summary
Object Models and Evaluation
Representation of Tracked
Objects
 The first question: How do we computationally
represent an object we want to track?
 Image template
 Combination of low level information (e.g. Lines)
 Contour information
Evaluation of different models
“fitness”
 We need a measure of model fitness on an image
given a set of parameters (e.g. Position + scale).
 For images, we have template matching using
different scores:
 Normalised cross correlation is the most basic
(i.e. Sum of
squares of
pixel differences)
Evaluation of different models
“fitness”
 There are more sophisticated methods for
matching a template to an image:
 Boosted detectors are a popular choice.
 Boosting is a method that combines a set of very
simple object detectors together to yield a strong
detector.
Boosted Cascade
Cascade Layer 1
90% Rejected
10% pass . . . .
Cascade Layer 2 Cascade Layer 3
10% pass
90% Rejected 90% Rejected 90% Rejected
Face
detected
Cascade Layer n
Boosted Cascade Layer 1
2 Classifiers
Layer 2
5 Classifiers
Layer 3
5 Classifiers
Layer 4
20 Classifiers
Layer 5
50 Classifiers
Layer 6
50 Classifiers
Layer 7
128 Classifiers
Layer 8
132 Classifiers
Layer 9
100 Classifiers
Detecting and Tracking Humans
in Images
Constrained Detection: Simple
Search Strategies
Simple Tracking Strategies
 Detection/Global Search
 Goal: Where to place the
contour on the image?
Simple Tracking Strategies
n
dI
dn
I
n
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
^
n1
^
n2
^
n3
^
n4
 Contours and Costs
– Search along contour normal for edges
– Move contour x,y,scale & rotation
Evaluation of different models
“fitness”
 For lines and contours, we can use distances to
nearest edges.
 But, different configurations of contour searches
can have different results.
 Run demos:
 3tracescanline.exe
 4tracescanlinelong.exe
n
dI
dn
I
n
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
^
n
1
^
n
2 ^
n
3
^
n
4
Simple Tracking Strategies
 Global Search
– If the parameter space of
the search is low in
dimensionality then a
simple global search of the
image is sufficient
Simple Tracking Strategies
 Global Search
– If the parameter space of
the search is low in
dimensionality then a
simple global search of the
image is sufficient
Simple Tracking Strategies
 Global Search
– If the parameter space of
the search is low in
dimensionality then a
simple global search of the
image is sufficient
– Not practical for most
applications
Detecting and Tracking
Humans in Images
 We can track just using
global search if the
detectors are fast enough
Iterative Tracking
 Most tracking schemes work on the
assumption that an object will make small
iterative movements between frames
 Using this assumption only a local search
is required to update model parameters
 Tracking is typically posed as a 2 step
process:
– Initialisation (Global/Detection)
– Iteration (Local)
Iterative Tracking Example 1
 Assume the initial
position is known
 Assume object wont
move far
 Search locally to find
movement that
maximises some
fitness function
Iterative Tracking Example 1
 Assume the initial
position is known
 Assume object wont
move far
 Search locally to find
movement that
maximises some
fitness function
Iterative Tracking Example 2
 Again:
– requires good initialisation
– relies on small inter-frame movements
Iterative Tracking Example 2
 Example of contour tracking failing
due to indistinct edges
 A better example of tracking but
highly susceptible to initialisation
 Increasing the local search
provides better initialisation but
decreases tracking performance
1BadContour.exe
2BetterContour.exe
4TraceScanLineLong.exe
Constrained Detection:
Optimisation Search Strategies
Tracking as an Optimisation
Problem
 Tracking can be thought of as an
optimisation where some cost function
represents how well a model fits an
image.
 Model fitting is done by attempt to find
the model parameters that
minimise/maximise this cost function
 This can be done at each frame to track
objects through a video sequence
Using Gradient Descent
 The previous approaches of iteratively
refining a model given a local search is
effectively a gradient descent optimisation
 This will only work if the
initial pose of the model is
very close to the ideal
position as energy surfaces
typically have many local
minima
Cost
Parame
Using Gradient Descent
 Energy surfaces are typically very complex and
impossible to visualise due to high dimensionality
 In the figure there is one global minimum but many local
minima that are almost as good
 Unless our model is very close
to the ideal location a gradient
descent approach will converge
on a local minima and get
trapped
 We've already seen this in
action on the contour tracker
Cost
Parameter
Choosing a cost function
 Returning to the contour example lets
formulate a cost function as the
Euclidean distance between a model and
the strongest features in the image
 We can visualise the cost surface across
a single parameter
 Notice the surface has a global minimum
but it is not distinct
3TraceScanLine.exe
Choosing a cost function
 We can do the same after increasing the
local search (by extending our search
along normals) to see how this affects
the cost surface
 Note it makes the minima more distinct
but this image has no background clutter.
Additional clutter would result in further
complicating the surface
TraceScanLineLong.exe
Choosing a cost function
 Lets choose a different cost function
 This time we will take the edge strength
supporting the model pose
 Notice the surface has inverted and we
now seek to find the maximum
 It has a very clear maximum which
corresponds to the global solution which
SHOULD be easy to find!!!
5cost2TraceScanLine.exe
Lucas-Kanade Tracking
 Remember Gradient Descent
Cost
Parame
 Well if we know more about the surface we
can speed things up:
– If we assume the cost
surface is a parabola
then given a position and
a gradient we can
move to the minimum in
one move
Lucas-Kanade Tracking
 Newton-Raphson
convergence
vn+1 =vn−
f n '
f n ''
 Jacobian
 Hessian
• Two differences
• LK uses the Sum of Squared
differences across the entire image.
• x is a multi-dimensional warp
parameter.
v
f(v)
Lucas-Kanade Tracking
     

x
ssd Tv,wI=d
2
xx
     xx Tv,wI
v
w
I=d
v x
ssd 






2
- =
{
}*
∑
y
w
I



Jacobian
?)(?,


ssdd
v
x
w
I



Lucas-Kanade Tracking
     

x
ssd Tv,wI=d
2
xx
     xx Tv,wI
v
w
I=d
v x
ssd 






2
 2
2
2
2 dO+
v
w
I
v
w
I=
v
d
x
T
ssd




















∑
y
w
I



Jacobian
Hessian
x
w
I



y
w
I












??
??
2
2
v
dssd
x
w
I



Lucas-Kanade Tracking
Lucas-Kanade Tracking
Youtube: vision: optical flow detection
Mean-shift
 We can look for local maxima in object
detector outputs using mean-shift
Mean-shift
 We can look for local maxima in object
detector outputs using mean-shift
Mean shift
 Example of simple mean-shift tracking
 Object “Detector” is distance to RGB histogram
Youtube: Mean shift tracking
of red bal, normalised RGB
and 64 bin histogram
Regression-based Tracking
Regression-based Tracking
 Up till now, tracking is seen as a constrained
detection problem. Essentially template
matching, searching a parameter space to
minimise a matching fitness function.
 Another approach is to pose the problem
as a regression problem: Given template
difference, predict the translational offset
to the correct position. (no explicit search
needed!)
Linear Predictors
(Robust Facial Feature Tracking using Shape Constrained Multi Resolution Selected
Linear Predictors, Ong et al)
a
c
b Y
P= [ Ia – I'a,
Ib – I'b,
lc – I'c ]
X = HP
 Reference Point + Support Pixels (a,b,c)
 Linear mapping (H) from support pixel
intensity difference to translation vector
 Linear Predictor “Bunches”
– Single LPs are not stable enough for tracking image
features
– Use a set (“bunch”) of
LPs instead
– Final prediction =
consensus of the most
common predicted
translation
Linear Predictors
 Linear Predictor “Bunches”
– Single LPs are not stable enough for tracking image
features
– Use a set (“bunch”) of
LPs instead
– Final prediction =
consensus of the most
common predicted
translation
Linear Predictors
 “Tracking context” is very important.
 We only want to use surrounding visual
information if it helps the tracking
Linear Predictors
We want to track this point
BUT, we should
use visual information
around here for tracking
it! Other regions have too
much variations.
 We can find the tracking context by evaluating the
accuracy of trackers using local patches, and
gradually removing the bad ones
Linear Predictors
 Cascaded linear predictors:
– Linear predictors trained to overcome large offsets are not
accurate but robust
– LPs trained to overcome small offsets are accurate but not robust.
– Solution, cascade them: Use big-offset LPs, then pass the results
to smaller ones for refinement.
Linear Predictors
Errors of “large” LP predicting
from an offseted position
(blue is medium prediction error)
Errors of “small” LP predicting
from an offseted position
(white is small prediction error)
Linear Predictors
Linear Predictors
Linear Predictors
Non-Linear Predictors
(Non-linear Predictors for Facial feature Tracking, FG2013, Sheerman-Chase et al.)
a
c
b Y
P= [ Ia – I'a,
Ib – I'b,
lc – I'c ]
X = H( P )
 Replace linear mapping with the non-linear
mapping of regression trees
 Input still support pixel differences, output
still offsets
Non-Linear Predictors
 Replace linear mapping with the non-linear
mapping of regression trees
 Input still support pixel differences, output
still offsets
S1<0.4
dy = 23 S50<0.1
Dy = 32dy = -10
Non-Linear Predictors
 Results: More robust tracking able to handle
larger amounts of pose and expression
variations.
Non-Linear Predictors
 Results: More robust tracking able to handle
larger amounts of pose and expression
variations.
Non-Linear Predictors
 Allows us to do freaky things like this:
Background to template update problem
 No update
– Misrepresentation Error
– Catastrophic
 Naïve update
– Drift Error
– Slow accumulation
True Feature – Old Appearance
True Feature – New Appearance
False Feature
Frame
time
Error
time
Error
1 2 3 4 5
Background template update
(Mutual information for Lucas Kanade tracking (MILK): An inverse compositional
formulation, Dowson et al, PAMI 08)
Building a Model of Templates
Appearance space
LP SMAT
SMAT
Incorporating Motion Models
for Tracking
Temporal Consistency
 This sequence shows a surveillance application
tracking subjects as they move.
The technique uses a per pixel mixture
of Gaussians to model background colour
distributions and perform dynamic
background subtraction.
Tracking with Motion Models
 The task of visual tracking involves locating the
position of a tracked target by a combination of
features and motion models.
 There is a strong relationship between the task of
object detection and tracking.
Visual
model +
Detector
Motion
Model
Using Motion
 Objects often exhibit consistent motion
Kalman Filter
 To exploit this motion consistency, many
authors model it with simple dynamics
in the what is called the Kalman filter
 A Kalman filter is simply an optimal
recursive data processing algorithm.
 It makes predictions based on previous
estimates and current observations
Kalman Filter
 Suppose we have some hidden information to recover (i.e. Not
directly observable) and takes the form of a state vector
 E.g. X = [x,y,v] position, velocity of a tracked object
 This object has a true position at time t, Xt, which we do not know
 But suppose we think this object’s dynamics works in a linear
fashion like: Xt = FXt-1
 BUT this may not be exactly the case, it might be slightly off, thus
we have Xt = FXt-1 + wt, where wt ~ N(0,Q)
Xt
Kalman Filter
 Suppose we have some sensors that can provide some
measurements about the tracked object in the form of a state
vector: Z = [a,b]
 This sensor measurements is originates from the hidden state
vector X with the form: Zt = HXt
 BUT, in reality this sensor can be imperfect, noisy etc...
 We deal with this by saying Zt = HXt + v, where v ~ N(0,R)
 R is called the sensor’s error covariance
Kalman Filter
 We want to recover some hidden information about a tracked
object: X = [x,y,v]
 We can predict it’s movements “blindly” using: X’t|t-1 = FX’t-1|t-1
+ wt
 But this model is inaccurate in a Gaussian sense: wt ~ N(0,Q)
 We have some sensors that provide observations to indirectly tell
us how accurate our predictions are Zt – HX’t|t-1
 BUT, need to take this with a pinch of salt, since our sensors are
inaccurate as well (Zt has Gaussian noise with covariance R)
Kalman Filter
 Suppose we have some hidden information to recover (i.e. Not
directly observable) and takes the form of a state vector
 E.g. X = [x,y,v] position, velocity of a tracked object
 This object has a true position at time t, Xt, which we do not know
 But suppose we think this object’s dynamics works in a linear
fashion like: Xt = FXt-1
 BUT this may not be exactly the case, it might be slightly off, thus
we have Xt = FXt-1 + wt, where wt ~ N(0,Q)
Xt
Kalman Filter
 So, task at hand: how do we best combine our prediction of a
tracked object state with the sensor observations, given that both
have Gaussian noise?
 That is what a Kalman filter does in a optimal sense (provide your
noise IS Gaussian and your dynamics IS linear)
 Xt|t = X’t|t-1 + K( Zt – HX’t|t-1 )
 K is called the “Kalman gain”
 Essentially, if sensor noise is small and prediction noise large, K
becomes H-1, meaning trust the observations.
 Conversely, if sensor noise is large,
K becomes 0, trust prediction
Kalman Filter Operation
From: Kalman filter for dummies
Using a Kalman Filter to Track
 How prediction overcomes occlusion
issues
Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
Using a Kalman Filter to Track
 How prediction overcomes occlusion
issues
Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
Using a Kalman Filter to Track
 How prediction overcomes occlusion
issues
Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
Extended Kalman Filter-EKF
 The Kalman filter addresses the
problem of dynamics estimation by
linear equations
 Most problems are non-linear
 EKF attempts to address this making
the state prediction Xt = F( Xt-1 ) + w
 F can be any non linear function
ww.cs.unc.edu/~welch for introductory tutorials and sampl
Exploring a parameter space for
the global solution
 We could try every single model configuration to find
the lowest cost solution but this can be unfeasible
(640x480x100x360=11,059,200,000)
 We could just randomly pick model configurations in
the hope that we find a low cost solution but this
does not guarantee that we will find it and as the
dimensionality and complexity increase so must the
number of random samples
 These are common problems and hence standard
optimisation techniques can be employed
– e.g. Simulated Annealing, Genetic Algorithms
7RandomSample.exe
Tracking as an Optimisation
Problem
 In simulated annealing we try and use some simple
heuristic to reduce the number of samples we need to
test
 In Genetic Algorithms we try and guide our random
search through observation to again reduce the
complexity of the search
 However, these are blind optimisations and we often
know much more about the problem we are trying to
solve such as the nature of observations or the
dynamics we are expecting (remember the Kalman
Filter)
Tracking as an Optimisation
Problem
 Example of using simulated annealing for tracking the
body pose
N. Lehment, M. Kaiser, D. Arsic, and
G. Rigoll.
Cue-Independent Extending Inverse
Kinematics For Robust
Pose Estimation in 3D Point Clouds.
Proc. IEEE Intern. Conf.on Image
Processing (ICIP2010)
Factored Sampling
 We have seen how the KF uses a simple
Gaussian to model observations but what
happens if observations are non-Gaussian?
 Factored Sampling can be used to search a
static image in these cases
 We want to calculate the posterior probability
that an object X exists in an image given the
observed data obj
– P(X |obj)
Factored Sampling
 This is difficult to achieve for continuous complex
non-Gaussian distributions
 Luckily Bayes’ formula says that the posterior
density can be obtained as a product of a prior
density P0(X ) and an observation density P(obj|
X )
– P(X |obj) ≈ P(obj|X ) P0(X )
 Factored sampling estimates the posterior by
generating samples from the prior and weighting
them according to the observation density
Factored Sampling
 A set of n points s (n), the centres of the blobs in the figure
are sampled randomly from the prior density P(X )
 Each sample is then assigned a weight (depicted by blob
area) based upon the observation density P(obj|X = s (n) )
 If n is sufficiently large then the weighted set represents
the posterior density P(X |obj)
State
X
Probability
posterior
density
weighted
sample
CONDENSATION and Particle Filtering
 CONDitional DENsity propagATION also known
as particle filtering is the natural extension of
the KF to factored sampling
 Basically:
– Randomly generate a distribution from the prior pdf
and apply a model of dynamics (i.e. predict)
– Fit each sample to the image (i.e. measure)
– Weight samples accordingly to generate a new
posterior pdf that will serve as the prior for the next
iteration
CONDENSATION and Particle Filtering
predict
measure
CONDENSATION and Particle Filtering
The animation shows a few cycles of the
algorithm applied to a one-dimensional system.
The green spheres correspond to the members
of the sample set, where the size of the sphere
is an indication of the sample weight. The red
line is the measurement density function.
This animation shows a short sequence of the
CONDENSATION filter tracking a leaf
exhibiting non-linear motion with occlusion
and clutter.
Movie sequences taken from
http://www.dai.ed.ac.uk/CVonline/LOCAL_COPIES/ISARD1/condensation.html
CONDENSATION and Particle Filtering
 We can extend our random sampler to a
simple PF using gaussian noise as our
dynamics/drift term
 Notice how the population quickly homes in
on the area of highest probability as we saw
in the random sampling
 It quickly converges on incorrect local
solutions, increasing the noise term helps
explore the space further but the global
maximum is at the bottom of the image
8ParticleFilter.exe
CONDENSATION and Particle Filtering
 We can further try to change the model to
better fit the head and ensure the global is at
the correct position
 Tracking is better but easily lost to other
maxima
 As the population size is increased we start to
see multiple hypothesis tracking
 By combining both the PF and a gradient
decent method we can get the best results
for the lowest population, but our cost
function is still flawed
9Particle filter.exe
10ParticleFilter.exe
CONDENSATION and Particle Filtering
 Advantages
– Allows complex non-Gaussian systems
– Easy to add non-linear dynamics
– Provides support for multiple hypotheses (!!!)
 Disadvantages
– Large numbers of samples make the techniques
extremely slow for high parameter spaces
– Not a global optimisation so has the tendency to
converge upon good observations at the cost of other
observations
 There are many schemes for overcoming these
problems but are beyond the scope of this lecture
Interesting Applications of
Motion Tracking
Lip-Reading
 Facial features of a subject are tracked, specifically the
mouth regions.
 Mouth texture and shape are extracted and
used to build discriminative patterns called
sequential patterns
Lip-Reading
 Results:
Sign Language Recognition
 Tracking required for extracting the motions of the
hands and head.
 Movement features of the hands and hand
shapes are extracted
 Again, discriminative movement patterns
uniquely identifying a sign is extracted
 These patterns will be used to detect whether a sign is
present in a video sequence or not
Sign Language Recognition
 Results:
Group Behaviour Profiling
 Even when tracking is not very accurate or robust, it
can still be used to do useful things!
 Example: Use simple trackers (e.g. Lucas
Kanade trackers) to “track” people in a crowd
 These will only last a short while, but can form
short trajectories.
 The analysis of these trajectories can be used
to do profile crowd behaviours.
Group Behaviour Profiling
 Results:
Summary
We have looked at a variety of tracking strategies from
very simple schemes to those which can learn and
predict complex non-linear motion in cluttered
environments. This talk is not exhaustive but should
give you a basic understanding of the types of
techniques used in modern computer vision systems.
For more details on many of the examples see my
website http://www.surrey.ac.uk/personal/e.ong
For a good introduction on the temporal mechanics of
tracking I would recommend reading
“Active Contours” by Isard and Blake
Things to remember!!!
 When tracking:
– Tracking is only as good as your model and data
 A bad metric will give bad results
 The larger the parameter space the more difficult things
become
– Make things as simple as possible
 Constrain your environment
 Use appropriate techniques and dynamics
– e.g. if your tracking someone jumping up and down don’t
use a kalman filter
– Don’t try to reinvent the wheel
 But if your going to use black box techniques ensure you
know what they will and wont do for you

Motion and tracking

  • 1.
    Motion and Tracking Eng-JonOng University of Surrey e.ong@surrey.ac.uk
  • 2.
    Introduction  There havebeen many objects that have been tracked in the past.  Whole objects: Cars, bicycles, human bodies. Source: Youtube: Intelligent Traffic Surveillance
  • 3.
    What objects havebeen tracked? There have been many objects that have been tracked in the past.  Medium level features: Heads, Hands, small objects, etc..
  • 4.
    What objects havebeen tracked?  There have been many objects that have been tracked in the past.  Fine level features: Facial feature points, finger positions, etc...
  • 5.
    Overview  The taskof visual tracking involves locating the position of a tracked target by a combination of features and motion models.  There is a strong relationship between the task of object detection and tracking. Visual model + Detector Motion Model
  • 6.
    Overview  One canthink of tracking as a motion-model constrained detection.  Detection on the whole image tends to be expensive Visual model + Detector Motion Model
  • 7.
    Overview  Introduction  Objectmodels  Simple search strategies  Using linear dynamics  Optimisation search strategies  Summary
  • 8.
  • 9.
    Representation of Tracked Objects The first question: How do we computationally represent an object we want to track?  Image template  Combination of low level information (e.g. Lines)  Contour information
  • 10.
    Evaluation of differentmodels “fitness”  We need a measure of model fitness on an image given a set of parameters (e.g. Position + scale).  For images, we have template matching using different scores:  Normalised cross correlation is the most basic (i.e. Sum of squares of pixel differences)
  • 11.
    Evaluation of differentmodels “fitness”  There are more sophisticated methods for matching a template to an image:  Boosted detectors are a popular choice.  Boosting is a method that combines a set of very simple object detectors together to yield a strong detector.
  • 12.
    Boosted Cascade Cascade Layer1 90% Rejected 10% pass . . . . Cascade Layer 2 Cascade Layer 3 10% pass 90% Rejected 90% Rejected 90% Rejected Face detected Cascade Layer n
  • 13.
    Boosted Cascade Layer1 2 Classifiers Layer 2 5 Classifiers Layer 3 5 Classifiers Layer 4 20 Classifiers Layer 5 50 Classifiers Layer 6 50 Classifiers Layer 7 128 Classifiers Layer 8 132 Classifiers Layer 9 100 Classifiers
  • 14.
    Detecting and TrackingHumans in Images
  • 15.
  • 16.
    Simple Tracking Strategies Detection/Global Search  Goal: Where to place the contour on the image?
  • 17.
    Simple Tracking Strategies n dI dn I n (x1,y1) (x2,y2) (x3,y3) (x4,y4) ^ n1 ^ n2 ^ n3 ^ n4 Contours and Costs – Search along contour normal for edges – Move contour x,y,scale & rotation
  • 18.
    Evaluation of differentmodels “fitness”  For lines and contours, we can use distances to nearest edges.  But, different configurations of contour searches can have different results.  Run demos:  3tracescanline.exe  4tracescanlinelong.exe n dI dn I n (x1,y1) (x2,y2) (x3,y3) (x4,y4) ^ n 1 ^ n 2 ^ n 3 ^ n 4
  • 19.
    Simple Tracking Strategies Global Search – If the parameter space of the search is low in dimensionality then a simple global search of the image is sufficient
  • 20.
    Simple Tracking Strategies Global Search – If the parameter space of the search is low in dimensionality then a simple global search of the image is sufficient
  • 21.
    Simple Tracking Strategies Global Search – If the parameter space of the search is low in dimensionality then a simple global search of the image is sufficient – Not practical for most applications
  • 22.
    Detecting and Tracking Humansin Images  We can track just using global search if the detectors are fast enough
  • 23.
    Iterative Tracking  Mosttracking schemes work on the assumption that an object will make small iterative movements between frames  Using this assumption only a local search is required to update model parameters  Tracking is typically posed as a 2 step process: – Initialisation (Global/Detection) – Iteration (Local)
  • 24.
    Iterative Tracking Example1  Assume the initial position is known  Assume object wont move far  Search locally to find movement that maximises some fitness function
  • 25.
    Iterative Tracking Example1  Assume the initial position is known  Assume object wont move far  Search locally to find movement that maximises some fitness function
  • 26.
    Iterative Tracking Example2  Again: – requires good initialisation – relies on small inter-frame movements
  • 27.
    Iterative Tracking Example2  Example of contour tracking failing due to indistinct edges  A better example of tracking but highly susceptible to initialisation  Increasing the local search provides better initialisation but decreases tracking performance 1BadContour.exe 2BetterContour.exe 4TraceScanLineLong.exe
  • 28.
  • 29.
    Tracking as anOptimisation Problem  Tracking can be thought of as an optimisation where some cost function represents how well a model fits an image.  Model fitting is done by attempt to find the model parameters that minimise/maximise this cost function  This can be done at each frame to track objects through a video sequence
  • 30.
    Using Gradient Descent The previous approaches of iteratively refining a model given a local search is effectively a gradient descent optimisation  This will only work if the initial pose of the model is very close to the ideal position as energy surfaces typically have many local minima Cost Parame
  • 31.
    Using Gradient Descent Energy surfaces are typically very complex and impossible to visualise due to high dimensionality  In the figure there is one global minimum but many local minima that are almost as good  Unless our model is very close to the ideal location a gradient descent approach will converge on a local minima and get trapped  We've already seen this in action on the contour tracker Cost Parameter
  • 32.
    Choosing a costfunction  Returning to the contour example lets formulate a cost function as the Euclidean distance between a model and the strongest features in the image  We can visualise the cost surface across a single parameter  Notice the surface has a global minimum but it is not distinct 3TraceScanLine.exe
  • 33.
    Choosing a costfunction  We can do the same after increasing the local search (by extending our search along normals) to see how this affects the cost surface  Note it makes the minima more distinct but this image has no background clutter. Additional clutter would result in further complicating the surface TraceScanLineLong.exe
  • 34.
    Choosing a costfunction  Lets choose a different cost function  This time we will take the edge strength supporting the model pose  Notice the surface has inverted and we now seek to find the maximum  It has a very clear maximum which corresponds to the global solution which SHOULD be easy to find!!! 5cost2TraceScanLine.exe
  • 35.
    Lucas-Kanade Tracking  RememberGradient Descent Cost Parame  Well if we know more about the surface we can speed things up: – If we assume the cost surface is a parabola then given a position and a gradient we can move to the minimum in one move
  • 36.
    Lucas-Kanade Tracking  Newton-Raphson convergence vn+1=vn− f n ' f n ''  Jacobian  Hessian • Two differences • LK uses the Sum of Squared differences across the entire image. • x is a multi-dimensional warp parameter. v f(v)
  • 37.
    Lucas-Kanade Tracking       x ssd Tv,wI=d 2 xx      xx Tv,wI v w I=d v x ssd        2 - = { }* ∑ y w I    Jacobian ?)(?,   ssdd v x w I   
  • 38.
    Lucas-Kanade Tracking       x ssd Tv,wI=d 2 xx      xx Tv,wI v w I=d v x ssd        2  2 2 2 2 dO+ v w I v w I= v d x T ssd                     ∑ y w I    Jacobian Hessian x w I    y w I             ?? ?? 2 2 v dssd x w I   
  • 39.
  • 40.
  • 41.
    Mean-shift  We canlook for local maxima in object detector outputs using mean-shift
  • 42.
    Mean-shift  We canlook for local maxima in object detector outputs using mean-shift
  • 43.
    Mean shift  Exampleof simple mean-shift tracking  Object “Detector” is distance to RGB histogram Youtube: Mean shift tracking of red bal, normalised RGB and 64 bin histogram
  • 44.
  • 45.
    Regression-based Tracking  Uptill now, tracking is seen as a constrained detection problem. Essentially template matching, searching a parameter space to minimise a matching fitness function.  Another approach is to pose the problem as a regression problem: Given template difference, predict the translational offset to the correct position. (no explicit search needed!)
  • 46.
    Linear Predictors (Robust FacialFeature Tracking using Shape Constrained Multi Resolution Selected Linear Predictors, Ong et al) a c b Y P= [ Ia – I'a, Ib – I'b, lc – I'c ] X = HP  Reference Point + Support Pixels (a,b,c)  Linear mapping (H) from support pixel intensity difference to translation vector
  • 47.
     Linear Predictor“Bunches” – Single LPs are not stable enough for tracking image features – Use a set (“bunch”) of LPs instead – Final prediction = consensus of the most common predicted translation Linear Predictors
  • 48.
     Linear Predictor“Bunches” – Single LPs are not stable enough for tracking image features – Use a set (“bunch”) of LPs instead – Final prediction = consensus of the most common predicted translation Linear Predictors
  • 49.
     “Tracking context”is very important.  We only want to use surrounding visual information if it helps the tracking Linear Predictors We want to track this point BUT, we should use visual information around here for tracking it! Other regions have too much variations.
  • 50.
     We canfind the tracking context by evaluating the accuracy of trackers using local patches, and gradually removing the bad ones Linear Predictors
  • 51.
     Cascaded linearpredictors: – Linear predictors trained to overcome large offsets are not accurate but robust – LPs trained to overcome small offsets are accurate but not robust. – Solution, cascade them: Use big-offset LPs, then pass the results to smaller ones for refinement. Linear Predictors Errors of “large” LP predicting from an offseted position (blue is medium prediction error) Errors of “small” LP predicting from an offseted position (white is small prediction error)
  • 52.
  • 53.
  • 54.
  • 55.
    Non-Linear Predictors (Non-linear Predictorsfor Facial feature Tracking, FG2013, Sheerman-Chase et al.) a c b Y P= [ Ia – I'a, Ib – I'b, lc – I'c ] X = H( P )  Replace linear mapping with the non-linear mapping of regression trees  Input still support pixel differences, output still offsets
  • 56.
    Non-Linear Predictors  Replacelinear mapping with the non-linear mapping of regression trees  Input still support pixel differences, output still offsets S1<0.4 dy = 23 S50<0.1 Dy = 32dy = -10
  • 57.
    Non-Linear Predictors  Results:More robust tracking able to handle larger amounts of pose and expression variations.
  • 58.
    Non-Linear Predictors  Results:More robust tracking able to handle larger amounts of pose and expression variations.
  • 59.
    Non-Linear Predictors  Allowsus to do freaky things like this:
  • 60.
    Background to templateupdate problem  No update – Misrepresentation Error – Catastrophic  Naïve update – Drift Error – Slow accumulation True Feature – Old Appearance True Feature – New Appearance False Feature Frame time Error time Error 1 2 3 4 5
  • 61.
    Background template update (Mutualinformation for Lucas Kanade tracking (MILK): An inverse compositional formulation, Dowson et al, PAMI 08)
  • 62.
    Building a Modelof Templates Appearance space
  • 63.
  • 64.
  • 65.
  • 66.
    Temporal Consistency  Thissequence shows a surveillance application tracking subjects as they move. The technique uses a per pixel mixture of Gaussians to model background colour distributions and perform dynamic background subtraction.
  • 67.
    Tracking with MotionModels  The task of visual tracking involves locating the position of a tracked target by a combination of features and motion models.  There is a strong relationship between the task of object detection and tracking. Visual model + Detector Motion Model
  • 68.
    Using Motion  Objectsoften exhibit consistent motion
  • 69.
    Kalman Filter  Toexploit this motion consistency, many authors model it with simple dynamics in the what is called the Kalman filter  A Kalman filter is simply an optimal recursive data processing algorithm.  It makes predictions based on previous estimates and current observations
  • 70.
    Kalman Filter  Supposewe have some hidden information to recover (i.e. Not directly observable) and takes the form of a state vector  E.g. X = [x,y,v] position, velocity of a tracked object  This object has a true position at time t, Xt, which we do not know  But suppose we think this object’s dynamics works in a linear fashion like: Xt = FXt-1  BUT this may not be exactly the case, it might be slightly off, thus we have Xt = FXt-1 + wt, where wt ~ N(0,Q) Xt
  • 71.
    Kalman Filter  Supposewe have some sensors that can provide some measurements about the tracked object in the form of a state vector: Z = [a,b]  This sensor measurements is originates from the hidden state vector X with the form: Zt = HXt  BUT, in reality this sensor can be imperfect, noisy etc...  We deal with this by saying Zt = HXt + v, where v ~ N(0,R)  R is called the sensor’s error covariance
  • 72.
    Kalman Filter  Wewant to recover some hidden information about a tracked object: X = [x,y,v]  We can predict it’s movements “blindly” using: X’t|t-1 = FX’t-1|t-1 + wt  But this model is inaccurate in a Gaussian sense: wt ~ N(0,Q)  We have some sensors that provide observations to indirectly tell us how accurate our predictions are Zt – HX’t|t-1  BUT, need to take this with a pinch of salt, since our sensors are inaccurate as well (Zt has Gaussian noise with covariance R)
  • 73.
    Kalman Filter  Supposewe have some hidden information to recover (i.e. Not directly observable) and takes the form of a state vector  E.g. X = [x,y,v] position, velocity of a tracked object  This object has a true position at time t, Xt, which we do not know  But suppose we think this object’s dynamics works in a linear fashion like: Xt = FXt-1  BUT this may not be exactly the case, it might be slightly off, thus we have Xt = FXt-1 + wt, where wt ~ N(0,Q) Xt
  • 74.
    Kalman Filter  So,task at hand: how do we best combine our prediction of a tracked object state with the sensor observations, given that both have Gaussian noise?  That is what a Kalman filter does in a optimal sense (provide your noise IS Gaussian and your dynamics IS linear)  Xt|t = X’t|t-1 + K( Zt – HX’t|t-1 )  K is called the “Kalman gain”  Essentially, if sensor noise is small and prediction noise large, K becomes H-1, meaning trust the observations.  Conversely, if sensor noise is large, K becomes 0, trust prediction
  • 75.
    Kalman Filter Operation From:Kalman filter for dummies
  • 76.
    Using a KalmanFilter to Track  How prediction overcomes occlusion issues Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
  • 77.
    Using a KalmanFilter to Track  How prediction overcomes occlusion issues Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
  • 78.
    Using a KalmanFilter to Track  How prediction overcomes occlusion issues Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft
  • 79.
    Extended Kalman Filter-EKF The Kalman filter addresses the problem of dynamics estimation by linear equations  Most problems are non-linear  EKF attempts to address this making the state prediction Xt = F( Xt-1 ) + w  F can be any non linear function ww.cs.unc.edu/~welch for introductory tutorials and sampl
  • 80.
    Exploring a parameterspace for the global solution  We could try every single model configuration to find the lowest cost solution but this can be unfeasible (640x480x100x360=11,059,200,000)  We could just randomly pick model configurations in the hope that we find a low cost solution but this does not guarantee that we will find it and as the dimensionality and complexity increase so must the number of random samples  These are common problems and hence standard optimisation techniques can be employed – e.g. Simulated Annealing, Genetic Algorithms 7RandomSample.exe
  • 81.
    Tracking as anOptimisation Problem  In simulated annealing we try and use some simple heuristic to reduce the number of samples we need to test  In Genetic Algorithms we try and guide our random search through observation to again reduce the complexity of the search  However, these are blind optimisations and we often know much more about the problem we are trying to solve such as the nature of observations or the dynamics we are expecting (remember the Kalman Filter)
  • 82.
    Tracking as anOptimisation Problem  Example of using simulated annealing for tracking the body pose N. Lehment, M. Kaiser, D. Arsic, and G. Rigoll. Cue-Independent Extending Inverse Kinematics For Robust Pose Estimation in 3D Point Clouds. Proc. IEEE Intern. Conf.on Image Processing (ICIP2010)
  • 83.
    Factored Sampling  Wehave seen how the KF uses a simple Gaussian to model observations but what happens if observations are non-Gaussian?  Factored Sampling can be used to search a static image in these cases  We want to calculate the posterior probability that an object X exists in an image given the observed data obj – P(X |obj)
  • 84.
    Factored Sampling  Thisis difficult to achieve for continuous complex non-Gaussian distributions  Luckily Bayes’ formula says that the posterior density can be obtained as a product of a prior density P0(X ) and an observation density P(obj| X ) – P(X |obj) ≈ P(obj|X ) P0(X )  Factored sampling estimates the posterior by generating samples from the prior and weighting them according to the observation density
  • 85.
    Factored Sampling  Aset of n points s (n), the centres of the blobs in the figure are sampled randomly from the prior density P(X )  Each sample is then assigned a weight (depicted by blob area) based upon the observation density P(obj|X = s (n) )  If n is sufficiently large then the weighted set represents the posterior density P(X |obj) State X Probability posterior density weighted sample
  • 86.
    CONDENSATION and ParticleFiltering  CONDitional DENsity propagATION also known as particle filtering is the natural extension of the KF to factored sampling  Basically: – Randomly generate a distribution from the prior pdf and apply a model of dynamics (i.e. predict) – Fit each sample to the image (i.e. measure) – Weight samples accordingly to generate a new posterior pdf that will serve as the prior for the next iteration
  • 87.
    CONDENSATION and ParticleFiltering predict measure
  • 88.
    CONDENSATION and ParticleFiltering The animation shows a few cycles of the algorithm applied to a one-dimensional system. The green spheres correspond to the members of the sample set, where the size of the sphere is an indication of the sample weight. The red line is the measurement density function. This animation shows a short sequence of the CONDENSATION filter tracking a leaf exhibiting non-linear motion with occlusion and clutter. Movie sequences taken from http://www.dai.ed.ac.uk/CVonline/LOCAL_COPIES/ISARD1/condensation.html
  • 89.
    CONDENSATION and ParticleFiltering  We can extend our random sampler to a simple PF using gaussian noise as our dynamics/drift term  Notice how the population quickly homes in on the area of highest probability as we saw in the random sampling  It quickly converges on incorrect local solutions, increasing the noise term helps explore the space further but the global maximum is at the bottom of the image 8ParticleFilter.exe
  • 90.
    CONDENSATION and ParticleFiltering  We can further try to change the model to better fit the head and ensure the global is at the correct position  Tracking is better but easily lost to other maxima  As the population size is increased we start to see multiple hypothesis tracking  By combining both the PF and a gradient decent method we can get the best results for the lowest population, but our cost function is still flawed 9Particle filter.exe 10ParticleFilter.exe
  • 91.
    CONDENSATION and ParticleFiltering  Advantages – Allows complex non-Gaussian systems – Easy to add non-linear dynamics – Provides support for multiple hypotheses (!!!)  Disadvantages – Large numbers of samples make the techniques extremely slow for high parameter spaces – Not a global optimisation so has the tendency to converge upon good observations at the cost of other observations  There are many schemes for overcoming these problems but are beyond the scope of this lecture
  • 92.
  • 93.
    Lip-Reading  Facial featuresof a subject are tracked, specifically the mouth regions.  Mouth texture and shape are extracted and used to build discriminative patterns called sequential patterns
  • 94.
  • 95.
    Sign Language Recognition Tracking required for extracting the motions of the hands and head.  Movement features of the hands and hand shapes are extracted  Again, discriminative movement patterns uniquely identifying a sign is extracted  These patterns will be used to detect whether a sign is present in a video sequence or not
  • 96.
  • 97.
    Group Behaviour Profiling Even when tracking is not very accurate or robust, it can still be used to do useful things!  Example: Use simple trackers (e.g. Lucas Kanade trackers) to “track” people in a crowd  These will only last a short while, but can form short trajectories.  The analysis of these trajectories can be used to do profile crowd behaviours.
  • 98.
  • 99.
    Summary We have lookedat a variety of tracking strategies from very simple schemes to those which can learn and predict complex non-linear motion in cluttered environments. This talk is not exhaustive but should give you a basic understanding of the types of techniques used in modern computer vision systems. For more details on many of the examples see my website http://www.surrey.ac.uk/personal/e.ong For a good introduction on the temporal mechanics of tracking I would recommend reading “Active Contours” by Isard and Blake
  • 100.
    Things to remember!!! When tracking: – Tracking is only as good as your model and data  A bad metric will give bad results  The larger the parameter space the more difficult things become – Make things as simple as possible  Constrain your environment  Use appropriate techniques and dynamics – e.g. if your tracking someone jumping up and down don’t use a kalman filter – Don’t try to reinvent the wheel  But if your going to use black box techniques ensure you know what they will and wont do for you