Motion and tracking

Motion and Tracking
Eng-Jon Ong
University of Surrey
e.ong@surrey.ac.uk

Introduction
 There have been many objects that have been
tracked in the past.
 Whole objects: Cars, bicycles, human bodies.
Source:
Youtube: Intelligent
Traffic Surveillance

What objects have been
tracked? There have been many
objects that have been
 Medium level features:
Heads, Hands, small
objects, etc..

What objects have been
tracked?
 There have been many objects that have been
 Fine level features: Facial feature points, finger
positions, etc...

Overview
 The task of visual tracking involves locating the
position of a tracked target by a combination of
features and motion models.
 There is a strong relationship between the task of
object detection and tracking.
Visual
model +
Detector
Motion
Model

Overview
 One can think of tracking as a motion-model
constrained detection.
 Detection on the whole image tends to be expensive
Visual
model +
Detector
Motion
Model

Overview
 Introduction
 Object models
 Simple search strategies
 Using linear dynamics
 Optimisation search
strategies
 Summary

Representation of Tracked
Objects
 The first question: How do we computationally
represent an object we want to track?
 Image template
 Combination of low level information (e.g. Lines)
 Contour information

Evaluation of different models
“fitness”
 We need a measure of model fitness on an image
given a set of parameters (e.g. Position + scale).
 For images, we have template matching using
different scores:
 Normalised cross correlation is the most basic
(i.e. Sum of
squares of
pixel differences)

“fitness”
 There are more sophisticated methods for
matching a template to an image:
 Boosted detectors are a popular choice.
 Boosting is a method that combines a set of very
simple object detectors together to yield a strong
detector.

Boosted Cascade
Cascade Layer 1
90% Rejected
10% pass . . . .
Cascade Layer 2 Cascade Layer 3
10% pass
90% Rejected 90% Rejected 90% Rejected
Face
detected
Cascade Layer n

Boosted Cascade Layer 1
2 Classifiers
Layer 2
5 Classifiers
Layer 3
5 Classifiers
Layer 4
20 Classifiers
Layer 5
50 Classifiers
Layer 6
50 Classifiers
Layer 7
128 Classifiers
Layer 8
132 Classifiers
Layer 9
100 Classifiers

Detecting and Tracking Humans
in Images

Constrained Detection: Simple
Search Strategies

Simple Tracking Strategies
 Detection/Global Search
 Goal: Where to place the
contour on the image?

n
dI
dn
I
n
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
^
n1
^
n2
^
n3
^
n4
 Contours and Costs
– Search along contour normal for edges
– Move contour x,y,scale & rotation

“fitness”
 For lines and contours, we can use distances to
nearest edges.
 But, different configurations of contour searches
can have different results.
 Run demos:
 3tracescanline.exe
 4tracescanlinelong.exe
n
dI
dn
I
n
(x1,y1)
(x2,y2)
(x3,y3)
(x4,y4)
^
n
1
^
n
2 ^
n
3
^
n
4

 Global Search
– If the parameter space of
the search is low in
dimensionality then a
simple global search of the
image is sufficient

 Global Search
– If the parameter space of
the search is low in
dimensionality then a
simple global search of the
image is sufficient
– Not practical for most
applications

Detecting and Tracking
Humans in Images
 We can track just using
global search if the
detectors are fast enough

Iterative Tracking
 Most tracking schemes work on the
assumption that an object will make small
iterative movements between frames
 Using this assumption only a local search
is required to update model parameters
 Tracking is typically posed as a 2 step
process:
– Initialisation (Global/Detection)
– Iteration (Local)

Iterative Tracking Example 1
 Assume the initial
position is known
 Assume object wont
move far
 Search locally to find
movement that
maximises some
fitness function

 Again:
– requires good initialisation
– relies on small inter-frame movements

 Example of contour tracking failing
due to indistinct edges
 A better example of tracking but
highly susceptible to initialisation
 Increasing the local search
provides better initialisation but
decreases tracking performance
1BadContour.exe
2BetterContour.exe
4TraceScanLineLong.exe

Constrained Detection:
Optimisation Search Strategies

Tracking as an Optimisation
Problem
 Tracking can be thought of as an
optimisation where some cost function
represents how well a model fits an
image.
 Model fitting is done by attempt to find
the model parameters that
minimise/maximise this cost function
 This can be done at each frame to track
objects through a video sequence

Using Gradient Descent
 The previous approaches of iteratively
refining a model given a local search is
effectively a gradient descent optimisation
 This will only work if the
initial pose of the model is
very close to the ideal
position as energy surfaces
typically have many local
minima
Cost
Parame

Using Gradient Descent
 Energy surfaces are typically very complex and
impossible to visualise due to high dimensionality
 In the figure there is one global minimum but many local
minima that are almost as good
 Unless our model is very close
to the ideal location a gradient
descent approach will converge
on a local minima and get
trapped
 We've already seen this in
action on the contour tracker
Cost
Parameter

Choosing a cost function
 Returning to the contour example lets
formulate a cost function as the
Euclidean distance between a model and
the strongest features in the image
 We can visualise the cost surface across
a single parameter
 Notice the surface has a global minimum
but it is not distinct
3TraceScanLine.exe

 We can do the same after increasing the
local search (by extending our search
along normals) to see how this affects
the cost surface
 Note it makes the minima more distinct
but this image has no background clutter.
Additional clutter would result in further
complicating the surface
TraceScanLineLong.exe

 Lets choose a different cost function
 This time we will take the edge strength
supporting the model pose
 Notice the surface has inverted and we
now seek to find the maximum
 It has a very clear maximum which
corresponds to the global solution which
SHOULD be easy to find!!!
5cost2TraceScanLine.exe

Lucas-Kanade Tracking
 Remember Gradient Descent
Cost
Parame
 Well if we know more about the surface we
can speed things up:
– If we assume the cost
surface is a parabola
then given a position and
a gradient we can
move to the minimum in
one move

 Newton-Raphson
convergence
vn+1 =vn−
f n '
f n ''
 Jacobian
 Hessian
• Two differences
• LK uses the Sum of Squared
differences across the entire image.
• x is a multi-dimensional warp
parameter.
v
f(v)

     

x
ssd Tv,wI=d
2
xx
     xx Tv,wI
v
w
I=d
v x
ssd 






2
- =
{
}*
∑
y
w
I



Jacobian
?)(?,


ssdd
v
x
w
I




     

x
ssd Tv,wI=d
2
xx
     xx Tv,wI
v
w
I=d
v x
ssd 






2
 2
2
2
2 dO+
v
w
I
v
w
I=
v
d
x
T
ssd




















∑
y
w
I



Jacobian
Hessian
x
w
I



y
w
I












??
??
2
2
v
dssd
x
w
I




Youtube: vision: optical flow detection

Mean-shift
 We can look for local maxima in object
detector outputs using mean-shift

Mean shift
 Example of simple mean-shift tracking
 Object “Detector” is distance to RGB histogram
Youtube: Mean shift tracking
of red bal, normalised RGB
and 64 bin histogram

Regression-based Tracking
 Up till now, tracking is seen as a constrained
detection problem. Essentially template
matching, searching a parameter space to
minimise a matching fitness function.
 Another approach is to pose the problem
as a regression problem: Given template
difference, predict the translational offset
to the correct position. (no explicit search
needed!)

Linear Predictors
(Robust Facial Feature Tracking using Shape Constrained Multi Resolution Selected
Linear Predictors, Ong et al)
a
c
b Y
P= [ Ia – I'a,
Ib – I'b,
lc – I'c ]
X = HP
 Reference Point + Support Pixels (a,b,c)
 Linear mapping (H) from support pixel
intensity difference to translation vector

 Linear Predictor “Bunches”
– Single LPs are not stable enough for tracking image
features
– Use a set (“bunch”) of
LPs instead
– Final prediction =
consensus of the most
common predicted
translation
Linear Predictors

 “Tracking context” is very important.
 We only want to use surrounding visual
information if it helps the tracking
Linear Predictors
We want to track this point
BUT, we should
use visual information
around here for tracking
it! Other regions have too
much variations.

 We can find the tracking context by evaluating the
accuracy of trackers using local patches, and
gradually removing the bad ones
Linear Predictors

 Cascaded linear predictors:
– Linear predictors trained to overcome large offsets are not
accurate but robust
– LPs trained to overcome small offsets are accurate but not robust.
– Solution, cascade them: Use big-offset LPs, then pass the results
to smaller ones for refinement.
Linear Predictors
Errors of “large” LP predicting
from an offseted position
(blue is medium prediction error)
Errors of “small” LP predicting
from an offseted position
(white is small prediction error)

Non-Linear Predictors
(Non-linear Predictors for Facial feature Tracking, FG2013, Sheerman-Chase et al.)
a
c
b Y
P= [ Ia – I'a,
Ib – I'b,
lc – I'c ]
X = H( P )
 Replace linear mapping with the non-linear
mapping of regression trees
 Input still support pixel differences, output
still offsets

 Replace linear mapping with the non-linear
mapping of regression trees
 Input still support pixel differences, output
still offsets
S1<0.4
dy = 23 S50<0.1
Dy = 32dy = -10

 Results: More robust tracking able to handle
larger amounts of pose and expression
variations.

 Allows us to do freaky things like this:

Background to template update problem
 No update
– Misrepresentation Error
– Catastrophic
 Naïve update
– Drift Error
– Slow accumulation
True Feature – Old Appearance
True Feature – New Appearance
False Feature
Frame
time
Error
time
Error
1 2 3 4 5

Background template update
(Mutual information for Lucas Kanade tracking (MILK): An inverse compositional
formulation, Dowson et al, PAMI 08)

Building a Model of Templates
Appearance space

Incorporating Motion Models
for Tracking

Temporal Consistency
 This sequence shows a surveillance application
tracking subjects as they move.
The technique uses a per pixel mixture
of Gaussians to model background colour
distributions and perform dynamic
background subtraction.

Tracking with Motion Models
 The task of visual tracking involves locating the
position of a tracked target by a combination of
features and motion models.
 There is a strong relationship between the task of
object detection and tracking.
Visual
model +
Detector
Motion
Model

Using Motion
 Objects often exhibit consistent motion

Kalman Filter
 To exploit this motion consistency, many
authors model it with simple dynamics
in the what is called the Kalman filter
 A Kalman filter is simply an optimal
recursive data processing algorithm.
 It makes predictions based on previous
estimates and current observations

Kalman Filter
 Suppose we have some hidden information to recover (i.e. Not
directly observable) and takes the form of a state vector
 E.g. X = [x,y,v] position, velocity of a tracked object
 This object has a true position at time t, Xt, which we do not know
 But suppose we think this object’s dynamics works in a linear
fashion like: Xt = FXt-1
 BUT this may not be exactly the case, it might be slightly off, thus
we have Xt = FXt-1 + wt, where wt ~ N(0,Q)
Xt

Kalman Filter
 Suppose we have some sensors that can provide some
measurements about the tracked object in the form of a state
vector: Z = [a,b]
 This sensor measurements is originates from the hidden state
vector X with the form: Zt = HXt
 BUT, in reality this sensor can be imperfect, noisy etc...
 We deal with this by saying Zt = HXt + v, where v ~ N(0,R)
 R is called the sensor’s error covariance

Kalman Filter
 We want to recover some hidden information about a tracked
object: X = [x,y,v]
 We can predict it’s movements “blindly” using: X’t|t-1 = FX’t-1|t-1
+ wt
 But this model is inaccurate in a Gaussian sense: wt ~ N(0,Q)
 We have some sensors that provide observations to indirectly tell
us how accurate our predictions are Zt – HX’t|t-1
 BUT, need to take this with a pinch of salt, since our sensors are
inaccurate as well (Zt has Gaussian noise with covariance R)

Kalman Filter
 So, task at hand: how do we best combine our prediction of a
tracked object state with the sensor observations, given that both
have Gaussian noise?
 That is what a Kalman filter does in a optimal sense (provide your
noise IS Gaussian and your dynamics IS linear)
 Xt|t = X’t|t-1 + K( Zt – HX’t|t-1 )
 K is called the “Kalman gain”
 Essentially, if sensor noise is small and prediction noise large, K
becomes H-1, meaning trust the observations.
 Conversely, if sensor noise is large,
K becomes 0, trust prediction

Kalman Filter Operation
From: Kalman filter for dummies

Using a Kalman Filter to Track
 How prediction overcomes occlusion
issues
Youtube: kalman Filter result on real aircraft & Result of Kalman Filter on a Moving Aircraft

Extended Kalman Filter-EKF
 The Kalman filter addresses the
problem of dynamics estimation by
linear equations
 Most problems are non-linear
 EKF attempts to address this making
the state prediction Xt = F( Xt-1 ) + w
 F can be any non linear function
ww.cs.unc.edu/~welch for introductory tutorials and sampl

Exploring a parameter space for
the global solution
 We could try every single model configuration to find
the lowest cost solution but this can be unfeasible
(640x480x100x360=11,059,200,000)
 We could just randomly pick model configurations in
the hope that we find a low cost solution but this
does not guarantee that we will find it and as the
dimensionality and complexity increase so must the
number of random samples
 These are common problems and hence standard
optimisation techniques can be employed
– e.g. Simulated Annealing, Genetic Algorithms
7RandomSample.exe

Problem
 In simulated annealing we try and use some simple
heuristic to reduce the number of samples we need to
test
 In Genetic Algorithms we try and guide our random
search through observation to again reduce the
complexity of the search
 However, these are blind optimisations and we often
know much more about the problem we are trying to
solve such as the nature of observations or the
dynamics we are expecting (remember the Kalman
Filter)

Problem
 Example of using simulated annealing for tracking the
body pose
N. Lehment, M. Kaiser, D. Arsic, and
G. Rigoll.
Cue-Independent Extending Inverse
Kinematics For Robust
Pose Estimation in 3D Point Clouds.
Proc. IEEE Intern. Conf.on Image
Processing (ICIP2010)

Factored Sampling
 We have seen how the KF uses a simple
Gaussian to model observations but what
happens if observations are non-Gaussian?
 Factored Sampling can be used to search a
static image in these cases
 We want to calculate the posterior probability
that an object X exists in an image given the
observed data obj
– P(X |obj)

Factored Sampling
 This is difficult to achieve for continuous complex
non-Gaussian distributions
 Luckily Bayes’ formula says that the posterior
density can be obtained as a product of a prior
density P0(X ) and an observation density P(obj|
X )
– P(X |obj) ≈ P(obj|X ) P0(X )
 Factored sampling estimates the posterior by
generating samples from the prior and weighting
them according to the observation density

Factored Sampling
 A set of n points s (n), the centres of the blobs in the figure
are sampled randomly from the prior density P(X )
 Each sample is then assigned a weight (depicted by blob
area) based upon the observation density P(obj|X = s (n) )
 If n is sufficiently large then the weighted set represents
the posterior density P(X |obj)
State
X
Probability
posterior
density
weighted
sample

CONDENSATION and Particle Filtering
 CONDitional DENsity propagATION also known
as particle filtering is the natural extension of
the KF to factored sampling
 Basically:
– Randomly generate a distribution from the prior pdf
and apply a model of dynamics (i.e. predict)
– Fit each sample to the image (i.e. measure)
– Weight samples accordingly to generate a new
posterior pdf that will serve as the prior for the next
iteration

predict
measure

The animation shows a few cycles of the
algorithm applied to a one-dimensional system.
The green spheres correspond to the members
of the sample set, where the size of the sphere
is an indication of the sample weight. The red
line is the measurement density function.
This animation shows a short sequence of the
CONDENSATION filter tracking a leaf
exhibiting non-linear motion with occlusion
and clutter.
Movie sequences taken from
http://www.dai.ed.ac.uk/CVonline/LOCAL_COPIES/ISARD1/condensation.html

 We can extend our random sampler to a
simple PF using gaussian noise as our
dynamics/drift term
 Notice how the population quickly homes in
on the area of highest probability as we saw
in the random sampling
 It quickly converges on incorrect local
solutions, increasing the noise term helps
explore the space further but the global
maximum is at the bottom of the image
8ParticleFilter.exe

 We can further try to change the model to
better fit the head and ensure the global is at
the correct position
 Tracking is better but easily lost to other
maxima
 As the population size is increased we start to
see multiple hypothesis tracking
 By combining both the PF and a gradient
decent method we can get the best results
for the lowest population, but our cost
function is still flawed
9Particle filter.exe
10ParticleFilter.exe

 Advantages
– Allows complex non-Gaussian systems
– Easy to add non-linear dynamics
– Provides support for multiple hypotheses (!!!)
 Disadvantages
– Large numbers of samples make the techniques
extremely slow for high parameter spaces
– Not a global optimisation so has the tendency to
converge upon good observations at the cost of other
observations
 There are many schemes for overcoming these
problems but are beyond the scope of this lecture

Interesting Applications of
Motion Tracking

Lip-Reading
 Facial features of a subject are tracked, specifically the
mouth regions.
 Mouth texture and shape are extracted and
used to build discriminative patterns called
sequential patterns

Sign Language Recognition
 Tracking required for extracting the motions of the
hands and head.
 Movement features of the hands and hand
shapes are extracted
 Again, discriminative movement patterns
uniquely identifying a sign is extracted
 These patterns will be used to detect whether a sign is
present in a video sequence or not

Sign Language Recognition
 Results:

Group Behaviour Profiling
 Even when tracking is not very accurate or robust, it
can still be used to do useful things!
 Example: Use simple trackers (e.g. Lucas
Kanade trackers) to “track” people in a crowd
 These will only last a short while, but can form
short trajectories.
 The analysis of these trajectories can be used
to do profile crowd behaviours.

Group Behaviour Profiling
 Results:

Summary
We have looked at a variety of tracking strategies from
very simple schemes to those which can learn and
predict complex non-linear motion in cluttered
environments. This talk is not exhaustive but should
give you a basic understanding of the types of
techniques used in modern computer vision systems.
For more details on many of the examples see my
website http://www.surrey.ac.uk/personal/e.ong
For a good introduction on the temporal mechanics of
tracking I would recommend reading
“Active Contours” by Isard and Blake

Things to remember!!!
 When tracking:
– Tracking is only as good as your model and data
 A bad metric will give bad results
 The larger the parameter space the more difficult things
become
– Make things as simple as possible
 Constrain your environment
 Use appropriate techniques and dynamics
– e.g. if your tracking someone jumping up and down don’t
use a kalman filter
– Don’t try to reinvent the wheel
 But if your going to use black box techniques ensure you
know what they will and wont do for you

Motion and tracking

More Related Content

What's hot

Similar to Motion and tracking

More from potaters

Recently uploaded

Motion and tracking