Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Object tracking a survey
1. Object Tracking and Detection
By Alper Yilmaz
Omar Javed
And Mubarak Shah
Compiled by
Haseeb Hassan
haseeb@ahu.edu.cn
Anhui University Hefei,China
2. Three Authors work discussed different articles from
1979-2006.
Presented the scenario in a deep way and a good way.
The paper have covers approximately 162 references.
Difficult to understand each and every thing in the
paper but tried level best to establish some basic concepts.
Our survey is focused on methodologies for tracking objects in general and
not on trackers tailored for specific objects, for example, person trackers that
use human kinematics as the basis of their implementation.
About this Review Paper
3. Preface
Extensive survey of object tracking methods and also give a brief review of related
topics.
• We divide tracking methods in three categories based on object representations
methods point correspondence, primitive geometric models and contour evolution.
• Point trackers require detection in every frame, geometric region or contours-based
trackers require detection only when the object first appears in the scene.
• Also included some discussion on object detection.
• Provided summaries of Object trackers, object representations, motion models.
• We believe that this survey of object tracking with a rich bibliography content, can give
valuable insight into this important research topic and encourage new research.
4. 1.What is Object Tracking
Estimating the trajectory of an object over time by locating its position in
every frame. important task within the field of computer vision. or
Estimating the trajectory
of an object in the image plane as it moves around a scene.
Important task within the field of computer vision.
There are three key steps in video analysis:
Detection of interesting moving objects
Tracking of objects from frame to frame
Objects tracks recognition
5. 1.2-Difficulties in Tracking
Difficulties in tracking objects can arise due to
Abrupt object motion
Changing appearance patterns of both the object and the scene,
Non-rigid object structures, object-to-object and object-to-
scene occlusions, and camera motion.
6. 1.3-Object Tracking Applications
Motion-Based Recognition ,human identification based on gait, automatic
object detection, etc.;
Automated Surveillance, that is, monitoring to detect suspicious activities
Video Indexing, that is, automatic annotation and retrieval of the videos in
multimedia databases
Human-Computer Interaction, that is, gesture recognition, eye gaze tracking
for data input to computers, etc.;
Traffic monitoring, that is, real-time gathering of traffic statistics to direct
traffic flow.
Vehicle Navigation, that is, video-based path planning and obstacle avoidance
capabilities.
7. Different Approaches Proposed
Proposed Numerous approaches for object tracking based on the following
questions:
A. Which object representation is suitable?
B. Which image features should be used?
C. How should the motion, appearance, and shape of the object be
modeled?
Answers are:
Depends on the context/environment in which the tracking is performed
Large number of tracking methods have been proposed which attempt to
answer these questions for a variety of scenarios.
8. 2.Object Representation
In a tracking scenario, object is
anything that is of interest for
further analysis. For instance, boats
on the sea, fish inside an aquarium,
vehicles on a road, planes in the air,
people walking on a road, or
bubbles in the water are a set of
objects that may be important to
track in a specific domain.
Objects can be represented by
shapes and appearances.
Points
Primitive geometric shapes
Object silhouette and
contour
Articulated shape models
Skeletal models
10. Continued…
—Probability densities of object appearance
Either parametric or non-parametric such as Gaussian or Mixture of
Gaussian. The probability densities of object appearance features
(color, texture) can be computed from the image regions specified by
the shape models (interior region of an ellipse or a contour).
—Templates
o Templates are formed using simple geometric shapes or silhouettes
[Fieguth and Terzopoulos 1997].
o Carries both spatial and appearance information
o Only suitable for tracking objects which does not vary considerably
11. Models
—Active appearance models. Generated by simultaneously modeling the object
shape and appearance [Edwards et al. 1998]. object shape is defined by a set of
landmarks in the form of color, texture, or gradient magnitude.
—Multiview appearance models.
Refers to different views of an object.
One approach to represent the different object views is to generate a
subspace from the given views. Like Principal Component Analysis (PCA) and
Independent Component Analysis (ICA), have been used for both shape and
appearance representation [Mughadam and Pentland 1997; Black and
Jepson 1998].
Another approach to learn the different views of an object is by training a set
of classifiers, for example, the support vector machines [Avidan 2001] or
Bayesian networks [Park and Aggarwal 2004].
13. 4. Object Detection
Tracking method requires an object detection mechanism
Common approach for detection to use information in single frame
Some object detection methods use of temporal information computed from
sequence of frames to reduce the numbers false detections.
This temporal information is usually in the form of frame differencing,
which highlights changing regions in consecutive frames.
.
16. 4.2-Background Subtraction
Object detection can be achieved by building a representation of the scene
called the background model.
Significant change in an image region from the background model signifies
a moving object.
The pixels constituting the regions undergoing change are marked for
further processing.
background subtraction became popular following
the work of Wren et al. [1997].
An alternate approach for background subtraction is intensity variations of
a pixel in an image sequence.
17. Background Subtraction
Mixture of Gaussian modeling
for background subtraction.
Most of state-of-the-art tracking
methods for fixed cameras, for
example, Haritaoglu et al. [2000]
and Collins et al. [2001] use
background subtraction methods
to detect regions of interest.
The most important limitation of
background subtraction is the
requirement of stationary
cameras.
Methods can be applied to video
acquired by mobile cameras for
small motion in successive
frames.
18. 5-Segmentation
Image segmentation
algorithms is to
partition the image
into perceptually
similar regions.
Every segmentation
algorithm addresses
two problems the
criteria for a good
partition and the
method for achieving
efficient partitioning
[Shi and Malik 2000].
1.Mean Shift Clustering
For the image segmentation
problem, Comaniciu and
Meer [2002] propose the
mean-shift approach to find
clusters in the joint
spatial+color space, [l , u, v,x,
y], where [l , u, v] represents
the color and [x, y]
represents the spatial
location.
20. Continued…
Mean-shift clustering
is scalable to various
other applications
such as edge
detection,
image regularization
[Comaniciu and
Meer 2002], and
tracking [Comaniciu
et al. 2003].
21. 5.2-Image Segmentation Using Graph-Cuts
• A cut in a graph is a set of edges whose
removal disconnects the graph.
• Image segmentation can also be
formulated as a graph partitioning
problem, where the vertices (pixels), V =
{u, v, . . .}, of a graph (image), G, are
partitioned into N disjoint sub graphs
(regions), Ai , N
i = 1 Ai = V,
Ai ∩ Aj = ∅, i = j.
• Limitation of minimum cut is its bias
toward over segmenting the image
• Shi and Malik [2000] propose the
normalized cut to overcome the over
segmentation problem.
normalized cut
22. Active Contours
o Object segmentation is achieved by
evolving a closed contour to the
object’s boundary, such that the
contour tightly encloses the object
region.
o The concept of active contours
models was first introduced in
1987.
o Active contour model, also
called snakes.
o Snakes do not solve the entire
problem of finding contours in
images, since the method requires
knowledge of the desired contour
shape beforehand.
(a) (b)
(c)
23. 6.Supervised Learning
Given a data set and already know our
correct output, having the idea there about
the relationship of the input and output.
Supervised learning methods generate
function that maps inputs to desired
outputs.
Learning different object views waives
requirement of storing a complete set of
templates.
Supervised learning methods require
large collection of samples from each
object class with manually labels.
Possible approach for reducing labeled
data amount is Cotraining with
supervised learning [Blum and Mitchell
1998]
Build model, train model and test
model.
Suppose a student want to learn machine
Learning.
1 – Suppose we are a model.
2 - Now your teacher will teach you
machine learning. During teaching, your
teacher use some resource, this is the
training process. Where we train our
model with past/current data.
3 - At the end of the course your teacher
may test your knowledge to check how
well you have done.
24. Cotraining Means
In the case of web-page classification, you build one model on the URL features of
your website and build a different model on the text features of the website. The
idea is that these models are complementary to one another and can help “correct”
each other since they are each likely to make different mistakes. Generally, this
process is run iteratively until some convergence criterion is met and if certain
assumptions hold (such as that the two views are independent but sufficient for
learning the class targets) will work well.
25. 6.1-Adaptive Boosting(Classifiers)
Iterative method of finding a very accurate classifier by combining many base
classifiers,
Boosting mechanism selects a base classifier gives the least error.
The algorithm encourages the selection of another classifier/classifiers that performs
better on the misclassified data in the next iteration.
In 2003, Viola et al. used the Adaboost framework to detect pedestrians. In their
approach, perceptrons were chosen as the weak classifiers
The individual learners can be weak, but as long as the performance of each one is
slightly better than random guessing the final model can be proven to converge to a
strong learner
26. 6.2-Support Vector Machines
Classifier used to cluster data into two classes by finding the maximum
marginal hyperplane that separates one class from the other [Boser et al.
1992].
In the context of object detection, Papageorgiou et al. [1998] use SVM for
detecting pedestrians and faces in images.
30. 6-1.2--Deterministic Methods
Deterministic methods define a cost
function which is made up of constraints
like maximum velocity, common motion
and rigidity.
This cost function must then be
minimized for tracking.
A greedy algorithm can be used for this
which iteratively optimizes point
correspondences [26 paste reference].
This algorithm is used by is based on the
algorithm used in a paper by Sethi and
Jain.
The algorithm is modified in [26] to
preserve a lot of motion information so
that point measurements are not missed.
Proximity assumes location of object
would not change notably from one
frame to other.
Maximum velocity defines upper
bound on the object velocity and limits
the possible correspondences to the
circular neighborhood around object.
Small velocity change (smooth
motion) assumes direction and speed of
object does not change drastically.
Common motion constraints the
velocity of objects in a small
neighborhood to be similar This
constraint is suitable for objects
represented by multiple
points.
31. Continued…
Rigidity assumes that objects in the 3D world are rigid, therefore, the distance between
any two points on the actual object will remain unchanged (see Figure 10(e)).
Proximal uniformity is a combination of the proximity and the small, velocity change
constraints.
Note: That these constraints are not specific to the deterministic methods, and
they can also be used in the context of point tracking using statistical methods.
32. 7.Statistical Methods
o Statistical methods models uncertain-ties to handle noise in an image. A well-known method
for statistical point tracking is multiple hypothesis tracking(MHT). A set of hypotheses are
designed for an object and predictions are made for each hypothesis for the object's position.
The hypothesis with the highest prediction is the most likely and is chosen for tracking .
o Multiple hypothesis tracking(MHT) is used in [Fieguth, P.& Terzopoulose], in order to
overcome occlusion
o For tracking single objects are the Kalman filter and Particle filters. The Kalman filter is
limited to a linear system and uses prediction and correction to estimate an object's motion ..
o Initialization of the particle filter was done using an algorithm based on Support Vector
Machines. The results from the study in [18], showed that this method of using color
distributions along with particle filtering is very effective in tracking fast-moving, non-rigid
objects.
o For example, these methods have extensively been used for tracking contours [Isard and Blake
1998], activity recognition [Vaswani et al. 2003], object identification [Zhou et al. 2003], and
o structure from motion [Matthies et al. 1989].
33. 8.Kernel Tracking
o Represents object as a geometric shape, called a kernel, and estimates motion of
this kernel in consecutive frames.
o KT commonly used to track a single object. Uses brute force to search an image for
a region that matches the template in the previous image [28]
o The brute force searching results in this method computationally expensive, but
this can be overcome by optimizations to the method, such as limiting the search to
a certain region.
o Mean-shift is used for template matching which eliminates the need for brute
force. Mean shift was first introduced in 1975 by Fukunaga and Hostetler in the
paper .It is an iterative algorithm that shifts a point towards the average of other
points in that area.
o A limitation of kernel tracking is that parts of the background may appear inside
the kernel, but this can be overcome by making the kernel inside the object, instead
of around it.
o We divide these tracking methods into two subcategories based on the appearance
representation used.
34. 8.1 Tracking single objects Approaches
• Template matching is common approach which is a brute force method of searching
the image.
• A limitation of template matching is its high computation cost due to the brute force
search.
• Other object representations can be used for tracking, like color histograms or
mixture models can be computed by using the appearance of pixels inside the
rectangular or ellipsoidal regions.
• Fieguth and Terzopoulos [1997] generate object models by finding the mean color of
the pixels inside the rectangular object region. To reduce computational complexity,
they search the object in eight neighboring locations.
• Comaniciu and Meer [2003] use a weighted histogram computed from a circular
region to represent the object instead of brute force search.
• Jepson et al. [2003] propose an object tracker that tracks an object as a three
component mixture, consisting of the stable appearance features, transient features
and noise process.
35. Examples
• In 1994, Shi and Tomasi proposed the KLT tracker.
Results of the robust online tracking method by Jepson et
al. [2003].
Tracking features using the KLT
tracker.
36. 8.2 Tracking Multiple Objects
Propose this method based on modeling the whole
image, I t , as a set of layers. This representation
includes a single background layer and one layer for
each object. Each layer consists of shape priors
(ellipse), , motion model (translation and rotation), ,
and layer appearance, A, (intensity modeled using a
single Gaussian).
Isard and MacCormick [2001] propose joint
modeling of the background and foreground regions
for tracking. The background appearance is
represented by a mixture of Gaussians.
Appearance of all foreground objects is also
modeled by mixture of Gaussians.
Comparison of kernel trackers can be obtained
based on tracking single or multiple objects, ability
to handle occlusion, requirement of training, type of
motion model.
37. 9. Silhouette Tracking
Objects have complex shapes, for example, hands, head, and shoulders cannot be
well described by simple geometric shapes. Silhouette based methods provide an
accurate shape description for these objects.
This model can be in the form of a color histogram, object edges or the object
contour. We divide silhouette trackers into two categories shape matching and
contour tracking.
Shape Matching can be performed similar to tracking based on template matching
where an object silhouette and its associated model is searched in the current frame.
The search is performed by computing the similarity of the object with the model
generated from the hypothesized object silhouette based on previous frame.
In 1993, Huttenlocher et al. performed shape matching using an edge-based
representation.
Another approach to match shapes is to find corresponding silhouettes detected in
two consecutive frames. Establishing silhouette correspondence, or in short
silhouette matching, can be considered similar to point matching discussed.
38. Silhouette Tracking Categories
Contour Tracking methods, in contrast to shape matching methods. iteratively evolve
an initial contour in the previous frame to its new position in the current frame. This
contour evolution requires that some part of the object. in the current frame overlap
with the object region in the previous frame.Silhouette tracking is employed when
tracking of the complete region
of an object is required.
39. 10.Resolving Occlusion
o Three categories: self occlusion, inter object occlusion, and occlusion by the
background scene structure.
o Self occlusion occurs when one part of the object occludes another. This situation most
frequently arises while tracking articulated objects.
o For interobject occlusion, the multiobject trackers(MOT) like MacCormick and
Blake [2000] and Elgammal et al. [2002] can exploit the knowledge of the position.
o A common approach to handle complete occlusion is to model the object motion by
linear dynamic models or by nonlinear dynamics.
o A nonlinear dynamic model is used in Isard and MacCormick [2001] and a particle
filter employed for state estimation.
o Other features to resolve occlusion, for example, silhouette projections and optical
flow also utilized.
o Yilmaz et al. [2004] build online shape priors using a mixture model based on the level
set contour representation. Their approach is able to handle complete object occlusion.
40. 11.Future Direction
o A lot of progress has been done in last few years and many trackers developed.
o From this survey smoothness of motion, minimal amount of occlusion, illumination
constancy, high contrast with respect to background, are violated in many realistic
scenarios so we need trackers.
For Tracking associated problems of feature selection, object representation,
dynamic shape, and motion estimation are very active areas of research and new
solutions are continuously being proposed.
Challenges:1:One challenge develop algorithms for tracking objects in unconstrained
videos like from broadcasting and homemade videos due to noise, compression
acquired from moving cameras from multiple views.
2: In a formal and informal meetings in a small field of view so many people so severe
occlusion occurs. Solution to this employ audio for tracking.
While developing of tracking algos is integration of contextual information. In vehicle
tracking application, the location of vehicles should be constrained to paths on the ground
as opposed to vertical walls or the sky. Recent work in the area of object recognition
[Torralba 2003; Kumar and Hebert 2003] has shown that exploiting contextual information
41. Future Direction
• In addition, advances in classifiers [Friedman et al. 2000; Tipping 2001] have made
accurate detection of scenes.A tracker which take advantage of contextual information
performs better.
• Feature Set for tracking also affect the performance like by discriminating multiple
objects ,between the objects and background.
• Wide Range of feature selection algos investigated but these algorithms require offline
training information for target detection Collins and Liu 2003 done some work but still
feature selection sets remains unsolved.
• One interesting direction that has largely been unexplored is the use of
semisupervised learning techniques for modeling objects.
• Kalman Filters [Bar-Shalom and Foreman 1988], JPDAFs [Cox 1993], HMMs [Rabiner
1989], and Dynamic Bayesian Networks (DBNs) [Jensen 2001] have been extensively
used to estimate object motion parameters.
• Overall, we believe that additional sources of information, in particular prior and
contextual information, should be exploited.
Editor's Notes
Additionally depending on the tracking domain, a tracker can also provide object-centric information, such as orientation, area, or shape of an object. Tracking objects can be complex due to.
-- loss of information caused by projection of the 3D world on a 2D image,
—noise in images,
—complex object motion,
—nonrigid or articulated nature of objects,
—partial and full object occlusions,
—complex object shapes,
—scene illumination changes, and
—real-time processing requirements.
For tracking objects, which appear very small in an image, point representation is usually appropriate.
For the objects whose shapes can be approximated by rectangles or ellipses, primitive geometric shape representations are more appropriate. Comaniciu et al. [2003] used.
For tracking objects with complex shapes, for example, humans, a contour or a silhouettebased representation is appropriate. Haritaoglu et al. [2000] use silhouettes for object tracking in a surveillance application.
Shape representations can also be combined with the appearance representations [Cootes et al. 2001] for tracking. Some common appearance representations in the context of object tracking are:
One limitation of multiview appearance models is that the appearances in all views are required ahead of time.
Among all features, color is one of the most widely used feature for tracking. Comaniciu et al. [2003] use a color histogram to represent the object appearance.
Also where color feature is not applicable Cremers et al. [2003] use optical flow as a feature for contour tracking. Jepson et al. [2003] use steerable filter responses for tracking.
An alternate approach for background subtraction is to represent the intensity variations of a pixel in an image sequence as discrete states corresponding to the events in the environment. In practice, background subtraction provides incomplete object regions in many instances.
In summary, most state-of-the-art tracking methods for fixed cameras, for example, Haritaoglu et al. [2000] and Collins et al. [2001] use background subtraction methods to detect regions of interest.
Approximates the behavior of a function by generating an output in the form of either a continuous value, which is called regression, or a class label, which is called classification.
Co-training (which is a special case of the more general multi-view learning) is when two different views of the data are used to build a pair of models/classifiers.
In the case of web-page classification, you build one model on the URL features of your website and build a different model on the text features of the website.
Approximates the behavior of a function by generating an output in the form of either a continuous value, which is called regression, or a class label, which is called classification.
Co-training (which is a special case of the more general multi-view learning) is when two different views of the data are used to build a pair of models/classifiers.
In the case of web-page classification, you build one model on the URL features of your website and build a different model on the text features of the website.
Additionally depending on the tracking domain, a tracker can also provide object-centric information, such as orientation, area, or shape of an object. Tracking objects can be complex due to.
-- loss of information caused by projection of the 3D world on a 2D image,
—noise in images,
—complex object motion,
—nonrigid or articulated nature of objects,
—partial and full object occlusions,
—complex object shapes,
—scene illumination changes, and
—real-time processing requirements.
Refrence 10:
Fieguth, P., and Terzopoulos, D. Color-based tracking of heads and other
mobile objects at video frame rates. In in Proc. IEEE Conf. on Computer Vision
and Pattern Recognition (1997), pp. 21{27.
Refrence 18:
Nummiaro, K., Koller-meier, E., and Gool, L. V. Color features for tracking
non-rigid objects. Special Issue on Visual Surveillance, Chinese Journal of Automa-
tion, May 2003 29 (2003), 345{355.
1:The most important advantage of tracking silhouettes is their flexibility to handle a large variety of object shapes.
2:Occlusion handling is another important aspect of silhouette tracking methods.
The chance of occlusion can be reduced by an appropriate selection of camera positions.
However, oblique view cameras are likely to encounter multiple object occlusions
and require occlusion handling mechanisms.
Multiple cameras viewing the same scene
can also be used to resolve object occlusions during tracking [Dockstader and Tekalp
2001a; Mittal and Davis 2003].
Multi-camera tracking methods like Dockstader
and Tekalp [2001a] and Mittal and Davis [2003] have demonstrated superior tracking
results as compared to single camera trackers in the case of persistent occlusion between
the objects.