real-time-object

1
Real-Time Object Localization and Tracking from
Image Sequences
Yuanwei Wu, Yao Sui, Arjan Gupta and Guanghui Wang
Abstract—To address the problem of autonomous sense and
avoidance for unmanned aerial vehicle navigation via vision-
based method, in this letter, we propose a real-time object
localization and tracking strategy from monocular image se-
quences. The proposed approach effectively integrates the tech-
niques of object detection, localization, and tracking into a
dynamic model. At the detection stage, the object of interest
is automatically detected and localized from a saliency map
computed via connectivity cue of the frame; at the tracking
stage, a Kalman filter is employed to provide a coarse prediction
of the object position and size, which is further refined via a
local detector using image boundary connectivity cue and context
information between consecutive frames. Compared to existing
methods, the proposed technique does not require any manual
initialization, runs much faster than the state-of-the-art trackers
of its kind, and achieves comparative tracking performance.
Extensive comparative experiments demonstrate the effectiveness
and better performance of the proposed approach.
Index Terms—Salient object detection; visual tracking;
Kalman filter; object localization; real-time tracking;
I. INTRODUCTION
VISUAL object tracking has played important roles in
many computer vision applications, such as human-
computer interaction, surveillance, and video understanding
[1]. Due to emerging real-world applications, like deliver-
ing packages using small unmanned aerial vehicles (UAVs)
[2], there is a huge demand for vision-based autonomous
navigation for UAVs. First of all, the vision-based methods
are robust to electromagnetic interference compared to con-
ventional sensor-based method, e.g. global positional system
(GPS) [3]. Second, vision-based methods are needed due to
strict small size and insufficient power supply of UAVs. Based
on this background, in this letter, we address autonomous sense
and avoidance of obstacles for UAVs during flight via the
integration of object detection and tracking.
The tracking-by-detection methods have become increas-
ingly popular for real-time applications [4] in visual tracking.
The correlation filter-based trackers attract more attention in
recent years due to its high speed performance [5]. However,
those conventional tracking methods [4, 6–13] require manual
initialization with the ground truth at the first frame. Moreover,
they are sensitive to the initialization variation caused by scales
This work is partly supported by the National Aeronautics and Space Ad-
ministration (NASA) LEARN II program under grant number NNX15AN94N.
The authors are with the Department of Electrical Engineering and
Computer Science, University of Kansas, Lawrence, KS 66045 USA.
The source code and dataset will be available on the authors home-
page http://www.ittc.ku.edu/∼ghwang/. (email: wuyuanwei2010@gmail.com,
suiyao@gmail.com, arjangupta@ku.edu, ghwang@ku.edu)
and position errors, and would return useless information once
failed during tracking [14].
Combining a detector with a tracker is a feasible solution
for automatic initialization [15]. The detector, however, needs
to be trained with large amount of training samples, while
the prior information about the object of interest is usually
not available in advance. In [16], Mahadevan et al. proposed
a saliency-based discriminative tracker with automatic initial-
ization, which builds the motion saliency map using optical
flow. This technique, however, is computational intensive and
not suitable for real-time applications.
Some recent techniques on salient object detection and
visual tracking [17, 18] have achieved superior performance
by using deep learning. However, these methods need large
amount of samples for training. The methods of object co-
localization in videos [19, 20] are originally designed to handle
objects of the same class across a set of distinct images or
videos, while for target tracking, we typically focus on a
salient object in a video sequence.
Several recent approaches exploit boundary connectivity
[21, 22] for natural images, which have been shown to be
effective for salient object detection. Since the saliency map ef-
fectively discovers the spatial information of target , it enables
us to improve the target localization accuracy. Inspired by the
salient object detection approach [21], which achieves high
detection speed on individual images, we develop an efficient
method by integrating two complementary processes: salient
object detection and tracking. A Kalman filter is employed to
predict a coarse location of the target object, and the detector
is used to refine the solution.
In summary, our contributions are threefold: 1) The pro-
posed algorithm integrates saliency map into a dynamic model
and adopts the target-specific saliency map as the observation
for tracking; 2) We develop a tracker with automatic initializa-
tion for real-world applications and 3) the proposed technique
achieves better performance than state-of-the-art competing
trackers from extensive real experiments.
II. THE PROPOSED APPROACH
The proposed fast object localization and tracking (FOLT)
algorithm can automatically and quickly localize the salient
object in the scene and track it across the sequence. In this
letter, the object of interest is the salient object in the view, so
the tracking problem is formulated as an unsupervised salient
object detection, which can be automatically obtained from the
saliency map computed from the frame [21]. In the following,
we will present a detailed elaboration of the approach.

2
Fig. 1: A flow-chart of the proposed approach.
A. Overview of the proposed approach
In most tracking scenarios, the linear Gaussian motion
model has been demonstrated to be an effective representation
for the motion behavior of salient object in natural image
sequences [23, 24]. Therefore, an optimal estimator, Kalman
filter [25], has been used to estimate the motion attributes, e.g.
the velocity, position and scale of the object. A flow chart of
the proposed approach is shown in Fig. 1, the bounding box
of the object is initialized from the saliency map of the entire
image [21]. A dynamic model is established to predict the
object position and size at the next frame. Under the constrain
of natural motion, this predicted bounding box provides the
tracking algorithm a coarse solution, which is not far away
from the ground truth [23]. Thus, a reasonable search region
can be automatically attained by expanding the predicted
object window with a fixed percentage. Then, the location and
size of the object is refined by computing the saliency within
the search region. Next, the refined bounding box, as a new
observation, is fed to the Kalman filter to update the dynamic
model in the correction phase. Through this process, the object
in the image sequence is automatically detected and tracked
relying on recursively prediction, observation, and correction.
B. Motion model
In the dynamic model, the salient object in a frame is
defined by a motion state variable S with six variables S =
{x, y, u, v, w, h}, where (x, y) denotes the center coordinates,
(u, v) denotes the velocities, and (w, h) denotes the width and
height of the minimum bounding box. In the t-th frame, the
predicted state ˆS−
t is evolved from the prior state ˆSt−1 in
frame t−1 given knowledge of the process prior to time t−1
according to the following linear stochastic equation
ˆS−
t = F ˆSt−1 + wt−1, (1)
where the variable wt−1 represents the additive, white Gaus-
sian noise with zero mean and known covariance, and F
denotes the state transition matrix. We use the notation St ∼
N(µ, Σ) to denote that state St is a random variable with a
normal probability distribution with mean µ and covariance
Σ in frame t. The covariance is a diagonal matrix, which is
composed by the variances of x, y, u, v, w, and h, respectively.
Let us assume that zt encodes the positions and dimensions of
the minimum bounding box of the observation in frame t. The
observation zt is the output of the fast salient object detector,
Fig. 2: Illustration of updating the search region ROI using (a)
raster scanning and (b) inverse-raster scanning.
which is represented by zt = {x, y, w, h}. The posterior state
of the object in frame t given observation zt is finally updated
by incorporating the observation and the dynamic model via
St = ˆS−
t + Kt(zt − H ˆS−
t ), (2)
where Kt denotes the Kalman gain in frame t with the leverage
of obtaining a posterior state estimation St. The estimation
with minimum mean-square error is obtained by weighting
the difference between the prediction and observation.
C. Salient object detection
It has been shown that the cue of image boundary connectiv-
ity is effective for salient object detection [21, 22]. In natural
images, it is safe to assume that the object regions are much
less connected to the image boundaries.
In this letter, the salient object detection is formulated as
finding the shortest path from pixel wij to the seed set B
from the image boundary, considering all possible paths in the
image. Each pixel in the 2D digital image I is denoted as a
vertex. The neighboring pixels are connected by edges. In this
work, we consider 4-adjacent neighbors, e.g. the neighbors of
wij are wi−1,j, wi+1,j, wi,j−1, and wi,j+1, as shown in Fig.
2. The path p = v(0), v(1), · · · , v(k) on image I denotes a
sequence of consecutive neighboring pixels. Given a loss func-
tion L(p), the problem of finding the salient object in the frame
t is defined as It(wij) = arg minp∈PB,wij
L(p), where PB,wij
denotes all possible paths connecting the seed set B and the
pixel wij in image It. Similar to the work in [21], we formulate
the loss function at the frame t as LIt (p) = maxn
j=0(p(i)) −
minn
j=0(p(j)), where LIt (p) calculates the pixel intensity
difference between the maximum and the minimum values
among all possible paths. Let E(wij, v) denotes the edge
connecting the vertex wij and v, Q(wij) denotes the current
path connecting the pixel wij with the image boundary set B.
We define CIt (Q(wij), E(wij, v)) as the cost of a new path
connecting the vertex v to the image boundary set B by adding
the edge E to Q(wij). CIt
(Q(wij), E(wij, v)) can be calcu-
lated from CIt
(Q(wij), E(wij, v)) = max{U(wij), It(v)} −
min{L(wij), It(v)}, where U(wij) and L(wij) denote the
maximum and the minimum pixel intensity values on the
path Q(wij). A raster scanning method [21] could be used
to calculate the cost CIt
(Q(wij), E(wij, v)). The details will
be discussed in sect. II-D.

3
Algorithm 1: Fast Object Localization Tracking (FOLT)
Input: image It+1, saliency map Dt, search region
ROIt, number of passes N
Output: saliency map Dt+1
Auxiliaries: Ut+1, Lt+1
Inside the search region ROI, set Dt to ∞
Outside the search region, keep the values Dt
Set Lt+1 ← It+1 and Ut+1 ← It+1
for each frame do
Prediction using Eq. (1)
Observation as following:
for i = 1 : N do
if mod(i, 2) = 1 then
Raster Scanning using Eq. (3), (4), (5)
end
else
Inverse-Raster Scanning using Eq.(3), (4), (5)
end
end
Correction using Eq. (2)
Update the complete Dt every ten frames
end
D. Fast object localization tracking
In [21], Zhang et al. provided a solution for individual im-
ages using the minimum barrier distance detection method. In
order to improve the accuracy and speed in image sequences,
we explore the integration of the image boundary connec-
tivity cue with the temporal context information between
consecutive frames. Therefore, we propose a fast salient object
detection and tracking framework as shown in Fig. 1. During
the observation stage, two fast scanning procedures, raster
scanning and inverse-raster scanning, are implemented to find
the location of the salient object between two consecutive
frames. As shown in Fig. 2, the inner window of the target
object is coarsely predicted using the dynamic model. The
search region is obtained by expanding the inner window
with a fixed percentage. The raster scanning and inverse-raster
scanning are used to update the pixel values in the search
region of image It. In the proposed approach, the search region
is dynamically determined based on the predicted position of
the salient object. As shown in Fig. 2 (a), the raster scanning is
used to update all the intensities from the top-left pixel to the
bottom-right pixel, which simultaneously updates two adjacent
neighbors wi,j−1 and wi−1,j. Similarly, in the inverse raster
scanning, the intensities of the two adjacent neighbors wi+1,j
and wi,j+1 in the search region are reversely updated, as shown
in Fig.2 (b). The values outside of the search region are not
updated since they have less contribution to the detection. As
a trade-off between the accuracy and efficiency, a complete
saliency map of the entire image is updated every ten frames.
The updating strategy in the search region is given by
It(wij) ← min(It(wij), Qwij
(v)) (3)
U(wij) ← max(U(v), It(wij)) (4)
L(wij) ← min(L(v), It(wij)) (5)
Fig. 3: Tracking results in representative frames of the pro-
posed and the 7 competing trackers on three challenging
sequences. First row: illumination variation (Skyjumping ce);
Second row: in-plane and out-of-plane rotations (big 2); Third
row: scale variation (motorcycle 006). (best viewed in color)
The implementation details of the above detection and tracking
algorithm is described in Algorithm 1, where the algorithm is
initialized based on the detection result of the first frame, and
the saliency map of the last frame t is fed to the algorithm.
III. EXPERIMENTAL EVALUATIONS
The proposed approach is implemented in C++ with
OpenCV 3.0.0 on a PC with an Intel Xeon W3250 2.67
GHz CPU and 8 GB RAM. The datasets and source code
of the proposed approach will be available on the authors
homepage. The proposed tracker is evaluated on 15 popular
video sequences selected from [14, 26–28] regarding the
salient object in the field of view. In each frame of these
video sequences, the target is labeled manually in a bounding
box, which is used as the ground truth in the quantitative
evaluations.
In our implementation, input images are first resized so that
the maximum dimension is 300 pixels. Three experiments are
designed to evaluate trackers as discussed in [14]: one pass
evaluation (OPE), temporal robustness evaluation (TRE), and
spatial robustness evaluation. For TRE, we randomly select the
starting frame and run a tracker to the end of the sequence.
Spatial robustness evaluation initializes the bounding box in
the first frame by shifting or scaling. As discussed in Section
II, the proposed method manages to automatically initialize
the tracker and is not sensitive to spatial fluctuation. Therefore,
we use the same temporal randomization as in [14], and refer
readers to [14] for more details.
A. Speed performance
In the detection stage, for individual images, the most up-
to-date fast detector MB+ [21] attains a speed of 49 frame-per-
second (fps), in contrast, the proposed method achieves a speed
of 149 fps and accurate performance on image sequences,
which is three times faster than MB+. The average speed
comparison of the proposed and the seven state-of-the-art
competing trackers is provided in Table I. The average speed of
our tracker is 141 fps, which is at the same level as the fastest
tracker KCF [11], however, KCF adopts a fixed tracking box,

4
TABLE I: Quantitative evaluations of the proposed and the 7
competing trackers on the 15 sequences. The best and second
best results are highlighted in bold-face and underline fonts,
respectively.
Ours CT STC CN SAMF DSST CCT KCF
[4] [6] [7] [8] [9] [10] [11]
Precision of TRE 0.79 0.51 0.59 0.64 0.65 0.65 0.66 0.60
Success rate of TRE 0.61 0.45 0.46 0.54 0.58 0.56 0.57 0.52
Precision of OPE 0.83 0.44 0.48 0.44 0.59 0.48 0.66 0.48
Success rate of OPE 0.66 0.34 0.41 0.42 0.52 0.44 0.53 0.38
CLE (in pixel) 14.5 74.4 38.0 55.0 40.8 55.7 23.2 45.6
Average speed (in fps) 141.3 12.0 73.6 87.1 12.9 20.8 21.3 144.8
which could not reflect the scale changes of the object. On
average, our method is more than ten times faster than CT [4]
and SAMF [8], five times faster than DSST [9] and CCT [10]
and about two times faster than STC [6] and CN [7].
B. Comparison with the state-of-the-art trackers
The performance of our approach is quantitatively validated
following the metrics used in [14]. We present the results using
precision, centre location error (CLE) and success rate (SR).
The CLE is defined as the Euclidean distance between the
centers of the tracking and the ground-truth bounding boxes.
The precision is computed from the percentage of frames
where the CLEs are smaller than a threshold. Following [14],
a threshold value of 20 pixels is used for the precision in
our evaluations. A tracking result in a frame is considered
successful if
at ag
at ag
> θ for a threshold θ ∈ [0, 1], where at
and ag denote the areas of the bounding boxes of the tracking
and the ground truth, respectively. Thus, SR is defined as the
percentage of frames where the overlap rates are greater than a
threshold θ. Normally, the threshold θ is set to 0.5. We evaluate
the proposed method by comparing to the seven state-of-the-
art trackers: CT, STC, CN, SAMF, DSST, CCT, and KCF.
The comparison results on the 15 sequences are shown in
Table I. We present the results under one-pass evaluation and
temporal robustness evaluation using the average precision,
success rate, and CLE over all sequences. As shown in the
table, the proposed method outperforms all seven competing
trackers. In is evident that, in the one pass evaluations, the
proposed tracker obtains the best performance in the CLE
(14.5 pixels), and the precision (0.83), which are 8.7 pixels
and 17% superior to the second best tracker, the CCT tracker
(23.2 pixels in CLE and 0.66 in precision,). Meanwhile, in
the success rate, the proposed tracker achieves the best result,
which is 13% improvement against the second best tracker, the
SAMF tracker. Please note that, for the 7 competing trackers,
the average performance in TRE is higher than that in OPE;
while for the proposed tracker, the precision and success scores
in TRE are lower than those in OPE. This is because the
proposed tracker tends to perform well in longer sequences,
while the 7 competing trackers work well in shorter sequences
[14]. In addition, Fig 4 plots the precision and success plots
in the one pass evaluation and temporal robustness evaluation
over all 15 sequences. In the two evaluations, according to both
the precision and the success rate, our approach significantly
Fig. 4: Precision and success rate plots over the 15 sequences
in (top) one pass evaluation (OPE) and (bottom) temporal
robustness evaluation (TRE). (best viewed in color)
outperforms the seven competing trackers. In summary, the
precision plot demonstrates that our approach is superior in
robustness compared to its counterparts in the experiments;
the success rate shows that our method estimates the scale
changes of the target more accurately.
C. Qualitative evaluation
In this section, we present some qualitative comparisons of
our approach with respect to the 7 competing trackers. Fig. 3
(first row) illustrates a sequence with significant illumination
variations as well as gradual out-of-plane rotations. Both CT
and STC can deal with illumination changes very well, but fail
in the presence of pose variations and out-of-plane rotations,
as shown in frames #365, and #666. In contrast, our tracker
accurately estimates both the scale and position of the target.
Fig. 3 (second row) shows the results on a sequence with
significant in-plane and out-of-plane rotations. Our approach
obtains the best performance in these cases. On the sequence,
our approach tracks part of the target due to out-of-plane
rotation, but it accurately reacquires the target in the following
frames, as shown in frames #319 and #369.
Fig. 3 (third row) illustrates the results on a sequence
with large scale variations. STC, SAMF, DSST, and CCT
were capable to handle scale changes, but they failed in this
sequence, as shown in frames #110, #145 and #170. The
competing trackers fail to handle the significant appearance
changes of rotating motions and fast scale variations. In
contrast, our tracker is robust to large and fast scale variations.
IV. CONCLUSIONS
In this paper, we have proposed an effective and efficient
approach for real-time visual object localization and tracking,
which can be applied to UAV navigation, such as obstacle
sence and avoidance. Our method integrates a fast salient
object detector within Kalman filtering framework. Compared
to the state-of-the-art trackers, our approach can not only
initialize automatically, it also achieves the fastest speed and
better performance than competing trackers.

5
REFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A
survey,” ACM Comput. Surv., vol. 38, no. 4, pp. 1–45,
2006.
[2] Amazon, “Amazon prime air,” https://www.youtube.com/
watch?v=98BIu9dpwHU, 2013.
[3] M. Fraiwan, A. Alsaleem, H. Abandeh, and O. Aljarrah,
“Obstacle avoidance and navigation in robotic systems:
A land and aerial robots study,” in 5th Int. Conf. Inf.
Commu. Systems (ICICS). IEEE, 2014, pp. 1–5.
[4] K. Zhang, L. Zhang, and M. Yang, “Real-time compres-
sive tracking,” in Eur. Conf. Computer Vision (ECCV),
pp. 864–877. Springer, 2012.
[5] Z. Chen, Z. Hong, and D. Tao, “An experimental survey
on correlation filter-based tracking,” arXiv preprint, pp.
1–13, 2015.
[6] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M. Yang,
“Fast visual tracking via dense spatio-temporal context
learning,” in Eur. Conf. Computer Vision (ECCV), pp.
127–141. Springer, 2014.
[7] M. Danelljan, F. Khan, M. Felsberg, and J. Weijer,
“Adaptive color attributes for real-time visual tracking,”
in IEEE Trans. Patt. Anal. Mach. Intell., 2014, pp. 1090–
1097.
[8] Y. Li and J. Zhu, “A scale adaptive kernel correlation
filter tracker with feature integration,” in Eur. Conf.
Computer Vision (ECCV) Workshops. Springer, 2014, pp.
254–265.
[9] M. Danelljan, G. Häger, F. Khan, and M. Felsberg,
“Accurate scale estimation for robust visual tracking,”
in Proc. Br. Mach. Conf. (BMVC), 2014, pp. 1–11.
[10] G. Zhu, J. Wang, Y. Wu, and H. Lu, “Collaborative
correlation tracking,” in Proc. Br. Mach. Conf. (BMVC),
2015, pp. 1–12.
[11] J. Henriques, R. Caseiro, P. Martins, and J. Batista,
“High-speed tracking with kernelized correlation filters,”
IEEE Trans. Patt. Anal. Mach. Intell., vol. 37, no. 3, pp.
583–596, 2015.
[12] Y. Sui, Z. Zhang, G. Wang, Y. Tang, and L. Zhang,
“Real-time visual tracking: Promoting the robustness of
correlation filter learning,” in Eur. Conf. Computer Vision
(ECCV). Springer, 2016.
[13] Y. Sui and L. Zhang, “Visual tracking via locally
structured gaussian process regression,” IEEE Signal
Process. Lett., vol. 22, no. 9, pp. 1331–1335, 2015.
[14] Y. Wu, J. Lim, and M. Yang, “Online object tracking:
A benchmark,” in IEEE Computer Soc. Conf. Computer
Vision and Pattern Recognition (CVPR), 2013, pp. 2411–
2418.
[15] M. Andriluka, S. Roth, and B. Schiele, “People-
tracking-by-detection and people-detection-by-tracking,”
in IEEE Computer Soc. Conf. Computer Vision and
Pattern Recognition (CVPR). IEEE, 2008, pp. 1–8.
[16] V. Mahadevan and N. Vasconcelos, “Saliency-based
discriminant tracking,” in IEEE Computer Soc. Conf.
Computer Vision and Pattern Recognition (CVPR). IEEE,
2009, pp. 1007–1013.
[17] C. Ma, J. Huang, X. Yang, and M. Yang, “Hierarchical
convolutional features for visual tracking,” in IEEE Int.
Conf. Computer Vision (ICCV), 2015, pp. 3074–3082.
[18] S. Hong, T. You, S. Kwak, and B. Han, “Online
tracking by learning discriminative saliency map with
convolutional neural network,” arXiv preprint, 2015.
[19] S. Gidaris and N. Komodakis, “Locnet: Improving lo-
calization accuracy for object detection,” arXiv preprint,
2015.
[20] K. Tang, A. Joulin, L. Li, and F. Li, “Co-localization
in real-world images,” in IEEE Computer Soc. Conf.
Computer Vision and Pattern Recognition (CVPR). IEEE,
2014, pp. 1464–1471.
[21] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and
R. Mech, “Minimum barrier salient object detection at
80 fps,” in IEEE Int. Conf. Computer Vision (ICCV),
2015, pp. 1404–1412.
[22] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency
optimization from robust background detection,” in IEEE
Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 2014, pp. 2814–2821.
[23] S. Yin, J. Na, J. Choi, and S. Oh, “Hierarchical kalman-
particle filter with adaptation to motion changes for
object tracking,” Comput. Vis. Image Underst., vol. 115,
no. 6, pp. 885–900, 2011.
[24] S. Weng, C. Kuo, and S. Tu, “Video object tracking
using adaptive kalman filter,” J. Vis. Commun. Image R.,
vol. 17, no. 6, pp. 1190–1208, 2006.
[25] G. Welch and G. Bishop, “An introduction to the kalman
filter,” in University of North Carolina at Chapel Hill,
NC, USA, Technique report, 2006, pp. 1–16.
[26] A. Li, M. Lin, Y. Wu, M. Yang, and S. Yan, “Nus-pro: A
new visual tracking challenge,” IEEE Trans. Patt. Anal.
Mach. Intell., vol. 38, no. 2, pp. 335–349, 2016.
[27] P. Liang, E. Blasch, and H. Ling, “Encoding color infor-
mation for visual tracking: Algorithms and benchmark,”
IEEE Trans. Image Process., vol. 24, no. 12, pp. 5630–
5644, 2015.
[28] M. Kristan and et al., “The visual object tracking vot2014
challenge results,” in Eur. Conf. Computer Vision (ECCV)
Workshop, 2014, pp. 191–217.

real-time-object

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to real-time-object

Similar to real-time-object (20)

real-time-object