SlideShare a Scribd company logo
1
Real-Time Object Localization and Tracking from
Image Sequences
Yuanwei Wu, Yao Sui, Arjan Gupta and Guanghui Wang
Abstract—To address the problem of autonomous sense and
avoidance for unmanned aerial vehicle navigation via vision-
based method, in this letter, we propose a real-time object
localization and tracking strategy from monocular image se-
quences. The proposed approach effectively integrates the tech-
niques of object detection, localization, and tracking into a
dynamic model. At the detection stage, the object of interest
is automatically detected and localized from a saliency map
computed via connectivity cue of the frame; at the tracking
stage, a Kalman filter is employed to provide a coarse prediction
of the object position and size, which is further refined via a
local detector using image boundary connectivity cue and context
information between consecutive frames. Compared to existing
methods, the proposed technique does not require any manual
initialization, runs much faster than the state-of-the-art trackers
of its kind, and achieves comparative tracking performance.
Extensive comparative experiments demonstrate the effectiveness
and better performance of the proposed approach.
Index Terms—Salient object detection; visual tracking;
Kalman filter; object localization; real-time tracking;
I. INTRODUCTION
VISUAL object tracking has played important roles in
many computer vision applications, such as human-
computer interaction, surveillance, and video understanding
[1]. Due to emerging real-world applications, like deliver-
ing packages using small unmanned aerial vehicles (UAVs)
[2], there is a huge demand for vision-based autonomous
navigation for UAVs. First of all, the vision-based methods
are robust to electromagnetic interference compared to con-
ventional sensor-based method, e.g. global positional system
(GPS) [3]. Second, vision-based methods are needed due to
strict small size and insufficient power supply of UAVs. Based
on this background, in this letter, we address autonomous sense
and avoidance of obstacles for UAVs during flight via the
integration of object detection and tracking.
The tracking-by-detection methods have become increas-
ingly popular for real-time applications [4] in visual tracking.
The correlation filter-based trackers attract more attention in
recent years due to its high speed performance [5]. However,
those conventional tracking methods [4, 6–13] require manual
initialization with the ground truth at the first frame. Moreover,
they are sensitive to the initialization variation caused by scales
This work is partly supported by the National Aeronautics and Space Ad-
ministration (NASA) LEARN II program under grant number NNX15AN94N.
The authors are with the Department of Electrical Engineering and
Computer Science, University of Kansas, Lawrence, KS 66045 USA.
The source code and dataset will be available on the authors home-
page http://www.ittc.ku.edu/∼ghwang/. (email: wuyuanwei2010@gmail.com,
suiyao@gmail.com, arjangupta@ku.edu, ghwang@ku.edu)
and position errors, and would return useless information once
failed during tracking [14].
Combining a detector with a tracker is a feasible solution
for automatic initialization [15]. The detector, however, needs
to be trained with large amount of training samples, while
the prior information about the object of interest is usually
not available in advance. In [16], Mahadevan et al. proposed
a saliency-based discriminative tracker with automatic initial-
ization, which builds the motion saliency map using optical
flow. This technique, however, is computational intensive and
not suitable for real-time applications.
Some recent techniques on salient object detection and
visual tracking [17, 18] have achieved superior performance
by using deep learning. However, these methods need large
amount of samples for training. The methods of object co-
localization in videos [19, 20] are originally designed to handle
objects of the same class across a set of distinct images or
videos, while for target tracking, we typically focus on a
salient object in a video sequence.
Several recent approaches exploit boundary connectivity
[21, 22] for natural images, which have been shown to be
effective for salient object detection. Since the saliency map ef-
fectively discovers the spatial information of target , it enables
us to improve the target localization accuracy. Inspired by the
salient object detection approach [21], which achieves high
detection speed on individual images, we develop an efficient
method by integrating two complementary processes: salient
object detection and tracking. A Kalman filter is employed to
predict a coarse location of the target object, and the detector
is used to refine the solution.
In summary, our contributions are threefold: 1) The pro-
posed algorithm integrates saliency map into a dynamic model
and adopts the target-specific saliency map as the observation
for tracking; 2) We develop a tracker with automatic initializa-
tion for real-world applications and 3) the proposed technique
achieves better performance than state-of-the-art competing
trackers from extensive real experiments.
II. THE PROPOSED APPROACH
The proposed fast object localization and tracking (FOLT)
algorithm can automatically and quickly localize the salient
object in the scene and track it across the sequence. In this
letter, the object of interest is the salient object in the view, so
the tracking problem is formulated as an unsupervised salient
object detection, which can be automatically obtained from the
saliency map computed from the frame [21]. In the following,
we will present a detailed elaboration of the approach.
2
Fig. 1: A flow-chart of the proposed approach.
A. Overview of the proposed approach
In most tracking scenarios, the linear Gaussian motion
model has been demonstrated to be an effective representation
for the motion behavior of salient object in natural image
sequences [23, 24]. Therefore, an optimal estimator, Kalman
filter [25], has been used to estimate the motion attributes, e.g.
the velocity, position and scale of the object. A flow chart of
the proposed approach is shown in Fig. 1, the bounding box
of the object is initialized from the saliency map of the entire
image [21]. A dynamic model is established to predict the
object position and size at the next frame. Under the constrain
of natural motion, this predicted bounding box provides the
tracking algorithm a coarse solution, which is not far away
from the ground truth [23]. Thus, a reasonable search region
can be automatically attained by expanding the predicted
object window with a fixed percentage. Then, the location and
size of the object is refined by computing the saliency within
the search region. Next, the refined bounding box, as a new
observation, is fed to the Kalman filter to update the dynamic
model in the correction phase. Through this process, the object
in the image sequence is automatically detected and tracked
relying on recursively prediction, observation, and correction.
B. Motion model
In the dynamic model, the salient object in a frame is
defined by a motion state variable S with six variables S =
{x, y, u, v, w, h}, where (x, y) denotes the center coordinates,
(u, v) denotes the velocities, and (w, h) denotes the width and
height of the minimum bounding box. In the t-th frame, the
predicted state ˆS−
t is evolved from the prior state ˆSt−1 in
frame t−1 given knowledge of the process prior to time t−1
according to the following linear stochastic equation
ˆS−
t = F ˆSt−1 + wt−1, (1)
where the variable wt−1 represents the additive, white Gaus-
sian noise with zero mean and known covariance, and F
denotes the state transition matrix. We use the notation St ∼
N(µ, Σ) to denote that state St is a random variable with a
normal probability distribution with mean µ and covariance
Σ in frame t. The covariance is a diagonal matrix, which is
composed by the variances of x, y, u, v, w, and h, respectively.
Let us assume that zt encodes the positions and dimensions of
the minimum bounding box of the observation in frame t. The
observation zt is the output of the fast salient object detector,
Fig. 2: Illustration of updating the search region ROI using (a)
raster scanning and (b) inverse-raster scanning.
which is represented by zt = {x, y, w, h}. The posterior state
of the object in frame t given observation zt is finally updated
by incorporating the observation and the dynamic model via
St = ˆS−
t + Kt(zt − H ˆS−
t ), (2)
where Kt denotes the Kalman gain in frame t with the leverage
of obtaining a posterior state estimation St. The estimation
with minimum mean-square error is obtained by weighting
the difference between the prediction and observation.
C. Salient object detection
It has been shown that the cue of image boundary connectiv-
ity is effective for salient object detection [21, 22]. In natural
images, it is safe to assume that the object regions are much
less connected to the image boundaries.
In this letter, the salient object detection is formulated as
finding the shortest path from pixel wij to the seed set B
from the image boundary, considering all possible paths in the
image. Each pixel in the 2D digital image I is denoted as a
vertex. The neighboring pixels are connected by edges. In this
work, we consider 4-adjacent neighbors, e.g. the neighbors of
wij are wi−1,j, wi+1,j, wi,j−1, and wi,j+1, as shown in Fig.
2. The path p = v(0), v(1), · · · , v(k) on image I denotes a
sequence of consecutive neighboring pixels. Given a loss func-
tion L(p), the problem of finding the salient object in the frame
t is defined as It(wij) = arg minp∈PB,wij
L(p), where PB,wij
denotes all possible paths connecting the seed set B and the
pixel wij in image It. Similar to the work in [21], we formulate
the loss function at the frame t as LIt (p) = maxn
j=0(p(i)) −
minn
j=0(p(j)), where LIt (p) calculates the pixel intensity
difference between the maximum and the minimum values
among all possible paths. Let E(wij, v) denotes the edge
connecting the vertex wij and v, Q(wij) denotes the current
path connecting the pixel wij with the image boundary set B.
We define CIt (Q(wij), E(wij, v)) as the cost of a new path
connecting the vertex v to the image boundary set B by adding
the edge E to Q(wij). CIt
(Q(wij), E(wij, v)) can be calcu-
lated from CIt
(Q(wij), E(wij, v)) = max{U(wij), It(v)} −
min{L(wij), It(v)}, where U(wij) and L(wij) denote the
maximum and the minimum pixel intensity values on the
path Q(wij). A raster scanning method [21] could be used
to calculate the cost CIt
(Q(wij), E(wij, v)). The details will
be discussed in sect. II-D.
3
Algorithm 1: Fast Object Localization Tracking (FOLT)
Input: image It+1, saliency map Dt, search region
ROIt, number of passes N
Output: saliency map Dt+1
Auxiliaries: Ut+1, Lt+1
Inside the search region ROI, set Dt to ∞
Outside the search region, keep the values Dt
Set Lt+1 ← It+1 and Ut+1 ← It+1
for each frame do
Prediction using Eq. (1)
Observation as following:
for i = 1 : N do
if mod(i, 2) = 1 then
Raster Scanning using Eq. (3), (4), (5)
end
else
Inverse-Raster Scanning using Eq.(3), (4), (5)
end
end
Correction using Eq. (2)
Update the complete Dt every ten frames
end
D. Fast object localization tracking
In [21], Zhang et al. provided a solution for individual im-
ages using the minimum barrier distance detection method. In
order to improve the accuracy and speed in image sequences,
we explore the integration of the image boundary connec-
tivity cue with the temporal context information between
consecutive frames. Therefore, we propose a fast salient object
detection and tracking framework as shown in Fig. 1. During
the observation stage, two fast scanning procedures, raster
scanning and inverse-raster scanning, are implemented to find
the location of the salient object between two consecutive
frames. As shown in Fig. 2, the inner window of the target
object is coarsely predicted using the dynamic model. The
search region is obtained by expanding the inner window
with a fixed percentage. The raster scanning and inverse-raster
scanning are used to update the pixel values in the search
region of image It. In the proposed approach, the search region
is dynamically determined based on the predicted position of
the salient object. As shown in Fig. 2 (a), the raster scanning is
used to update all the intensities from the top-left pixel to the
bottom-right pixel, which simultaneously updates two adjacent
neighbors wi,j−1 and wi−1,j. Similarly, in the inverse raster
scanning, the intensities of the two adjacent neighbors wi+1,j
and wi,j+1 in the search region are reversely updated, as shown
in Fig.2 (b). The values outside of the search region are not
updated since they have less contribution to the detection. As
a trade-off between the accuracy and efficiency, a complete
saliency map of the entire image is updated every ten frames.
The updating strategy in the search region is given by
It(wij) ← min(It(wij), Qwij
(v)) (3)
U(wij) ← max(U(v), It(wij)) (4)
L(wij) ← min(L(v), It(wij)) (5)
Fig. 3: Tracking results in representative frames of the pro-
posed and the 7 competing trackers on three challenging
sequences. First row: illumination variation (Skyjumping ce);
Second row: in-plane and out-of-plane rotations (big 2); Third
row: scale variation (motorcycle 006). (best viewed in color)
The implementation details of the above detection and tracking
algorithm is described in Algorithm 1, where the algorithm is
initialized based on the detection result of the first frame, and
the saliency map of the last frame t is fed to the algorithm.
III. EXPERIMENTAL EVALUATIONS
The proposed approach is implemented in C++ with
OpenCV 3.0.0 on a PC with an Intel Xeon W3250 2.67
GHz CPU and 8 GB RAM. The datasets and source code
of the proposed approach will be available on the authors
homepage. The proposed tracker is evaluated on 15 popular
video sequences selected from [14, 26–28] regarding the
salient object in the field of view. In each frame of these
video sequences, the target is labeled manually in a bounding
box, which is used as the ground truth in the quantitative
evaluations.
In our implementation, input images are first resized so that
the maximum dimension is 300 pixels. Three experiments are
designed to evaluate trackers as discussed in [14]: one pass
evaluation (OPE), temporal robustness evaluation (TRE), and
spatial robustness evaluation. For TRE, we randomly select the
starting frame and run a tracker to the end of the sequence.
Spatial robustness evaluation initializes the bounding box in
the first frame by shifting or scaling. As discussed in Section
II, the proposed method manages to automatically initialize
the tracker and is not sensitive to spatial fluctuation. Therefore,
we use the same temporal randomization as in [14], and refer
readers to [14] for more details.
A. Speed performance
In the detection stage, for individual images, the most up-
to-date fast detector MB+ [21] attains a speed of 49 frame-per-
second (fps), in contrast, the proposed method achieves a speed
of 149 fps and accurate performance on image sequences,
which is three times faster than MB+. The average speed
comparison of the proposed and the seven state-of-the-art
competing trackers is provided in Table I. The average speed of
our tracker is 141 fps, which is at the same level as the fastest
tracker KCF [11], however, KCF adopts a fixed tracking box,
4
TABLE I: Quantitative evaluations of the proposed and the 7
competing trackers on the 15 sequences. The best and second
best results are highlighted in bold-face and underline fonts,
respectively.
Ours CT STC CN SAMF DSST CCT KCF
[4] [6] [7] [8] [9] [10] [11]
Precision of TRE 0.79 0.51 0.59 0.64 0.65 0.65 0.66 0.60
Success rate of TRE 0.61 0.45 0.46 0.54 0.58 0.56 0.57 0.52
Precision of OPE 0.83 0.44 0.48 0.44 0.59 0.48 0.66 0.48
Success rate of OPE 0.66 0.34 0.41 0.42 0.52 0.44 0.53 0.38
CLE (in pixel) 14.5 74.4 38.0 55.0 40.8 55.7 23.2 45.6
Average speed (in fps) 141.3 12.0 73.6 87.1 12.9 20.8 21.3 144.8
which could not reflect the scale changes of the object. On
average, our method is more than ten times faster than CT [4]
and SAMF [8], five times faster than DSST [9] and CCT [10]
and about two times faster than STC [6] and CN [7].
B. Comparison with the state-of-the-art trackers
The performance of our approach is quantitatively validated
following the metrics used in [14]. We present the results using
precision, centre location error (CLE) and success rate (SR).
The CLE is defined as the Euclidean distance between the
centers of the tracking and the ground-truth bounding boxes.
The precision is computed from the percentage of frames
where the CLEs are smaller than a threshold. Following [14],
a threshold value of 20 pixels is used for the precision in
our evaluations. A tracking result in a frame is considered
successful if
at ag
at ag
> θ for a threshold θ ∈ [0, 1], where at
and ag denote the areas of the bounding boxes of the tracking
and the ground truth, respectively. Thus, SR is defined as the
percentage of frames where the overlap rates are greater than a
threshold θ. Normally, the threshold θ is set to 0.5. We evaluate
the proposed method by comparing to the seven state-of-the-
art trackers: CT, STC, CN, SAMF, DSST, CCT, and KCF.
The comparison results on the 15 sequences are shown in
Table I. We present the results under one-pass evaluation and
temporal robustness evaluation using the average precision,
success rate, and CLE over all sequences. As shown in the
table, the proposed method outperforms all seven competing
trackers. In is evident that, in the one pass evaluations, the
proposed tracker obtains the best performance in the CLE
(14.5 pixels), and the precision (0.83), which are 8.7 pixels
and 17% superior to the second best tracker, the CCT tracker
(23.2 pixels in CLE and 0.66 in precision,). Meanwhile, in
the success rate, the proposed tracker achieves the best result,
which is 13% improvement against the second best tracker, the
SAMF tracker. Please note that, for the 7 competing trackers,
the average performance in TRE is higher than that in OPE;
while for the proposed tracker, the precision and success scores
in TRE are lower than those in OPE. This is because the
proposed tracker tends to perform well in longer sequences,
while the 7 competing trackers work well in shorter sequences
[14]. In addition, Fig 4 plots the precision and success plots
in the one pass evaluation and temporal robustness evaluation
over all 15 sequences. In the two evaluations, according to both
the precision and the success rate, our approach significantly
Fig. 4: Precision and success rate plots over the 15 sequences
in (top) one pass evaluation (OPE) and (bottom) temporal
robustness evaluation (TRE). (best viewed in color)
outperforms the seven competing trackers. In summary, the
precision plot demonstrates that our approach is superior in
robustness compared to its counterparts in the experiments;
the success rate shows that our method estimates the scale
changes of the target more accurately.
C. Qualitative evaluation
In this section, we present some qualitative comparisons of
our approach with respect to the 7 competing trackers. Fig. 3
(first row) illustrates a sequence with significant illumination
variations as well as gradual out-of-plane rotations. Both CT
and STC can deal with illumination changes very well, but fail
in the presence of pose variations and out-of-plane rotations,
as shown in frames #365, and #666. In contrast, our tracker
accurately estimates both the scale and position of the target.
Fig. 3 (second row) shows the results on a sequence with
significant in-plane and out-of-plane rotations. Our approach
obtains the best performance in these cases. On the sequence,
our approach tracks part of the target due to out-of-plane
rotation, but it accurately reacquires the target in the following
frames, as shown in frames #319 and #369.
Fig. 3 (third row) illustrates the results on a sequence
with large scale variations. STC, SAMF, DSST, and CCT
were capable to handle scale changes, but they failed in this
sequence, as shown in frames #110, #145 and #170. The
competing trackers fail to handle the significant appearance
changes of rotating motions and fast scale variations. In
contrast, our tracker is robust to large and fast scale variations.
IV. CONCLUSIONS
In this paper, we have proposed an effective and efficient
approach for real-time visual object localization and tracking,
which can be applied to UAV navigation, such as obstacle
sence and avoidance. Our method integrates a fast salient
object detector within Kalman filtering framework. Compared
to the state-of-the-art trackers, our approach can not only
initialize automatically, it also achieves the fastest speed and
better performance than competing trackers.
5
REFERENCES
[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A
survey,” ACM Comput. Surv., vol. 38, no. 4, pp. 1–45,
2006.
[2] Amazon, “Amazon prime air,” https://www.youtube.com/
watch?v=98BIu9dpwHU, 2013.
[3] M. Fraiwan, A. Alsaleem, H. Abandeh, and O. Aljarrah,
“Obstacle avoidance and navigation in robotic systems:
A land and aerial robots study,” in 5th Int. Conf. Inf.
Commu. Systems (ICICS). IEEE, 2014, pp. 1–5.
[4] K. Zhang, L. Zhang, and M. Yang, “Real-time compres-
sive tracking,” in Eur. Conf. Computer Vision (ECCV),
pp. 864–877. Springer, 2012.
[5] Z. Chen, Z. Hong, and D. Tao, “An experimental survey
on correlation filter-based tracking,” arXiv preprint, pp.
1–13, 2015.
[6] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M. Yang,
“Fast visual tracking via dense spatio-temporal context
learning,” in Eur. Conf. Computer Vision (ECCV), pp.
127–141. Springer, 2014.
[7] M. Danelljan, F. Khan, M. Felsberg, and J. Weijer,
“Adaptive color attributes for real-time visual tracking,”
in IEEE Trans. Patt. Anal. Mach. Intell., 2014, pp. 1090–
1097.
[8] Y. Li and J. Zhu, “A scale adaptive kernel correlation
filter tracker with feature integration,” in Eur. Conf.
Computer Vision (ECCV) Workshops. Springer, 2014, pp.
254–265.
[9] M. Danelljan, G. H¨ager, F. Khan, and M. Felsberg,
“Accurate scale estimation for robust visual tracking,”
in Proc. Br. Mach. Conf. (BMVC), 2014, pp. 1–11.
[10] G. Zhu, J. Wang, Y. Wu, and H. Lu, “Collaborative
correlation tracking,” in Proc. Br. Mach. Conf. (BMVC),
2015, pp. 1–12.
[11] J. Henriques, R. Caseiro, P. Martins, and J. Batista,
“High-speed tracking with kernelized correlation filters,”
IEEE Trans. Patt. Anal. Mach. Intell., vol. 37, no. 3, pp.
583–596, 2015.
[12] Y. Sui, Z. Zhang, G. Wang, Y. Tang, and L. Zhang,
“Real-time visual tracking: Promoting the robustness of
correlation filter learning,” in Eur. Conf. Computer Vision
(ECCV). Springer, 2016.
[13] Y. Sui and L. Zhang, “Visual tracking via locally
structured gaussian process regression,” IEEE Signal
Process. Lett., vol. 22, no. 9, pp. 1331–1335, 2015.
[14] Y. Wu, J. Lim, and M. Yang, “Online object tracking:
A benchmark,” in IEEE Computer Soc. Conf. Computer
Vision and Pattern Recognition (CVPR), 2013, pp. 2411–
2418.
[15] M. Andriluka, S. Roth, and B. Schiele, “People-
tracking-by-detection and people-detection-by-tracking,”
in IEEE Computer Soc. Conf. Computer Vision and
Pattern Recognition (CVPR). IEEE, 2008, pp. 1–8.
[16] V. Mahadevan and N. Vasconcelos, “Saliency-based
discriminant tracking,” in IEEE Computer Soc. Conf.
Computer Vision and Pattern Recognition (CVPR). IEEE,
2009, pp. 1007–1013.
[17] C. Ma, J. Huang, X. Yang, and M. Yang, “Hierarchical
convolutional features for visual tracking,” in IEEE Int.
Conf. Computer Vision (ICCV), 2015, pp. 3074–3082.
[18] S. Hong, T. You, S. Kwak, and B. Han, “Online
tracking by learning discriminative saliency map with
convolutional neural network,” arXiv preprint, 2015.
[19] S. Gidaris and N. Komodakis, “Locnet: Improving lo-
calization accuracy for object detection,” arXiv preprint,
2015.
[20] K. Tang, A. Joulin, L. Li, and F. Li, “Co-localization
in real-world images,” in IEEE Computer Soc. Conf.
Computer Vision and Pattern Recognition (CVPR). IEEE,
2014, pp. 1464–1471.
[21] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and
R. Mech, “Minimum barrier salient object detection at
80 fps,” in IEEE Int. Conf. Computer Vision (ICCV),
2015, pp. 1404–1412.
[22] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency
optimization from robust background detection,” in IEEE
Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), 2014, pp. 2814–2821.
[23] S. Yin, J. Na, J. Choi, and S. Oh, “Hierarchical kalman-
particle filter with adaptation to motion changes for
object tracking,” Comput. Vis. Image Underst., vol. 115,
no. 6, pp. 885–900, 2011.
[24] S. Weng, C. Kuo, and S. Tu, “Video object tracking
using adaptive kalman filter,” J. Vis. Commun. Image R.,
vol. 17, no. 6, pp. 1190–1208, 2006.
[25] G. Welch and G. Bishop, “An introduction to the kalman
filter,” in University of North Carolina at Chapel Hill,
NC, USA, Technique report, 2006, pp. 1–16.
[26] A. Li, M. Lin, Y. Wu, M. Yang, and S. Yan, “Nus-pro: A
new visual tracking challenge,” IEEE Trans. Patt. Anal.
Mach. Intell., vol. 38, no. 2, pp. 335–349, 2016.
[27] P. Liang, E. Blasch, and H. Ling, “Encoding color infor-
mation for visual tracking: Algorithms and benchmark,”
IEEE Trans. Image Process., vol. 24, no. 12, pp. 5630–
5644, 2015.
[28] M. Kristan and et al., “The visual object tracking vot2014
challenge results,” in Eur. Conf. Computer Vision (ECCV)
Workshop, 2014, pp. 191–217.

More Related Content

What's hot

CLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATACLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATA
csandit
 
Deep VO and SLAM IV
Deep VO and SLAM IVDeep VO and SLAM IV
Deep VO and SLAM IV
Yu Huang
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
Yu Huang
 
Paper id 26201483
Paper id 26201483Paper id 26201483
Paper id 26201483
IJRAT
 
Deep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data IIDeep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data II
Yu Huang
 
Pedestrian behavior/intention modeling for autonomous driving V
Pedestrian behavior/intention modeling for autonomous driving VPedestrian behavior/intention modeling for autonomous driving V
Pedestrian behavior/intention modeling for autonomous driving V
Yu Huang
 
I0341042048
I0341042048I0341042048
I0341042048
inventionjournals
 
Robust foreground modelling to segment and detect multiple moving objects in ...
Robust foreground modelling to segment and detect multiple moving objects in ...Robust foreground modelling to segment and detect multiple moving objects in ...
Robust foreground modelling to segment and detect multiple moving objects in ...
IJECEIAES
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
Yu Huang
 
Iee egold2010 presentazione_finale_veracini
Iee egold2010 presentazione_finale_veraciniIee egold2010 presentazione_finale_veracini
Iee egold2010 presentazione_finale_veracinigrssieee
 
Fusion of Multispectral And Full Polarimetric SAR Images In NSST Domain
Fusion of Multispectral And Full Polarimetric SAR Images In NSST DomainFusion of Multispectral And Full Polarimetric SAR Images In NSST Domain
Fusion of Multispectral And Full Polarimetric SAR Images In NSST Domain
CSCJournals
 
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
IJECEIAES
 
Mj upjs
Mj upjsMj upjs
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
Yu Huang
 
Motion Detection and Clustering Using PCA and NN in Color Image Sequence
Motion Detection and Clustering Using PCA and NN in Color Image SequenceMotion Detection and Clustering Using PCA and NN in Color Image Sequence
Motion Detection and Clustering Using PCA and NN in Color Image Sequence
TELKOMNIKA JOURNAL
 
Latest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore Projects
Latest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore ProjectsLatest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore Projects
Latest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore Projects
1crore projects
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
Yu Huang
 
An adaptive gmm approach to background subtraction for application in real ti...
An adaptive gmm approach to background subtraction for application in real ti...An adaptive gmm approach to background subtraction for application in real ti...
An adaptive gmm approach to background subtraction for application in real ti...
eSAT Publishing House
 

What's hot (18)

CLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATACLUSTERING HYPERSPECTRAL DATA
CLUSTERING HYPERSPECTRAL DATA
 
Deep VO and SLAM IV
Deep VO and SLAM IVDeep VO and SLAM IV
Deep VO and SLAM IV
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
Paper id 26201483
Paper id 26201483Paper id 26201483
Paper id 26201483
 
Deep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data IIDeep Learning’s Application in Radar Signal Data II
Deep Learning’s Application in Radar Signal Data II
 
Pedestrian behavior/intention modeling for autonomous driving V
Pedestrian behavior/intention modeling for autonomous driving VPedestrian behavior/intention modeling for autonomous driving V
Pedestrian behavior/intention modeling for autonomous driving V
 
I0341042048
I0341042048I0341042048
I0341042048
 
Robust foreground modelling to segment and detect multiple moving objects in ...
Robust foreground modelling to segment and detect multiple moving objects in ...Robust foreground modelling to segment and detect multiple moving objects in ...
Robust foreground modelling to segment and detect multiple moving objects in ...
 
3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II3-d interpretation from single 2-d image for autonomous driving II
3-d interpretation from single 2-d image for autonomous driving II
 
Iee egold2010 presentazione_finale_veracini
Iee egold2010 presentazione_finale_veraciniIee egold2010 presentazione_finale_veracini
Iee egold2010 presentazione_finale_veracini
 
Fusion of Multispectral And Full Polarimetric SAR Images In NSST Domain
Fusion of Multispectral And Full Polarimetric SAR Images In NSST DomainFusion of Multispectral And Full Polarimetric SAR Images In NSST Domain
Fusion of Multispectral And Full Polarimetric SAR Images In NSST Domain
 
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...
 
Mj upjs
Mj upjsMj upjs
Mj upjs
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Motion Detection and Clustering Using PCA and NN in Color Image Sequence
Motion Detection and Clustering Using PCA and NN in Color Image SequenceMotion Detection and Clustering Using PCA and NN in Color Image Sequence
Motion Detection and Clustering Using PCA and NN in Color Image Sequence
 
Latest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore Projects
Latest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore ProjectsLatest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore Projects
Latest 2016 IEEE Projects | 2016 Final Year Project Titles - 1 Crore Projects
 
3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III3-d interpretation from single 2-d image III
3-d interpretation from single 2-d image III
 
An adaptive gmm approach to background subtraction for application in real ti...
An adaptive gmm approach to background subtraction for application in real ti...An adaptive gmm approach to background subtraction for application in real ti...
An adaptive gmm approach to background subtraction for application in real ti...
 

Similar to real-time-object

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...
AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...
AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...
IJCSEIT Journal
 
I0343065072
I0343065072I0343065072
I0343065072
ijceronline
 
F1063337
F1063337F1063337
F1063337
IJERD Editor
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
 
D018112429
D018112429D018112429
D018112429
IOSR Journals
 
Adaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detectionAdaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detection
Nguyen Duong
 
A New Algorithm for Tracking Objects in Videos of Cluttered Scenes
A New Algorithm for Tracking Objects in Videos of Cluttered ScenesA New Algorithm for Tracking Objects in Videos of Cluttered Scenes
A New Algorithm for Tracking Objects in Videos of Cluttered Scenes
Zac Darcy
 
Object tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform ImplementationObject tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform Implementation
Editor IJCATR
 
G04743943
G04743943G04743943
G04743943
IOSR-JEN
 
E0333021025
E0333021025E0333021025
E0333021025
theijes
 
Motion Human Detection & Tracking Based On Background Subtraction
Motion Human Detection & Tracking Based On Background SubtractionMotion Human Detection & Tracking Based On Background Subtraction
Motion Human Detection & Tracking Based On Background Subtraction
International Journal of Engineering Inventions www.ijeijournal.com
 
A Novel Approach for Moving Object Detection from Dynamic Background
A Novel Approach for Moving Object Detection from Dynamic BackgroundA Novel Approach for Moving Object Detection from Dynamic Background
A Novel Approach for Moving Object Detection from Dynamic Background
IJERA Editor
 
Object Distance Detection using a Joint Transform Correlator
Object Distance Detection using a Joint Transform CorrelatorObject Distance Detection using a Joint Transform Correlator
Object Distance Detection using a Joint Transform CorrelatorAlexander Layton
 
A STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITY
A STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITYA STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITY
A STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITY
Zac Darcy
 
Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...
Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...
Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...
IRJET Journal
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Pioneer Natural Resources
 

Similar to real-time-object (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...
AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...
AN EFFICIENT IMPLEMENTATION OF TRACKING USING KALMAN FILTER FOR UNDERWATER RO...
 
I0343065072
I0343065072I0343065072
I0343065072
 
F1063337
F1063337F1063337
F1063337
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
D018112429
D018112429D018112429
D018112429
 
Ijctt v7 p104
Ijctt v7 p104Ijctt v7 p104
Ijctt v7 p104
 
Adaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detectionAdaptive and online one class support vector machine-based outlier detection
Adaptive and online one class support vector machine-based outlier detection
 
A New Algorithm for Tracking Objects in Videos of Cluttered Scenes
A New Algorithm for Tracking Objects in Videos of Cluttered ScenesA New Algorithm for Tracking Objects in Videos of Cluttered Scenes
A New Algorithm for Tracking Objects in Videos of Cluttered Scenes
 
Object tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform ImplementationObject tracking with SURF: ARM-Based platform Implementation
Object tracking with SURF: ARM-Based platform Implementation
 
G04743943
G04743943G04743943
G04743943
 
E0333021025
E0333021025E0333021025
E0333021025
 
Motion Human Detection & Tracking Based On Background Subtraction
Motion Human Detection & Tracking Based On Background SubtractionMotion Human Detection & Tracking Based On Background Subtraction
Motion Human Detection & Tracking Based On Background Subtraction
 
A Novel Approach for Moving Object Detection from Dynamic Background
A Novel Approach for Moving Object Detection from Dynamic BackgroundA Novel Approach for Moving Object Detection from Dynamic Background
A Novel Approach for Moving Object Detection from Dynamic Background
 
Object Distance Detection using a Joint Transform Correlator
Object Distance Detection using a Joint Transform CorrelatorObject Distance Detection using a Joint Transform Correlator
Object Distance Detection using a Joint Transform Correlator
 
A STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITY
A STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITYA STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITY
A STOCHASTIC STATISTICAL APPROACH FOR TRACKING HUMAN ACTIVITY
 
Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...
Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...
Hybrid Quantum Convolutional Neural Network for Tuberculosis Prediction Using...
 
Space Tug Rendezvous
Space Tug RendezvousSpace Tug Rendezvous
Space Tug Rendezvous
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
Distance Metric Based Multi-Attribute Seismic Facies Classification to Identi...
 

real-time-object

  • 1. 1 Real-Time Object Localization and Tracking from Image Sequences Yuanwei Wu, Yao Sui, Arjan Gupta and Guanghui Wang Abstract—To address the problem of autonomous sense and avoidance for unmanned aerial vehicle navigation via vision- based method, in this letter, we propose a real-time object localization and tracking strategy from monocular image se- quences. The proposed approach effectively integrates the tech- niques of object detection, localization, and tracking into a dynamic model. At the detection stage, the object of interest is automatically detected and localized from a saliency map computed via connectivity cue of the frame; at the tracking stage, a Kalman filter is employed to provide a coarse prediction of the object position and size, which is further refined via a local detector using image boundary connectivity cue and context information between consecutive frames. Compared to existing methods, the proposed technique does not require any manual initialization, runs much faster than the state-of-the-art trackers of its kind, and achieves comparative tracking performance. Extensive comparative experiments demonstrate the effectiveness and better performance of the proposed approach. Index Terms—Salient object detection; visual tracking; Kalman filter; object localization; real-time tracking; I. INTRODUCTION VISUAL object tracking has played important roles in many computer vision applications, such as human- computer interaction, surveillance, and video understanding [1]. Due to emerging real-world applications, like deliver- ing packages using small unmanned aerial vehicles (UAVs) [2], there is a huge demand for vision-based autonomous navigation for UAVs. First of all, the vision-based methods are robust to electromagnetic interference compared to con- ventional sensor-based method, e.g. global positional system (GPS) [3]. Second, vision-based methods are needed due to strict small size and insufficient power supply of UAVs. Based on this background, in this letter, we address autonomous sense and avoidance of obstacles for UAVs during flight via the integration of object detection and tracking. The tracking-by-detection methods have become increas- ingly popular for real-time applications [4] in visual tracking. The correlation filter-based trackers attract more attention in recent years due to its high speed performance [5]. However, those conventional tracking methods [4, 6–13] require manual initialization with the ground truth at the first frame. Moreover, they are sensitive to the initialization variation caused by scales This work is partly supported by the National Aeronautics and Space Ad- ministration (NASA) LEARN II program under grant number NNX15AN94N. The authors are with the Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045 USA. The source code and dataset will be available on the authors home- page http://www.ittc.ku.edu/∼ghwang/. (email: wuyuanwei2010@gmail.com, suiyao@gmail.com, arjangupta@ku.edu, ghwang@ku.edu) and position errors, and would return useless information once failed during tracking [14]. Combining a detector with a tracker is a feasible solution for automatic initialization [15]. The detector, however, needs to be trained with large amount of training samples, while the prior information about the object of interest is usually not available in advance. In [16], Mahadevan et al. proposed a saliency-based discriminative tracker with automatic initial- ization, which builds the motion saliency map using optical flow. This technique, however, is computational intensive and not suitable for real-time applications. Some recent techniques on salient object detection and visual tracking [17, 18] have achieved superior performance by using deep learning. However, these methods need large amount of samples for training. The methods of object co- localization in videos [19, 20] are originally designed to handle objects of the same class across a set of distinct images or videos, while for target tracking, we typically focus on a salient object in a video sequence. Several recent approaches exploit boundary connectivity [21, 22] for natural images, which have been shown to be effective for salient object detection. Since the saliency map ef- fectively discovers the spatial information of target , it enables us to improve the target localization accuracy. Inspired by the salient object detection approach [21], which achieves high detection speed on individual images, we develop an efficient method by integrating two complementary processes: salient object detection and tracking. A Kalman filter is employed to predict a coarse location of the target object, and the detector is used to refine the solution. In summary, our contributions are threefold: 1) The pro- posed algorithm integrates saliency map into a dynamic model and adopts the target-specific saliency map as the observation for tracking; 2) We develop a tracker with automatic initializa- tion for real-world applications and 3) the proposed technique achieves better performance than state-of-the-art competing trackers from extensive real experiments. II. THE PROPOSED APPROACH The proposed fast object localization and tracking (FOLT) algorithm can automatically and quickly localize the salient object in the scene and track it across the sequence. In this letter, the object of interest is the salient object in the view, so the tracking problem is formulated as an unsupervised salient object detection, which can be automatically obtained from the saliency map computed from the frame [21]. In the following, we will present a detailed elaboration of the approach.
  • 2. 2 Fig. 1: A flow-chart of the proposed approach. A. Overview of the proposed approach In most tracking scenarios, the linear Gaussian motion model has been demonstrated to be an effective representation for the motion behavior of salient object in natural image sequences [23, 24]. Therefore, an optimal estimator, Kalman filter [25], has been used to estimate the motion attributes, e.g. the velocity, position and scale of the object. A flow chart of the proposed approach is shown in Fig. 1, the bounding box of the object is initialized from the saliency map of the entire image [21]. A dynamic model is established to predict the object position and size at the next frame. Under the constrain of natural motion, this predicted bounding box provides the tracking algorithm a coarse solution, which is not far away from the ground truth [23]. Thus, a reasonable search region can be automatically attained by expanding the predicted object window with a fixed percentage. Then, the location and size of the object is refined by computing the saliency within the search region. Next, the refined bounding box, as a new observation, is fed to the Kalman filter to update the dynamic model in the correction phase. Through this process, the object in the image sequence is automatically detected and tracked relying on recursively prediction, observation, and correction. B. Motion model In the dynamic model, the salient object in a frame is defined by a motion state variable S with six variables S = {x, y, u, v, w, h}, where (x, y) denotes the center coordinates, (u, v) denotes the velocities, and (w, h) denotes the width and height of the minimum bounding box. In the t-th frame, the predicted state ˆS− t is evolved from the prior state ˆSt−1 in frame t−1 given knowledge of the process prior to time t−1 according to the following linear stochastic equation ˆS− t = F ˆSt−1 + wt−1, (1) where the variable wt−1 represents the additive, white Gaus- sian noise with zero mean and known covariance, and F denotes the state transition matrix. We use the notation St ∼ N(µ, Σ) to denote that state St is a random variable with a normal probability distribution with mean µ and covariance Σ in frame t. The covariance is a diagonal matrix, which is composed by the variances of x, y, u, v, w, and h, respectively. Let us assume that zt encodes the positions and dimensions of the minimum bounding box of the observation in frame t. The observation zt is the output of the fast salient object detector, Fig. 2: Illustration of updating the search region ROI using (a) raster scanning and (b) inverse-raster scanning. which is represented by zt = {x, y, w, h}. The posterior state of the object in frame t given observation zt is finally updated by incorporating the observation and the dynamic model via St = ˆS− t + Kt(zt − H ˆS− t ), (2) where Kt denotes the Kalman gain in frame t with the leverage of obtaining a posterior state estimation St. The estimation with minimum mean-square error is obtained by weighting the difference between the prediction and observation. C. Salient object detection It has been shown that the cue of image boundary connectiv- ity is effective for salient object detection [21, 22]. In natural images, it is safe to assume that the object regions are much less connected to the image boundaries. In this letter, the salient object detection is formulated as finding the shortest path from pixel wij to the seed set B from the image boundary, considering all possible paths in the image. Each pixel in the 2D digital image I is denoted as a vertex. The neighboring pixels are connected by edges. In this work, we consider 4-adjacent neighbors, e.g. the neighbors of wij are wi−1,j, wi+1,j, wi,j−1, and wi,j+1, as shown in Fig. 2. The path p = v(0), v(1), · · · , v(k) on image I denotes a sequence of consecutive neighboring pixels. Given a loss func- tion L(p), the problem of finding the salient object in the frame t is defined as It(wij) = arg minp∈PB,wij L(p), where PB,wij denotes all possible paths connecting the seed set B and the pixel wij in image It. Similar to the work in [21], we formulate the loss function at the frame t as LIt (p) = maxn j=0(p(i)) − minn j=0(p(j)), where LIt (p) calculates the pixel intensity difference between the maximum and the minimum values among all possible paths. Let E(wij, v) denotes the edge connecting the vertex wij and v, Q(wij) denotes the current path connecting the pixel wij with the image boundary set B. We define CIt (Q(wij), E(wij, v)) as the cost of a new path connecting the vertex v to the image boundary set B by adding the edge E to Q(wij). CIt (Q(wij), E(wij, v)) can be calcu- lated from CIt (Q(wij), E(wij, v)) = max{U(wij), It(v)} − min{L(wij), It(v)}, where U(wij) and L(wij) denote the maximum and the minimum pixel intensity values on the path Q(wij). A raster scanning method [21] could be used to calculate the cost CIt (Q(wij), E(wij, v)). The details will be discussed in sect. II-D.
  • 3. 3 Algorithm 1: Fast Object Localization Tracking (FOLT) Input: image It+1, saliency map Dt, search region ROIt, number of passes N Output: saliency map Dt+1 Auxiliaries: Ut+1, Lt+1 Inside the search region ROI, set Dt to ∞ Outside the search region, keep the values Dt Set Lt+1 ← It+1 and Ut+1 ← It+1 for each frame do Prediction using Eq. (1) Observation as following: for i = 1 : N do if mod(i, 2) = 1 then Raster Scanning using Eq. (3), (4), (5) end else Inverse-Raster Scanning using Eq.(3), (4), (5) end end Correction using Eq. (2) Update the complete Dt every ten frames end D. Fast object localization tracking In [21], Zhang et al. provided a solution for individual im- ages using the minimum barrier distance detection method. In order to improve the accuracy and speed in image sequences, we explore the integration of the image boundary connec- tivity cue with the temporal context information between consecutive frames. Therefore, we propose a fast salient object detection and tracking framework as shown in Fig. 1. During the observation stage, two fast scanning procedures, raster scanning and inverse-raster scanning, are implemented to find the location of the salient object between two consecutive frames. As shown in Fig. 2, the inner window of the target object is coarsely predicted using the dynamic model. The search region is obtained by expanding the inner window with a fixed percentage. The raster scanning and inverse-raster scanning are used to update the pixel values in the search region of image It. In the proposed approach, the search region is dynamically determined based on the predicted position of the salient object. As shown in Fig. 2 (a), the raster scanning is used to update all the intensities from the top-left pixel to the bottom-right pixel, which simultaneously updates two adjacent neighbors wi,j−1 and wi−1,j. Similarly, in the inverse raster scanning, the intensities of the two adjacent neighbors wi+1,j and wi,j+1 in the search region are reversely updated, as shown in Fig.2 (b). The values outside of the search region are not updated since they have less contribution to the detection. As a trade-off between the accuracy and efficiency, a complete saliency map of the entire image is updated every ten frames. The updating strategy in the search region is given by It(wij) ← min(It(wij), Qwij (v)) (3) U(wij) ← max(U(v), It(wij)) (4) L(wij) ← min(L(v), It(wij)) (5) Fig. 3: Tracking results in representative frames of the pro- posed and the 7 competing trackers on three challenging sequences. First row: illumination variation (Skyjumping ce); Second row: in-plane and out-of-plane rotations (big 2); Third row: scale variation (motorcycle 006). (best viewed in color) The implementation details of the above detection and tracking algorithm is described in Algorithm 1, where the algorithm is initialized based on the detection result of the first frame, and the saliency map of the last frame t is fed to the algorithm. III. EXPERIMENTAL EVALUATIONS The proposed approach is implemented in C++ with OpenCV 3.0.0 on a PC with an Intel Xeon W3250 2.67 GHz CPU and 8 GB RAM. The datasets and source code of the proposed approach will be available on the authors homepage. The proposed tracker is evaluated on 15 popular video sequences selected from [14, 26–28] regarding the salient object in the field of view. In each frame of these video sequences, the target is labeled manually in a bounding box, which is used as the ground truth in the quantitative evaluations. In our implementation, input images are first resized so that the maximum dimension is 300 pixels. Three experiments are designed to evaluate trackers as discussed in [14]: one pass evaluation (OPE), temporal robustness evaluation (TRE), and spatial robustness evaluation. For TRE, we randomly select the starting frame and run a tracker to the end of the sequence. Spatial robustness evaluation initializes the bounding box in the first frame by shifting or scaling. As discussed in Section II, the proposed method manages to automatically initialize the tracker and is not sensitive to spatial fluctuation. Therefore, we use the same temporal randomization as in [14], and refer readers to [14] for more details. A. Speed performance In the detection stage, for individual images, the most up- to-date fast detector MB+ [21] attains a speed of 49 frame-per- second (fps), in contrast, the proposed method achieves a speed of 149 fps and accurate performance on image sequences, which is three times faster than MB+. The average speed comparison of the proposed and the seven state-of-the-art competing trackers is provided in Table I. The average speed of our tracker is 141 fps, which is at the same level as the fastest tracker KCF [11], however, KCF adopts a fixed tracking box,
  • 4. 4 TABLE I: Quantitative evaluations of the proposed and the 7 competing trackers on the 15 sequences. The best and second best results are highlighted in bold-face and underline fonts, respectively. Ours CT STC CN SAMF DSST CCT KCF [4] [6] [7] [8] [9] [10] [11] Precision of TRE 0.79 0.51 0.59 0.64 0.65 0.65 0.66 0.60 Success rate of TRE 0.61 0.45 0.46 0.54 0.58 0.56 0.57 0.52 Precision of OPE 0.83 0.44 0.48 0.44 0.59 0.48 0.66 0.48 Success rate of OPE 0.66 0.34 0.41 0.42 0.52 0.44 0.53 0.38 CLE (in pixel) 14.5 74.4 38.0 55.0 40.8 55.7 23.2 45.6 Average speed (in fps) 141.3 12.0 73.6 87.1 12.9 20.8 21.3 144.8 which could not reflect the scale changes of the object. On average, our method is more than ten times faster than CT [4] and SAMF [8], five times faster than DSST [9] and CCT [10] and about two times faster than STC [6] and CN [7]. B. Comparison with the state-of-the-art trackers The performance of our approach is quantitatively validated following the metrics used in [14]. We present the results using precision, centre location error (CLE) and success rate (SR). The CLE is defined as the Euclidean distance between the centers of the tracking and the ground-truth bounding boxes. The precision is computed from the percentage of frames where the CLEs are smaller than a threshold. Following [14], a threshold value of 20 pixels is used for the precision in our evaluations. A tracking result in a frame is considered successful if at ag at ag > θ for a threshold θ ∈ [0, 1], where at and ag denote the areas of the bounding boxes of the tracking and the ground truth, respectively. Thus, SR is defined as the percentage of frames where the overlap rates are greater than a threshold θ. Normally, the threshold θ is set to 0.5. We evaluate the proposed method by comparing to the seven state-of-the- art trackers: CT, STC, CN, SAMF, DSST, CCT, and KCF. The comparison results on the 15 sequences are shown in Table I. We present the results under one-pass evaluation and temporal robustness evaluation using the average precision, success rate, and CLE over all sequences. As shown in the table, the proposed method outperforms all seven competing trackers. In is evident that, in the one pass evaluations, the proposed tracker obtains the best performance in the CLE (14.5 pixels), and the precision (0.83), which are 8.7 pixels and 17% superior to the second best tracker, the CCT tracker (23.2 pixels in CLE and 0.66 in precision,). Meanwhile, in the success rate, the proposed tracker achieves the best result, which is 13% improvement against the second best tracker, the SAMF tracker. Please note that, for the 7 competing trackers, the average performance in TRE is higher than that in OPE; while for the proposed tracker, the precision and success scores in TRE are lower than those in OPE. This is because the proposed tracker tends to perform well in longer sequences, while the 7 competing trackers work well in shorter sequences [14]. In addition, Fig 4 plots the precision and success plots in the one pass evaluation and temporal robustness evaluation over all 15 sequences. In the two evaluations, according to both the precision and the success rate, our approach significantly Fig. 4: Precision and success rate plots over the 15 sequences in (top) one pass evaluation (OPE) and (bottom) temporal robustness evaluation (TRE). (best viewed in color) outperforms the seven competing trackers. In summary, the precision plot demonstrates that our approach is superior in robustness compared to its counterparts in the experiments; the success rate shows that our method estimates the scale changes of the target more accurately. C. Qualitative evaluation In this section, we present some qualitative comparisons of our approach with respect to the 7 competing trackers. Fig. 3 (first row) illustrates a sequence with significant illumination variations as well as gradual out-of-plane rotations. Both CT and STC can deal with illumination changes very well, but fail in the presence of pose variations and out-of-plane rotations, as shown in frames #365, and #666. In contrast, our tracker accurately estimates both the scale and position of the target. Fig. 3 (second row) shows the results on a sequence with significant in-plane and out-of-plane rotations. Our approach obtains the best performance in these cases. On the sequence, our approach tracks part of the target due to out-of-plane rotation, but it accurately reacquires the target in the following frames, as shown in frames #319 and #369. Fig. 3 (third row) illustrates the results on a sequence with large scale variations. STC, SAMF, DSST, and CCT were capable to handle scale changes, but they failed in this sequence, as shown in frames #110, #145 and #170. The competing trackers fail to handle the significant appearance changes of rotating motions and fast scale variations. In contrast, our tracker is robust to large and fast scale variations. IV. CONCLUSIONS In this paper, we have proposed an effective and efficient approach for real-time visual object localization and tracking, which can be applied to UAV navigation, such as obstacle sence and avoidance. Our method integrates a fast salient object detector within Kalman filtering framework. Compared to the state-of-the-art trackers, our approach can not only initialize automatically, it also achieves the fastest speed and better performance than competing trackers.
  • 5. 5 REFERENCES [1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Comput. Surv., vol. 38, no. 4, pp. 1–45, 2006. [2] Amazon, “Amazon prime air,” https://www.youtube.com/ watch?v=98BIu9dpwHU, 2013. [3] M. Fraiwan, A. Alsaleem, H. Abandeh, and O. Aljarrah, “Obstacle avoidance and navigation in robotic systems: A land and aerial robots study,” in 5th Int. Conf. Inf. Commu. Systems (ICICS). IEEE, 2014, pp. 1–5. [4] K. Zhang, L. Zhang, and M. Yang, “Real-time compres- sive tracking,” in Eur. Conf. Computer Vision (ECCV), pp. 864–877. Springer, 2012. [5] Z. Chen, Z. Hong, and D. Tao, “An experimental survey on correlation filter-based tracking,” arXiv preprint, pp. 1–13, 2015. [6] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M. Yang, “Fast visual tracking via dense spatio-temporal context learning,” in Eur. Conf. Computer Vision (ECCV), pp. 127–141. Springer, 2014. [7] M. Danelljan, F. Khan, M. Felsberg, and J. Weijer, “Adaptive color attributes for real-time visual tracking,” in IEEE Trans. Patt. Anal. Mach. Intell., 2014, pp. 1090– 1097. [8] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with feature integration,” in Eur. Conf. Computer Vision (ECCV) Workshops. Springer, 2014, pp. 254–265. [9] M. Danelljan, G. H¨ager, F. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in Proc. Br. Mach. Conf. (BMVC), 2014, pp. 1–11. [10] G. Zhu, J. Wang, Y. Wu, and H. Lu, “Collaborative correlation tracking,” in Proc. Br. Mach. Conf. (BMVC), 2015, pp. 1–12. [11] J. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 37, no. 3, pp. 583–596, 2015. [12] Y. Sui, Z. Zhang, G. Wang, Y. Tang, and L. Zhang, “Real-time visual tracking: Promoting the robustness of correlation filter learning,” in Eur. Conf. Computer Vision (ECCV). Springer, 2016. [13] Y. Sui and L. Zhang, “Visual tracking via locally structured gaussian process regression,” IEEE Signal Process. Lett., vol. 22, no. 9, pp. 1331–1335, 2015. [14] Y. Wu, J. Lim, and M. Yang, “Online object tracking: A benchmark,” in IEEE Computer Soc. Conf. Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2411– 2418. [15] M. Andriluka, S. Roth, and B. Schiele, “People- tracking-by-detection and people-detection-by-tracking,” in IEEE Computer Soc. Conf. Computer Vision and Pattern Recognition (CVPR). IEEE, 2008, pp. 1–8. [16] V. Mahadevan and N. Vasconcelos, “Saliency-based discriminant tracking,” in IEEE Computer Soc. Conf. Computer Vision and Pattern Recognition (CVPR). IEEE, 2009, pp. 1007–1013. [17] C. Ma, J. Huang, X. Yang, and M. Yang, “Hierarchical convolutional features for visual tracking,” in IEEE Int. Conf. Computer Vision (ICCV), 2015, pp. 3074–3082. [18] S. Hong, T. You, S. Kwak, and B. Han, “Online tracking by learning discriminative saliency map with convolutional neural network,” arXiv preprint, 2015. [19] S. Gidaris and N. Komodakis, “Locnet: Improving lo- calization accuracy for object detection,” arXiv preprint, 2015. [20] K. Tang, A. Joulin, L. Li, and F. Li, “Co-localization in real-world images,” in IEEE Computer Soc. Conf. Computer Vision and Pattern Recognition (CVPR). IEEE, 2014, pp. 1464–1471. [21] J. Zhang, S. Sclaroff, Z. Lin, X. Shen, B. Price, and R. Mech, “Minimum barrier salient object detection at 80 fps,” in IEEE Int. Conf. Computer Vision (ICCV), 2015, pp. 1404–1412. [22] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 2814–2821. [23] S. Yin, J. Na, J. Choi, and S. Oh, “Hierarchical kalman- particle filter with adaptation to motion changes for object tracking,” Comput. Vis. Image Underst., vol. 115, no. 6, pp. 885–900, 2011. [24] S. Weng, C. Kuo, and S. Tu, “Video object tracking using adaptive kalman filter,” J. Vis. Commun. Image R., vol. 17, no. 6, pp. 1190–1208, 2006. [25] G. Welch and G. Bishop, “An introduction to the kalman filter,” in University of North Carolina at Chapel Hill, NC, USA, Technique report, 2006, pp. 1–16. [26] A. Li, M. Lin, Y. Wu, M. Yang, and S. Yan, “Nus-pro: A new visual tracking challenge,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 38, no. 2, pp. 335–349, 2016. [27] P. Liang, E. Blasch, and H. Ling, “Encoding color infor- mation for visual tracking: Algorithms and benchmark,” IEEE Trans. Image Process., vol. 24, no. 12, pp. 5630– 5644, 2015. [28] M. Kristan and et al., “The visual object tracking vot2014 challenge results,” in Eur. Conf. Computer Vision (ECCV) Workshop, 2014, pp. 191–217.