Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Human Detection and Tracking Using Apparent Features under.pdf
1. Human Detection and Tracking Using Apparent Features under
Multi-cameras with Non-overlapping
Lu Tian, Shengjin Wang, Xiaoqing Ding
Tsinghua University, Department of Electronic Engineering, Beijing, China
tianlu@ocrserv.ee.tsinghua.edu.cn
Abstract
This paper describes a human detection and track-
ing system under multi-cameras with non-overlapping
views using apparent features only. Our system is able
to first detect people and then perform object matching.
In the distributed intelligent surveillance system, com-
puters need to detect pedestrians automatically under
multi-cameras probably with non-overlapping views for
providing a steady and continuous tracking of the pedes-
trian targets. In this paper, we combine Histograms of
Oriented Gradients (HOG) and Local Binary Pattern
(LBP) to detect human and segment human body from
the background using GrabCut algorithm. We also study
the method of pedestrian feature extraction and object
matching based on appearance. We connect all the mod-
ules above in series to obtain a complete system and test
it on samples we collect over three cameras with non-
overlapping views to prove the effectiveness. We believe
that our system will be helpful to the development of the
public security system.
1. Introduction
Nowadays cameras are all around us and they are
everywhere in the buildings and on the streets. How to
use them to help us with the development of the public
security system is very meaningful. We expect comput-
ers to detect and track pedestrians automatically under
multi-cameras with non-overlapping views. This task is
complex and full of challenges because human detec-
tion, segmentation, feature extraction and object match-
ing are all needed. Not only people have variable ap-
pearance and wide range of poses, but also the scale,
visual angle and illumination change a lot under multi-
cameras.
In the research of object detection and tracking, the
difficult problems can be summarized in the following
points: (a) Changes of visual angle: The imaging of
object in the image plane changes with the change of
projection transformation matrix which is caused by the
variable angles between objects in real scene and the
camera optical axis. (b) Scale changes: The size of an
object varies when it moves through different cameras.
(c) Illumination: Images captured by cameras are relate
to illumination direction, intensity and target surface re-
flectivity in real scene and the relevance is different to
describe and model. (d) The deformation of objects: In
the tracking process the target we are interested in is al-
ways moving such as pedestrian walking, running and
jumping, thus its shape will change over time. All the
above factors will lead to time-varying observation of
the target we detect and track, especially when the tar-
get goes through multi-cameras with non-overlapping.
Our system is composed of four parts including hu-
man detection, pedestrian segmentation, apparent fea-
tures extraction and object matching. An overview of
our system is shown in Figure 1.
First, we study pedestrian detection in the video
[1,2,3,5,6]. By combining Histograms of Oriented Gra-
dients (HOG) and Local Binary Pattern (LBP) as the fea-
ture set, we provide a cascade human detector trained by
AdaBoost algorithm. The results show that the pedes-
trian detector based on HOG features has a high hit rate.
By combining HOG and LBP we can reduce the false
alarm rate while keeping the high hit rate.
Second, we use the output of human detector that
contains people as the input of object segmentation in
order to filter out the background information. We use
GrabCut algorithm to segment objects [7,8]. We create a
prior mask to initialize the segment. Experiments show
that when the edge of the input rectangle includes rich
background information we can get good results of the
object segmentation.
Third, we extract the apparent features of the fore-
ground obtained by the object segmentation. The fea-
tures we extract include color histogram in different
978-1-4673-0174-9/12/$31.00 c
⃝2012 IEEE 1082 ICALIP2012
2. (a) the system is composed of four parts
(b) the actual implementation steps of the system
Figure 1. An overview of our system.
color space and LBP descriptor histogram to describe
both color and texture characteristics of the pedestrians.
Then we do pedestrian matching to achieve the pur-
pose of tracking using apparent features we extract.
When matching people we calculate the correlation co-
efficients between two histogram descriptors and match
the targets by threshold control and selecting the maxi-
mum correlation coefficient.
At last, all the modules above are connected in series
to obtain a complete system to detect and track human
over different non-overlapping camera views. We also
test the system on the samples we collect over three cam-
eras with non-overlapping and prove the effectiveness of
the system.
2. Human Detection
Dalal and Triggs proposed using grids of HOG de-
scriptors for human detection and obtained good results
[1]. In recent years, HOG descriptors are widely used
and proved to be effective [4]. HOG features focus on
the edge orientations. However this may lose some other
useful information such as texture features. So we de-
termine to combine HOG and LBP to describe both the
edge and texture information of human in videos so that
the appearance can be better captured. Some works have
been done to detect human by combining HOG and LBP
[2]. It is difficult to balance the weights of different fea-
tures when putting HOG feature and LBP feature in an
augmented vector to train the detector. So in this paper
Figure 2. The flow chart of human detec-
tion in our system. Our detector is a cas-
cade of HOG classifier and LBP classifier.
we train HOG classifier and LBP classifier separately
and cascade them to get our final detector.
We extract HOG features by following the procedure
in [1]. 2×2 cell blocks and 9 orientation bins are chosen
then we obtain 36 dimensional HOG descriptors.
To capture texture features we extract LBP-based
region descriptor. In this paper a LBP histogram de-
scriptor similar to S-LBP (Semantic-LBP) [5] is used
but we make the calculation easier. We use the nota-
tion LBPP,R in [4], where P is the number of sampling
points and R is the radius of the circle to be sampled,
and LBPP,R for pixel(x, y) is:
LBPP,R = [bP −1, ..., b1, b0], bi ∈ 0, 1 (1)
The LBP8,1 operator labels the pixels of an image by
thresholding the 3×3-neighbourhood of each pixel with
the center value and considering the result as a binary
1083
3. Figure 3. Examples of false alarms pass-
ing HOG classifier then removed by LBP
classifier .
number. LBP is often used after decimal coding:
DecimalCode{LBPP,R} =
P −1
∑
i=1
bi2i
(2)
so LBPP,R has 2P
different values. In order to present
LBP features of a region more simply, instead of decimal
coding we put a LBP operator as circular data and calcu-
late the histogram of the region based on the following
interpretation: several continuous ”1” in an operator can
be compactly represented with their length and the point
they start. We calculate the number of occurrences of
different continuous ”1” and organize the statistics in a
histogram to express texture features. For example, op-
erator ”10011011” has two continuous ”1”, one starts at
the fourth number and its length is 2, while the other
starts at the seventh number and its length is 3. To an
8-bit binary operator there are 7×8+2 different kinds of
continuous ”1” including all the eight binary numbers
are ”0” or ”1”. So the histogram of LBP8,1 has 58 bins
and finally we can obtain 58 dimensional LBP descrip-
tors.
After we get the histogram descriptors it is easy to
calculate the integral image of the features. We train
our human detector using Fisher’s linear discriminant as
weak classifier and using AdaBoost algorithm to gener-
ate strong cascade classifier. After training HOG clas-
sifier and LBP classifier separately, we cascade them to
get our final detector. We use a sliding window of vari-
able size to detect human in videos and the process in
detail is shown in Figure 2. The experiments tested on
our video samples show that the feature fusion detector
reduces about fifty percent of false positive detections
with a slight decline in hit rate compared to using HOG
detector only. Some examples are presented in Figure 3.
3. Pedestrian Segmentation
After human detection we get some rectangles con-
tain pedestrians in them. We want to cut the pedestrians
Figure 4. The Left is the mask we use to
initialize and right is a simple rectangle
initialization. Red represents foreground,
black represents background, and the blue
part is unknown region.
from images for extracting features only on the body we
are interested in and filtering out the background infor-
mation.
In our system we use GrabCut algorithm which is
introduced in [7] to segment pedestrians. GrabCut is
a kind of interactive foreground extraction using iter-
ated graph cuts by modeling Gaussian Mixture Model
(GMM) in RGB color space to analyze the input image
and separate foreground people from the background.
Object segmentation with complex background is
difficult to be accurate since computers do not know
which part in an image is foreground or background.
Good segmentation algorithms tend to use the full su-
pervision or semi-supervision [5]. For example, when
we use Photoshop to segment an object we need to man-
ually draw the contour to get better result. Thus initial-
ization becomes quite important to segment object auto-
matically. By knowing the middle part of an input rect-
angle is a pedestrian whose contour looks like an oval
and the edge part is background, we initialize the input
rectangles with a prier mask in figure 4 and it turns out
to perform better than initializing the input with a simple
rectangle. Figure 5 shows that segmentation initialized
with our mask is good to retain the whole body.
We can take the advantage of segmentation results
to remove some of the false positive detections by ana-
lyzing the size and the shape of foreground. If the num-
ber of foreground pixels is less than a threshold or the
proportion of the object’s contour is unreasonable, this
detection result must be a false alarm.
4. Feature Extraction
As we want to track human under multi-cameras
with non-overlapping, the movement information of a
person is no longer continuous, hence we want to ex-
tract apparent features to adapt to the changes of visual
1084
4. (a) (b) (c) (d)
Figure 5. Pedestrian segmentation. (a) In-
put images of segmentation. (b) GMM im-
ages. (c) Results of rectangle initialization.
(d) Results of mask initialization.
angle, scale, illumination, etc. Consequently we con-
centrate on the color and texture features of the body.
Color histogram is a kind of color characteristics
widely used in image or video processing. It describes
the proportion of different colors in an image and do not
care about the spatial location of each color, so it can
adapt to object deformation and scale changes to a cer-
tain extent. Color histograms calculate statistical color
information of image data and organize the statistics to
a series of bins predefined to present color distributions.
Color histogram can be calculated in different color
space like RGB and HSV. We experiment the extraction
of color histogram both in RGB space and HSV space
for tracking. It turns out that the extraction in HSV per-
Figure 6. Examples of feature extraction.
The first column are the body parts we
segment and the second column are ap-
parent feature histograms extracted. The
left part of a histogram represent color fea-
tures and the right part in blue represent
texture features.
forms better because HSV shows a more direct descrip-
tion of color compared to RGB. As illumination has a
great influence on V-channel (value), we only calculate
the histogram of H-channel (hue) and S-channel (satu-
ration) to adapt to variable illumination. We equally di-
vide H-channel into 16 bins and S-channel into 8 bins
so that we obtain 128 dimensional histogram to describe
the color characteristics.
As for texture features we still use the 58 dimen-
sional LBP histogram descriptor introduced in section
2.
In order to only get the information of the body we
are interested in, we just extract the color histogram and
LBP histogram of the foreground region after pedestrian
segmentation. We combine color and texture features
by connecting the two histograms together after normal-
ized individually. So finally we obtain 186 dimensional
feature vectors. Part based features seem to be more ac-
curate to describe a person [9], but they do not perform
better than features extracted from the whole body in
1085
5. Figure 7. Correlation coefficients between
objects under multi-cameras. The top row
shows some objects after segmentation.
Correlation coefficients between them are
listed in table below. Each triangular gray
area represents a same person under dif-
ferent cameras.
tracking under multi-cameras. Part based features are
not stable or robust because of the deformation of ob-
ject under multi-cameras and our experiments also prove
this. Therefore we use the apparent features of the whole
foreground region to describe a person, and some results
are shown in Figure 6.
5. Object Matching
We track human under multi-cameras only using ap-
parent features we extract without the help of discontin-
uous motion information under multi-cameras. With the
color and texture histogram features we extracted in sec-
tion 4 we do object matching by calculating correlation
coefficients between every two histograms and selecting
the maximum. Two histograms that have the maximum
correlation coefficient are determined to be the same per-
son as well as the maximum correlation coefficient ex-
ceeds the matching threshold predetermined. The corre-
lation coefficient between histogram H1, H2 is calculate
as following formulas and some examples are shown in
Figure 7. From the figure we see that with the changes
of illumination and variable postures the correlation co-
efficients between the same person in different images
are still greater than others.
H′
k(i) = Hk(i) − (1/N)(
∑
j
Hk(j)) (3)
dcorrel(H1, H2) =
∑
iH′
1(i)H′
2(i)
√∑
iH′2
1 (i)H′2
2 (i)
(4)
Once appear a new person, his histogram feature will
be recorded in a database. As we track pedestrians in
different videos with non-overlapping, we need to up-
date the information we record since the feature of the
same person is always changing. We set two thresh-
olds to control matching and updating respectively. The
lower threshold called matching threshold is to deter-
mine if two histograms match and describe the same per-
son. The higher threshold is called updating threshold
because we can certainly identify that two histograms
are extracted from the same person if their correlation
exceeds this threshold and we can update the informa-
tion recorded of the person using average value of the
two histograms. The information update improves the
matching effectively and especially does well in stable
tracking pedestrians under the same camera since it can
catch the slow changes of pedestrian features.
(a) top view of sample collection place.
(b) channel A (c) channel B (d) channel C.
Figure 8. Sample collection under three
cameras with non-overlapping. (a) shows
positional relationship of multi cameras.
(b)(c)(d) are scenes of three channels.
6. Experiment Results
Our test samples are collected over three cameras
with non-overlapping views. We collect different chan-
nels at the same time. Figure 8 shows position, height
and angle information of sample acquisition cameras.
When running our system on these samples according
to the time sequence of the videos, we deal with three
1086
6. Figure 9. Results of human detection and tracking under three cameras. Green rectangles
show outputs of human detector. The red rectangles are bounding boxes after segmentation.
Blue numbers represent the labels of people identified by objects matching.
frames from different cameras at the same time. Every
time a new person appears we give the target a new label.
Some experiment results are shown in Figure 9.
7. Conclusions and Future Work
We implement a public security system composed
of four parts including human detection, pedestrian
segmentation, apparent features extraction and object
matching. The results of experiments tested on the
samples we collect show the effectiveness of detect-
ing and tracking human under three cameras with non-
overlapping views. In the future we can add information
between adjacent frames for stable tracking under single
camera and consider spatial and temporal information
for multi-cameras tracking.
Acknowledgments
This work is supported by the National High Tech-
nology Research and Development Program of China
(863 program) under Grant No. 2011AA110402 and
the National Natural Science Foundation of China un-
der Grant No. 61071135.
References
[1] N.Dalal, and B.Triggs .Histograms of oriented grandients
for huamn detection, CVPR 2005, volume 1 , pp. 886-893 ,
2005.
[2] Xiaoyu Wang, Tony X. Han, and Shuicheng Yan .An
HOG-LBP human detector with partial occlusion handling,
ICCV 2009, pp. 32-39 , 2009.
[3] William Robson Schwartz, Aniruddha Kembhavi, David
Harwood, and Larry S.Davis .Human detection using partial
least squares analysis, ICCV 2009, pp. 24-31 , 2009.
[4] T.Ahonen, A.Hadid, and M.Pietikinen .Face recognition
with local binary patterns, ECCV 2004, volume 3021/2004,
pp. 469-481 , 2004.
[5] Y.Mu, S.yan, Y.Liu, T.Huang, and B.Zhou .Discriminative
local binary patterns for human detection in personal album,
CVPR 2008, 2008.
[6] Pedro Felzenszwalb, David McAllester, and Deva Ra-
manan .A discriminatively trained, multiscale, deformable part
model, CVPR 2008, pp. 1-8 , 2008.
[7] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake
.”GrabCut” - Interactive foreground extraction using iterated
graph cut, ACM Transactions on Graphics, volume 23 , No. 3
, 2004.
[8] Sara Vicente, Vladimir Kolmogorov, and Carsten Rother
.Graph cut based image segmentation with connectivity priors,
CVPR 2008, pp. 1-8 , 2008.
[9] M.farenzena, L.Bazzani, A.Perina, V.Murino, and
M.Cristani .Person re-identification by symmetry-driven accu-
mulation of local features, CVPR 2010, pp. 2360-2367 , 2010.
1087