Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Fpga human detection
1. An FPGA based human detection system with embedded platform
Pei-Yung Hsiao a,⇑
, Shih-Yu Lin a
, Shih-Shinh Huang b
a
Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, ROC
b
Department of Computer and Communication Eng., Nat’l Kaohsiung First Univ. of Science and Technology, Taiwan, ROC
a r t i c l e i n f o
Article history:
Received 27 August 2014
Received in revised form 23 December 2014
Accepted 17 January 2015
Available online 29 January 2015
Keywords:
FPGA circuit design
Real-time embedded system
Human detection
HOG
SVM
Adaboost
a b s t r a c t
Focusing on the computing speed of the practical machine learning based human detection system at the
testing (detecting) stage to reach the real-time requirement in an embedded platform, the idea of
iterative computing HOG with FPGA circuit design is proposed. The completed HOG accelerator contains
gradient calculation circuit module and histogram accumulation circuit module. The linear SVM classifi-
cation algorithm producing a number of necessary weak classifiers is combined with Adaboost algorithm
to establish a strong classifier. The human detection is successfully implemented on a portable embedded
platform to reduce the system cost and size. Experimental result shows that the performance error of
accuracy appears merely about 0.1–0.4% in comparison between the presented FPGA based HW/SW
co-design and the PC based pure software. Meanwhile, the computing speed achieves the requirement
of a real-time embedded system, 15 fps.
Ó 2015 Elsevier B.V. All rights reserved.
1. Introduction
Human and pedestrian detection technologies have been the
rage in the fields of intelligent transportation system, computer
vision, and perceptive surveillance system in the past years [1–5].
As we known, various machine learning algorithms have been pro-
posed to solve the problem of human detection, in which a local fea-
ture vector of the histograms of oriented gradients (HOG), proposed
by Dalal et al. [1] in 2005, turns into the mostly cited human
descriptive feature [2–6].
Despite a lot of researchers aiming at designing FPGA based cir-
cuits for image processing, detection, and other related applica-
tions [7–8], only several but few researches focused on
developing hardware circuit of HOG in the recent years [4–6].
Yet, those few works did not present as a whole system built in
an embedded platform to achieve a real-time HW/SW co-design
system for human detection. Besides, the detailed comparisons of
computation speeds and detection error rates among FPGA acceler-
ator, embedded platform, and personal computer still did not
appear in the past literatures.
In this study, the human detection is successfully implemented
on a portable embedded platform to reduce the system cost and
size. Moreover, the computing speed of the testing stage achieves
about 15 frames per second, which fully matching the requirement
of a real-time embedded system [7].
2. Principles and FPGA design
2.1. Human detection algorithm
The human detection algorithm covers training stage and test-
ing, or named as detecting, stage, as shown in Fig. 1. Both algo-
rithms of SVM [9–10] and Adaboost [3] are required at the
training stage. However, only SVM algorithm should be used in
the testing stage. In our experiment, both of various still image
pictures and videos are utilized as input at both stages, separately.
Two public image datasets are collected for the former and the later
as well, while two public videos and one additional video are cho-
sen for the latter. The additional video is shot by us with the scene
set-up in our laboratory.
The scene images need to be artificially collected for positive
samples (human) and negative samples (non-human) with the res-
olution 64 Â 128 as a detecting window [1] before getting into the
training stage. To reduce the scanning range of the detecting
window, the input image frame is first proceeded foreground seg-
mentation at the testing stage in order to acquire the region of
interest (ROI) for diminishing the computation time. As the human
objects in the image frame would change the sizes with distinct
distances between camera and objects, the detecting windows
with different sizes need to be scaled up/down.
The system will judge whether there is a human or not in each
detecting window by the One Detecting Window Strong Classifier
Module. The module contains two computing steps. First, all HOG
vectors, which being correspondent with all weak classifiers built
http://dx.doi.org/10.1016/j.mee.2015.01.018
0167-9317/Ó 2015 Elsevier B.V. All rights reserved.
⇑ Corresponding author.
E-mail address: pyhsiao@nuk.edu.tw (P.-Y. Hsiao).
Microelectronic Engineering 138 (2015) 42–46
Contents lists available at ScienceDirect
Microelectronic Engineering
journal homepage: www.elsevier.com/locate/mee
2. as a strong classifier, are calculated. For instance, a strong classifier
with 40 weak classifiers needs to proceed 40 times of HOG calcula-
tion. Second, all weak classifiers are used to proceed the SVM pre-
dict once by using the SVM model file acquired from the training
stage in order to identify the object inside the detecting window
being a human or not a human (non-human).
2.2. Histograms of oriented gradients
As we known, HOG proposed by Dalal et al. [1] is the mostly
cited human local feature. The idea is to use the 36D vector as
the descriptive feature in human detection for representing the
contour and appearance information of an object in an image block
or in a detecting window. With the hardware design of HOG using
FPGA, based on design principles of simplicity, regularity, and
modularity, our circuit architecture presents four arithmetic mod-
ules described below.
2.2.1. Gradient components and gradient magnitude
In order to acquire the vertical and horizontal gradient compo-
nents, the vertical and horizontal differential operations are first
proceeded at a aiming block. In other words, mask [À1, 0, 1] or
[À1, 0, 1]T
is used for the convolution operation, through Eq. (1),
to calculate Gx and Gy, whose values range between À255 and +255.
Gxðx; yÞ ¼ fðx þ 1; yÞ À fðx À 1; yÞ
Gyðx; yÞ ¼ fðx; y þ 1Þ À fðx; y À 1Þ
ð1Þ
The above Gx and Gy are used for calculating the gradient mag-
nitude with Eq. (2), i.e., to square root the sum of the square. The
resultant values appear in 0–357.
rfðx; yÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Gxðx; yÞ2
þ Gyðx; yÞ2
q
ð2Þ
2.2.2. Gradient orientation
The above Gx and Gy are again used for calculating the gradient
orientation with Eq. (3). The results are converted to the angles
ranging between 0° and 180°, i.e., to acquire unsigned gradient
orientation.
hðx; yÞ ¼ tanÀ1 Gyðx; yÞ
Gxðx; yÞ
ð3Þ
2.2.3. Accumulated histogram
After obtaining the gradient magnitude and the gradient orien-
tation, four cells equally divided from a block are separately accu-
mulated to produce four 9D vectors, which are combined into a
36D vector v=(v1,v2,...,v36) as shown in Fig. 2. The gradient orienta-
tion is segmented with 20° as a bin in a cell, and total 9 bins are
acquired.
2.2.4. L2 normalization
The acquired 36D vector utilizes L2 normalization to have each
component value appear in 0–1 from Eq. (4), where e is a small
constant.
vi ¼
vi
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðjjvjj2Þ2
þ e2
q ; i ¼ 1; 2; :::36 ð4Þ
2.3. FPGA modular circuit design
The block diagram of the developed FPGA circuits is shown in
Fig. 3. The HOG vector generator module is illustrated on the right
lower sub-block inside the whole block diagram. Four circuit sub-
modules are contained inside the sub-block. The full buffering
scheme shown in the left upper part of the HOG vector generator
module is designed for adjusting the flow of input block pixels. The
gradient module is used for calculating gradient components, grat a
aiming block. In other wordsadient magnitude, and gradient orienta-
tion. The histogram module is used for accumulating the histograms
of four cells. In addition, the histogram PISO in the left lower part is
designed for buffering the output of the 36D HOG accelerator. More-
over, the testing and communication circuits are put on this design in
theleftpartoutsidethe HOGvectorgenerator moduleinFig.3,soasto
control signal and share data with embedded ARM CPU.
The gradient module covers three sub-modules, including the
GradientComponents for calculating gradient components, the Com-
ponentsToMagnitude for producing the gradient magnitude, and the
ComponentsToOrientation for generating the gradient orientation,
respectively. Before inputting the image data into these sub-mod-
ules, a data buffering scheme is required, and an appropriate com-
puting for simplifying floating point manipulation is necessary
before designing the sub-module circuit of ComponentsToOrienta-
tion. The gradient orientation contains two types of signed and
unsigned gradient orientations, where the latter is used in this
study. When accumulating the orientation, the range of 20° is used
Start
End
Labeled Human
Testing Frames
SVM and AdaBoost
Training Stage
Training Stage
Testing Stage
Detecting Window
Scanning
Foreground
Segmentation
Frame Input and
Grey Scale
One Detecting Window
Strong Classifier
Module
If Result >=0
Yes
No
Positive samples Negative samples
SVM Model File and
Strong Classifier File
Detecting Window
Scaling
Fig. 1. Machine learning based human detection system.
9D 9D
9D9D
Feature block 4 Cells
9D Histogram
36D Histogram
Fig. 2. Combining four 9D histograms to an extracted HOG feature of 36D
histogram.
P.-Y. Hsiao et al. / Microelectronic Engineering 138 (2015) 42–46 43
3. as the partition basis. By applying Eq. (5) with just one multiplier for
simple and regular hardware implementation principle, each bin
value is defined as the accumulated number of pixels in a cell, in
which those pixel orientations should belong to the same range of
20°. Therefore, there are nine angle bins in a cell. To effectively
reduce the proportion of accumulation deflection, a 32-bit register
is utilized for the 220
magnification.
Gxðx; yÞ tanðhiÞ 6 Gyðx; yÞ < Gxðx; yÞ tanðhiþ1Þ ð5Þ
Besides designing three sub-modules in the gradient module,
two types of sub-modules, BlockTo4Cells and Vote9DVector, are
covered in the histogram module. The former sub-module judges
each set of orientation bin and magnitude belonging to which of
the four cells. Then it combines both of the signals, OrientationBin
(4-bit) and Magnitude (9-bit), as illustrated in the most upper part
in Fig. 4, into a 13-bit signal. Finally, the former sub-module deliv-
ers the combined 13-bit signal to four parallel same-type latter
sub-modules, Vote9Dvector1–Vote9Dvector4 as shown in Fig. 3,
orderly according to those four cells with a 1-to-4 demultiplexer.
Consequently, each sub-module of Vote9Dvector functions
accumulates each set of orientation bin and magnitude to a target
cell. The circuit design of each sub-module of Vote9Dvector is
shown in Fig. 4.
3. Experiment results
3.1. Public image datasets
Two public image datasets, CBCL [11] and CVC [12], are selected
for training and testing stages in the system. The image frames
used for training stage would not be used for testing stage in order
to guarantee the effectiveness and persuasion of the test results,
i.e., detection rate and accuracy.
Such two datasets present distinct characteristics naturally.
After manual treatment, several hundred or thousands samples
are selected for training, and other different samples, one by one,
are used for testing. The numbers of the selected positive and neg-
ative samples are shown in Table 1. Here, total 924 positive sam-
ples are selected from CBCL for the first experiment listed in the
CBCL row of Table 1. Half of them are used in training stage, and
the other half are used for testing stage. Because there is no nega-
tive sample existed in CBCL, 1038 negative samples are taken from
CVC to combine with the first half of 924 positive samples for
training. Similarly, 1359 negative samples are taken from CVC to
combine with the second half of 924 positive samples for testing.
Besides, there are total 3356 positive samples existed in CVC.
For the second experiment listed in the CVC row of Table 1 and
571 positive samples are picked out for training and the other
set of 571 different positive samples for testing. For preparing
the negative samples from CVC, total 4096 negative samples are
segmented out from which 1326 negative samples are picked out
for training stage. On the other hand, the different 2048 negative
samples are selected for testing stage.
3.2. HOG accelerator
The completed HOG accelerator, implemented as a Xilinx FPGA
circuit module, presents the highest frequency up to 192 MHz. The
number of cycles for computing one HOG can be formulated as
#cycles = BlockWidth + BlockWidth * BlockHeight + 58.
In the gradient component module, the boundary of block
image for gradient convolution is processed with shrinking manip-
ulation. Moreover, the circuit computing in the ComponentsTo-
Magnitude module is taken the integer rather than the floating
point value. Such two mentioned factors would bring out differ-
ence between the detection rates reached by the FPGA based
HW/SW co-design on the embedded platform and by the execution
of pure software on PC.
SMIMS Macube
Embedded Platform
Colibri T20 (Linux Software)
XC6SLX150T FPGA(Top Module)(Hardware)
One-HOG Vector Generator Module
Gradient
Components
Components
To
Magnitude
Components
To
Orientation
mem1
64x128*8b
OneBlock
Gradient
Module
Vote 9D Vector
Histogram Module
(Vote 36D Vector)
Block
To
4Cells
Vote 9D Vector
Vote 9D Vector
Vote 9D Vector
MO_Generator Module
mem2
36*32b
36D Vector
Full
Buffering
parameter
Width
Height
BufLen
Histogram
PISO
mem0
Dual
Port
4096x16b
ARM_
Mem_
Interface
Tegra 2
CPU
Write
Buffer
4096x16b
Read
Buffer
4096x16b
CPU Write to FPGA Buffer
CPU Read from FPGA Buffer
Input
Data
Move
Output
Data
Move
Generic
Memory
Interface
(GMI) Data Bus
System
Signal
Controller
OneHVG_Start
PISO_done
StartDone
ARM_R_Ready
ARM_W_Ready
USB
Host
SMIMS
Engine
Control and
Program FPGA
Fig. 3. Block diagram for the architecture of our modular circuits for human
detection.
+
Register1
Demultiplexer
1D-HOG
Component
Sel
OrientationBin[3:0]
Input
Magnitude[8:0]
Output9Output1 Output2 Output3 Output4 Output5 Output6 Output7 Output8
+
Register2
+
Register3
+
Register4
+
Register5
+
Register6
+
Register7
+
Register8
+
Register9
1D-HOG
Component
1D-HOG
Component
1D-HOG
Component
1D-HOG
Component
1D-HOG
Component
1D-HOG
Component
1D-HOG
Component
1D-HOG
Component
Fig. 4. One of four parallel Vote9DVector sub-module circuits.
Table 1
The numbers of positive or negative samples selected from two public image datasets.
DataSet Training/testing #Pos. samples #Neg. samples Total #
CBCL Training 462 1038 (CVC) 1500
Testing 462 1359 (CVC) 1821
CVC Training 571 1326 1897
Testing 571 2048 2619
Table 2
Comparison of computation time for one HOG.
Computation basis Spec. OS Speed for one HOG (rate)
PC
(SW)
i7-3770/3.4 Ghz
DDR3/8 GB
Win7 0.035393 ms
(20.601)
ARM Colibri
T20 (SW)
Tegra 2/1.0 Ghz
DDR2/512 MB
Linux 0.122800 ms
(71.478)
FPGA
(HW)
Xilinx/192 MHz
XC6SLX150T
None 0.001718 ms
(1)
44 P.-Y. Hsiao et al. / Microelectronic Engineering 138 (2015) 42–46
4. Based on the accuracy observation of in Table 3, the experimen-
tal results reveal that such an error of detection rate is kept in 0.1–
0.4% for CBCL and CVC datasets, respectively. The accuracy
decreases of 0.1% (98.5–98.4%) and 0.4% (97.4–97.0%) were
obtained from the last row of Table 3. The detailed detection rate
analyses and statistics are described in the next sub-section. How-
ever, the comparison of the computing speed with one HOG is
given in advance in Table 2. The computation time of one HOG
for a block of 16 * 16 pixels with the designed FPGA hardware cir-
cuit merely takes 0.001718 ms. That is, the computing speed with
the proposed one HOG hardware circuit is 20 times faster than the
software computation on a PC and 71 times faster than the ARM
CPU based software on the embedded platform. The computation
time of the entire human detection system is further described in
the following sub-section.
3.3. Detection rate and computation speed
Various measures can be utilized for calculating the detection
rate. In this experiment, three statistic measures are applied to
the experiments of above CBCL and CVC public datasets, including
positive predictive value (PPV or precision), true positive rate (TPR
or recall), and accuracy. PPV stands for the probability of the
labeled (detected) human being a real human; TPR represents
the probability of all human images identified as human, i.e., the
so-called detection rate (or recall); accuracy refers to the propor-
tion of all human and non-human being correctly classified.
In comparison of executing software on the embedded HW/SW
co-design platform and replacing HOG module with FPGA hard-
ware circuit, the above various measures of detection rates are pre-
ceded in our experiments. The experimental results are shown in
Table 3, where the errors of PPV, TPR, and accuracy between
HOG accelerating computation with FPGA and pure software run
on embedded platform or on PC appear below 0.9%, 05%, and
0.4%, respectively. This presents that our HOG hardware computa-
tion gives a high precision, and brings out small errors in various
detection rate measures in terms of the whole system.
What is more, the computing time required for processing one
detecting window at the testing stage is further experimented and
compared, as shown in Table 4. The discovered time for 22–33
times of iterative calculating HOG is quite larger than it for calcu-
lating SVM module. Apparently, designing an iterative used HOG
hardware in machine learning based human detection reveals
higher importance and necessity than SVM hardware [10]. Besides,
the time for human detection with iterative computation of HOG
hardware at the testing stage for a detecting window, could be
effectively reduced and suppressed under fewer than 1 ms, namely
0.922 ms or 0.882 ms, from Table 4. In other words, our FPGA
based human detection system could deal with about 1080 detect-
ing windows in a second. When 72 detecting windows are applied
in each image frame, the system can successfully achieve the
requirement of real-time embedded system of about 15 frames
per second. It further turns out that the computing speed compar-
ison by a detecting window shown in Table 4 presents our con-
vinced outcome and value better than that by one time used
HOG hardware in Table 2.
In comparison with Dalal’s detector [1] based on the same data-
sets of CBCL and CVC as shown in Table 1, this detector confirms a
little lift of the performance. On average, the improvement of PPV,
TPR, and accuracy of this system shows 0.3%, 1.4%, and 0.4% better
than Dalal’s detector, respectively, as listed in Table 5.
3.4. Implementation on a real-time embedded platform
In this research, the architecture of real-time embedded devel-
opment platform, MaCube, consisting of an ARM module modeled
Colibri T20 and an NVIDIA Tegra-2 dual-core Cortex-A9 micropro-
cessor run in 2 Â 1.0 Ghz, is employed with the Linux OS. Mean-
while, the HW/SW integration environment is established by
using Tegra-2 generic memory interface (GMI) data bus and Xilinx
Spartan-6 LX-150T FPGA chip.
An interface engine IC is allocated between ARM SOC and FPGA,
being in charge of the procedure control between ARM and FPGA
and able to download our designed circuit files to FPGA. To accel-
erate the data access speed, DMA is used by MaCube platform for
accessing data to/from FPGA chip. It makes the HW/SW integration
be more efficient.
Table 3
Detection rate and accuracy for our embedded h/s co-design and pure software
human detection systems.
DataSets CBCL CVC
H/S co-design
vs. SW
Pure
SW
Embedded SW
with HOG/FPGA
Pure SW Embedded SW
with HOG/FPGA
# Weak classifiers 33 23 24 22
TP 436 435 512 509
TN 1359 1358 2039 2034
FP 0 1 9 14
FN 26 27 59 62
PPV 100% 99.7% 98.2% 97.3%
TPR 94.3% 94.1% 89.6% 89.1%
Accuracy 98.5% 98.4% 97.4% 97.0%
Table 4
Comparison of computation efficiency per one detecting window.
Speed vs. HOG & SVM # Weak
classifiers
HOG
(ms)
SVM
(ms)
Total
(ms)
CBCL Embedded SW 33 4.104 0.047 4.151
Embedded SW with HOG/FPGA 23 0.922 0.029 0.951
CVC Embedded SW 24 2.363 0.030 2.393
Embedded SW with HOG/FPGA 22 0.882 0.028 0.910
Table 5
Detection rate comparison between Dalal’s and our detector.
DataSets CBCL (SW)
(%)
CVC (SW)
(%)
Dalal PPV 99.5 98.1
TPR 93.1 88.0
Accuracy 98.2 96.9
Ours PPV 100 98.2
TPR 94.3 89.6
Accuracy 98.5 97.4
Fig. 5. Video demonstration for our FPGA based human detection system run in a
real-time embedded platform. (a) and (b) Caviar video; (c) and (d) AVSS 2007video;
(e) and (f) our video.
P.-Y. Hsiao et al. / Microelectronic Engineering 138 (2015) 42–46 45
5. The memory mapping is applied to the user program to corre-
spond with the source end and the target end of the DMA data.
The transmission of DMA is 4096 Â 16 bits at a time. The user pro-
gram writes the data into the write buffer at the source end, calls
the driver to start DMA, and then successfully transmits the data
into FPGA. The control authority of software is then returned to
the user program after the successful transmission. For the reverse
behavior of data transmission, i.e., to read data from the FPFA, the
driver is first called to start DMA so as to read the FPGA data and
write them into the read buffer.
To demonstrate the final results of this study, the videos with
three different scenes of public Caviar video [13], public AVSS
2007 video [14], and self-shot video from our laboratory are pro-
ceeded as the dynamic dataset experiments, as shown in Fig. 5,
where object inside a detecting window automatically detected
as a human by this system is labeled with red rectangle.
4. Conclusion
In order to get rid of a PC-based computing environment and
transfer to the embedded platform as well as to speed up the com-
puting speed of a bottleneck module, the FPGA based HOG vector
generator is successfully accomplished as a modular circuit design.
The completed HOG accelerator contains two circuit modules of
gradient calculation and histogram accumulation. With iterative
used HOG hardware, accuracy changing rate of the FPGA based
human detection system in comparison with that of pure software
appears merely within 0.4%, which presenting a very small effect of
the accelerating design on the detection rate. Meanwhile, the com-
pleted FPGA based human detection system could process about
1075 detecting windows in a second. In other words, it successfully
achieves the requirement of a real-time embedded system of about
15 fps.
Acknowledgments
This research is partially sponsored under the projects MOST
103-2221-E-390-028-MY2 and NSC102-2221-E-390-026.
References
[1] N. Dalal, B. Triggs, Proc. IEEE Conf. Comput. Vision Pattern Recognit. 1 (2005)
886–893.
[2] D. Gerónimo, A.M. López, A.D. Sappa, T. Graf, IEEE Trans. Pattern Anal. Mach.
Intell. 32 (2010) 1239–1258
[3] Q. Zhu, M.-C. Yeh, K.-T. Cheng, S. Avidan, Proc. IEEE Conf. Comput. Vision
Pattern Recognit. 2 (2006) 1491–1498
[4] P.Y. Chen, C.C. Huang, C.Y. Lien, Y.H. Tsai, IEEE Trans. Intell. Transp. Syst. 15 (2)
(2014) 656–662.
[5] S. Bauer, U. Brunsmann, S. Schlotterbeck-Macht, In: MPC Workshop, (2009) pp.
49–58.
[6] R. Kadota, H. Sugano, M. Hiromoto, H. Ochi, R. Miyamoto, Y. Nakamura, Proc.
IIH-MSP IEEE (2009) 1330–1333.
[7] P.Y. Hsiao, C.H. Chen, H. Wen, S.J. Chen, IEE Proc. Comput. Digit. Tech. 153 (4)
(2006) 1871–1874.
[8] K.G. Gokhan, S. Afsar, Microprocess. Microsyst. 37 (3) (2013) 270–286.
[9] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, C.J. Lin, J. Machine Learning
Research 9 (2008) 1871–1874. <http://www.csie.ntu.edu.tw/~cjlin/liblinear>.
[10] D. Anguita, A. Boni, S. Ridella, IEEE Trans. Neural Networks 14 (5) (2003) 993–
1009.
[11] CBCL Pedestrian DB, <http://cbcl.mit.edu/software-datasets>.
[12] CVC Virtual Dataset, <http://www.cvc.uab.es/adas/databases>.
[13] CAVIAR, (2001) <http://homepages.inf.ed.ac.uk/rbf/CAVIAR>.
[14] AVSS, <http://www.eecs.qmul.ac.uk/~andrea/avss2007_d.html>.
46 P.-Y. Hsiao et al. / Microelectronic Engineering 138 (2015) 42–46