On comprehensive analysis of learning algorithms on pedestrian detection using shape features

Journal of Intelligent & Fuzzy Systems 35 (2018) 4807–4820
DOI:10.3233/JIFS-18491
IOS Press
4807
On comprehensive analysis of learning
algorithms on pedestrian detection
using shape features
Igi Ardiyanto∗
, Teguh Bharata Adji and Dika Akilla Asmaraman
Department of Electrical Engineering and Information Technology, Universitas Gadjah Mada,
Yogyakarta, Indonesia
Abstract. Despite the surge of deep learning, deploying the deep learning-based pedestrian detection into the real system
faces hurdles, mainly due to the huge resource usages. The classical feature-based detection system still becomes feasible
option. There have been many efforts to improve the performance of pedestrian detection system. Among many feature set,
Histogram of Oriented Gradient seems to be very effective for person detection. In this research, various machine learning
algorithms are investigated for person detection. Different machine learning algorithms are evaluated to obtain the optimal
accuracy and speed of the system.
Keywords: Pedestrian detection, machine learning, Histogram of Oriented Gradient, shape features
1. Introduction
Person detection system has been continuously
developed to get better performance. The system
plays an important role in the latest technological
developments, such as in Advanced Driving Assistant
System. The success rate of this system is determined
by its working accuracy and speed. Machine learning
and feature extaction are the main factor behind the
performance of the system. Many methods have been
applied, both using conventional and modern meth-
ods to achieve the best performance of the system.
The conventional detection method using supervised
machine learning and sliding window is preferred to
build a low cost system.
For many years, researchers have looked new ways
to improve the performance of person detection sys-
tem,bothonthefeaturedescriptorandonthemachine
∗Corresponding author. Igi Ardiyanto, Department of Electri-
cal Engineering and Information Technology, Universitas Gadjah
Mada, Jl. Grafika No. 2, Yogyakarta, Indonesia. E-mail: igi@ugm.
ac.id.
learning. For instance, Dalal et al. [1] had intro-
duced new feature set known as HOG (Histogram of
Oriented Gradient) that outperforms existing feature
sets, such as Haar wavelet, Shape Context and PCA-
SIFT (Principal Component Analysis-Scale Invariant
Feature Transform) for linear SVM (Support Vector
Machine) based person detection. The HOG feature
set then becomes very popular because of its superior
performance and is widely used for many detection
problems, including person detection.
Beside HOG, many feature sets, i.e. Haar wavelet
and SIFT also have been used for person detection.
For instance, [2, 3] use SVM classifier for person
detection based on Haar wavelet. In [4], AdaBoost
algorithm is used to learn a cascade of classifier based
on spatial temporal Haar wavelet. In [5], the authors
use recognition-by-component approach based on
Haar wavelets and SVM training. The authors in [6]
use two SVM classifiers and gradient magnitude fea-
tures to detect person in front, rear, and side views. In
[7], recognition-by-component approach is also used
for detection based on SIFT features and AdaBoost
1064-1246/18/$35.00 © 2018 – IOS Press and the authors. All rights reserved

4808 I. Ardiyanto et al. / On comprehensive analysis of learning algorithms on pedestrian detection
classifier. The authors in [8] use a neural network and
the gradient magnitude feature for person detection.
Some researchers also modified and combined
the existing methods to improve the performance of
person detection. The authors in [9] combine Haar
wavelets and Edge Orientation Histogram features
and classified them with AdaBoost classifier. In [10]
the authors use random forest to classify DOT (Dom-
inant Orientation Template) features, binary version
of HOG descriptor. The authors in [11] use AdaBoost
algorithm and cascading methods to segment pedes-
trian candidates and make recognizing with SVM
classifier. In [12] the authors proposed a modified
method for HOG, C-HOGC (Combination of HOG
Channels) to perform faster person detection. In very
recent research, such classical gradient features is
still employed with variant of machine learning other
than SVM, e.g. Adaboost [13, 14]. And lastly, Garcia
et al. [15] takes the combination of different sensors
into their approach.
New person detection methods continue to emerge
until now. This encourages some researchers to
perform performance analysis of various existing
methods to know which method works best for person
detection. For instance, the authors in [16] make com-
parison of SVR (Support Vector Regressor) adapted
to binary classification, SVM, and k-NN (k-Nearest
Neighbor) in person detection. In [17], SVM, k-NN,
and decision tree algorithm are used for overlapping
and non-overlapping person detection. There also
exists very recent approach for pedestrian detection,
i.e. using deep learning (e.g. [18]). The problem is,
deep learning method tends to eat huge amount of
resources, making them less deployable on the small
system. In this case, we now focus on the classical
approach which utilizes features and classifiers as its
main core algorithm.
In many researches, the performance of machine
learning is only observed by its accuracy without con-
sidering its processing speed to be deployed in the
real system. Therefore, this research aims to provide
better comparison of machine learning performance
both in terms of accuracy and processing speed. Our
contribution is crystal clear, making a comprehensive
analysis toward learning algorithm used for pedes-
trian detection.
In this research, basic Linear SVM [19], random
forest [20], ERT (Extremely Randomized Tree) [21],
AdaBoost [22, 23], and K-NN are used to classify
HOG features for person detection. Accuracy and
processing speed of the machine learning classifier
are measured and used to calculate the performance
score. The performance score of the machine learn-
ing classifiers are then compared to find out which
machine learning has the best performance in the per-
son detection. The research also focuses on obtaining
optimal value of machine learning parameters. Some
tests and measurement using dataset and stream-
ing frames are observed. The results of the studies
show the effect of varying the value of machine
learning parameters on the performance and provide
performance comparison of several commonly used
machine learning.
2. Basic theory
2.1. Histogram of Oriented Gradient (HOG)
Histogram of Oriented Gradient [1] is descriptor
that extracts image features from calculation of gra-
dient vector of pixels and distributes them to gradient
histograms. This descriptor introduced by Navneet
Dalal and Bill Triggs has exceled performance espe-
cially for person detection compared by preceding
descriptors based on edge and gradient of image. In
general, this descriptor works by dividing images
window into some small spatial regions, called as
“cell”, for each cell accumulating a local 1-D his-
togram of gradient directions or edge orientations
over the pixels of the cell. The combined histogram
then entries form representation. For better invariance
to illumination, shadowing, etc., contrast- normalized
is done by normalize histogram features for every
larger spatial regions, called as “block cell”.
In this research, the histogram contains 9 bins
corresponding to angles 0, 20, 40, . . . , 160. Each
block cell contains 4 cells that each has 9 values of
histogram as features. Each block cell contain 36 fea-
tures vector that will be normalized. In 48 × 96 pixels
window, there are 3 × 6 block cell (no overlapping)
that each contains 36 features, so the output of the
HOG descriptor in this research is 648 features. The
process of each cell in this descriptor is illustrated by
Fig. 1.
2.2. Machine Learning Algorithm for Pedestrian
Detection
Machine learning is a subfield of computer sci-
ence that gives computer ability to learn without
being explicitly programmed [20]. Machine learn-
ing focus on the development of computer programs
to learn data and then provide processing function

I. Ardiyanto et al. / On comprehensive analysis of learning algorithms on pedestrian detection 4809
Fig. 1. HOG Descriptor Cell Processing.
to new data. Machine learning algorithms are often
categorized as being supervised or unsupervised.
Supervised machine learning learns by knowing first
the desired input and output data and then creat-
ing some mathematical functions from the data. This
research uses 5 commonly used supervised machine
learning algorithms to classify HOG features for per-
son detection.
2.2.1. Support Vector Machine (SVM)
The original Support Vector Machine algorithm
[19] was invented by Vladimir N. Vapnik and Alexey
Ya. Chervonenkis in 1963. The algorithm is based on
finding hyperplane that gives the largest minimum
distance to the training examples The notation used
to define linear hyperplane is shown in eq. 1, where
variable w, x, and b are weight vector, input vector,
and weight bias respectively.
f(x) = wT
x + b. (1)
By using the result of geometry, the orthogonal
distance between vector x and hyperplane can be
derived equal to |f(x)|
w . If the training examples clos-
est to hyperplane, called as support vectors, are set to
have f(x) equal to 1, the distance between two oppo-
site support vectors orthogonal to the hyperplane,
or called as hyperplane margin, can be calculated
equal to 2
w . The problem of maximizing mar-
gin is equivalent to the problem of minimizing
a function L(w) subject to some constraints. The
constraints model for the hyperplane to classify
all training examples correctly can be formulated
by eq. 2, where variable yi is label of ith train-
ing examples. The problem of this optimization can
be solved using Lagrange multipliers to obtain the
weight vector w and bias b of the optimal hyper-
plane.
min
w,b
L(w) =
1
2
w 2
, s.t. yif(xi) ≥ 1, ∀i. (2)
If SVM algorithm allows some misclassifications
in training, known as soft margin SVM, the con-
straints model of hyperplane can be formulated by eq.
3, where C is regularization parameter and ζi is dis-
tanceerrortothehyperplaneandequalto1 − yif(xi).
The minimum value of the model is also can be
found by Lagrangian multiplier with set value of
C parameter. C parameter will determine the mar-
gin of SVM. SVM with higher C value tends to
make narrower margin. In the process of finding
the hyperplane, the value of C will be multiplied
by training examples distance error ζi. SVM with
lower C value is more indifferent to misclassificar-
tion error and tends to make hyperplane with wider
margin.
min
w,b
L(w) =
1
2
w 2
+ C
N
i=0
ζi, s.t. yif(xi) ≥ 1 − ζi, ∀i.
(3)

2.2.2. Random forest
Random forest algorithm [20] has been introduced
by Leo Breiman. Random Forest is a collection
(ensemble) of tree predictor or decision tree. In gen-
eral, the random forest algorithm works by taking the
input feature vectors at set random values, classifying
it with every tree. The output of the classifier is the
class label that received the majority of votes from all
trees. In training, random forest uses bootstrap pro-
cedure to avoid over-fitting so all the trees are trained
on different training set. Each tree samples N training
examples randomly from N training examples of the
training set with replacement. From the N training
examples with M input variables, each tree only uses
m input variables selected randomly. The N training
examples with m input variables are then used for
growing the tree to the largest extent possible. Each
node of the tree finds the best split based on gini index
of all possible splits of all features. For example, if
the node a is split into two child nodes, node b and
node c, and let variable nx, nxpos, nxneg be the num-
ber of training examples, number of positive training
examples, and number of negative training examples
in node x respectively, gini index of split Ig in the
node a can be calculated by eq. 4. Higher gini index
of split indicates better split.
I(x) = 1 −
nxpos
nx
−
nxneg
nx
Ig(a) = I(a) −
nb
na
I(b) −
nc
na
I(c)
(4)
The accuracy of random forest is determined by the
two things, i.e. the correlation between any two trees
and the strength of each individual tree. Increasing
the number of random selected features will increase
the correlation between any two trees and the strength
of each individual tree. Increasing the correlation will
increase the random forest error rate, while increasing
individual tree strength will decrease the random for-
est error rate. The optimal number of random selected
features can be obtained by performance test
2.2.3. AdaBoost
AdaBoost or Adaptive Boosting [22] is introduced
in 1995 by Yoav Freund and Robert Schaphire. It is
a boosting technique that combines multiple weak
classifiers into a single strong classifier. AdaBoost
works by assigning weight to each training examples.
After training a classifier, AdaBoost will increase the
weight on the misclassified examples and reduce the
weight on classified example, so the next classifier
will classify them differently. After each classifier is
trained, the classifier weight is calculated based on its
accuracy; accurate classifier are given more weight,
classifier with 50% accuracy is given a weight of zero,
and classifier with less than 50% accuracy is given
negative weight. The final output of AdaBoost, H(x)
is the signum function of linear combination of all
the weak classifier, shown in eq. 5, where variable T,
αt, ht(x), and x are number of weak classifier, weight
applied to tth classifier, output of tth weak classifier,
and vector of new example respectively.
H(x) = sign
T
t=1
αtht(x) (5)
First classifier (t = 1) is trained with equal training
example weight (= 1
m ), for all m training examples.
The weight of the classifier then can be computed by
formula shown in eq. 6, where t is the total weights
of misclassification example over the total weights of
the training set divided by the training set size.
αt =
1
2
ln
1 − t
t
(6)
After computing αt, the training examples weight
are updated using formula shown in eq. 7, where yi
denotes the desired output and Zt is the normalization
constant so the sum of the new weights of training
example is equal to 1.
Dt+1(i) =
Dt(i) exp(−αtyiht(xi))
Zt
(7)
2.2.4. Extremely randomized trees
Extremely randomized trees [21] have been intro-
duced by Pierre Geurts, Damien Ernst, and Louis
Wehenkel in 2006. The algorithm is almost similar
to random forest algorithm. Unlike random forest
algorithm, this algorithm does not apply bootstrap
procedure, so every tree is trained with the same train-
ing set. The other difference is the algorithm pick a
node split randomly. In random forest, gini index of
every possible split in every variable is evaluated. The
split with the highest score is chosen for splitting the
node. In ERT, split value is a random value between
the smallest and the highest value in each variable.
The score of random splits in every variable are then
evaluated based on gini index also. The split with
highest score is chosen for splitting the node.

Fig. 2. Proposed block diagram for pedestrian detection.
2.2.5. k-Nearest Neighbor (k-NN)
k-NNisoneofthesimplestalgorithmsavailablefor
supervised learning. The idea is to search for closest
match of the test data in feature space. The training
examples features are stored for later use to calcu-
late feature space distance to the new example. The
research uses Euclidean distance formula to calcu-
late the feature space distance of new example to
each training examples. The output is the majority
class label of the training examples with k shortest
distance to the new example.
2.3. Sliding Windows for Detection
Slidingwindowisaconventionalmethodforobject
detection that works by moving detection window
horizontally and vertically along the processed frame.
Image in detection window size is cropped from the
processed frame at every window position and pro-
cessed one by one. The detection window must have
the same size with the training images. The HOG fea-
ture set is then extracted from each cropped region
and classified by the machine learning classifier. If
the classifier gives positive result, bounding box at
window position will be displayed on the processed
frame.
To perform multi scale detection, the processed
frame must be resized multiple times, so the detection
window can classify small and large objects. In multi
scale detection, many bounding boxes will appear
overlap each other. NMS (Non-Maxima Suppression)
algorithm must be applied to solve the problem. The
algorithm works by choosing one bounding box that
has highest classifier confidence score from among
the multiple overlapping boxes. The two bounding
boxes will be considered as overlapping box if the
overlapping area is more than half of the area of one
of the overlapping boxes.
3. Methodology
Our work on the pedestrian detection is divided
intotwoparts,trainingandtesting.Theblockdiagram
of this process is shown in Fig. 2. In training part,
person images (Fig. 3a) and non-person images (Fig.
3b) are first resized to 48 × 96 pixels size.
After pre-processing, HOG feature set is extracted
from each image. The feature sets and the ground
truth values of the training images are then used to
train the machine learning approach.
In dataset testing, new images are also resized
to 48 × 96 pixels size. The HOG feature set of the
imagesarethenextracted.Thefeaturesetoftheimage
is classified by the machine learning classifier and
compared to the image ground truth. In video testing,
the frames are resized multiple times. HOG feature
extraction and detection are done while applying slid-
ing window method. The 48 × 96 pixels image are
cropped at every position of detection window and
the HOG feature set are extracted. NMS algorithm
is then applied after sliding window process is com-
plete. The bounding boxes are then displayed on the
processed frame and visually observed.
4. Results and discussions
In this research, system performance is observed
in several detection tests. First, the classifier per-
formance with various parameter values in dataset
testing are observed to get the optimal value of

Fig. 3. Pedestrian dataset: (a) positive images, and (b) negative images.
parameter. The classifier performance with the opti-
mal value of parameters is then compared in 5-cross
validation. The precision and recall of machine lean-
ings are also observed in video test. The results show
the superiority of SVM performance in person detec-
tion using HOG features compared with the other
algorithms.
4.1. Tools and Materials
Tools used in the research are as follows:
1. Computer with specification Windows 7 oper-
ating system, processor Intel(R) Core(TM) i3
CPU M370 @2.4 GHz memory 2,00 GB.
2. OpenCV Library 2.4.9.
Material in the form of person and non-person
images is obtained from INRIAPerson, MIT, Penn-
sylvania dataset, and some videos of ours. The image
numbers of each dataset will be explained in each test
below.
It is also important to be noted that the provided
computation time for the performance measurements
in all experiments are restricted relatively to the above
system setup. Any other system may yield different
timing.
4.2. Effect of Varying the Value of Parameters on
Machine Learning Classifier Performance
To show the effect of varying the value of machine
learning parameters on the performance, the accu-
racy, processing time, and performance score of
machine learning classifier are measured. Accuracy is
measured by area under ROC curve, time by average
processing time per image, and performance score
by subtraction of min-max normalized accuracy and
speed value. Accuracy are min-max normalized from
[0.9, 1] to [0, 1] and speed value from [0, 0.05] to
[0, 1]. In this test, there are two cases with differ-
ent training and testing data composition; the case A
Table 1
Effect of Varying Parameter on Linear SVM Performance
Case A
C AUC time (s) score
0.01 0.9949 9.89e-07 0.9488
0.1 0.9951 9.73e-07 0.9508
1 0.9928 9.71e-07 0.9278
Case B
C AUC time (s) score
0.01 0.9953 9.84e-07 0.9528
0.1 0.9955 9.71e-07 0.9551
1 0.9915 9.76e-07 0.9148
uses training set containing 8K/16K positive/negative
images and testing set containing 1500/3K posi-
tive/negative images. The case B uses training set
containing8K/32Kpositive/negativeimagesandtest-
ing set containing 1500/6K positive/negative images.
4.2.1. Support Vector Machine
In this test, only the value of parameter regulariza-
tion (C) is varied. The value of C is set to 0.01, 0.1,
and 1. SVM does not use any kernel in this research
because linear SVM has been able to do classifica-
tion well for this case. The result of the test for case A
and case B are displayed on Table 1. The result of the
test shows that SVM gives very fast processing time
and very high accuracy. The time here is the aver-
age of processing time per image in second. SVM
has very simple computation resulting very fast pro-
cessing time. Table 1. shows no major difference of
SVM performance for case A and case B. The result
also shows that SVM with C = 0.1 give the best per-
formance in the accuracy and the performance score.
The C value determines the width of SVM hyper-
plane margin. The higher the C value, the wider the
margin. SVM that has too narrow hyperplane margin
will give many misclassifications in high variance
testing data, whereas too wide margin SVM will
misclassify many training examples as error. The lat-
ter will also give many misclassifications in testing
data if the set contains many examples similar to the

Fig. 4. Effect of Varying The Value of Parameters on Random Forest Performance.
misclassified training examples. The good C value
must be obtained by varying and observing the C
value in some range or one can calculate it adap-
tively. In our case, the chosen C value is 0.1, based
on the observation to its range we have mentioned
in the beginning. Such C value may be different for
different classification case.
4.2.2. Random Forest
In this test, the number of random selected fea-
tures, number of trees, and split depth level of random
forest classifier are varied. The results are shown in
Fig. 4. In the graphs, the blue lines, the red lines,
and the green lines represent random forest with 50,
100, and 200 random selected features respectively.
The dotted lines are test results for case A and the
continuous lines are test results for case B. Figure
4a shows that increasing the number of trees will
increase accuracy to the convergence value. In this
test, the split depth level of the tree is set to maximum
number. The combination of more weak classifiers
tends to make stronger classifier. In a certain num-
ber of trees, the correlation between the trees reach
optimal value, so increasing the number of trees will
no more increase accuracy. In case A and case B,
The result shows that random forest with 100 random
selected features give the highest accuracy (in the
range 0.986 to 0.988). Random forest with too high
number of random selected features will have too cor-
related trees. Many trees will work the same way
in classification, so random forest becomes ineffi-
cient and will have bad performance. Random forest
with too small number of random selected features
will have very weak trees. More number of random
selected features in random forest will give each tree
higher probability to make better split in each node
resulting a stronger tree.
Figure 4b shows the effect of increasing num-
ber of trees to the average processing time per
image. Time scale here is in 10 base logarithmic
scales. The graph shows that increasing the num-
ber of the trees will increase the processing time.
Each tree in random forest will take different time
for classification, so increasing the number of trees
will increase time needed to finish all tree classi-
fications. The result also shows that random forest
with more number of selected features has faster
processing time. Random forest with more random
selected features tends to give tree better split to
separate the two classes completely. The number of
nodes of the tree will be smaller and result faster
classification.

Figure 4c shows the performance score of random
forest classifier in the effect of varying the number of
trees. The result shows the optimal number of tree is
in the range 80 to 140. Increasing the number of tree
at the certain number will increase accuracy very lit-
tle or no more, but the time keeps increasing in fixed
amounts, so the performance score will decrease. Fig-
ure 4d and 4e shows the effect of split depth level on
the random forest accuracy and processing time. In
this test number of trees is set to 300. Split depth
level here is a limitation of split level of the trees.
The result shows that increasing the split depth level
tends to increase accuracy and processing time to the
convergence value. Higher level node contains more
classified training examples, so the split in that level
become less important and less influential. Decision
tree with high split depth level tends to overfit the
training examples, but it is hard to see the effect on
the random forest performance. Figure 4f shows the
performance score of random forest in the effect of
changing the split depth level. The result shows that
the effective split level of random forest is in the range
10 to 12. This parameter is highly dependent on the
tree structures of random forest so different random
forest may result different optimal value.
4.2.3. Extremely Randomized Tree (ERT)
In this test, the number of random selected fea-
tures, number of trees, and split depth level of ERT
classifier are varied. The results are shown in Fig. 5.
In the graphs, the blue lines, the red lines, and the
green lines represent ERT with 50, 100, and 200 ran-
dom selected features respectively. The dotted lines
are the test results for case A and the continuous lines
are the test results for case B. Figure 5a shows that
increasing number of trees will increase ERT accu-
racy to the convergence value. In this test, the split
depth level of the tree is set to maximum number.
The graph shows almost the same result with Fig-
ure 4a. Increasing number of trees will also make
ERT stronger classifier. The result shows that ERT
with 200 random selected features give the high-
est accuracy up to 0.988. Correlation between the
trees in ERT is lower than correlation between the
trees in random forest. Two trees in ERT with simi-
lar random selected features may still have different
classification pattern because the split value in ERT is
chosen randomly. This makes ERT with high number
of random selected features can still have good accu-
racy. Nevertheless, increasing the number of random
selected features too much will also make ERT trees
too correlated resulting bad performance.
Figure 5b shows the effect of increasing number of
trees on the average ERT processing time per image.
Time scale here is in 10 base logarithmic scales. The
graph shows almost the same results with Figure 4b.
Increasing the number of the trees will increase the
processing time. ERT with more number of selected
features have faster processing time. This is the same
case with random forest. ERT also chooses the best
split among the random split of all features, so ERT
with more number of random selected features tends
to make better split in each node and results faster
classification. In ERT, there is a considerable amount
of time difference between case A and case B. ERT
takes longer time than random forest to separate train-
ing examples classes completely. The split quality in
ERT is much lower than random forest split. ERT just
pick random value for every feature and does not eval-
uate all possible splits, so increasing the number of
training examples will increase the processing time
more significantly. Figure 5c shows that ERT with
200 features has the highest performance score. Fig-
ure 5d and 5e shows the effect of split depth level on
the accuracy and processing time of ERT. In this test
number of trees is set to 300. Split depth level here
is a limitation of split level of all trees. The result
shows that increasing the split depth level tends to
increase accuracy and processing time to the conver-
gence value. Figure 5f shows the performance score
of ERT in the effect of changing the split depth level.
The result shows different optimal split level in dif-
ferent case and different parameter value.
4.2.4. AdaBoost
In this test, number of trees and split depth of
AdaBoost classifier are varied. The results are shown
in Fig. 6. In the graphs, the blue lines, the red lines,
and the green lines represent AdaBoost with 2, 5, and
10 split depth levels respectively. The dotted lines are
the test results for case A and the continuous lines are
the test results for case B.
Figure 6a shows that increasing the number of
trees in AdaBoost will also increase accuracy to the
convergence value. This is the same case with RF
and ERT that the more combination of weak classi-
fier will result stronger classifier. The result shows
that AdaBoost with 5 split depth level has the best
accuracy (in the range 0.986 to 0.988). All trees in
AdaBoost except the first, will classify the train-
ing examples that are weighted adaptively based on
previous tree misclassification. The tree classifiers
then are also weighted based on their each perfor-
mance. The more accurate tree will be assigned a

Fig. 5. Effect of Varying the Value of Parameter on Extremely Randomized Tree Performance.
Fig. 6. Effect of Varying the Value of Parameters on AdaBoost Performance.
higher weight. AdaBoost with too low split level
may result too many misclassiﬁcation error, so the
training examples weights will be changed too fast.
In high variance training examples, AdaBoost with
very low split depth level may have most of the
trees having very low accuracy and assigned very
low weights. This combination of poor trees will
result the bad performance of AdaBoost. In con-
trary, AdaBoost with too high split level will result
too few misclassiﬁcation error, so the training exam-
ples weights will be changed too slow. This will also
result highly correlated trees and bad performance in
AdaBoost.
Figure 6b shows the effect of increasing the
number of trees on the average AdaBoost processing
time per image. Time scale here is in 10 base
logarithmic scales. The graph shows almost the
same result with random forest and ERT. Increasing
the number of trees will increase computation time.
The result also shows that AdaBoost with more split
depth level take more processing time. This is also
the same case with the random forest and ERT that

Fig. 7. Effect of Varying the Value of Parameters on k-NN Performance.
increasing the split depth level will increase the
processing time. The processing time of AdaBoost
for case A and case B are the same because the trees
still have the same tree depth split level although
they have different number of training examples.
Figure 6c shows that AdaBoost with 5 split depth
level has the highest performance score.
4.2.5. k-Nearest Neighbor
In the test, number of features used to calcu-
late the distance and number of training examples
in k-NN are varied. The results are shown in Fig.
7. In the graphs, the dark blue lines, the magenta
lines, the green lines, the purple lines, the light
blue lines, and the brown lines represent the k-NN
using only 50, 100, 150, 200, 250, and 300 fea-
tures respectively. In this test k-NN does not use all
HOG features for computing the distance, but only
take some features that have highest gini index of
split. The k value is set to square root of the num-
ber of training examples. Figures 7a and 7d show
the effect of varying the number of training example
on k-NN accuracy for case A and case B respec-
tively. In the test, k-NN training examples are scale
down by the factor of 2 from 1, 1/2, 1/4, . . . , 1/128
of the number of training examples. The results
show that reducing the number of training examples
and will reduce the k-NN accuracy. Reducing the
number of training examples will reduce the vari-
ance of training examples and result less accurate
classification. The results also show that k-NN clas-
sifiers that use 50 and 100 features give the highest
accuracy.
Figure 7b and 7e show the effect of varying the
number of training example on the k-NN processing
speed. The results show that reducing the train-
ing examples will reduce considerable amount of
processing time value. k-NN works by measuring
distance between the new example and all train-
ing examples. k-NN with more number of training
examples will take longer time in classification. The
results also show that k-NN that uses fewer fea-
tures have faster processing speed. Figure 7c and 7f
show that k-NN that uses 50 features has the best
performance for both case A and case B. Reduc-
ing the number of less important feature in k-NN
reduces the processing time significantly resulting
better performance score. The optimal number of

Table 2
Cross Validation Test
Fold Measured Machine Learning Classifier
SVM RF ERT ADA k-NN
k=1 AUC 0.9880 0.9783 0.9789 0,9897 0,9851
time (s) 1.20e-06 9.10e-05 9.00e-05 1.60e-05 1.50e-04
score 0.8796 0.7646 0.7705 0.8805 0.8218
k=2 AUC 0.9927 0.9849 0.9858 0.9945 0.9818
time (s) 1.00e-06 6.70e-05 7.70e-05 1.30e-05 3.80e-04
score 0.9262 0.8361 0.8425 0.9322 0.7431
k=3 AUC 0.9965 0.9876 0.9893 0.9969 0.9771
time (s) 1.00e-06 4.90e-05 7.10e-05 9.20e-06 3.80e-04
score 0.9643 0.8658 0.8784 0.9602 0.6964
k=4 AUC 0.9964 0.9893 0.9915 0.9969 0.9799
time (s) 9.80e-07 6.10e-05 9.20e-05 1.00e-05 3.80e-04
score 0.963 0.8812 0.8961 0.9591 0.7232
k=5 AUC 0.9955 0.9864 0.9889 0.9964 0.9776
time (s) 1.00e-06 6.00e-05 9.30e-05 1.30e-05 3.80e-04
score 0.9542 0.8518 0.8707 0.9506 0.7005
Mean AUC 0.9938 0.9853 0.9869 0.9949 0.9803
time (s) 1.00e-06 6.60e-05 8.50e-05 1.20e-05 3.30e-04
score 0.9375 0.8399 0.8516 0.9365 0.737
training examples here is 1/64 of the original number
of training examples. Reducing this parameter will
reduce the accuracy slightly, but reduce the process-
ing time significantly.
4.3. Performance Comparison of Machine
Learning in Dataset Test
Performance of machine learning in this sec-
tion is measured in 5-cross validation test using
8K/80K positive/negative training examples. In this
test, parameter of machine learning classifiers is set
to the optimal value obtained in previous test. The C
parameter of SVM is set to 0.1. Random forest is set
to have 100 random selected features. ERT is set to
have 200 random selected features. AdaBoost is set
to have 5 level of split depth. k-NN is set to only use
50 features for measuring the distance. Other param-
eters are varied and selected the best. The result of
the test is shown in Table 2.
The result shows that linear SVM has the best
performance among the other classifiers. SVM has
advantages in high-dimensional data classification.
HOG is very effective feature set for person detec-
tion. It gives almost linear separable features that
SVM can classify the two classes very well (up to
98% accuracy in training examples). SVM also has
very fast processing time The second classifier that
has the best performance is AdaBoost. AdaBoost has
accuracy as high as SVM accuracy, but with higher
processing time. AdaBoost consists of many adaptive
learning trees combined into very strong classifier.
Fig. 8. Performance of machine learning classifications on video
test.
The third and the forth are ERT and random forest.
ERT has higher accuracy but also higher processing
time than random forest. ERT trees have lower cor-
relation than random forest trees, so the combination
of trees is better and gives higher accuracy. Random
forest has better node split than ERT resulting faster
classification time. Random forest and ERT may have
superior performance if they are used for classifi-
cation of very large low-dimensional data. Random
forest and ERT also can be used for more than two
classes classification that single SVM and AdaBoost
cannot. k-NN has the worst performance among the
classifiers in this case. k-NN is very slow for pro-
cessing the large data. Reducing the training data in
k-NN will reduce the processing time but also reduce
the accuracy. k-NN may be suitable for multi-class

Fig. 9. Pedestrian detection performance on frame test.
classification that does not necessarily require fast
processing.
4.4. Performance Comparison of Machine
Learning in Video Test
In this test, machine learning classifiers are used
to perform person detection on video containing high
variance and very unbalanced-class data. The video
consists of 260 frames in size of 640x480 pixels.
The frames are resized into 7 scales to make multi-
scale detection. The machine learnings in this test
are trained by 9500/75K positive/negative training
examples. In the test, C parameter of SVM is set
to to 0.1. Random forest is set to have 100 ran-
dom selected features, 150 trees, and maximum split
depth level. ERT is set to have 200 random selected
features, 150 trees and maximum split depth level.
AdaBoost is set to have 5 level of split depth and
300 trees. k-NN is set to use 50 features and 1/128
of the number of training example. The performance
of machine learning classifier is plot in the precision-

recall graph shown in Fig. 8. The graph shows that
SVM and AdaBoost have the highest accuracy in this
test. SVM separates the features of the two classes
with quite big margin resulting robust performance
in various data condition. The processing speed of
SVM in this test varies from 10.49 FPS to 11.11
FPS.
The performance of tree-based classifier usually
decreases when dealing with very unbalanced-class
data. Tree classifier tends to overfit the training data.
Training data with more unbalanced class will make
tree classifier better in classifying the major class
but worse in classifying the minor class. Although
AdaBoost is combination of many tree classifiers,
the performance is not significantly affected by the
classunbalance.AdaBoosttreesaregrownwithadap-
tively weighted data and limited split level, so they
rarely overfit the data. AdaBoost classifier with 300
trees in this test runs at 8.44 FPS to 8.90 FPS. ERT
and random forest performance also does not change
significantly in this test. The classifiers in this test
are trained with training data that have very unbal-
anced class as well so their performance does not
significantly change dealing with the condition. ERT
runs at 3.56 to 4.15 FPS and random forest runs
faster at 5.28 to 5.82 FPS. k-NN has the lowest
accuracy in this test. k-NN in this test use only 50
features and 1/128 of the number of training exam-
ples to support its speed. This make the training data
become less variance resulting low accuracy perfor-
mance. Even with these limitations, k-NN still runs
slow at 3.23 to 3.70 FPS. Classification result of
some machine learning in the frame test is shown in
Fig. 9.
5. Conclusions
Among the compared machine learning algo-
rithms, SVM has the best performance on person
detection using HOG features. HOG feature set gives
almost linear separable features that linear SVM can
work best. The optimal value of machine learning
parameters can be obtained by varying the value of
parameters and observing the performance in dataset
test. In detection with high-spec computer, extracting
HOG features with larger window size, i.e. 64 × 128
pixels and overlapping block cells may increase the
overall performance of the classifiers. Larger and
more variance training data may be required for better
detection. Tracking algorithm may also be required
to make the detection more stable.
References
[1] N. Dalal and B. Triggs, Histograms of oriented gradi-
ents for human detection, in 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition
(CVPR’05), vol. 1, 2005, pp. 886–893.
[2] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna and T. Poggio,
Pedestrian detection using wavelet templates, in Proceed-
ings of IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 1997, pp. 193–199.
[3] C. Papageorgiou and T. Poggio, Trainable pedestrian detec-
tion, in Proceedings 1999 International Conference on
Image Processing (Cat. 99CH36348), vol. 4, 1999, pp.
35–39.
[4] P. Viola, M.J. Jones and D. Snow, Detecting pedestrians
using patterns of motion and appearance, in Proceedings
Ninth IEEE International Conference on Computer Vision,
vol. 2, 2003, pp. 734–741.
[5] A. Mohan, C. Papageorgiou and T. Poggio, Example based
object detection in images by components, IEEE Trans Pat-
tern Anal and Machine Intell 23 (2001), 349–361.
[6] G. Grubb, A. Zelinsky, L. Nilsson and M. Rilbe, 3d vision
sensing for improved pedestrian safety, in IEEE Intelligent
Vehicles Symposium, 2004, pp. 19–24.
[7] A. Shashua, Y. Gdalyahu and G. Hayun, Pedestrian
detection for driving assistance systems: Single-frame clas-
sification and system level performance, in IEEE Intelligent
Vehicles Symposium, 2004, pp. 1–6.
[8] L. Zhao and C.E. Thorpe, Stereo- and neural network-based
pedestrian detection, IEEE Transactions on Intelligent
Transportation Systems, vol. 1, 2000, pp. 148–154.
[9] D. Geronimo, A.D. Shappa, A. Lopez and D. Ponsa, Pedes-
trian detection using adaboost leaning of features and
vehicle pitch estimation, in The Sixth IASTED Interntional
Conference, Visualization, Imaging, and Image Processing,
2006, pp. 400–405.
[10] D. Tang, Y. Liu and T. Kim, Fast pedestrian detection by cas-
caded random forest with dominant orientation templates,
in Proceeding of the British Machine Vision Conference
(BMCV2012), 2012, pp. 1–11.
[11] R.A. Kharjul, V.K. Tungar, Y.P. Kulkarni, S.K. Upadhyay
and R. Shirsath, Real-time pedestrian detection using svm
and adaboost, in 2015 International Conference on Energy
Systems and Applications, 2015, pp. 740–743.
[12] L. Weixing, S. Haijun, P. Feng, G. Qi and Q. Bin, A fast
pedestrian detection via modified hog feature, in 2015 34th
Chinese Control Conference (CCC), 2015, pp. 3870–3873.
[13] L. Guo, P.-S. Ge, M.-H. Zhang, L.-H. Li and Y.-B. Zhao,
Pedestrian detection for intelligent transportation systems
combining adaboost algorithm and support vector machine,
Expert Systems with Applications 39(4) (2012), 4274 –4286.
[14] V.-D. Hoang, M.-H. Le and K.-H. Jo, Hybrid cascade
boosting machine using variant scale blocks based hog fea-
tures for pedestrian detection, Neurocomputing 135 (2014),
357 –366.
[15] F. Garcia, J. Garcia, A. Ponz, A. de la Escalera and
J.M. Armingol, Context aided pedestrian detection for
danger estimation based on laser scanner and computer
vision, Expert Systems with Applications 41(15) (2014),
6646–6661.
[16] M. Errami and M. Rziza, Improving pedestrian detection
using support vector regression, in 2016 13th International
Conference on Computer Graphics, Imaging and Visualiza-
tion (CGiV), 2016, pp. 156–160.

[17] B. Amirgaliyev, K. Perizat and C. Kenshimov, Pedestrian
detection algorithm for overlapping and non-overlapping
conditions, in 2015 Twelve International Conference on
Electronics Computer and Computation (ICECCO), 2015,
pp. 1–4.
[18] S. Ren, K. He, R. Girshick and J. Sun, Faster r-cnn: Towards
real-time object detection with region proposal networks,
in Advances in Neural Information Processing Systems 28
(C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama and R.
Garnett, eds.), Curran Associates, Inc., 2015, pp. 91–99.
[19] V.N. Vapnik, Statistical Learning Theory. Wiley-Inter-
science, 1998.
[20] L. Breiman, Random forests, Machine Learning 45 (2001),
5–32.
[21] P. Geurts, D. Ernst and L. Wehenkel, Extremely randomized
trees, Machine Learning 63 (2006), 3–42.
[22] Y. Freund and R.E. Schapire, A decision-theoretic general-
ization of on-line learning and an application to boosting,
Journal of Computer and System Sciences 55(1) (1997),
119–139.
[23] R.E. Schapire, The Boosting Approach to Machine Learn-
ing: An Overview, New York, NY: Springer New York,
2003, pp. 149–171.

On comprehensive analysis of learning algorithms on pedestrian detection using shape features

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to On comprehensive analysis of learning algorithms on pedestrian detection using shape features

Similar to On comprehensive analysis of learning algorithms on pedestrian detection using shape features (20)

More from UniversitasGadjahMada

More from UniversitasGadjahMada (20)

Recently uploaded

Recently uploaded (20)

On comprehensive analysis of learning algorithms on pedestrian detection using shape features