A parsimonious SVM model selection criterion for classiﬁcation of real-world data sets via an adaptive population-based algorithm

ORIGINAL ARTICLE
A parsimonious SVM model selection criterion for classification
of real-world data sets via an adaptive population-based algorithm
Omid Naghash Almasi1 • Mohammad Hassan Khooban2
Received: 30 November 2016 / Accepted: 1 March 2017
Ó The Natural Computing Applications Forum 2017
Abstract This paper proposes and optimizes a two-term
cost function consisting of a sparseness term and a gener-
alized v-fold cross-validation term by a new adaptive par-
ticle swarm optimization (APSO). APSO updates its
parameters adaptively based on a dynamic feedback from
the success rate of the each particle’s personal best. Since
the proposed cost function is based on the choosing fewer
numbers of support vectors, the complexity of SVM model
is decreased while the accuracy remains in an accept-
able range. Therefore, the testing time decreases and makes
SVM more applicable for practical applications in real data
sets. A comparative study on data sets of UCI database is
performed between the proposed cost function and con-
ventional cost function to demonstrate the effectiveness of
the proposed cost function.
Keywords Parameter selection Á Model complexity Á
Support vector machines Á Adaptive particle swarm
optimization Á Classification Á Real-world data sets
1 Introduction
Support vector machines (SVMs) are proposed by Vapnik
[1]. SVM is based on statistical learning theory imple-
menting the structural risk minimization principle. There-
fore, SVM is proven as a powerful machine learning
method that attracts a great deal of research in the fields of
classification, function estimation problems, and distribu-
tion estimation [2].
The generalization ability of SVM depends on the
proper choosing of a set of two adjustable parameters
which is called SVM model selection problem [3–5].
Another important feature of SVM is its sparseness prop-
erty which allows only a small part of training data named
support vectors (SVs) contributes in construction of final
hyper-plane. This causes that SVM model has a small size,
and hence, less time is consumed in testing phase in
comparison with a model which built up with contribution
of all training data.
The solution of a model selection problem not only
controls the generalization performance, but also affects
on the SVM model size. Large problems generate large
data sets, and consequently in these data sets, the SVM
model size (number of SVs) will increase. It is antici-
pated from SVM as a sparse machine learning method to
deal with this problem, but the model reduction is not as
much as expected for real-world application and the
number of support vectors increases with the size of data
set.
Generally, two crucial problems arise in SVM applica-
tions. The first is the lack of a certain method for tuning
SVM parameters, and the other one is the size of model in
large data set. In fact, the model selection problem plays an
important role in SVM generalization performance for
either small or large size of data sets, but for large real-
world data sets, the model selection complexity dramati-
cally increases.
Various model selection methods have proposed by
considering different criteria such as Jaakkola–Haussler
bound [6], Opper–Winther bound [7], span bound [8],
radius/margin bound [9], distance between two classes
& Mohammad Hassan Khooban
khooban@sutech.ac.ir
1
Young Researchers and Elite Club, Mashhad Branch, Islamic
Azad University, Mashhad, Iran
2
Department of Electrical Engineering, Shiraz University of
Technology, Shiraz, Iran
123
Neural Comput & Applic
DOI 10.1007/s00521-017-2930-y

[10], and v-fold cross-validation [11]. Generally, gradient
decent method-based algorithms are used to optimize dif-
ferentiable criteria. Although these methods are fast, the
algorithm may stuck in local minima and therefore not
applicable for all aforesaid criteria [4, 9, 10, 12, 13]. To
overcome these drawbacks, global optimization methods
such as PSO [14–16], simulated annealing [17], ant colony
[18], GA [19, 20] have introduced for non-differentiable
and non-smooth cost function optimization problems. More
recently, a PSO-based method has proposed which uses
PSO method to tune SVM parameters and evolve more
artificial instances to make imbalanced data sets balanced
[21].
Many researchers have used v-fold cross-validation
instead of conventional validation in their research works
to evaluate the generalization performance, because in
some cases there are not enough data available to partition
it into separate training and test sets without losing sig-
nificant modeling or testing capability [4, 9, 11, 22–26].
Moreover, the aim in v-fold cross-validation is to
ensure that every datum from the original data set has
the same chance of being in both the training and the
testing sets.
The main contribution of this paper is summarized as
follows: (1) a new criterion is proposed for model selection
problem to consider both tuning a SVM parameters and
model size reduction at once. Building a parsimonious
model and efficient tuning of SVM parameters play
important roles to reduce the testing time and increase the
generalization performance of a SVM, respectively. To
concurrently achieve these goals, a two-term cost function
consisting of sparseness and generalization performance
measures of SVM is proposed. (2) To achieve the global
optimal solution of the proposed cost function, a new
adaptive particle swarm optimization (APSO) is also pro-
posed to solve optimization problem. APSO uses a success
rate feedback to update inertia weight, and also, its cog-
nitive and social weights are adaptively changed during the
optimization process to improve the APSO performance.
The efficiency of APSO is evaluated by comparing to
standard PSO in optimizing static benchmark test func-
tions. Finally, the effectiveness of proposed cost function is
assessed in comparison with one-term cost function con-
sisting of generalization performance criterion on nine data
sets.
This rest of this paper is organized as follows. The SVM
formulation for binary classification is reviewed in Sect. 2.
In Sect. 3.1, generalized v-fold cross-validation formula-
tion is stated, then in Sect. 3.2, new APSO is introduced,
and finally, in Sect. 3.3, the proposed model selection is
proposed. Section 4 begins with stating the experiment
conditions, and then, the experimental results are dis-
cussed. Finally, conclusions are drawn in Sect. 5.
2 Support vector machine
Assume a given two-class labeled data set as follows
X = (xi, yi). Each data point xi 2 Rn
belongs to either of
two classes as determined by a corresponding label
yi 2 { -1, 1} for i = 1, …, n. The optimal hyper-plane is
obtained by solving a quadratic optimization problem
Eq. (1).
Min u w; nð Þ ¼
1
2
wT
w þ C
Xn
i¼1
ni
s:t:
yi wT
:xi þ bð Þ ! 1 À ni; i ¼ 1; 2; . . .; n
ni ! 0; i ¼ 1; 2; . . .; n
ð1Þ
where ni is a slack variable that represents the violation
of pattern separation condition for each of the data and
C is a penalty factor called regularization parameter for
controlling the SVM model complexity. This is one of
the model selection parameters in the SVM formulation.
For nonlinear separable data, a kernel trick is utilized to
map the input space into a high-dimensional space
named feature space. Then, the optimal hyper-plane is
obtained in the feature space. The primal optimal prob-
lem Eq. (1) is transformed into its dual form written as
below:
Max Q að Þ ¼
1
2
Xn
i¼1
Xn
j¼1
aiajyiyjk xi; xj
À Á
À
Xn
j¼1
aj
s:t:
Pn
j¼1
aiyi ¼ 0
0 ai C; i ¼ 1; . . .; n
ð2Þ
where k(., .) is a kernel function. Some of the conventional
kernel functions are listed in Table 1. Kernel parameter
highly affects on generalization performance as well as the
model complexity of SVM. Therefore, kernel parameters
are considered as the other model selection parameter.
Furthermore, in Eq. (2), a = (a1, …, an) is the vector of
non-negative Lagrange multipliers [1]. The solution vector
a = (a1, …, an) is sparse, i.e., ai ¼ 0 for most indices of
training data. This is the so-called SVM sparseness prop-
erty. The points xi that correspond to nonzero ai are called
Table 1 Conventional kernel functions
Name Kernel function expression
Linear kernel k(xi, xj) = xi
T
xj
Polynomial kernel kðxi; xjÞ ¼ ðtÃ
þ xT
i xjÞdÃ
RBF kernel kðxi; xjÞ ¼ expðÀxi À x2
j =r2Ã
Þ
MLP kernel kðxi; xjÞ ¼ tan hðbÃ
0xT
i xj þ bÃ
1Þ
* Kernel parameter
Neural Comput & Applic
123

support vectors. Therefore, the points xi that correspond to
ai = 0 have no contribution in construction of the optimal
hyper-plane and only a part of training data, support vec-
tors, constructs the optimal hyper-plane. Let v be the index
set of support vectors; then, the optimal hyper-plane is
f xð Þ ¼
X#sv
i2m
aiyik xi; xj
À Á
þ b ¼ 0 ð3Þ
and the resulting classifier is
y xð Þ ¼ sgn
X#sv
i2m
aiyik xi; xj
À Á
þ b
" #
ð4Þ
where b is shown the bias parameter and determined by
Karush–Kuhn–Tucker (KKT) conditions [1].
3 Proposed model selection
3.1 Generalized v-fold cross-validation criterion
Generalized v-fold cross-validation (CV) criterion was first
introduced by Craven et al. [27]. Consider a given training
set of n data points as follows fðxk; ykÞjk ¼ 1; 2; . . .; ng.
The following definition is assumed to formulate general-
ized v-fold CV estimator.
Definition 3.1 (Linear smoother) An estimator ^f of f is
called a linear smoother if, for each x 2 Rd
, there exists a
vector LðxÞ ¼ ðl1ðxÞ; . . .; lnðxÞÞT
2 Rn
such that
^f xð Þ ¼
Xn
k¼1
lk xð ÞYi: ð5Þ
In matrix form, this can be written as ^f ¼ LY, with L 2
RnÂn
and L is called the smoother matrix. Craven et al. [27]
demonstrated that the deleted residuals Yk À ^fðÀkÞ
ðXk; hÞ
can be written in terms of Yk À ^fðXk; hÞ and the trace of the
smoother matrix L. Moreover, the smoother matrix depends
on tunable parameter h ¼ ðc; rÞ. The generalized v-fold
CV criterion satisfies
Generalized vÀfold CV hð Þ ¼
1
n
Xn
k¼1
YkÀ^f Xk; hð Þ
1 À nÀ1tr L hð Þ½ Š

2
:
ð6Þ
The generalized v-fold CV estimate of h can be obtained by
minimizing (6). For more details, see [27, 28]. Li [29] and
Cao et al. [30] investigated the effectiveness of generalized
v-fold CV, finding that generalized v-fold CV was a robust
criterion, and regardless of the magnitude of noise, always
the same h is obtained.
3.2 Adaptive particle swarm optimization
PSO is one of the modern population-based optimization
algorithms first introduced by Kennedy and Eberhart [31]. It
uses swarm of particles to find the global optimum solution in
a search space. Each particle represents a candidate solution
for the cost function, and it has its own position and velocity.
Assume particle swarms are in D-dimensional search space.
Let the ith particle in a D-dimensional space represented as
xi = (xi1, …, xid, …, xiD). The best previous position of the
ith particle is recorded and represented as pbi = (pbi1, …, -
pbid, …, pbiD), which is called Pbest and given the best value
in the cost function. General best position, gbest, is denoted
by pgb and shown the best value of the Pbest among all the
particles in the cost function.
The velocity of the ith particle is represented as
vi = (vi1, …, vid, …, viD). In each of the iterations, the
velocity and the position of each particle are updated
according to Eqs. (7) and (8), respectively.
vid ¼ wvid þ C1r1 pbid À xidð Þ þ C2r2 pgb À xid
À Á
ð7Þ
xid ¼ xid þ vid ð8Þ
where w is an inertia weight and it is typically selected
within an interval of [0, 1]. C1 is a cognition weight factor,
C2 is a social weight factor, r1 and r2 are generated ran-
domly within an interval of [0, 1]. Standard PSO has some
shortcomings. It converges to local minima in multimodal
optimization problem and also has some parameters which
should be tuned to have an acceptable exploration and
exploitation properties [32, 33]. In [34], by considering the
stability condition and an adaptive inertia weight, the
acceleration parameters of PSO are adaptively determined.
A simple adaptive nonlinear strategy is introduced. This
strategy mainly depends on each particle’s performance
and determines each particle’s performance by calculating
the absolute distance between each particle’s personal best
(Pbest) and the global best position (gbest) among all
particles in each iterations of algorithm [35]. In [36], the
inertia weight is given by a function of evolution speed
factor and aggregation degree factor, and the value of
inertia weight is dynamically adjusted according to the
evolution speed and aggregation degree. In order to
improve the performance of standard PSO, the inertia, the
cognition, and the social weight factors should be modified.
In this paper, the main idea of modifying the inertia weight
is inspired from 1
5 success rate introduced by Schwefel
[37, 38] in evolution algorithms. Herein, in each iteration,
the success rate of each particle is meant that a better cost
function value is achieved by the Pbest in that each itera-
tion in comparison with its previous iteration. The success
rate is formulated in Eq. (9). Then, the percentage of
success rate is calculated by using Eq. (10).
Neural Comput Applic
123

SucessRate ¼
1 if CostFcn Pbestiter
i
À Á
CostFcn PbestiterÀ1
i
À Á
0 Otherwise

ð9Þ
PSucc ¼
Pn
i¼1 SucessRate i; tð Þ
n
ð10Þ
where n is the number of particles. Now the value of PSucc
can vary within an interval of [0, 1]. It is transparent that
while PSucc is high for a particle, Pbest for that particle is
far from the optimum point of cost function and vice versa.
Therefore, the inertia weight should be correlated with
PSucc. Because of frequent use of linear form for presenting
the inertia weight, we formulate the function of the inertia
weight as a linear function of PSucc as follows:
w iterð Þ ¼ wmax À wminð ÞPsucc þ wmin ð11Þ
The range of the inertia weight [wmin, wmax] is selected to
be [0.2, 0.9]. In order to control the trade-off between
exploitation and exploration properties of PSO algorithm at
the beginning of the optimization process, a large value for
the cognitive weight and a small value for the social weight
should be chosen. Therefore, the exploration property of
PSO is enhanced. In contrast, close to ending stages of PSO
algorithm, a small cognitive weight and a large social
weight should be assigned in such a way to improve the
algorithm convergence to the global optimum point [39].
Therefore, it is necessary to change the cognitive weight
and social weight during the optimization process adap-
tively. To this end, the following formula for APSO is
utilized [32, 33, 38]:
If C1
final
C1
initial
,
C1 ¼ Cfinal
1 À Cinitial
1
À Á iter
itermax

þ Cinitial
1 ð12Þ
If C2
final
[ C2
initial
,
C2 ¼ Cfinal
2 À Cinitial
2
À Á iter
itermax

þ Cinitial
2 ð13Þ
where the superscripts ‘‘initial’’ and ‘‘final’’ indicate the
initial and final values of the cognition weight and the
social weight factor, respectively.
To proof the superior performance of APSO, it is
compared with a standard PSO in optimizing three com-
mon static benchmark test functions. Finally, APSO is used
to solve the model selection problem in SVM. The test
functions are used to investigate the convergence speed and
solution quality of PSO and APSO. Table 2 provides a
detailed description of these functions. All the test func-
tions are a minimization problem. The first function
(Rosenbrock) is a unimodal function while the rest of the
functions (Rastrigin and Ackly) are multimodal optimiza-
tion problems.
The termination criterion of both PSO and APSO is
determined by reaching the maximum iteration number.
In this study, the maximum number of iterations and the
number of particles for both algorithms are selected to be
50 and 30, respectively. The dimension of the search
space (D) is 30. For all test function, xÃ
is the best
solution of test function and fðxÃ
Þ represents the best
achievable fitness for that functions. Figure 1 shows the
comparison results of PSO and APSO based on the final
accuracy and the convergence speed over 100 iterations.
These results demonstrate that APSO has a considerable
higher performance in both unimodal and multimodal
optimization problems.
In solving model selection problem of SVM, APSO is
used to optimize the proposed cost function; after the
maximum number of iteration reached, global best particle
represents an optimal solution consisting of the best regu-
lation parameter and the best kernel parameter for SVM
model.
3.3 Proposed cost function for model selection
problem
A successful selection of SVM model is based on two
important parameters affecting both generalization perfor-
mance and model size of SVM. As we discussed earlier,
those two parameters are regularization and kernel
parameters.
In non-separable problems, noisy training data will
introduce slack variables to measure their violation of the
margin. Therefore, a penalty factor C is considered in
SVM formulation to control the amount of margin vio-
lation. In other words, the penalty factor C is defined to
determine the trade-off between minimizing empirical
error and structural risk error and also to guarantee the
accuracy of classifier outcome in the presence of noisy
training data. Selecting a large value for C causes the
margin to be hard, and the cost of violation becomes too
high, so the separating model surface over-fits the training
data. In contrast, choosing a small value for C allows the
margin to be soft, which results in under-fitting separating
model surface. In both cases, the generalization perfor-
mance of classifier is unsatisfactory, so it makes the SVM
model useless [40].
Kernel parameter(s) are implicitly characterizing the
geometric structure of data in high-dimensional space
named feature space. In feature space, the data become
linearly separable in such a way that the maximal margin
of separation between two classes is achieved. The
selection of kernel parameter(s) will change the shape of
the separating surface in input space. Selecting improp-
erly large or small value for the kernel parameter results
123

over-fitting or under-fitting problem in the SVM model,
so the model is unable to accurately classify data set
[13, 41].
Therefore, we define a model selection problem as an
optimization problem by proposing a cost function which
can concurrently boost up both generalization performance
and sparseness property of a SVM. Although only con-
sidering generalization performance error obtained from
the generalized v-fold CV method as the model selection
criterion guarantees high generalization performance of the
model, no avoidance from over-/under-fitting problem and
also no steering toward improving the sparseness property
of SVM are observed which are more possible in real data
sets, because of large number of SVs. The one-term cost
function consisting of a generalized v-fold CV error is
defined as follows:
One-Term Cost Fun ¼ Generalized v-fold CV Error
ð14Þ
a modification needs to be applied to overcome men-
tioned drawbacks of one-term cost function. Finally, the
proposed two-term cost function is formulated as
follows:
Two-Term Cost Fun ¼ a1 Á Generalized v-fold CV Error
þ a2 Á Sparseness
ð15Þ
where a1 = 0.8 and a2 ¼ 0:2 are the coefficients showing
the significant of Generalized v-fold CV Error and
Sparseness in the cost function, respectively. Sparseness
term is obtained by dividing total number of SVs by the
total number of training data. The proposed cost function is
the weighted sum of the generalized v-fold cross-validation
error and a sparseness property of SVM. By considering
the SVM sparseness as the second term of the cost func-
tion, the over-/under-fitting problem is controlled. There-
fore, the sparsity of the solution is improved and the model
size as well as testing time is decreased.
4 Computational experiments
4.1 Experimental configuration
To evaluate the performance of the proposed cost function,
a PC with configuration of Dual-Core E2160@1.8 GHz
CPU and 1 GB RAM is utilized. Nine commonly used data
sets of UCI database in the literature c used to assess the
effectiveness of the proposed cost function in comparison
with one-term cost function in solving model selection
problem. The v value in generalized v-fold CV is consid-
ered to be 10 in this study. Data sets descriptions are
presented in Table 3. Although the proposed method could
Table 2 Benchmark test functions [34]
Function name Test function #Dim Search space xÃ
f(xÃ
)
Rosenbrock f(x) =
P
i=1
D-1
[100(xi
2
- xi?1)2
? (xi - 1)2
30 [-5, 10]D
[0,…,0] 0
Rastrigin f(x) =
P
i=1
D
(xi
2
- 10 cos(2pxi) ? 10 30 [-5.12, 5.12]D
[0,…,0] 0
Ackly
fðxÞ ¼ À20 expðÀ0:2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
30
PD
i¼1 x2
i
q
Þ À expð1
D
PD
i¼1 cos 2pxiÞ þ 20 þ e
30 [-32, 32]D
[0,…,0] 0
(a) (b) (c)
0 10 20 30 40 50 60 70 80 90 100
0
5
10
15
x 10
5
Iteration
Rosenbrock Function
PSO
APSO
0 10 20 30 40 50 60 70 80 90 100
100
150
200
250
300
350
400
450
500
Iteration
Rastregin Function
PSO
APSO
0 10 20 30 40 50 60 70 80 90 100
1.68
1.7
1.72
1.74
1.76
1.78
1.8
Iteration
Ackely Function
PSO
APSO
Fig. 1 Comparison results between PSO algorithm and new APSO algorithm on three benchmark test functions, a Rosenbrock, b Rastrigin,
c Ackley
123

be applied to every kernel functions, all experiments
reported here are implemented by using RBF kernel for the
following reasons: The RBF kernel non-linearly maps data
sets into the feature space so it can handle the data sets
when the relation between desired output and input attri-
butes is nonlinear. The second reason is less number of
hyper-parameters which inﬂuences on the complexity of
model selection problem. Finally, RBF kernel has less
numerical difﬁculties [10, 13, 41]. As a result, the model
selection parameters are regularization parameter (C) and
RBF kernel parameter (r). The search space for C and r,
model selection range, is set to be [1, 1000] and [0.01,
100], respectively. The performance of SVM model is
achieved by averaging over 1000 optimal models made out
of optimal parameters.
4.2 Experimental results and discussion
For each data set of Table 3, a comparative study between
the optimal models obtained by the proposed two-term cost
function and one-term cost function is performed. In the
comparative study, the generalization performance accu-
racy, the model size, and the testing time are discussed.
The results of the comparative study for data sets are
presented in Table 4.
Table 4 is shown that the parsimonious model
obtained from the two-term cost function has a
remarkable effect on reducing the model size in com-
parison with the model obtained from the one-term cost
function; consequently, the testing time is considerably
reduced. Overall, all data sets are shown on average 46%
reduction in model size and on average 37% reduction in
testing time. For instance, for smallest data set of
experiment (Wine) and the largest data set (DNA),
model size reduction is 58 and 64%, respectively, and
the testing time reduction is about 26.51, 66.00% in
comparison with one-term cost function.
Table 3 Description of data sets
Data set name #Data #Feature
Wine 178 13
Ionosphere 351 35
Breast cancer 699 10
German 1000 20
Splice 2991 60
Waveform 5000 21
Two norm 7400 20
Banana 10,000 2
DNA 10,372 181
Table 4 Results of comparative study for one-term and two-term cost functions on nine data sets
Data set Cost function Accuracy Model size Testing time
% (±SD) Reduction (%) #SVs (±SD) Reduction (%) (s) Reduction (%)
Wine One-term 99.62 ± 0.57 -0.78 28.53 ± 2.59 58.67 2.49 26.51
Two-term 98.84 ± 0.21 11.79 ± 1.65 1.83
Ionosphere One-term 91.86 ± 1.87 -0.86 117.11 ± 4.10 43.28 3.90 25.38
Two-term 91.07 ± 1.19 66.42 ± 4.35 2.91
Breast cancer One-term 97.07 ± 0.68 -0.42 60.45 ± 4.03 55.99 3.41 24.34
Two-term 96.66 ± 0.83 26.60 ± 4.38 2.58
German One-term 72.78 ± 0.52 -0.74 409.26 ± 8.43 36.70 29.92 35.53
Two-term 72.24 ± 0.43 259.03 ± 8.07 19.29
Splice One-term 90.04 ± 0.69 -0.99 1029.73 ± 16.05 42.67 209.19 49.72
Two-term 89.16 ± 0.70 590.23 ± 18.38 105.17
Waveform One-term 90.32 ± 0.49 -0.12 722.60 ± 17.85 38.84 234.61 33.10
Two-term 90.20 ± 0.47 441.94 ± 19.16 156.94
Two norm One-term 97.78 ± 0.19 -0.06 398.20 ± 11.51 43.21 190.50 49.85
Two-term 97.72 ± 0.16 226.13 ± 12.38 95.52
Banana One-term 96.28 ± 0.20 -0.23 705.12 ± 43.70 32.12 485.70 28.58
Two-term 96.05 ± 0.21 478.6 ± 34.28 346.84
DNA One-term 95.60 ± 1.80 -1.07 1180.11 ± 154.85 64.59 565.97 66.00
Two-term 94.57 ± 1.24 417.79 ± 38.12 192.39
123

One-term Two-term
One-term Two-term
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0.25
0.3
0.35
0.4
log10
(c)
log10
(σ)
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0.3
0.32
0.34
0.36
0.38
0.4
log10
(c)
log10
(σ)
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0
0.05
0.1
0.15
0.2
log10
(c)
log10
(σ)
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0
0.05
0.1
0.15
0.2
log10
(c)
log10
(σ)
(a)
(b)
Fig. 2 Two examples of model selection problem with one-term and two-term cost functions for data sets described in Table 3, a German,
b Banana
Two norm Splice DNA
One-term Two-term
0
10
20
30
40
50
60
70
80
90
100
Accuracy (%)
One-term Two-term
0
50
100
150
200
250
300
350
400
Model Size (#SVs)
One-term Two-term
0
20
40
60
80
100
120
140
160
180
200
Testing Time (Sec.)
One-term Two-term
0
10
20
30
40
50
60
70
80
90
100
Accuracy (%)
One-term Two-term
0
200
400
600
800
1000
1200
Model Size (#SVs)
One-term Two-term
0
50
100
150
200
250
Testing Time (Sec.)
One-term Two-term
0
10
20
30
40
50
60
70
80
90
100
Accuracy (%)
One-term Two-term
0
200
400
600
800
1000
1200
Model Size (#SVs)
One-term Two-term
0
100
200
300
400
500
600
Testing Time (Sec.)
Fig. 3 Three visual examples of one-term cost function (blue) and two-term cost function (green) extracted from Table 4. Accuracy (left bars),
model size (middle bars), and testing time (right bars) (colour ﬁgure online)
123

Although it is expected that by reducing model size, the
generalization performance considerably decreased,
experimental results show that only a slight drop happened
in the generalization performance for all data sets. The
accuracies reduction ranges are below 0.58% on average
for all data sets. By considering the importance of time
consuming, a slight decrease in generalization performance
of SVM is acceptable.
The parameters of the optimal model selection process
obtained by APSO are shown in ‘‘Appendix’’. In Fig. 2,
two examples of one-term and two-term cost functions
surfaces are plotted versus two model selection parameters
to present the difference between one-term and proposed
two-term cost functions. In addition, three examples of
obtained results listed in Table 4 are visualized in Fig. 3, to
show the efficiency of proposed two-term cost function
over one-term cost function.
5 Conclusion
A new two-term cost function based on the generalized
v-fold generalization performance and the sparseness
property of SVM proposed for the SVM model selection
problem. In addition, a new APSO introduced to solve
the non-convex and multimodal optimization problem.
The feasibility of this cost function in comparison with
one-term cost function evaluated on nine data sets. The
proposed cost function shows an acceptable loss in
generalization performance while providing a parsimo-
nious model and avoiding SVM model from over-/under-
fitting problem. The experimental results demonstrated
that the parsimonious model has a lower model size on
average 46% and less time consuming on average 37%
in SVM testing phase in comparison with model
obtained by the one-term cost function.
Compliance with ethical standards
Conflict of interest The authors declare that there is no conflict of
interests regarding the publication of this paper.
Appendix
The optimal model selection parameters for all experiments
data sets are presented in Table 5.
References
1. Vapnik VN (1998) Statistical learning theory. Wiley, New York
2. Almasi ON, Rouhani M (2016) Fast and de-noise support vector
machine training method based on fuzzy clustering method for
large real world datasets. Turk J Electr Eng Comput 241:219–233
3. Peng X, Wang Y (2009) A geometric method for model selection
in support vector machine. Expert Syst Appl 36:5745–5749
4. Wang S, Meng B (2011) Parameter selection algorithm for sup-
port vector machine. Environ Sci Conf Proc 11:538–544
5. Chapelle O, Vapnik VN, Bousquet O, Mukherjee S (2002)
Choosing multiple parameters for support vector machines. Mach
Learn 461:131–159
6. Jaakkola T, Haussler D (1999) Probabilistic kernel regression
models. Artif Int Stat 126:1–4
7. Opper M, Winther O (2000) Gaussian processes and SVM: mean
field and leave-one-out estimator. In: Smola A, Bartlett P,
Scholkopf B, Schuurmans D (eds) Advances in large margin
classifiers. MIT Press, Cambridge, MA
8. Vapnik V, Chapelle O (2000) Bounds on error expectation for
support vector machines. Neural Comput 12(9):2013–2016
9. Keerthi SS (2002) Efficient tuning of SVM hyperparameters
using radius/margin bound and iterative algorithms. IEEE Trans
Neural Netw 135:1225–1229
10. Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance
between two classes for tuning SVM hyperparameters. IEEE
Trans Neural Netw 212:305–318
Table 5 Optimal model selection parameters
Data set Cost function C r
Wine One-term 49.60 2.58
Two-term 855.06 13.08
Ionosphere One-term 31.08 2.86
Two-term 354.26 4.90
Breast cancer One-term 19.56 34.08
Two-term 997.48 26.32
German One-term 7.91 2.01
Two-term 24.34 5.73
Splice One-term 3.36 4.80
Two-term 636.01 25.82
Waveform One-term 1.01 2.73
Two-term 9.10 7.93
Two norm One-term 1.03 6.87
Two-term 992.21 53.36
Banana One-term 9.28 0.27
Two-term 25.50 0.30
DNA One-term 348.60 8.56
Two-term 870.91 52.83
123

11. Guo XC, Yang JH, Wu CG, Wang CY, Liang YC (2008) A novel
LS-SVMs hyper-parameter selection based on particle swarm
optimization. Neurocomputing 71:3211–3215
12. Glasmachers T, Igel C (2005) Gradient-based adaptation of
general Gaussian kernels. Neural Comput 1710:2099–2105
13. Lin KM, Lin CJ (2003) A study on reduced support vector
machines. IEEE Trans Neural Netw 146:1449–1459
14. Wang S, Meng B (2010) PSO algorithm for support vector machine.
In: Electronic commerce and security conference, pp 377–380
15. Lei P, Lou Y (2010) Parameter selection of support vector
machine using an improved PSO algorithm. In: Intelligent
human–machine systems and cybernetics conference, pp 196–199
16. Lin SW, Ying KC, Chen SC, Lee ZJ (2008) Particle swarm
optimization for parameter determination and feature selection of
support vector machines. Expert Syst Appl 354:1817–1824
17. Zhang W, Niu P (2011) LS-SVM based on chaotic particle swarm
optimization with simulated annealing and application. In:
Intelligent control and information processing, 2011 2nd inter-
national conference, vol 2, pp 931–935
18. Blondin J, Saad A (2010) Metaheuristic techniques for support
vector machine model selection. In: Hybrid intelligent systems,
2010 10th international conference, pp 197–200
19. Almasi ON, Akhtarshenas E, Rouhani M (2014) An efficient
model selection for SVM in real-world datasets using BGA and
RGA. Neural Netw World 24(5):501
20. Lihu A, Holban S (2012) Real-valued genetic algorithms with
disagreements. Stud Comp Intell 4(4):317–325
21. Cervantes J, Garcia-Lamont F, Rodriguez L, Lopez A, Castilla
JR, Trueba A (2017) PSO-based method for SVM classification
on skewed data sets. Neurocomputing 228:187–197
22. Williams P, Li S, Feng J, Wu S (2007) A geometrical method to
improve performance of the support vector machine. IEEE Trans
Neural Netw 183:942–947
23. An S, Liu W, Venkatesh S (2007) Fast cross-validation algo-
rithms for least squares support vector machine and kernel ridge
regression. Pattern Recognit 408:2154–2162
24. Huang CM, Lee YJ, Lin DK, Huang SY (2007) Model selection
for support vector machines via uniform design. Comput Stat
Data Anal 521:335–346
25. Almasi ON, Rouhani M (2016) A new fuzzy membership
assignment and model selection approach based on dynamic class
centers for fuzzy SVM family using the firefly algorithm. Turk J
Electr Eng Comput 243:1797–1814
26. Almasi BN, Almasi ON, Kavousi M, Sharifinia A (2013) Com-
puter-aided diagnosis of diabetes using least square support
vector machine. J Adv Computer Sci Technol 22:68–76
27. Craven P, Wahba G (1978) Smoothing noisy data with spline
functions. Numer Math 314:377–403
28. Efron B (1986) How biased is the apparent error rate of a pre-
diction rule? J Am Stat Assoc 81394:461–470
29. Li KC (1987) Asymptotic optimality for Cp, CL, cross-validation
and generalized cross-validation: discrete index set. Ann Stat
15(3):958–975
30. Cao Y, Golubev Y (2006) On oracle inequalities related to
smoothing splines. Math Methods Stat 154:398–414
31. Kennedy J, Eberhart RC (2001) Swarm intelligence. Academic
Press, USA
32. Beyer HG, Schwefel HP (2002) Evolution strategies: a compre-
hensive introduction. Nat Comput 352:2002
33. Yuan X, Wang L, Yuan Y (2008) Application of enhanced PSO
approach to optimal scheduling of hydro system. Energy Convers
Manag 49:2966–2972
34. Taherkhani M, Safabakhsh R (2016) A novel stability-based
adaptive inertia weight for particle swarm optimization. Appl
Soft Comput 31:281–295
35. Chauhan P, Deep K, Pant M (2013) Novel inertia weight strate-
gies for particle swarm optimization. Memet Comput 5:229–251
36. Yang X, Yuan J, Yuan J, Mao H (2007) A modified particle
swarm optimizer with dynamic adaptation. Appl Math Comput
189:1205–1213
37. Schwefel HPP (1993) Evolution and optimum seeking: the sixth
generation. John Wiley Sons, Inc
38. Almasi ON, Naghedi AA, Tadayoni E, Zare A (2014) Optimal
design of T-S fuzzy controller for a nonlinear system using a new
adaptive particle swarm optimization algorithm. J Adv Comput
Sci Technol 31:37–47
39. Wang Y, Li B, Weise T, Wang J, Yuan B, Tian Q (2011) Self-
adaptive learning based particle swarm optimization. Inf Sci
181:4515–4538
40. Keerthi SS, Lin CJ (2003) Asymptotic behavior of support vector
machines with gaussian kernel. Neural Comput 157:1667–1689
41. Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel
classifiers with online and active learning. J Mach Learn Res
6:1579–1619
123

A parsimonious SVM model selection criterion for classiﬁcation of real-world data sets via an adaptive population-based algorithm

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A parsimonious SVM model selection criterion for classiﬁcation of real-world data sets via an adaptive population-based algorithm

Similar to A parsimonious SVM model selection criterion for classiﬁcation of real-world data sets via an adaptive population-based algorithm (20)

Recently uploaded

Recently uploaded (20)

A parsimonious SVM model selection criterion for classiﬁcation of real-world data sets via an adaptive population-based algorithm