SlideShare a Scribd company logo
ORIGINAL ARTICLE
A parsimonious SVM model selection criterion for classification
of real-world data sets via an adaptive population-based algorithm
Omid Naghash Almasi1 • Mohammad Hassan Khooban2
Received: 30 November 2016 / Accepted: 1 March 2017
Ó The Natural Computing Applications Forum 2017
Abstract This paper proposes and optimizes a two-term
cost function consisting of a sparseness term and a gener-
alized v-fold cross-validation term by a new adaptive par-
ticle swarm optimization (APSO). APSO updates its
parameters adaptively based on a dynamic feedback from
the success rate of the each particle’s personal best. Since
the proposed cost function is based on the choosing fewer
numbers of support vectors, the complexity of SVM model
is decreased while the accuracy remains in an accept-
able range. Therefore, the testing time decreases and makes
SVM more applicable for practical applications in real data
sets. A comparative study on data sets of UCI database is
performed between the proposed cost function and con-
ventional cost function to demonstrate the effectiveness of
the proposed cost function.
Keywords Parameter selection Á Model complexity Á
Support vector machines Á Adaptive particle swarm
optimization Á Classification Á Real-world data sets
1 Introduction
Support vector machines (SVMs) are proposed by Vapnik
[1]. SVM is based on statistical learning theory imple-
menting the structural risk minimization principle. There-
fore, SVM is proven as a powerful machine learning
method that attracts a great deal of research in the fields of
classification, function estimation problems, and distribu-
tion estimation [2].
The generalization ability of SVM depends on the
proper choosing of a set of two adjustable parameters
which is called SVM model selection problem [3–5].
Another important feature of SVM is its sparseness prop-
erty which allows only a small part of training data named
support vectors (SVs) contributes in construction of final
hyper-plane. This causes that SVM model has a small size,
and hence, less time is consumed in testing phase in
comparison with a model which built up with contribution
of all training data.
The solution of a model selection problem not only
controls the generalization performance, but also affects
on the SVM model size. Large problems generate large
data sets, and consequently in these data sets, the SVM
model size (number of SVs) will increase. It is antici-
pated from SVM as a sparse machine learning method to
deal with this problem, but the model reduction is not as
much as expected for real-world application and the
number of support vectors increases with the size of data
set.
Generally, two crucial problems arise in SVM applica-
tions. The first is the lack of a certain method for tuning
SVM parameters, and the other one is the size of model in
large data set. In fact, the model selection problem plays an
important role in SVM generalization performance for
either small or large size of data sets, but for large real-
world data sets, the model selection complexity dramati-
cally increases.
Various model selection methods have proposed by
considering different criteria such as Jaakkola–Haussler
bound [6], Opper–Winther bound [7], span bound [8],
radius/margin bound [9], distance between two classes
& Mohammad Hassan Khooban
khooban@sutech.ac.ir
1
Young Researchers and Elite Club, Mashhad Branch, Islamic
Azad University, Mashhad, Iran
2
Department of Electrical Engineering, Shiraz University of
Technology, Shiraz, Iran
123
Neural Comput & Applic
DOI 10.1007/s00521-017-2930-y
[10], and v-fold cross-validation [11]. Generally, gradient
decent method-based algorithms are used to optimize dif-
ferentiable criteria. Although these methods are fast, the
algorithm may stuck in local minima and therefore not
applicable for all aforesaid criteria [4, 9, 10, 12, 13]. To
overcome these drawbacks, global optimization methods
such as PSO [14–16], simulated annealing [17], ant colony
[18], GA [19, 20] have introduced for non-differentiable
and non-smooth cost function optimization problems. More
recently, a PSO-based method has proposed which uses
PSO method to tune SVM parameters and evolve more
artificial instances to make imbalanced data sets balanced
[21].
Many researchers have used v-fold cross-validation
instead of conventional validation in their research works
to evaluate the generalization performance, because in
some cases there are not enough data available to partition
it into separate training and test sets without losing sig-
nificant modeling or testing capability [4, 9, 11, 22–26].
Moreover, the aim in v-fold cross-validation is to
ensure that every datum from the original data set has
the same chance of being in both the training and the
testing sets.
The main contribution of this paper is summarized as
follows: (1) a new criterion is proposed for model selection
problem to consider both tuning a SVM parameters and
model size reduction at once. Building a parsimonious
model and efficient tuning of SVM parameters play
important roles to reduce the testing time and increase the
generalization performance of a SVM, respectively. To
concurrently achieve these goals, a two-term cost function
consisting of sparseness and generalization performance
measures of SVM is proposed. (2) To achieve the global
optimal solution of the proposed cost function, a new
adaptive particle swarm optimization (APSO) is also pro-
posed to solve optimization problem. APSO uses a success
rate feedback to update inertia weight, and also, its cog-
nitive and social weights are adaptively changed during the
optimization process to improve the APSO performance.
The efficiency of APSO is evaluated by comparing to
standard PSO in optimizing static benchmark test func-
tions. Finally, the effectiveness of proposed cost function is
assessed in comparison with one-term cost function con-
sisting of generalization performance criterion on nine data
sets.
This rest of this paper is organized as follows. The SVM
formulation for binary classification is reviewed in Sect. 2.
In Sect. 3.1, generalized v-fold cross-validation formula-
tion is stated, then in Sect. 3.2, new APSO is introduced,
and finally, in Sect. 3.3, the proposed model selection is
proposed. Section 4 begins with stating the experiment
conditions, and then, the experimental results are dis-
cussed. Finally, conclusions are drawn in Sect. 5.
2 Support vector machine
Assume a given two-class labeled data set as follows
X = (xi, yi). Each data point xi 2 Rn
belongs to either of
two classes as determined by a corresponding label
yi 2 { -1, 1} for i = 1, …, n. The optimal hyper-plane is
obtained by solving a quadratic optimization problem
Eq. (1).
Min u w; nð Þ ¼
1
2
wT
w þ C
Xn
i¼1
ni
s:t:
yi wT
:xi þ bð Þ ! 1 À ni; i ¼ 1; 2; . . .; n
ni ! 0; i ¼ 1; 2; . . .; n
ð1Þ
where ni is a slack variable that represents the violation
of pattern separation condition for each of the data and
C is a penalty factor called regularization parameter for
controlling the SVM model complexity. This is one of
the model selection parameters in the SVM formulation.
For nonlinear separable data, a kernel trick is utilized to
map the input space into a high-dimensional space
named feature space. Then, the optimal hyper-plane is
obtained in the feature space. The primal optimal prob-
lem Eq. (1) is transformed into its dual form written as
below:
Max Q að Þ ¼
1
2
Xn
i¼1
Xn
j¼1
aiajyiyjk xi; xj
À Á
À
Xn
j¼1
aj
s:t:
Pn
j¼1
aiyi ¼ 0
0 ai C; i ¼ 1; . . .; n
ð2Þ
where k(., .) is a kernel function. Some of the conventional
kernel functions are listed in Table 1. Kernel parameter
highly affects on generalization performance as well as the
model complexity of SVM. Therefore, kernel parameters
are considered as the other model selection parameter.
Furthermore, in Eq. (2), a = (a1, …, an) is the vector of
non-negative Lagrange multipliers [1]. The solution vector
a = (a1, …, an) is sparse, i.e., ai ¼ 0 for most indices of
training data. This is the so-called SVM sparseness prop-
erty. The points xi that correspond to nonzero ai are called
Table 1 Conventional kernel functions
Name Kernel function expression
Linear kernel k(xi, xj) = xi
T
xj
Polynomial kernel kðxi; xjÞ ¼ ðtÃ
þ xT
i xjÞdÃ
RBF kernel kðxi; xjÞ ¼ expðÀxi À x2
j =r2Ã
Þ
MLP kernel kðxi; xjÞ ¼ tan hðbÃ
0xT
i xj þ bÃ
1Þ
* Kernel parameter
Neural Comput & Applic
123
support vectors. Therefore, the points xi that correspond to
ai = 0 have no contribution in construction of the optimal
hyper-plane and only a part of training data, support vec-
tors, constructs the optimal hyper-plane. Let v be the index
set of support vectors; then, the optimal hyper-plane is
f xð Þ ¼
X#sv
i2m
aiyik xi; xj
À Á
þ b ¼ 0 ð3Þ
and the resulting classifier is
y xð Þ ¼ sgn
X#sv
i2m
aiyik xi; xj
À Á
þ b
" #
ð4Þ
where b is shown the bias parameter and determined by
Karush–Kuhn–Tucker (KKT) conditions [1].
3 Proposed model selection
3.1 Generalized v-fold cross-validation criterion
Generalized v-fold cross-validation (CV) criterion was first
introduced by Craven et al. [27]. Consider a given training
set of n data points as follows fðxk; ykÞjk ¼ 1; 2; . . .; ng.
The following definition is assumed to formulate general-
ized v-fold CV estimator.
Definition 3.1 (Linear smoother) An estimator ^f of f is
called a linear smoother if, for each x 2 Rd
, there exists a
vector LðxÞ ¼ ðl1ðxÞ; . . .; lnðxÞÞT
2 Rn
such that
^f xð Þ ¼
Xn
k¼1
lk xð ÞYi: ð5Þ
In matrix form, this can be written as ^f ¼ LY, with L 2
RnÂn
and L is called the smoother matrix. Craven et al. [27]
demonstrated that the deleted residuals Yk À ^fðÀkÞ
ðXk; hÞ
can be written in terms of Yk À ^fðXk; hÞ and the trace of the
smoother matrix L. Moreover, the smoother matrix depends
on tunable parameter h ¼ ðc; rÞ. The generalized v-fold
CV criterion satisfies
Generalized vÀfold CV hð Þ ¼
1
n
Xn
k¼1
YkÀ^f Xk; hð Þ
1 À nÀ1tr L hð Þ½ Š










2
:
ð6Þ
The generalized v-fold CV estimate of h can be obtained by
minimizing (6). For more details, see [27, 28]. Li [29] and
Cao et al. [30] investigated the effectiveness of generalized
v-fold CV, finding that generalized v-fold CV was a robust
criterion, and regardless of the magnitude of noise, always
the same h is obtained.
3.2 Adaptive particle swarm optimization
PSO is one of the modern population-based optimization
algorithms first introduced by Kennedy and Eberhart [31]. It
uses swarm of particles to find the global optimum solution in
a search space. Each particle represents a candidate solution
for the cost function, and it has its own position and velocity.
Assume particle swarms are in D-dimensional search space.
Let the ith particle in a D-dimensional space represented as
xi = (xi1, …, xid, …, xiD). The best previous position of the
ith particle is recorded and represented as pbi = (pbi1, …, -
pbid, …, pbiD), which is called Pbest and given the best value
in the cost function. General best position, gbest, is denoted
by pgb and shown the best value of the Pbest among all the
particles in the cost function.
The velocity of the ith particle is represented as
vi = (vi1, …, vid, …, viD). In each of the iterations, the
velocity and the position of each particle are updated
according to Eqs. (7) and (8), respectively.
vid ¼ wvid þ C1r1 pbid À xidð Þ þ C2r2 pgb À xid
À Á
ð7Þ
xid ¼ xid þ vid ð8Þ
where w is an inertia weight and it is typically selected
within an interval of [0, 1]. C1 is a cognition weight factor,
C2 is a social weight factor, r1 and r2 are generated ran-
domly within an interval of [0, 1]. Standard PSO has some
shortcomings. It converges to local minima in multimodal
optimization problem and also has some parameters which
should be tuned to have an acceptable exploration and
exploitation properties [32, 33]. In [34], by considering the
stability condition and an adaptive inertia weight, the
acceleration parameters of PSO are adaptively determined.
A simple adaptive nonlinear strategy is introduced. This
strategy mainly depends on each particle’s performance
and determines each particle’s performance by calculating
the absolute distance between each particle’s personal best
(Pbest) and the global best position (gbest) among all
particles in each iterations of algorithm [35]. In [36], the
inertia weight is given by a function of evolution speed
factor and aggregation degree factor, and the value of
inertia weight is dynamically adjusted according to the
evolution speed and aggregation degree. In order to
improve the performance of standard PSO, the inertia, the
cognition, and the social weight factors should be modified.
In this paper, the main idea of modifying the inertia weight
is inspired from 1
5 success rate introduced by Schwefel
[37, 38] in evolution algorithms. Herein, in each iteration,
the success rate of each particle is meant that a better cost
function value is achieved by the Pbest in that each itera-
tion in comparison with its previous iteration. The success
rate is formulated in Eq. (9). Then, the percentage of
success rate is calculated by using Eq. (10).
Neural Comput  Applic
123
SucessRate ¼
1 if CostFcn Pbestiter
i
À Á
CostFcn PbestiterÀ1
i
À Á
0 Otherwise

ð9Þ
PSucc ¼
Pn
i¼1 SucessRate i; tð Þ
n
ð10Þ
where n is the number of particles. Now the value of PSucc
can vary within an interval of [0, 1]. It is transparent that
while PSucc is high for a particle, Pbest for that particle is
far from the optimum point of cost function and vice versa.
Therefore, the inertia weight should be correlated with
PSucc. Because of frequent use of linear form for presenting
the inertia weight, we formulate the function of the inertia
weight as a linear function of PSucc as follows:
w iterð Þ ¼ wmax À wminð ÞPsucc þ wmin ð11Þ
The range of the inertia weight [wmin, wmax] is selected to
be [0.2, 0.9]. In order to control the trade-off between
exploitation and exploration properties of PSO algorithm at
the beginning of the optimization process, a large value for
the cognitive weight and a small value for the social weight
should be chosen. Therefore, the exploration property of
PSO is enhanced. In contrast, close to ending stages of PSO
algorithm, a small cognitive weight and a large social
weight should be assigned in such a way to improve the
algorithm convergence to the global optimum point [39].
Therefore, it is necessary to change the cognitive weight
and social weight during the optimization process adap-
tively. To this end, the following formula for APSO is
utilized [32, 33, 38]:
If C1
final
 C1
initial
,
C1 ¼ Cfinal
1 À Cinitial
1
À Á iter
itermax
 
þ Cinitial
1 ð12Þ
If C2
final
[ C2
initial
,
C2 ¼ Cfinal
2 À Cinitial
2
À Á iter
itermax
 
þ Cinitial
2 ð13Þ
where the superscripts ‘‘initial’’ and ‘‘final’’ indicate the
initial and final values of the cognition weight and the
social weight factor, respectively.
To proof the superior performance of APSO, it is
compared with a standard PSO in optimizing three com-
mon static benchmark test functions. Finally, APSO is used
to solve the model selection problem in SVM. The test
functions are used to investigate the convergence speed and
solution quality of PSO and APSO. Table 2 provides a
detailed description of these functions. All the test func-
tions are a minimization problem. The first function
(Rosenbrock) is a unimodal function while the rest of the
functions (Rastrigin and Ackly) are multimodal optimiza-
tion problems.
The termination criterion of both PSO and APSO is
determined by reaching the maximum iteration number.
In this study, the maximum number of iterations and the
number of particles for both algorithms are selected to be
50 and 30, respectively. The dimension of the search
space (D) is 30. For all test function, xÃ
is the best
solution of test function and fðxÃ
Þ represents the best
achievable fitness for that functions. Figure 1 shows the
comparison results of PSO and APSO based on the final
accuracy and the convergence speed over 100 iterations.
These results demonstrate that APSO has a considerable
higher performance in both unimodal and multimodal
optimization problems.
In solving model selection problem of SVM, APSO is
used to optimize the proposed cost function; after the
maximum number of iteration reached, global best particle
represents an optimal solution consisting of the best regu-
lation parameter and the best kernel parameter for SVM
model.
3.3 Proposed cost function for model selection
problem
A successful selection of SVM model is based on two
important parameters affecting both generalization perfor-
mance and model size of SVM. As we discussed earlier,
those two parameters are regularization and kernel
parameters.
In non-separable problems, noisy training data will
introduce slack variables to measure their violation of the
margin. Therefore, a penalty factor C is considered in
SVM formulation to control the amount of margin vio-
lation. In other words, the penalty factor C is defined to
determine the trade-off between minimizing empirical
error and structural risk error and also to guarantee the
accuracy of classifier outcome in the presence of noisy
training data. Selecting a large value for C causes the
margin to be hard, and the cost of violation becomes too
high, so the separating model surface over-fits the training
data. In contrast, choosing a small value for C allows the
margin to be soft, which results in under-fitting separating
model surface. In both cases, the generalization perfor-
mance of classifier is unsatisfactory, so it makes the SVM
model useless [40].
Kernel parameter(s) are implicitly characterizing the
geometric structure of data in high-dimensional space
named feature space. In feature space, the data become
linearly separable in such a way that the maximal margin
of separation between two classes is achieved. The
selection of kernel parameter(s) will change the shape of
the separating surface in input space. Selecting improp-
erly large or small value for the kernel parameter results
Neural Comput  Applic
123
over-fitting or under-fitting problem in the SVM model,
so the model is unable to accurately classify data set
[13, 41].
Therefore, we define a model selection problem as an
optimization problem by proposing a cost function which
can concurrently boost up both generalization performance
and sparseness property of a SVM. Although only con-
sidering generalization performance error obtained from
the generalized v-fold CV method as the model selection
criterion guarantees high generalization performance of the
model, no avoidance from over-/under-fitting problem and
also no steering toward improving the sparseness property
of SVM are observed which are more possible in real data
sets, because of large number of SVs. The one-term cost
function consisting of a generalized v-fold CV error is
defined as follows:
One-Term Cost Fun ¼ Generalized v-fold CV Error
ð14Þ
a modification needs to be applied to overcome men-
tioned drawbacks of one-term cost function. Finally, the
proposed two-term cost function is formulated as
follows:
Two-Term Cost Fun ¼ a1 Á Generalized v-fold CV Error
þ a2 Á Sparseness
ð15Þ
where a1 = 0.8 and a2 ¼ 0:2 are the coefficients showing
the significant of Generalized v-fold CV Error and
Sparseness in the cost function, respectively. Sparseness
term is obtained by dividing total number of SVs by the
total number of training data. The proposed cost function is
the weighted sum of the generalized v-fold cross-validation
error and a sparseness property of SVM. By considering
the SVM sparseness as the second term of the cost func-
tion, the over-/under-fitting problem is controlled. There-
fore, the sparsity of the solution is improved and the model
size as well as testing time is decreased.
4 Computational experiments
4.1 Experimental configuration
To evaluate the performance of the proposed cost function,
a PC with configuration of Dual-Core E2160@1.8 GHz
CPU and 1 GB RAM is utilized. Nine commonly used data
sets of UCI database in the literature c used to assess the
effectiveness of the proposed cost function in comparison
with one-term cost function in solving model selection
problem. The v value in generalized v-fold CV is consid-
ered to be 10 in this study. Data sets descriptions are
presented in Table 3. Although the proposed method could
Table 2 Benchmark test functions [34]
Function name Test function #Dim Search space xÃ
f(xÃ
)
Rosenbrock f(x) =
P
i=1
D-1
[100(xi
2
- xi?1)2
? (xi - 1)2
30 [-5, 10]D
[0,…,0] 0
Rastrigin f(x) =
P
i=1
D
(xi
2
- 10 cos(2pxi) ? 10 30 [-5.12, 5.12]D
[0,…,0] 0
Ackly
fðxÞ ¼ À20 expðÀ0:2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
30
PD
i¼1 x2
i
q
Þ À expð1
D
PD
i¼1 cos 2pxiÞ þ 20 þ e
30 [-32, 32]D
[0,…,0] 0
(a) (b) (c)
0 10 20 30 40 50 60 70 80 90 100
0
5
10
15
x 10
5
Iteration
Rosenbrock Function
PSO
APSO
0 10 20 30 40 50 60 70 80 90 100
100
150
200
250
300
350
400
450
500
Iteration
Rastregin Function
PSO
APSO
0 10 20 30 40 50 60 70 80 90 100
1.68
1.7
1.72
1.74
1.76
1.78
1.8
Iteration
Ackely Function
PSO
APSO
Fig. 1 Comparison results between PSO algorithm and new APSO algorithm on three benchmark test functions, a Rosenbrock, b Rastrigin,
c Ackley
Neural Comput  Applic
123
be applied to every kernel functions, all experiments
reported here are implemented by using RBF kernel for the
following reasons: The RBF kernel non-linearly maps data
sets into the feature space so it can handle the data sets
when the relation between desired output and input attri-
butes is nonlinear. The second reason is less number of
hyper-parameters which influences on the complexity of
model selection problem. Finally, RBF kernel has less
numerical difficulties [10, 13, 41]. As a result, the model
selection parameters are regularization parameter (C) and
RBF kernel parameter (r). The search space for C and r,
model selection range, is set to be [1, 1000] and [0.01,
100], respectively. The performance of SVM model is
achieved by averaging over 1000 optimal models made out
of optimal parameters.
4.2 Experimental results and discussion
For each data set of Table 3, a comparative study between
the optimal models obtained by the proposed two-term cost
function and one-term cost function is performed. In the
comparative study, the generalization performance accu-
racy, the model size, and the testing time are discussed.
The results of the comparative study for data sets are
presented in Table 4.
Table 4 is shown that the parsimonious model
obtained from the two-term cost function has a
remarkable effect on reducing the model size in com-
parison with the model obtained from the one-term cost
function; consequently, the testing time is considerably
reduced. Overall, all data sets are shown on average 46%
reduction in model size and on average 37% reduction in
testing time. For instance, for smallest data set of
experiment (Wine) and the largest data set (DNA),
model size reduction is 58 and 64%, respectively, and
the testing time reduction is about 26.51, 66.00% in
comparison with one-term cost function.
Table 3 Description of data sets
Data set name #Data #Feature
Wine 178 13
Ionosphere 351 35
Breast cancer 699 10
German 1000 20
Splice 2991 60
Waveform 5000 21
Two norm 7400 20
Banana 10,000 2
DNA 10,372 181
Table 4 Results of comparative study for one-term and two-term cost functions on nine data sets
Data set Cost function Accuracy Model size Testing time
% (±SD) Reduction (%) #SVs (±SD) Reduction (%) (s) Reduction (%)
Wine One-term 99.62 ± 0.57 -0.78 28.53 ± 2.59 58.67 2.49 26.51
Two-term 98.84 ± 0.21 11.79 ± 1.65 1.83
Ionosphere One-term 91.86 ± 1.87 -0.86 117.11 ± 4.10 43.28 3.90 25.38
Two-term 91.07 ± 1.19 66.42 ± 4.35 2.91
Breast cancer One-term 97.07 ± 0.68 -0.42 60.45 ± 4.03 55.99 3.41 24.34
Two-term 96.66 ± 0.83 26.60 ± 4.38 2.58
German One-term 72.78 ± 0.52 -0.74 409.26 ± 8.43 36.70 29.92 35.53
Two-term 72.24 ± 0.43 259.03 ± 8.07 19.29
Splice One-term 90.04 ± 0.69 -0.99 1029.73 ± 16.05 42.67 209.19 49.72
Two-term 89.16 ± 0.70 590.23 ± 18.38 105.17
Waveform One-term 90.32 ± 0.49 -0.12 722.60 ± 17.85 38.84 234.61 33.10
Two-term 90.20 ± 0.47 441.94 ± 19.16 156.94
Two norm One-term 97.78 ± 0.19 -0.06 398.20 ± 11.51 43.21 190.50 49.85
Two-term 97.72 ± 0.16 226.13 ± 12.38 95.52
Banana One-term 96.28 ± 0.20 -0.23 705.12 ± 43.70 32.12 485.70 28.58
Two-term 96.05 ± 0.21 478.6 ± 34.28 346.84
DNA One-term 95.60 ± 1.80 -1.07 1180.11 ± 154.85 64.59 565.97 66.00
Two-term 94.57 ± 1.24 417.79 ± 38.12 192.39
Neural Comput  Applic
123
One-term Two-term
One-term Two-term
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0.25
0.3
0.35
0.4
log10
(c)
log10
(σ)
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0.3
0.32
0.34
0.36
0.38
0.4
log10
(c)
log10
(σ)
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0
0.05
0.1
0.15
0.2
log10
(c)
log10
(σ)
0
0.5
1
1.5
2
2.5
3
-2
-1
0
1
2
0
0.05
0.1
0.15
0.2
log10
(c)
log10
(σ)
(a)
(b)
Fig. 2 Two examples of model selection problem with one-term and two-term cost functions for data sets described in Table 3, a German,
b Banana
Two norm Splice DNA
One-term Two-term
0
10
20
30
40
50
60
70
80
90
100
Accuracy (%)
One-term Two-term
0
50
100
150
200
250
300
350
400
Model Size (#SVs)
One-term Two-term
0
20
40
60
80
100
120
140
160
180
200
Testing Time (Sec.)
One-term Two-term
0
10
20
30
40
50
60
70
80
90
100
Accuracy (%)
One-term Two-term
0
200
400
600
800
1000
1200
Model Size (#SVs)
One-term Two-term
0
50
100
150
200
250
Testing Time (Sec.)
One-term Two-term
0
10
20
30
40
50
60
70
80
90
100
Accuracy (%)
One-term Two-term
0
200
400
600
800
1000
1200
Model Size (#SVs)
One-term Two-term
0
100
200
300
400
500
600
Testing Time (Sec.)
Fig. 3 Three visual examples of one-term cost function (blue) and two-term cost function (green) extracted from Table 4. Accuracy (left bars),
model size (middle bars), and testing time (right bars) (colour figure online)
Neural Comput  Applic
123
Although it is expected that by reducing model size, the
generalization performance considerably decreased,
experimental results show that only a slight drop happened
in the generalization performance for all data sets. The
accuracies reduction ranges are below 0.58% on average
for all data sets. By considering the importance of time
consuming, a slight decrease in generalization performance
of SVM is acceptable.
The parameters of the optimal model selection process
obtained by APSO are shown in ‘‘Appendix’’. In Fig. 2,
two examples of one-term and two-term cost functions
surfaces are plotted versus two model selection parameters
to present the difference between one-term and proposed
two-term cost functions. In addition, three examples of
obtained results listed in Table 4 are visualized in Fig. 3, to
show the efficiency of proposed two-term cost function
over one-term cost function.
5 Conclusion
A new two-term cost function based on the generalized
v-fold generalization performance and the sparseness
property of SVM proposed for the SVM model selection
problem. In addition, a new APSO introduced to solve
the non-convex and multimodal optimization problem.
The feasibility of this cost function in comparison with
one-term cost function evaluated on nine data sets. The
proposed cost function shows an acceptable loss in
generalization performance while providing a parsimo-
nious model and avoiding SVM model from over-/under-
fitting problem. The experimental results demonstrated
that the parsimonious model has a lower model size on
average 46% and less time consuming on average 37%
in SVM testing phase in comparison with model
obtained by the one-term cost function.
Compliance with ethical standards
Conflict of interest The authors declare that there is no conflict of
interests regarding the publication of this paper.
Appendix
The optimal model selection parameters for all experiments
data sets are presented in Table 5.
References
1. Vapnik VN (1998) Statistical learning theory. Wiley, New York
2. Almasi ON, Rouhani M (2016) Fast and de-noise support vector
machine training method based on fuzzy clustering method for
large real world datasets. Turk J Electr Eng Comput 241:219–233
3. Peng X, Wang Y (2009) A geometric method for model selection
in support vector machine. Expert Syst Appl 36:5745–5749
4. Wang S, Meng B (2011) Parameter selection algorithm for sup-
port vector machine. Environ Sci Conf Proc 11:538–544
5. Chapelle O, Vapnik VN, Bousquet O, Mukherjee S (2002)
Choosing multiple parameters for support vector machines. Mach
Learn 461:131–159
6. Jaakkola T, Haussler D (1999) Probabilistic kernel regression
models. Artif Int Stat 126:1–4
7. Opper M, Winther O (2000) Gaussian processes and SVM: mean
field and leave-one-out estimator. In: Smola A, Bartlett P,
Scholkopf B, Schuurmans D (eds) Advances in large margin
classifiers. MIT Press, Cambridge, MA
8. Vapnik V, Chapelle O (2000) Bounds on error expectation for
support vector machines. Neural Comput 12(9):2013–2016
9. Keerthi SS (2002) Efficient tuning of SVM hyperparameters
using radius/margin bound and iterative algorithms. IEEE Trans
Neural Netw 135:1225–1229
10. Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance
between two classes for tuning SVM hyperparameters. IEEE
Trans Neural Netw 212:305–318
Table 5 Optimal model selection parameters
Data set Cost function C r
Wine One-term 49.60 2.58
Two-term 855.06 13.08
Ionosphere One-term 31.08 2.86
Two-term 354.26 4.90
Breast cancer One-term 19.56 34.08
Two-term 997.48 26.32
German One-term 7.91 2.01
Two-term 24.34 5.73
Splice One-term 3.36 4.80
Two-term 636.01 25.82
Waveform One-term 1.01 2.73
Two-term 9.10 7.93
Two norm One-term 1.03 6.87
Two-term 992.21 53.36
Banana One-term 9.28 0.27
Two-term 25.50 0.30
DNA One-term 348.60 8.56
Two-term 870.91 52.83
Neural Comput  Applic
123
11. Guo XC, Yang JH, Wu CG, Wang CY, Liang YC (2008) A novel
LS-SVMs hyper-parameter selection based on particle swarm
optimization. Neurocomputing 71:3211–3215
12. Glasmachers T, Igel C (2005) Gradient-based adaptation of
general Gaussian kernels. Neural Comput 1710:2099–2105
13. Lin KM, Lin CJ (2003) A study on reduced support vector
machines. IEEE Trans Neural Netw 146:1449–1459
14. Wang S, Meng B (2010) PSO algorithm for support vector machine.
In: Electronic commerce and security conference, pp 377–380
15. Lei P, Lou Y (2010) Parameter selection of support vector
machine using an improved PSO algorithm. In: Intelligent
human–machine systems and cybernetics conference, pp 196–199
16. Lin SW, Ying KC, Chen SC, Lee ZJ (2008) Particle swarm
optimization for parameter determination and feature selection of
support vector machines. Expert Syst Appl 354:1817–1824
17. Zhang W, Niu P (2011) LS-SVM based on chaotic particle swarm
optimization with simulated annealing and application. In:
Intelligent control and information processing, 2011 2nd inter-
national conference, vol 2, pp 931–935
18. Blondin J, Saad A (2010) Metaheuristic techniques for support
vector machine model selection. In: Hybrid intelligent systems,
2010 10th international conference, pp 197–200
19. Almasi ON, Akhtarshenas E, Rouhani M (2014) An efficient
model selection for SVM in real-world datasets using BGA and
RGA. Neural Netw World 24(5):501
20. Lihu A, Holban S (2012) Real-valued genetic algorithms with
disagreements. Stud Comp Intell 4(4):317–325
21. Cervantes J, Garcia-Lamont F, Rodriguez L, Lopez A, Castilla
JR, Trueba A (2017) PSO-based method for SVM classification
on skewed data sets. Neurocomputing 228:187–197
22. Williams P, Li S, Feng J, Wu S (2007) A geometrical method to
improve performance of the support vector machine. IEEE Trans
Neural Netw 183:942–947
23. An S, Liu W, Venkatesh S (2007) Fast cross-validation algo-
rithms for least squares support vector machine and kernel ridge
regression. Pattern Recognit 408:2154–2162
24. Huang CM, Lee YJ, Lin DK, Huang SY (2007) Model selection
for support vector machines via uniform design. Comput Stat
Data Anal 521:335–346
25. Almasi ON, Rouhani M (2016) A new fuzzy membership
assignment and model selection approach based on dynamic class
centers for fuzzy SVM family using the firefly algorithm. Turk J
Electr Eng Comput 243:1797–1814
26. Almasi BN, Almasi ON, Kavousi M, Sharifinia A (2013) Com-
puter-aided diagnosis of diabetes using least square support
vector machine. J Adv Computer Sci Technol 22:68–76
27. Craven P, Wahba G (1978) Smoothing noisy data with spline
functions. Numer Math 314:377–403
28. Efron B (1986) How biased is the apparent error rate of a pre-
diction rule? J Am Stat Assoc 81394:461–470
29. Li KC (1987) Asymptotic optimality for Cp, CL, cross-validation
and generalized cross-validation: discrete index set. Ann Stat
15(3):958–975
30. Cao Y, Golubev Y (2006) On oracle inequalities related to
smoothing splines. Math Methods Stat 154:398–414
31. Kennedy J, Eberhart RC (2001) Swarm intelligence. Academic
Press, USA
32. Beyer HG, Schwefel HP (2002) Evolution strategies: a compre-
hensive introduction. Nat Comput 352:2002
33. Yuan X, Wang L, Yuan Y (2008) Application of enhanced PSO
approach to optimal scheduling of hydro system. Energy Convers
Manag 49:2966–2972
34. Taherkhani M, Safabakhsh R (2016) A novel stability-based
adaptive inertia weight for particle swarm optimization. Appl
Soft Comput 31:281–295
35. Chauhan P, Deep K, Pant M (2013) Novel inertia weight strate-
gies for particle swarm optimization. Memet Comput 5:229–251
36. Yang X, Yuan J, Yuan J, Mao H (2007) A modified particle
swarm optimizer with dynamic adaptation. Appl Math Comput
189:1205–1213
37. Schwefel HPP (1993) Evolution and optimum seeking: the sixth
generation. John Wiley  Sons, Inc
38. Almasi ON, Naghedi AA, Tadayoni E, Zare A (2014) Optimal
design of T-S fuzzy controller for a nonlinear system using a new
adaptive particle swarm optimization algorithm. J Adv Comput
Sci Technol 31:37–47
39. Wang Y, Li B, Weise T, Wang J, Yuan B, Tian Q (2011) Self-
adaptive learning based particle swarm optimization. Inf Sci
181:4515–4538
40. Keerthi SS, Lin CJ (2003) Asymptotic behavior of support vector
machines with gaussian kernel. Neural Comput 157:1667–1689
41. Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel
classifiers with online and active learning. J Mach Learn Res
6:1579–1619
Neural Comput  Applic
123

More Related Content

What's hot

ASS_SDM2012_Ali
ASS_SDM2012_AliASS_SDM2012_Ali
ASS_SDM2012_Ali
MDO_Lab
 
PEMF-1-MAO2012-Ali
PEMF-1-MAO2012-AliPEMF-1-MAO2012-Ali
PEMF-1-MAO2012-Ali
MDO_Lab
 
AIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-MehmaniAIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-Mehmani
OptiModel
 
AMS_Aviation_2014_Ali
AMS_Aviation_2014_AliAMS_Aviation_2014_Ali
AMS_Aviation_2014_Ali
MDO_Lab
 
Computational intelligence systems in industrial engineering
Computational intelligence systems in industrial engineeringComputational intelligence systems in industrial engineering
Computational intelligence systems in industrial engineering
Springer
 
Adaptive response surface by kriging using pilot points for structural reliab...
Adaptive response surface by kriging using pilot points for structural reliab...Adaptive response surface by kriging using pilot points for structural reliab...
Adaptive response surface by kriging using pilot points for structural reliab...
IOSR Journals
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
Stéphane Canu
 

What's hot (20)

ASS_SDM2012_Ali
ASS_SDM2012_AliASS_SDM2012_Ali
ASS_SDM2012_Ali
 
master-thesis
master-thesismaster-thesis
master-thesis
 
PEMF-1-MAO2012-Ali
PEMF-1-MAO2012-AliPEMF-1-MAO2012-Ali
PEMF-1-MAO2012-Ali
 
AIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-MehmaniAIAA-SciTech-ModelSelection-2014-Mehmani
AIAA-SciTech-ModelSelection-2014-Mehmani
 
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
A REVIEW ON OPTIMIZATION OF LEAST SQUARES SUPPORT VECTOR MACHINE FOR TIME SER...
 
Multi criteria decision making
Multi criteria decision makingMulti criteria decision making
Multi criteria decision making
 
Modified hotelling’s 푻 ퟐ control charts using modified mahalanobis distance
Modified hotelling’s 푻 ퟐ  control charts using modified mahalanobis distance Modified hotelling’s 푻 ퟐ  control charts using modified mahalanobis distance
Modified hotelling’s 푻 ퟐ control charts using modified mahalanobis distance
 
multi criteria decision making
multi criteria decision makingmulti criteria decision making
multi criteria decision making
 
AMS_Aviation_2014_Ali
AMS_Aviation_2014_AliAMS_Aviation_2014_Ali
AMS_Aviation_2014_Ali
 
Computational intelligence systems in industrial engineering
Computational intelligence systems in industrial engineeringComputational intelligence systems in industrial engineering
Computational intelligence systems in industrial engineering
 
PARTICLE SWARM OPTIMIZATION FOR MULTIDIMENSIONAL CLUSTERING OF NATURAL LANGUA...
PARTICLE SWARM OPTIMIZATION FOR MULTIDIMENSIONAL CLUSTERING OF NATURAL LANGUA...PARTICLE SWARM OPTIMIZATION FOR MULTIDIMENSIONAL CLUSTERING OF NATURAL LANGUA...
PARTICLE SWARM OPTIMIZATION FOR MULTIDIMENSIONAL CLUSTERING OF NATURAL LANGUA...
 
Adaptive response surface by kriging using pilot points for structural reliab...
Adaptive response surface by kriging using pilot points for structural reliab...Adaptive response surface by kriging using pilot points for structural reliab...
Adaptive response surface by kriging using pilot points for structural reliab...
 
9 coldengine
9 coldengine9 coldengine
9 coldengine
 
080924 Measurement System Analysis Re Sampling
080924 Measurement System Analysis Re Sampling080924 Measurement System Analysis Re Sampling
080924 Measurement System Analysis Re Sampling
 
FCMPE2015
FCMPE2015FCMPE2015
FCMPE2015
 
Lecture7 cross validation
Lecture7 cross validationLecture7 cross validation
Lecture7 cross validation
 
Time Truncated Chain Sampling Plans for Inverse Rayleigh Distribution
Time Truncated Chain Sampling Plans for Inverse Rayleigh  Distribution Time Truncated Chain Sampling Plans for Inverse Rayleigh  Distribution
Time Truncated Chain Sampling Plans for Inverse Rayleigh Distribution
 
S4101116121
S4101116121S4101116121
S4101116121
 
C070409013
C070409013C070409013
C070409013
 
Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...Online learning in estimation of distribution algorithms for dynamic environm...
Online learning in estimation of distribution algorithms for dynamic environm...
 

Similar to A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm

On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
Amir Ziai
 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learning
butest
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
Dario Panada
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
MDO_Lab
 

Similar to A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm (20)

Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4Sequential estimation of_discrete_choice_models__copy_-4
Sequential estimation of_discrete_choice_models__copy_-4
 
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
On the Performance of the Pareto Set Pursuing (PSP) Method for Mixed-Variable...
 
LNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine LearningLNCS 5050 - Bilevel Optimization and Machine Learning
LNCS 5050 - Bilevel Optimization and Machine Learning
 
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERSFIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
FIDUCIAL POINTS DETECTION USING SVM LINEAR CLASSIFIERS
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Guide
GuideGuide
Guide
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
 
Abrigo and love_2015_
Abrigo and love_2015_Abrigo and love_2015_
Abrigo and love_2015_
 
Kitamura1992
Kitamura1992Kitamura1992
Kitamura1992
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
A BI-OBJECTIVE MODEL FOR SVM WITH AN INTERACTIVE PROCEDURE TO IDENTIFY THE BE...
 
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
PROGRAM TEST DATA GENERATION FOR BRANCH COVERAGE WITH GENETIC ALGORITHM: COMP...
 
Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...Generalization of linear and non-linear support vector machine in multiple fi...
Generalization of linear and non-linear support vector machine in multiple fi...
 
Configuration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case PrioritizationConfiguration Navigation Analysis Model for Regression Test Case Prioritization
Configuration Navigation Analysis Model for Regression Test Case Prioritization
 
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature SetOptimal Feature Selection from VMware ESXi 5.1 Feature Set
Optimal Feature Selection from VMware ESXi 5.1 Feature Set
 
Optimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature setOptimal feature selection from v mware esxi 5.1 feature set
Optimal feature selection from v mware esxi 5.1 feature set
 
Parameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point DetectionParameter Optimisation for Automated Feature Point Detection
Parameter Optimisation for Automated Feature Point Detection
 
RBHF_SDM_2011_Jie
RBHF_SDM_2011_JieRBHF_SDM_2011_Jie
RBHF_SDM_2011_Jie
 
Transient stability analysis of power system
Transient stability analysis of power systemTransient stability analysis of power system
Transient stability analysis of power system
 

Recently uploaded

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
Kamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
Kamal Acharya
 

Recently uploaded (20)

Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
Pharmacy management system project report..pdf
Pharmacy management system project report..pdfPharmacy management system project report..pdf
Pharmacy management system project report..pdf
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Online resume builder management system project report.pdf
Online resume builder management system project report.pdfOnline resume builder management system project report.pdf
Online resume builder management system project report.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES  INTRODUCTION UNIT-IENERGY STORAGE DEVICES  INTRODUCTION UNIT-I
ENERGY STORAGE DEVICES INTRODUCTION UNIT-I
 
AI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdfAI for workflow automation Use cases applications benefits and development.pdf
AI for workflow automation Use cases applications benefits and development.pdf
 
2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge2024 DevOps Pro Europe - Growing at the edge
2024 DevOps Pro Europe - Growing at the edge
 
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptxCloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
Cloud-Computing_CSE311_Computer-Networking CSE GUB BD - Shahidul.pptx
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
Laundry management system project report.pdf
Laundry management system project report.pdfLaundry management system project report.pdf
Laundry management system project report.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 

A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm

  • 1. ORIGINAL ARTICLE A parsimonious SVM model selection criterion for classification of real-world data sets via an adaptive population-based algorithm Omid Naghash Almasi1 • Mohammad Hassan Khooban2 Received: 30 November 2016 / Accepted: 1 March 2017 Ó The Natural Computing Applications Forum 2017 Abstract This paper proposes and optimizes a two-term cost function consisting of a sparseness term and a gener- alized v-fold cross-validation term by a new adaptive par- ticle swarm optimization (APSO). APSO updates its parameters adaptively based on a dynamic feedback from the success rate of the each particle’s personal best. Since the proposed cost function is based on the choosing fewer numbers of support vectors, the complexity of SVM model is decreased while the accuracy remains in an accept- able range. Therefore, the testing time decreases and makes SVM more applicable for practical applications in real data sets. A comparative study on data sets of UCI database is performed between the proposed cost function and con- ventional cost function to demonstrate the effectiveness of the proposed cost function. Keywords Parameter selection Á Model complexity Á Support vector machines Á Adaptive particle swarm optimization Á Classification Á Real-world data sets 1 Introduction Support vector machines (SVMs) are proposed by Vapnik [1]. SVM is based on statistical learning theory imple- menting the structural risk minimization principle. There- fore, SVM is proven as a powerful machine learning method that attracts a great deal of research in the fields of classification, function estimation problems, and distribu- tion estimation [2]. The generalization ability of SVM depends on the proper choosing of a set of two adjustable parameters which is called SVM model selection problem [3–5]. Another important feature of SVM is its sparseness prop- erty which allows only a small part of training data named support vectors (SVs) contributes in construction of final hyper-plane. This causes that SVM model has a small size, and hence, less time is consumed in testing phase in comparison with a model which built up with contribution of all training data. The solution of a model selection problem not only controls the generalization performance, but also affects on the SVM model size. Large problems generate large data sets, and consequently in these data sets, the SVM model size (number of SVs) will increase. It is antici- pated from SVM as a sparse machine learning method to deal with this problem, but the model reduction is not as much as expected for real-world application and the number of support vectors increases with the size of data set. Generally, two crucial problems arise in SVM applica- tions. The first is the lack of a certain method for tuning SVM parameters, and the other one is the size of model in large data set. In fact, the model selection problem plays an important role in SVM generalization performance for either small or large size of data sets, but for large real- world data sets, the model selection complexity dramati- cally increases. Various model selection methods have proposed by considering different criteria such as Jaakkola–Haussler bound [6], Opper–Winther bound [7], span bound [8], radius/margin bound [9], distance between two classes & Mohammad Hassan Khooban khooban@sutech.ac.ir 1 Young Researchers and Elite Club, Mashhad Branch, Islamic Azad University, Mashhad, Iran 2 Department of Electrical Engineering, Shiraz University of Technology, Shiraz, Iran 123 Neural Comput & Applic DOI 10.1007/s00521-017-2930-y
  • 2. [10], and v-fold cross-validation [11]. Generally, gradient decent method-based algorithms are used to optimize dif- ferentiable criteria. Although these methods are fast, the algorithm may stuck in local minima and therefore not applicable for all aforesaid criteria [4, 9, 10, 12, 13]. To overcome these drawbacks, global optimization methods such as PSO [14–16], simulated annealing [17], ant colony [18], GA [19, 20] have introduced for non-differentiable and non-smooth cost function optimization problems. More recently, a PSO-based method has proposed which uses PSO method to tune SVM parameters and evolve more artificial instances to make imbalanced data sets balanced [21]. Many researchers have used v-fold cross-validation instead of conventional validation in their research works to evaluate the generalization performance, because in some cases there are not enough data available to partition it into separate training and test sets without losing sig- nificant modeling or testing capability [4, 9, 11, 22–26]. Moreover, the aim in v-fold cross-validation is to ensure that every datum from the original data set has the same chance of being in both the training and the testing sets. The main contribution of this paper is summarized as follows: (1) a new criterion is proposed for model selection problem to consider both tuning a SVM parameters and model size reduction at once. Building a parsimonious model and efficient tuning of SVM parameters play important roles to reduce the testing time and increase the generalization performance of a SVM, respectively. To concurrently achieve these goals, a two-term cost function consisting of sparseness and generalization performance measures of SVM is proposed. (2) To achieve the global optimal solution of the proposed cost function, a new adaptive particle swarm optimization (APSO) is also pro- posed to solve optimization problem. APSO uses a success rate feedback to update inertia weight, and also, its cog- nitive and social weights are adaptively changed during the optimization process to improve the APSO performance. The efficiency of APSO is evaluated by comparing to standard PSO in optimizing static benchmark test func- tions. Finally, the effectiveness of proposed cost function is assessed in comparison with one-term cost function con- sisting of generalization performance criterion on nine data sets. This rest of this paper is organized as follows. The SVM formulation for binary classification is reviewed in Sect. 2. In Sect. 3.1, generalized v-fold cross-validation formula- tion is stated, then in Sect. 3.2, new APSO is introduced, and finally, in Sect. 3.3, the proposed model selection is proposed. Section 4 begins with stating the experiment conditions, and then, the experimental results are dis- cussed. Finally, conclusions are drawn in Sect. 5. 2 Support vector machine Assume a given two-class labeled data set as follows X = (xi, yi). Each data point xi 2 Rn belongs to either of two classes as determined by a corresponding label yi 2 { -1, 1} for i = 1, …, n. The optimal hyper-plane is obtained by solving a quadratic optimization problem Eq. (1). Min u w; nð Þ ¼ 1 2 wT w þ C Xn i¼1 ni s:t: yi wT :xi þ bð Þ ! 1 À ni; i ¼ 1; 2; . . .; n ni ! 0; i ¼ 1; 2; . . .; n ð1Þ where ni is a slack variable that represents the violation of pattern separation condition for each of the data and C is a penalty factor called regularization parameter for controlling the SVM model complexity. This is one of the model selection parameters in the SVM formulation. For nonlinear separable data, a kernel trick is utilized to map the input space into a high-dimensional space named feature space. Then, the optimal hyper-plane is obtained in the feature space. The primal optimal prob- lem Eq. (1) is transformed into its dual form written as below: Max Q að Þ ¼ 1 2 Xn i¼1 Xn j¼1 aiajyiyjk xi; xj À Á À Xn j¼1 aj s:t: Pn j¼1 aiyi ¼ 0 0 ai C; i ¼ 1; . . .; n ð2Þ where k(., .) is a kernel function. Some of the conventional kernel functions are listed in Table 1. Kernel parameter highly affects on generalization performance as well as the model complexity of SVM. Therefore, kernel parameters are considered as the other model selection parameter. Furthermore, in Eq. (2), a = (a1, …, an) is the vector of non-negative Lagrange multipliers [1]. The solution vector a = (a1, …, an) is sparse, i.e., ai ¼ 0 for most indices of training data. This is the so-called SVM sparseness prop- erty. The points xi that correspond to nonzero ai are called Table 1 Conventional kernel functions Name Kernel function expression Linear kernel k(xi, xj) = xi T xj Polynomial kernel kðxi; xjÞ ¼ ðtà þ xT i xjÞdà RBF kernel kðxi; xjÞ ¼ expðÀxi À x2 j =r2Ã Þ MLP kernel kðxi; xjÞ ¼ tan hðbà 0xT i xj þ bà 1Þ * Kernel parameter Neural Comput & Applic 123
  • 3. support vectors. Therefore, the points xi that correspond to ai = 0 have no contribution in construction of the optimal hyper-plane and only a part of training data, support vec- tors, constructs the optimal hyper-plane. Let v be the index set of support vectors; then, the optimal hyper-plane is f xð Þ ¼ X#sv i2m aiyik xi; xj À Á þ b ¼ 0 ð3Þ and the resulting classifier is y xð Þ ¼ sgn X#sv i2m aiyik xi; xj À Á þ b " # ð4Þ where b is shown the bias parameter and determined by Karush–Kuhn–Tucker (KKT) conditions [1]. 3 Proposed model selection 3.1 Generalized v-fold cross-validation criterion Generalized v-fold cross-validation (CV) criterion was first introduced by Craven et al. [27]. Consider a given training set of n data points as follows fðxk; ykÞjk ¼ 1; 2; . . .; ng. The following definition is assumed to formulate general- ized v-fold CV estimator. Definition 3.1 (Linear smoother) An estimator ^f of f is called a linear smoother if, for each x 2 Rd , there exists a vector LðxÞ ¼ ðl1ðxÞ; . . .; lnðxÞÞT 2 Rn such that ^f xð Þ ¼ Xn k¼1 lk xð ÞYi: ð5Þ In matrix form, this can be written as ^f ¼ LY, with L 2 RnÂn and L is called the smoother matrix. Craven et al. [27] demonstrated that the deleted residuals Yk À ^fðÀkÞ ðXk; hÞ can be written in terms of Yk À ^fðXk; hÞ and the trace of the smoother matrix L. Moreover, the smoother matrix depends on tunable parameter h ¼ ðc; rÞ. The generalized v-fold CV criterion satisfies Generalized vÀfold CV hð Þ ¼ 1 n Xn k¼1 YkÀ^f Xk; hð Þ 1 À nÀ1tr L hð Þ½ Š 2 : ð6Þ The generalized v-fold CV estimate of h can be obtained by minimizing (6). For more details, see [27, 28]. Li [29] and Cao et al. [30] investigated the effectiveness of generalized v-fold CV, finding that generalized v-fold CV was a robust criterion, and regardless of the magnitude of noise, always the same h is obtained. 3.2 Adaptive particle swarm optimization PSO is one of the modern population-based optimization algorithms first introduced by Kennedy and Eberhart [31]. It uses swarm of particles to find the global optimum solution in a search space. Each particle represents a candidate solution for the cost function, and it has its own position and velocity. Assume particle swarms are in D-dimensional search space. Let the ith particle in a D-dimensional space represented as xi = (xi1, …, xid, …, xiD). The best previous position of the ith particle is recorded and represented as pbi = (pbi1, …, - pbid, …, pbiD), which is called Pbest and given the best value in the cost function. General best position, gbest, is denoted by pgb and shown the best value of the Pbest among all the particles in the cost function. The velocity of the ith particle is represented as vi = (vi1, …, vid, …, viD). In each of the iterations, the velocity and the position of each particle are updated according to Eqs. (7) and (8), respectively. vid ¼ wvid þ C1r1 pbid À xidð Þ þ C2r2 pgb À xid À Á ð7Þ xid ¼ xid þ vid ð8Þ where w is an inertia weight and it is typically selected within an interval of [0, 1]. C1 is a cognition weight factor, C2 is a social weight factor, r1 and r2 are generated ran- domly within an interval of [0, 1]. Standard PSO has some shortcomings. It converges to local minima in multimodal optimization problem and also has some parameters which should be tuned to have an acceptable exploration and exploitation properties [32, 33]. In [34], by considering the stability condition and an adaptive inertia weight, the acceleration parameters of PSO are adaptively determined. A simple adaptive nonlinear strategy is introduced. This strategy mainly depends on each particle’s performance and determines each particle’s performance by calculating the absolute distance between each particle’s personal best (Pbest) and the global best position (gbest) among all particles in each iterations of algorithm [35]. In [36], the inertia weight is given by a function of evolution speed factor and aggregation degree factor, and the value of inertia weight is dynamically adjusted according to the evolution speed and aggregation degree. In order to improve the performance of standard PSO, the inertia, the cognition, and the social weight factors should be modified. In this paper, the main idea of modifying the inertia weight is inspired from 1 5 success rate introduced by Schwefel [37, 38] in evolution algorithms. Herein, in each iteration, the success rate of each particle is meant that a better cost function value is achieved by the Pbest in that each itera- tion in comparison with its previous iteration. The success rate is formulated in Eq. (9). Then, the percentage of success rate is calculated by using Eq. (10). Neural Comput Applic 123
  • 4. SucessRate ¼ 1 if CostFcn Pbestiter i À Á CostFcn PbestiterÀ1 i À Á 0 Otherwise ð9Þ PSucc ¼ Pn i¼1 SucessRate i; tð Þ n ð10Þ where n is the number of particles. Now the value of PSucc can vary within an interval of [0, 1]. It is transparent that while PSucc is high for a particle, Pbest for that particle is far from the optimum point of cost function and vice versa. Therefore, the inertia weight should be correlated with PSucc. Because of frequent use of linear form for presenting the inertia weight, we formulate the function of the inertia weight as a linear function of PSucc as follows: w iterð Þ ¼ wmax À wminð ÞPsucc þ wmin ð11Þ The range of the inertia weight [wmin, wmax] is selected to be [0.2, 0.9]. In order to control the trade-off between exploitation and exploration properties of PSO algorithm at the beginning of the optimization process, a large value for the cognitive weight and a small value for the social weight should be chosen. Therefore, the exploration property of PSO is enhanced. In contrast, close to ending stages of PSO algorithm, a small cognitive weight and a large social weight should be assigned in such a way to improve the algorithm convergence to the global optimum point [39]. Therefore, it is necessary to change the cognitive weight and social weight during the optimization process adap- tively. To this end, the following formula for APSO is utilized [32, 33, 38]: If C1 final C1 initial , C1 ¼ Cfinal 1 À Cinitial 1 À Á iter itermax þ Cinitial 1 ð12Þ If C2 final [ C2 initial , C2 ¼ Cfinal 2 À Cinitial 2 À Á iter itermax þ Cinitial 2 ð13Þ where the superscripts ‘‘initial’’ and ‘‘final’’ indicate the initial and final values of the cognition weight and the social weight factor, respectively. To proof the superior performance of APSO, it is compared with a standard PSO in optimizing three com- mon static benchmark test functions. Finally, APSO is used to solve the model selection problem in SVM. The test functions are used to investigate the convergence speed and solution quality of PSO and APSO. Table 2 provides a detailed description of these functions. All the test func- tions are a minimization problem. The first function (Rosenbrock) is a unimodal function while the rest of the functions (Rastrigin and Ackly) are multimodal optimiza- tion problems. The termination criterion of both PSO and APSO is determined by reaching the maximum iteration number. In this study, the maximum number of iterations and the number of particles for both algorithms are selected to be 50 and 30, respectively. The dimension of the search space (D) is 30. For all test function, xà is the best solution of test function and fðxÃ Þ represents the best achievable fitness for that functions. Figure 1 shows the comparison results of PSO and APSO based on the final accuracy and the convergence speed over 100 iterations. These results demonstrate that APSO has a considerable higher performance in both unimodal and multimodal optimization problems. In solving model selection problem of SVM, APSO is used to optimize the proposed cost function; after the maximum number of iteration reached, global best particle represents an optimal solution consisting of the best regu- lation parameter and the best kernel parameter for SVM model. 3.3 Proposed cost function for model selection problem A successful selection of SVM model is based on two important parameters affecting both generalization perfor- mance and model size of SVM. As we discussed earlier, those two parameters are regularization and kernel parameters. In non-separable problems, noisy training data will introduce slack variables to measure their violation of the margin. Therefore, a penalty factor C is considered in SVM formulation to control the amount of margin vio- lation. In other words, the penalty factor C is defined to determine the trade-off between minimizing empirical error and structural risk error and also to guarantee the accuracy of classifier outcome in the presence of noisy training data. Selecting a large value for C causes the margin to be hard, and the cost of violation becomes too high, so the separating model surface over-fits the training data. In contrast, choosing a small value for C allows the margin to be soft, which results in under-fitting separating model surface. In both cases, the generalization perfor- mance of classifier is unsatisfactory, so it makes the SVM model useless [40]. Kernel parameter(s) are implicitly characterizing the geometric structure of data in high-dimensional space named feature space. In feature space, the data become linearly separable in such a way that the maximal margin of separation between two classes is achieved. The selection of kernel parameter(s) will change the shape of the separating surface in input space. Selecting improp- erly large or small value for the kernel parameter results Neural Comput Applic 123
  • 5. over-fitting or under-fitting problem in the SVM model, so the model is unable to accurately classify data set [13, 41]. Therefore, we define a model selection problem as an optimization problem by proposing a cost function which can concurrently boost up both generalization performance and sparseness property of a SVM. Although only con- sidering generalization performance error obtained from the generalized v-fold CV method as the model selection criterion guarantees high generalization performance of the model, no avoidance from over-/under-fitting problem and also no steering toward improving the sparseness property of SVM are observed which are more possible in real data sets, because of large number of SVs. The one-term cost function consisting of a generalized v-fold CV error is defined as follows: One-Term Cost Fun ¼ Generalized v-fold CV Error ð14Þ a modification needs to be applied to overcome men- tioned drawbacks of one-term cost function. Finally, the proposed two-term cost function is formulated as follows: Two-Term Cost Fun ¼ a1 Á Generalized v-fold CV Error þ a2 Á Sparseness ð15Þ where a1 = 0.8 and a2 ¼ 0:2 are the coefficients showing the significant of Generalized v-fold CV Error and Sparseness in the cost function, respectively. Sparseness term is obtained by dividing total number of SVs by the total number of training data. The proposed cost function is the weighted sum of the generalized v-fold cross-validation error and a sparseness property of SVM. By considering the SVM sparseness as the second term of the cost func- tion, the over-/under-fitting problem is controlled. There- fore, the sparsity of the solution is improved and the model size as well as testing time is decreased. 4 Computational experiments 4.1 Experimental configuration To evaluate the performance of the proposed cost function, a PC with configuration of Dual-Core E2160@1.8 GHz CPU and 1 GB RAM is utilized. Nine commonly used data sets of UCI database in the literature c used to assess the effectiveness of the proposed cost function in comparison with one-term cost function in solving model selection problem. The v value in generalized v-fold CV is consid- ered to be 10 in this study. Data sets descriptions are presented in Table 3. Although the proposed method could Table 2 Benchmark test functions [34] Function name Test function #Dim Search space xà f(xà ) Rosenbrock f(x) = P i=1 D-1 [100(xi 2 - xi?1)2 ? (xi - 1)2 30 [-5, 10]D [0,…,0] 0 Rastrigin f(x) = P i=1 D (xi 2 - 10 cos(2pxi) ? 10 30 [-5.12, 5.12]D [0,…,0] 0 Ackly fðxÞ ¼ À20 expðÀ0:2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 30 PD i¼1 x2 i q Þ À expð1 D PD i¼1 cos 2pxiÞ þ 20 þ e 30 [-32, 32]D [0,…,0] 0 (a) (b) (c) 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 x 10 5 Iteration Rosenbrock Function PSO APSO 0 10 20 30 40 50 60 70 80 90 100 100 150 200 250 300 350 400 450 500 Iteration Rastregin Function PSO APSO 0 10 20 30 40 50 60 70 80 90 100 1.68 1.7 1.72 1.74 1.76 1.78 1.8 Iteration Ackely Function PSO APSO Fig. 1 Comparison results between PSO algorithm and new APSO algorithm on three benchmark test functions, a Rosenbrock, b Rastrigin, c Ackley Neural Comput Applic 123
  • 6. be applied to every kernel functions, all experiments reported here are implemented by using RBF kernel for the following reasons: The RBF kernel non-linearly maps data sets into the feature space so it can handle the data sets when the relation between desired output and input attri- butes is nonlinear. The second reason is less number of hyper-parameters which influences on the complexity of model selection problem. Finally, RBF kernel has less numerical difficulties [10, 13, 41]. As a result, the model selection parameters are regularization parameter (C) and RBF kernel parameter (r). The search space for C and r, model selection range, is set to be [1, 1000] and [0.01, 100], respectively. The performance of SVM model is achieved by averaging over 1000 optimal models made out of optimal parameters. 4.2 Experimental results and discussion For each data set of Table 3, a comparative study between the optimal models obtained by the proposed two-term cost function and one-term cost function is performed. In the comparative study, the generalization performance accu- racy, the model size, and the testing time are discussed. The results of the comparative study for data sets are presented in Table 4. Table 4 is shown that the parsimonious model obtained from the two-term cost function has a remarkable effect on reducing the model size in com- parison with the model obtained from the one-term cost function; consequently, the testing time is considerably reduced. Overall, all data sets are shown on average 46% reduction in model size and on average 37% reduction in testing time. For instance, for smallest data set of experiment (Wine) and the largest data set (DNA), model size reduction is 58 and 64%, respectively, and the testing time reduction is about 26.51, 66.00% in comparison with one-term cost function. Table 3 Description of data sets Data set name #Data #Feature Wine 178 13 Ionosphere 351 35 Breast cancer 699 10 German 1000 20 Splice 2991 60 Waveform 5000 21 Two norm 7400 20 Banana 10,000 2 DNA 10,372 181 Table 4 Results of comparative study for one-term and two-term cost functions on nine data sets Data set Cost function Accuracy Model size Testing time % (±SD) Reduction (%) #SVs (±SD) Reduction (%) (s) Reduction (%) Wine One-term 99.62 ± 0.57 -0.78 28.53 ± 2.59 58.67 2.49 26.51 Two-term 98.84 ± 0.21 11.79 ± 1.65 1.83 Ionosphere One-term 91.86 ± 1.87 -0.86 117.11 ± 4.10 43.28 3.90 25.38 Two-term 91.07 ± 1.19 66.42 ± 4.35 2.91 Breast cancer One-term 97.07 ± 0.68 -0.42 60.45 ± 4.03 55.99 3.41 24.34 Two-term 96.66 ± 0.83 26.60 ± 4.38 2.58 German One-term 72.78 ± 0.52 -0.74 409.26 ± 8.43 36.70 29.92 35.53 Two-term 72.24 ± 0.43 259.03 ± 8.07 19.29 Splice One-term 90.04 ± 0.69 -0.99 1029.73 ± 16.05 42.67 209.19 49.72 Two-term 89.16 ± 0.70 590.23 ± 18.38 105.17 Waveform One-term 90.32 ± 0.49 -0.12 722.60 ± 17.85 38.84 234.61 33.10 Two-term 90.20 ± 0.47 441.94 ± 19.16 156.94 Two norm One-term 97.78 ± 0.19 -0.06 398.20 ± 11.51 43.21 190.50 49.85 Two-term 97.72 ± 0.16 226.13 ± 12.38 95.52 Banana One-term 96.28 ± 0.20 -0.23 705.12 ± 43.70 32.12 485.70 28.58 Two-term 96.05 ± 0.21 478.6 ± 34.28 346.84 DNA One-term 95.60 ± 1.80 -1.07 1180.11 ± 154.85 64.59 565.97 66.00 Two-term 94.57 ± 1.24 417.79 ± 38.12 192.39 Neural Comput Applic 123
  • 7. One-term Two-term One-term Two-term 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0.25 0.3 0.35 0.4 log10 (c) log10 (σ) 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0.3 0.32 0.34 0.36 0.38 0.4 log10 (c) log10 (σ) 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0 0.05 0.1 0.15 0.2 log10 (c) log10 (σ) 0 0.5 1 1.5 2 2.5 3 -2 -1 0 1 2 0 0.05 0.1 0.15 0.2 log10 (c) log10 (σ) (a) (b) Fig. 2 Two examples of model selection problem with one-term and two-term cost functions for data sets described in Table 3, a German, b Banana Two norm Splice DNA One-term Two-term 0 10 20 30 40 50 60 70 80 90 100 Accuracy (%) One-term Two-term 0 50 100 150 200 250 300 350 400 Model Size (#SVs) One-term Two-term 0 20 40 60 80 100 120 140 160 180 200 Testing Time (Sec.) One-term Two-term 0 10 20 30 40 50 60 70 80 90 100 Accuracy (%) One-term Two-term 0 200 400 600 800 1000 1200 Model Size (#SVs) One-term Two-term 0 50 100 150 200 250 Testing Time (Sec.) One-term Two-term 0 10 20 30 40 50 60 70 80 90 100 Accuracy (%) One-term Two-term 0 200 400 600 800 1000 1200 Model Size (#SVs) One-term Two-term 0 100 200 300 400 500 600 Testing Time (Sec.) Fig. 3 Three visual examples of one-term cost function (blue) and two-term cost function (green) extracted from Table 4. Accuracy (left bars), model size (middle bars), and testing time (right bars) (colour figure online) Neural Comput Applic 123
  • 8. Although it is expected that by reducing model size, the generalization performance considerably decreased, experimental results show that only a slight drop happened in the generalization performance for all data sets. The accuracies reduction ranges are below 0.58% on average for all data sets. By considering the importance of time consuming, a slight decrease in generalization performance of SVM is acceptable. The parameters of the optimal model selection process obtained by APSO are shown in ‘‘Appendix’’. In Fig. 2, two examples of one-term and two-term cost functions surfaces are plotted versus two model selection parameters to present the difference between one-term and proposed two-term cost functions. In addition, three examples of obtained results listed in Table 4 are visualized in Fig. 3, to show the efficiency of proposed two-term cost function over one-term cost function. 5 Conclusion A new two-term cost function based on the generalized v-fold generalization performance and the sparseness property of SVM proposed for the SVM model selection problem. In addition, a new APSO introduced to solve the non-convex and multimodal optimization problem. The feasibility of this cost function in comparison with one-term cost function evaluated on nine data sets. The proposed cost function shows an acceptable loss in generalization performance while providing a parsimo- nious model and avoiding SVM model from over-/under- fitting problem. The experimental results demonstrated that the parsimonious model has a lower model size on average 46% and less time consuming on average 37% in SVM testing phase in comparison with model obtained by the one-term cost function. Compliance with ethical standards Conflict of interest The authors declare that there is no conflict of interests regarding the publication of this paper. Appendix The optimal model selection parameters for all experiments data sets are presented in Table 5. References 1. Vapnik VN (1998) Statistical learning theory. Wiley, New York 2. Almasi ON, Rouhani M (2016) Fast and de-noise support vector machine training method based on fuzzy clustering method for large real world datasets. Turk J Electr Eng Comput 241:219–233 3. Peng X, Wang Y (2009) A geometric method for model selection in support vector machine. Expert Syst Appl 36:5745–5749 4. Wang S, Meng B (2011) Parameter selection algorithm for sup- port vector machine. Environ Sci Conf Proc 11:538–544 5. Chapelle O, Vapnik VN, Bousquet O, Mukherjee S (2002) Choosing multiple parameters for support vector machines. Mach Learn 461:131–159 6. Jaakkola T, Haussler D (1999) Probabilistic kernel regression models. Artif Int Stat 126:1–4 7. Opper M, Winther O (2000) Gaussian processes and SVM: mean field and leave-one-out estimator. In: Smola A, Bartlett P, Scholkopf B, Schuurmans D (eds) Advances in large margin classifiers. MIT Press, Cambridge, MA 8. Vapnik V, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12(9):2013–2016 9. Keerthi SS (2002) Efficient tuning of SVM hyperparameters using radius/margin bound and iterative algorithms. IEEE Trans Neural Netw 135:1225–1229 10. Sun J, Zheng C, Li X, Zhou Y (2010) Analysis of the distance between two classes for tuning SVM hyperparameters. IEEE Trans Neural Netw 212:305–318 Table 5 Optimal model selection parameters Data set Cost function C r Wine One-term 49.60 2.58 Two-term 855.06 13.08 Ionosphere One-term 31.08 2.86 Two-term 354.26 4.90 Breast cancer One-term 19.56 34.08 Two-term 997.48 26.32 German One-term 7.91 2.01 Two-term 24.34 5.73 Splice One-term 3.36 4.80 Two-term 636.01 25.82 Waveform One-term 1.01 2.73 Two-term 9.10 7.93 Two norm One-term 1.03 6.87 Two-term 992.21 53.36 Banana One-term 9.28 0.27 Two-term 25.50 0.30 DNA One-term 348.60 8.56 Two-term 870.91 52.83 Neural Comput Applic 123
  • 9. 11. Guo XC, Yang JH, Wu CG, Wang CY, Liang YC (2008) A novel LS-SVMs hyper-parameter selection based on particle swarm optimization. Neurocomputing 71:3211–3215 12. Glasmachers T, Igel C (2005) Gradient-based adaptation of general Gaussian kernels. Neural Comput 1710:2099–2105 13. Lin KM, Lin CJ (2003) A study on reduced support vector machines. IEEE Trans Neural Netw 146:1449–1459 14. Wang S, Meng B (2010) PSO algorithm for support vector machine. In: Electronic commerce and security conference, pp 377–380 15. Lei P, Lou Y (2010) Parameter selection of support vector machine using an improved PSO algorithm. In: Intelligent human–machine systems and cybernetics conference, pp 196–199 16. Lin SW, Ying KC, Chen SC, Lee ZJ (2008) Particle swarm optimization for parameter determination and feature selection of support vector machines. Expert Syst Appl 354:1817–1824 17. Zhang W, Niu P (2011) LS-SVM based on chaotic particle swarm optimization with simulated annealing and application. In: Intelligent control and information processing, 2011 2nd inter- national conference, vol 2, pp 931–935 18. Blondin J, Saad A (2010) Metaheuristic techniques for support vector machine model selection. In: Hybrid intelligent systems, 2010 10th international conference, pp 197–200 19. Almasi ON, Akhtarshenas E, Rouhani M (2014) An efficient model selection for SVM in real-world datasets using BGA and RGA. Neural Netw World 24(5):501 20. Lihu A, Holban S (2012) Real-valued genetic algorithms with disagreements. Stud Comp Intell 4(4):317–325 21. Cervantes J, Garcia-Lamont F, Rodriguez L, Lopez A, Castilla JR, Trueba A (2017) PSO-based method for SVM classification on skewed data sets. Neurocomputing 228:187–197 22. Williams P, Li S, Feng J, Wu S (2007) A geometrical method to improve performance of the support vector machine. IEEE Trans Neural Netw 183:942–947 23. An S, Liu W, Venkatesh S (2007) Fast cross-validation algo- rithms for least squares support vector machine and kernel ridge regression. Pattern Recognit 408:2154–2162 24. Huang CM, Lee YJ, Lin DK, Huang SY (2007) Model selection for support vector machines via uniform design. Comput Stat Data Anal 521:335–346 25. Almasi ON, Rouhani M (2016) A new fuzzy membership assignment and model selection approach based on dynamic class centers for fuzzy SVM family using the firefly algorithm. Turk J Electr Eng Comput 243:1797–1814 26. Almasi BN, Almasi ON, Kavousi M, Sharifinia A (2013) Com- puter-aided diagnosis of diabetes using least square support vector machine. J Adv Computer Sci Technol 22:68–76 27. Craven P, Wahba G (1978) Smoothing noisy data with spline functions. Numer Math 314:377–403 28. Efron B (1986) How biased is the apparent error rate of a pre- diction rule? J Am Stat Assoc 81394:461–470 29. Li KC (1987) Asymptotic optimality for Cp, CL, cross-validation and generalized cross-validation: discrete index set. Ann Stat 15(3):958–975 30. Cao Y, Golubev Y (2006) On oracle inequalities related to smoothing splines. Math Methods Stat 154:398–414 31. Kennedy J, Eberhart RC (2001) Swarm intelligence. Academic Press, USA 32. Beyer HG, Schwefel HP (2002) Evolution strategies: a compre- hensive introduction. Nat Comput 352:2002 33. Yuan X, Wang L, Yuan Y (2008) Application of enhanced PSO approach to optimal scheduling of hydro system. Energy Convers Manag 49:2966–2972 34. Taherkhani M, Safabakhsh R (2016) A novel stability-based adaptive inertia weight for particle swarm optimization. Appl Soft Comput 31:281–295 35. Chauhan P, Deep K, Pant M (2013) Novel inertia weight strate- gies for particle swarm optimization. Memet Comput 5:229–251 36. Yang X, Yuan J, Yuan J, Mao H (2007) A modified particle swarm optimizer with dynamic adaptation. Appl Math Comput 189:1205–1213 37. Schwefel HPP (1993) Evolution and optimum seeking: the sixth generation. John Wiley Sons, Inc 38. Almasi ON, Naghedi AA, Tadayoni E, Zare A (2014) Optimal design of T-S fuzzy controller for a nonlinear system using a new adaptive particle swarm optimization algorithm. J Adv Comput Sci Technol 31:37–47 39. Wang Y, Li B, Weise T, Wang J, Yuan B, Tian Q (2011) Self- adaptive learning based particle swarm optimization. Inf Sci 181:4515–4538 40. Keerthi SS, Lin CJ (2003) Asymptotic behavior of support vector machines with gaussian kernel. Neural Comput 157:1667–1689 41. Bordes A, Ertekin S, Weston J, Bottou L (2005) Fast kernel classifiers with online and active learning. J Mach Learn Res 6:1579–1619 Neural Comput Applic 123