1.
Direct Kernel Least-Squares Support Vector Machines with Heuristic Regularization
Mark J. Embrechts
Department of Decision Sciences and Engineering Systems
Rensselaer Polytechnic Institute, Troy, NY 12180
E-mail:embrem@rpi.edu
Abstract – This paper introduces least squares and Partial-Least Squares models (PLS). Popular
support vector machines as a direct kernel examples of neural network-based models include
method, where the kernel is considered as a data feedforward neural networks (trained with one of the
pre-processing step. A heuristic formula for the many popular learning methods), Sef-Organizing
regularization parameter is proposed based on Maps (SOMs), and Radial Basis Function Networks
preliminary scaling experiments. (RBFN). Examples of Support Vector Machine
algorithms include the perceptron-like support vector
I. INTRODUCTION machines (SVMs), and Least-Squares Support Vector
Machines (LS-SVM), also known as kernel ridge
A. One-Layered Neural Networks for Regression regression. A straightforward way to estimate the
weights is outlined in Equation (2).
T
A standard (predictive) data mining problem is X mn X nm wm = X mn y n
T
defined as a regression problem for predicting the
response from descriptive features. In order to do so,
( X mn X nm ) −1 ( X mn X nm ) wm = ( X mn X nm ) −1 X mn yn
T T T T
(2)
wm = ( X mn X nm ) X mn y n
T −1 T
we will first build a predictive model based on
training data, evaluate the performance of this Predictions for the training set can now be made for
predictive model based on validation data, and finally y by substituting (2) in (1):
y n = X nm ( X mn X nm ) X mn y n (3)
use this predictive model to make actual predictions −1 T
on a test data for which we generally do not know (or
ˆ T
pretend not to know) the response value. Before applying this formula for a general
It is customary to denote the data matrix as X nm prediction proper data preprocessing is required. A
common procedure in data mining to center all the
and the response vector as y n . In this case, there are descriptors and to bring them to a unity variance. The
n data points and m descriptive features in the dataset. same process is then applied to the response. This
procedure of centering and variance normalization is
We would like to infer y n from X nm by induction,
known as Mahalanobis scaling. While Mahalanobis
denoted as X nm ⇒ y n , in such a way that our scaling is not the only way to pre-process the data, it
inference model works not only for the training data, is probably the most general and the most robust way
but also does a good job on the out-of-sample data to do pre-processing that applies well across the
(i.e., validation data and test data). In other words, board. If we represent a feature vector as z ,
we aim to build a linear predictive model of the type: Mahalanobis scaling will result in a rescaled feature
ˆ
y n = X nm wm (1) vector z ′ and can be summarized as:
z−z
The hat symbol indicates that we are making z'= (4)
predictions that are not perfect (especially for the std ( z )
validation and test data). Equation (1) is the answer to where z represents the average value and std ( z )
the question “wouldn’t it be nice if we could apply
represents the standard deviation for attribute z .
wisdom to the data, and pop comes out the answer?” Making a test model proceeds in a very similar
The vector wn is that wisdom vector and is usually way as for training: the “wisdom vector” or the
called the weight vector in machine learning. weight vector will now be applied to the test data to
There are many different ways to build such make predictions according to:
predictive regression models. Just to mention a few test
ˆ test
y k = X km wm (5)
possibilities here, the regression model could be a
linear statistical model, a Neural Network based In the above expression it was assumed that there
model (NN), or a Support Vector Machine (SVM)[1-3] are k test data, and the superscript ‘test” is used to
based model. Examples for linear statistical models explicitly indicate that the weight vector will be
are Principal Component Regression models (PCR) applied to a set of k test data with m attributes or
descriptors. If one considers testing for one sample
2.
data point at a time, Eq. (5) can be represented as a and left as an exercise to the reader. Note that now
simple neural network with an input layer and just a the inverse is needed for a different entity matrix,
single neuron, as shown in Fig. 1. The neuron which now has an n × n dimensionality, and is
produces the weighted sum of the average input called the data kernel, K D , as defined by:
features. Note that the transfer function, commonly
found in neural networks, is not present here. Note K D = X nm X mn (11)
T
also that that the number of weights for this one-layer The right-hand pseudo-inverse formulation is less
neural networks equals the number of input frequently cited in the literature, because it can only
descriptors or attributes. be non-rank deficient when there are more
descriptive attributes than data points, which is not
the usual case for data mining problems (except for
data strip mining[17] cases). The data kernel matrix is a
symmetrical matrix that contains entries representing
similarities between data points. The solution to this
problem seems to be straightforward. We will first try
to explain here what seems to be an obvious solution,
and then actually show why this won’t work.
Looking at Eqs. (10) and (11) it can be concluded
that, except for rare cases where there are as many
Fig. 1. Neural network representation for regression data records as there are features, either the feature
kernel is rank deficient (in case that m > n , i.e.,
B. The Machine Learning Dilemma there are more attributes than data), or the data kernel
is rank deficient (in case that n > m , i.e., there are
Equations (2) and (3) contain the inverse of the more data than attributes). It can be now argued that
for the m < n case one can proceed with the usual
feature kernel, K F , defined as:
left-hand pseudo-inverse method of Eq. (2), and that
K F = X mn X nm (9)
T
for the m > n case one should proceed with the
The feature kernel is a m × m symmetric matrix right-hand pseudo inverse, or Penrose inverse
where each entry represents the similarity between following Eq. (10).
features. Obviously, if there were two features that While the approach just proposed here seems to
would be completely redundant the feature matrix be reasonable, it will not work well in practice.
would contain two columns and two rows that are Learning occurs by discovering patterns in data
(exactly) identical, and the inverse does not exist. through redundancies present in the data. Data
One can argue that all is still well, and that in order to redundancies imply that there are data present that
make the simple regression method work one would seem to be very similar to each other (and that have
just make sure that the same descriptor or attribute is similar values for the response as well). An extreme
not included twice. By the same argument, highly example for data redundancy would be a dataset that
correlated descriptors (i.e., “cousin features” in data contains the same data point twice. Obviously, in that
mining lingo) should be eliminated as well. While case, the data matrix is ill-conditioned and the inverse
this argument sounds plausible, the truth of the matter does not exist. This type of redundancy, where data
is more subtle. Let us repeat Eq. (2) again and go just repeat themselves, will be called here a “hard
one step further as shown below. redundancy.” However, for any dataset that one can
possibly learn from, there have to be many “soft
T redundancies” as well. While these soft redundancies
T
X mn X nm wm = X mn y n
will not necessarily make the data matrix ill-
(X T
mn X nm ) (
−1 T
)
X mn X nm wm (
= X mn X nm
T
) T
−1
X mn y n conditioned, in the sense that the inverse does not
(10)
= (X )
T −1 T exist because the determinant of the data kernel is
wm mn X nm X mn y n
zero, in practice this determinant will be very small.
wm ( T −1
= X mn X nm X mn y n
T
) In other words, regardless whether one proceeds with
a left-hand or a right-hand inverse, if data contain
Eq. (10) is the derivation of an equivalent linear information that can be learnt from, there have to be
formulation to Eq. (2), based on the so-called right- soft or hard redundancies in the data. Unfortunately,
hand pseudo-inverse or Penrose inverse, rather than Eqs. (2) and (10) can’t be solved for the weight
using the more common left-hand pseudo-inverse. It vector in that case, because the kernel will either be
was not shown here how that last line followed from rank deficient (i.e., ill-conditioned), or poor-
the previous equation, but the proof is straightforward
3.
y = X nm X mn ( X nm X mn + λI ) y n
conditioned, i.e., calculating the inverse will be
ˆ T T −1
numerically unstable. We call this phenomenon “the
−1
machine learning dilemma:” (i) machine learning = K D ( K D + λI ) y n (13)
from data can only occur when data contain
= K D wn
redundancies; (ii) but, in that case the kernel inverse
in Eq. (2) or Eq. (10) is either not defined or where a very different weight vector was introduced:
numerically unstable because of poor conditioning. wn . This weight vector is applied directly to the data
Taking the inverse of a poor-conditioned matrix is kernel matrix (rather than the training data matrix)
possible, but the inverse is not “sharply defined” and and has the same dimensionality as the number of
most numerical methods, with the exception of training data. To make a prediction on the test set,
methods based on single value decomposition (SVD), one proceeds in a similar way, but applies the weight
will run into numerical instabilities. The data mining vector on the data kernel for the test data, which is
dilemma seems to have some similarity with the generally a rectangular matrix, and projects the test
uncertainty principle in physics, but we will not try to data on the training data according to:
K D = X km ( X mn )
draw that parallel too far. test test train T
Statisticians have been aware of the data mining (14)
dilemma for a long time, and have devised various where it is assumed that there are k data points in
methods around this paradox. In the next sections, we the test set.
will propose several methods to deal with the data
mining dilemma, and obtain efficient and robust II. THE KERNEL TRANSFORMATION
prediction models in the process.
The kernel transformation is an elegant way to
C. Regression Models Based on the Data Kernel make a regression model nonlinear. The kernel
transformation goes back at least to the early 1900’s,
Reconsider the data kernel formulation of Eq. when Hilbert addressed kernels in the mathematical
(10) for predictive modeling. There are several well- literature. A kernel is a matrix containing similarity
known methods for dealing with the data mining measures for a dataset: either between the data of the
dilemma by using techniques that ensure that the dataset itself, or with other data (e.g., support
kernel matrix will not be rank deficient anymore. vectors[1,3]). A classical use of a kernel is the
Two well-known methods are principal component correlation matrix used for determining the principal
regression and ridge regression.[5] In order to keep the components in principal component analysis, where
mathematical diversions to its bare minimum, only the feature kernel contains linear similarity measures
ridge regression will be discussed. between (centered) attributes. In support vector
Ridge regression is a very straightforward way to machines, the kernel entries are similarity measures
ensure that the kernel matrix is positive definite (or between data rather than features and these similarity
well-conditioned), before inverting the data kernel. In measures are usually nonlinear, unlike the dot
ridge regression, a small positive value, λ, is added to product similarity measure that we used before to
each element on the main diagonal of the data matrix. define a kernel. There are many possible nonlinear
Usually the same value for λ is used for each entry. similarity measures, but in order to be mathematically
Obviously, we are not solving the same problem tractable the kernel has to satisfy certain conditions,
anymore. In order to not deviate too much from the the so-called Mercer conditions. [1]
original problem, the value for λ will be kept as small k11 k12 ... k1n
as we reasonably can tolerate. A good choice for λ is k
k 22 ... k 2 n
a small value that will make the newly defined data K nn = 21 (15)
kernel matrix barely positive definite, so that the ...
inverse exists and is mathematically stable. In data
kernel space, the solution for the weight vector that k n1 kn2 ... k nn
will be used in the ridge regression prediction model The expression above, introduces the general
now becomes: structure for the data kernel matrix, K nm , for n data.
wn = X mn ( X nm X mn + λI ) y n (12)
T T −1
The kernel matrix is a symmetrical matrix where each
entry contains a (linear or nonlinear) similarity
and predictions for y can now be made according to:
between two data vectors. There are many different
possibilities for defining similarity metrics such as
the dot product, which is a linear similarity measure
and the Radial Basis Function kernel or RBF kernel,
4.
which is a nonlinear similarity measure. The RBF and code: only the data are different, and we operate
kernel is the most widely used nonlinear kernel and on the kernel transformation of the data rather than
the kernel entries are defined by the data themselves.
x j − xl
2
In order to make out-of-sample predictions on
2
− (16) true test data, a similar kernel transformation needs to
k ij ≡ e
2
2σ
be applied to the test data, as shown in Eq. (14). The
Note that in the kernel definition above, the idea of direct kernel methods is illustrated in Fig. 2,
kernel entry contains the square of the Euclidean by showing how any regression model can be applied
distance (or two-norm) between data points, which is to kernel-transformed data. One could also represent
a dissimilarity measure (rather than a similarity), in a the kernel transformation in a neural network type of
negative exponential. The negative exponential also flow diagram and the first hidden layer would now
contains a free parameter, σ, which is the Parzen yield the kernel-transformed data, and the weights in
window width for the RBF kernel. The proper choice the first layer would be just the descriptors of the
for selecting the Parzen window is usually training data. The second layer contains the weights
determined by an additional tuning, also called hyper- that can be calculated with a hard computing method,
tuning, on an external validation set. The precise such as kernel ridge regression. When a radial basis
choice for σ is not crucial, there usually is a relatively function kernel is used, this type of neural network
broad range for the choice for σ for which the model would look very similar to a radial basis function
quality should be stable. neural network, except that the weights in the second
Different learning methods distinguish layer are calculated differently.
themselves in the way by which the weights are
determined. Obviously, the model in Eqs. (12 - 14) to
produce estimates or predictions for y is linear. Such
a linear model has a handicap in the sense that it
cannot capture inherent nonlinearities in the data.
This handicap can easily be overcome by applying
the kernel transformation directly as a data
transformation. We will therefore not operate directly
on the data, but on a nonlinear transform of the data,
in this case the nonlinear data kernel. This is very Fig. 2. Direct kernels as a data pre-processing step
similar to what is done in principal component
analysis, where the data are substituted by their A. Dealing with Bias: Centering the Kernel
principal components before building a model. A
similar procedure will be applied here, but rather than There is still one important detail that was
substituting data by their principal components, the overlooked so far, and that is necessary to make
data will be substituted by their kernel transform direct kernel methods work. Looking at the prediction
(either linear or nonlinear) before building a equations in which the weight vector is applied to
predictive model. data as in Eq. (1), there is no constant offset term or
The kernel transformation is applied here as a bias. It turns out that for data that are centered this
data transformation in a separate pre-processing offset term is always zero and does not have to be
stage. We actually replace the data by a nonlinear included explicitly. In machine learning lingo the
data kernel and apply a traditional linear predictive proper name for this offset term is the bias, and rather
model. Methods where a traditional linear algorithm than applying Eq. (1), a more general predictive
is used on a nonlinear kernel transform of the data are model that includes this bias can be written as:
introduced here as “direct kernel methods.” The
ˆ
elegance and advantage of such a direct kernel y n = X nm wm + b (17)
method is that the nonlinear aspects of the problem where b is the bias term. Because we made it a
are captured entirely in the kernel and are transparent practice in data mining to center the data first by
to the applied algorithm. If a linear algorithm was Mahalanobis scaling, this bias term is zero and can be
used before introducing the kernel transformation, the ignored.
required mathematical operations remain linear. It is When dealing with kernels, the situation is more
now clear how linear methods such as principal complex, as they need some type of bias as well. We
component regression, ridge regression, and partial will give only a recipe here, that works well in
least squares can be turned into nonlinear direct practice, and refer the reader to the literature for a
kernel methods, by using exactly the same algorithm more detailed explanation.[3, 6] Even when the data
5.
were Mahalanobis-scaled, before applying a kernel machine learning dilemma, a ridge can be applied to
transform, the kernel still needs some type of the main diagonal of the data kernel matrix. Since the
centering to be able to omit the bias term in the kernel transformation is applied directly on the data,
prediction model. A straightforward way for kernel before applying ridge regression, this method is
centering is to subtract the average from each column called direct-kernel ridge regression.
of the training data kernel, and store this average for Kernel ridge regression and (direct) kernel ridge
later recall, when centering the test kernel. A second regression are not new. The roots for ridge regression
step for centering the kernel is going through the can be traced back to the statistics literature. [5]
newly obtained vertically centered kernel again, this Methods equivalent to kernel ridge regression were
time row by row, and subtracting the row average recently introduced under different names in the
form each horizontal row. machine learning literature (e.g., proximal SVMs
The kernel of the test data needs to be centered in were introduced by Mangasarian et al.[7] kernel ridge
a consistent way, following a similar procedure. In regression was introduced by Poggio et al.[8] and
this case, the stored column centers from the kernel Least-Squares Support Vector Machines were
of the training data will be used for the vertical introduced by Suykens et al.[9-10]). In these works,
centering of the kernel of the test data. This vertically Kerned Ridge Regression is usually introduced as a
centered test kernel is then centered horizontally, i.e., regularization method that solves a convex
for each row, the average of the vertically centered optimization problem in a Langrangian formulation
test kernel is calculated, and each horizontal entry of for the dual problem that is very similar to traditional
the vertically centered test kernel is substituted by SVMs. The equivalency with ridge regression
that entry minus the row average. techniques then appears after a series of mathematical
Mathematical formulations for centering square manipulations. By contrast, we introduced kernel
kernels are explained in the literature.[3, 6] The ridge regression with few mathematical diversions in
advantage of the kernel-centering algorithm the context of the machine learning dilemma and
introduced (and described above in words) in this direct kernel methods. For all practical purposes,
section is that it also applies to rectangular data kernel ride regression is similar to support vector
kernels. The flow chart for pre-processing the data, machines, works in the same feature space as support
applying a kernel transform on this data, and vector machines, and was therefore named least-
centering the kernel for the training data, validation squares support vector machines by Suykens et al.
data, and test data is shown in Fig. 3. Note that kernel ridge regression still requires the
computation of an inverse for a n × n matrix, which
can be quite large. This task is computationally
demanding for large datasets, as is the case in a
typical data mining problem. Since the kernel matrix
now scales with the number of data squared, this
method can also become prohibitive from a practical
computer implementation point of view, because both
memory and processing requirements can be very
demanding. Krylov space-based methods[10] and
conjugate gradient methods[1, 10] are relatively
Fig. 3. Data pre-processing with kernel centering efficient ways to speed up the matrix inverse
transformation of large matrices, where the
B. Direct Kernel Ridge Regression computation time now scales as n2, rather than n3.
The Analyze/Stripminer code[12] developed by the
So far, the argument was made that by applying
author applies MØller’s scaled conjugate gradient
the kernel transformation in Eqs. (13) and (14), many
method to calculate the matrix inverse.[13]
traditional linear regression models can be
The issue of dealing with large datasets is even
transformed into a nonlinear direct kernel method.
more profound. There are several potential solutions
The kernel transformation and kernel centering
that will not be discussed in detail. One approach
proceed as data pre-processing steps (Fig. 2). In order
would be to use a rectangular kernel, were not all the
to make the predictive model inherently nonlinear,
data are used as bases to calculate the kernel, but a
the radial basis function kernel will be applied, rather
good subset of “support vectors” is estimated by
than the (linear) dot product kernel, used in Eqs. (2)
chunking[1] or other techniques such as sensitivity
and (10). There are actually several alternate choices
analysis. More efficient ways for inverting large
for the kernel,[1-3] but the RBF kernel is the most
matrices are based on piece-wise inversion.
widely applied kernel. In order to overcome the
6.
Alternatively, the matrix inversion may be avoided ACKNOWLEDGEMENT
altogether by adhering to the support vector machine
formulation of kernel ridge regression and solving the The author acknowledges the National Science
dual Lagrangian optimization problem and applying Foundation support of this work (IIS-9979860). The
the sequential minimum optimization or SMO.[16] discussions with Robert Bress, Kristin Bennett,
Karsten Sternickel, Boleslaw Szymanski and Seppo
III. HEURISTIC REGULARIZATION FOR λ Ovaska were extremely helpful to prepare this paper.
It has been shown that kernel ridge regression can REFERENCES
be expressed as an optimization method,[10-15] where
[1] Nello Cristianini and John Shawe-Taylor [2000] Support
rather than minimizing the residual error on the Vector Machines and Other Kernel-Based Learning
training set, according to: Methods, Cambridge University Press.
2
n train
[2] Vladimir Vapnik [1998] Statistical Learning Theory, John
∑i =1
ˆ
yi − yi (18)
[3]
Wiley & Sons.
Bernhard Schölkopf and Alexander J. Smola [2002]
2
Learning with Kernels, MIT Press.
we now minimize: [4] Vladimir Cherkassky and Filip Mulier [1998] Learning
2
n train
λ 2 from Data: Concepts, Theory, and Methods, John Wiley &
∑i =1
ˆ
yi − yi + w 2 (19)
2 [5]
Sons, Inc.
A. E. Hoerl, and R. W. Kennard [1970] “Ridge Regression:
2
The above equation is a form of Tikhonov Biased Estimation for Non-Orthogonal Problems,”
Technometrics, Vol. 12, pp. 69-82.
regularization[14] that has been explained in detail by [6] B. Schölkopf, A. Smola, and K-R Müller [1998] “Nonlinear
Cherkassky and Mulier[4] in the context of empirical Component Analysis as a Kernel Eigenvalue Problem,”
versus structural risk minimization. Minimizing the Neural Computation, Vol. 10, 1299-1319, 1998.
norm of the weight vector is in a sense similar to an [7] Glenn Fung and Olvi L. Mangasarian, “Proximal Support
Vector Machine Classifiers,” in Proceedings KDD 2001,
error penalization for prediction models with a large San Francisco, CA.
number of free parameters. An obvious question in [8] Evgeniou, T., Pontil, and M. Poggio, T. [2000] “Statistical
this context relates to the proper choice for the Learning Theory: A Primer,” International Journal of
Computer Vision, Vol. 38(1), pp. 9-13.
regularization parameter or ridge parameter λ. [9] Suykens, J. A. K. and Vandewalle, J. [1999] “Least-Squares
In the machine learning, it is common to tune the Support Vector Machine Classifiers,” Neural Processing
hyper-parameter λ using a tuning/validation set. This letters, Vol. 9(3), pp. 293-300, Vol. 14, pp. 71-84.
tuning procedure can be quite time consuming for [10] Suykens, J. A. K., van Gestel, T. de Brabanter, J. De Moor,
M., and Vandewalle, J. [2003] Least Squares Support
large datasets, especially in consideration that a Vector Machines, World Scientific Pub Co, Singapore.
simultaneous tuning for the RBF kernel width must [11] Ilse C. F. Ipsen, and Carl D. Meyer [1998] “The Idea behind
proceed in a similar manner. We therefore propose a Krylov Methods,” American Mathematical Monthly, Vol.
heuristic formula for the proper choice for the ridge 105, 889-899.
[12] The Analyze/StripMiner code is available on request for
parameter, that has proven to be close to optimal in academic use, or can be downloaded from
numerous practical cases [36]. If the data were www.drugmining.com.
originally Mahalanobis scaled, it was found by [13] Møller, M. F., [1993] “A Scaled Conjugate Gradient
scaling experiments that a near optimal choice for λ Algorithm for Fast Supervised Learning,” Neural Networks,
Vol. 6, pp.525-534.
is [14] A. N. Tikhonov and V. Y. Arsenin [1977] Solutions of ill-
3
Posed Problems, W.H. Winston, Washinton D.C.
n 2
λ = min 1; 0.05 (20) [15] Bennett, K. P., and Embrechts, M. J. [2003] “An
200
Optimization Perspective on Kernel Partial Least Squares
where n is the number of data in the training set. Regression,” Chapter 11 in Advances in Learning Theory:
Note that in order to apply the above heuristic the Methods, Models and Applications, Suykens J.A.K. et al.,
Eds., NATO-ASI Series in Computer and System Sciences,
data have to be Mahalanobis scaled first. Eq. (20) IOS Press, Amsterdam, The Netherlands.
was validated on a variety of standard benchmark [16] Keerthi, S. S., and Shevade S. K. [2003] “SMO Algorithm
datasets from the UCI data repository, and provided for Least Squares SVM Formulations,” Neural
results that are nearly identical to an optimally tuned Computation, Vol. 15, pp. 487-507.
[17] Robert H. Kewley, and Mark J. Embrechts [2000] “Data
λ on a tuning/validation set. In any case, the heuristic Strip Mining for the Virtual Design of Pharmaceuticals with
formula for λ should be an excellent starting choice Neural Networks,” IEEE Transactions on Neural Networks,
for the tuning process for λ. The above formula Vol.11 (3), pp. 668-679.
proved to be also useful for the initial choice for the
regularization parameter C of SVMs, where C is now
taken as 1/λ.
Be the first to comment