Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Direct Kernel Least-Squares Support Vector Machines with Heuristic Regularization Mark J. Embrechts Department of Decision Sciences and Engineering Systems Rensselaer Polytechnic Institute, Troy, NY 12180 E-mail:embrem@rpi.edu Abstract – This paper introduces least squares and Partial-Least Squares models (PLS). Popular support vector machines as a direct kernel examples of neural network-based models include method, where the kernel is considered as a data feedforward neural networks (trained with one of the pre-processing step. A heuristic formula for the many popular learning methods), Sef-Organizing regularization parameter is proposed based on Maps (SOMs), and Radial Basis Function Networks preliminary scaling experiments. (RBFN). Examples of Support Vector Machine algorithms include the perceptron-like support vector I. INTRODUCTION machines (SVMs), and Least-Squares Support Vector Machines (LS-SVM), also known as kernel ridge A. One-Layered Neural Networks for Regression regression. A straightforward way to estimate the weights is outlined in Equation (2).  T  A standard (predictive) data mining problem is X mn X nm wm = X mn y n T defined as a regression problem for predicting the response from descriptive features. In order to do so, ( X mn X nm ) −1 ( X mn X nm ) wm = ( X mn X nm ) −1 X mn yn T T  T T  (2) wm = ( X mn X nm ) X mn y n  T −1 T  we will first build a predictive model based on training data, evaluate the performance of this Predictions for the training set can now be made for predictive model based on validation data, and finally y by substituting (2) in (1): y n = X nm ( X mn X nm ) X mn y n (3) use this predictive model to make actual predictions  −1 T  on a test data for which we generally do not know (or ˆ T pretend not to know) the response value. Before applying this formula for a general It is customary to denote the data matrix as X nm prediction proper data preprocessing is required. A  common procedure in data mining to center all the and the response vector as y n . In this case, there are descriptors and to bring them to a unity variance. The n data points and m descriptive features in the dataset. same process is then applied to the response. This  procedure of centering and variance normalization is We would like to infer y n from X nm by induction,  known as Mahalanobis scaling. While Mahalanobis denoted as X nm ⇒ y n , in such a way that our scaling is not the only way to pre-process the data, it inference model works not only for the training data, is probably the most general and the most robust way but also does a good job on the out-of-sample data to do pre-processing that applies well across the  (i.e., validation data and test data). In other words, board. If we represent a feature vector as z , we aim to build a linear predictive model of the type: Mahalanobis scaling will result in a rescaled feature    ˆ y n = X nm wm (1) vector z ′ and can be summarized as:   z−z The hat symbol indicates that we are making z'=  (4) predictions that are not perfect (especially for the std ( z )  validation and test data). Equation (1) is the answer to where z represents the average value and std ( z ) the question “wouldn’t it be nice if we could apply  represents the standard deviation for attribute z . wisdom to the data, and pop comes out the answer?” Making a test model proceeds in a very similar  The vector wn is that wisdom vector and is usually way as for training: the “wisdom vector” or the called the weight vector in machine learning. weight vector will now be applied to the test data to There are many different ways to build such make predictions according to: predictive regression models. Just to mention a few  test ˆ test  y k = X km wm (5) possibilities here, the regression model could be a linear statistical model, a Neural Network based In the above expression it was assumed that there model (NN), or a Support Vector Machine (SVM)[1-3] are k test data, and the superscript ‘test” is used to based model. Examples for linear statistical models explicitly indicate that the weight vector will be are Principal Component Regression models (PCR) applied to a set of k test data with m attributes or descriptors. If one considers testing for one sample
  2. 2. data point at a time, Eq. (5) can be represented as a and left as an exercise to the reader. Note that now simple neural network with an input layer and just a the inverse is needed for a different entity matrix, single neuron, as shown in Fig. 1. The neuron which now has an n × n dimensionality, and is produces the weighted sum of the average input called the data kernel, K D , as defined by: features. Note that the transfer function, commonly found in neural networks, is not present here. Note K D = X nm X mn (11) T also that that the number of weights for this one-layer The right-hand pseudo-inverse formulation is less neural networks equals the number of input frequently cited in the literature, because it can only descriptors or attributes. be non-rank deficient when there are more descriptive attributes than data points, which is not the usual case for data mining problems (except for data strip mining[17] cases). The data kernel matrix is a symmetrical matrix that contains entries representing similarities between data points. The solution to this problem seems to be straightforward. We will first try to explain here what seems to be an obvious solution, and then actually show why this won’t work. Looking at Eqs. (10) and (11) it can be concluded that, except for rare cases where there are as many Fig. 1. Neural network representation for regression data records as there are features, either the feature kernel is rank deficient (in case that m > n , i.e., B. The Machine Learning Dilemma there are more attributes than data), or the data kernel is rank deficient (in case that n > m , i.e., there are Equations (2) and (3) contain the inverse of the more data than attributes). It can be now argued that for the m < n case one can proceed with the usual feature kernel, K F , defined as: left-hand pseudo-inverse method of Eq. (2), and that K F = X mn X nm (9) T for the m > n case one should proceed with the The feature kernel is a m × m symmetric matrix right-hand pseudo inverse, or Penrose inverse where each entry represents the similarity between following Eq. (10). features. Obviously, if there were two features that While the approach just proposed here seems to would be completely redundant the feature matrix be reasonable, it will not work well in practice. would contain two columns and two rows that are Learning occurs by discovering patterns in data (exactly) identical, and the inverse does not exist. through redundancies present in the data. Data One can argue that all is still well, and that in order to redundancies imply that there are data present that make the simple regression method work one would seem to be very similar to each other (and that have just make sure that the same descriptor or attribute is similar values for the response as well). An extreme not included twice. By the same argument, highly example for data redundancy would be a dataset that correlated descriptors (i.e., “cousin features” in data contains the same data point twice. Obviously, in that mining lingo) should be eliminated as well. While case, the data matrix is ill-conditioned and the inverse this argument sounds plausible, the truth of the matter does not exist. This type of redundancy, where data is more subtle. Let us repeat Eq. (2) again and go just repeat themselves, will be called here a “hard one step further as shown below. redundancy.” However, for any dataset that one can possibly learn from, there have to be many “soft  T  redundancies” as well. While these soft redundancies T X mn X nm wm = X mn y n will not necessarily make the data matrix ill- (X T mn X nm ) ( −1 T )  X mn X nm wm ( = X mn X nm T ) T  −1 X mn y n conditioned, in the sense that the inverse does not (10) = (X )  T −1 T  exist because the determinant of the data kernel is wm mn X nm X mn y n zero, in practice this determinant will be very small.  wm ( T −1  = X mn X nm X mn y n T ) In other words, regardless whether one proceeds with a left-hand or a right-hand inverse, if data contain Eq. (10) is the derivation of an equivalent linear information that can be learnt from, there have to be formulation to Eq. (2), based on the so-called right- soft or hard redundancies in the data. Unfortunately, hand pseudo-inverse or Penrose inverse, rather than Eqs. (2) and (10) can’t be solved for the weight using the more common left-hand pseudo-inverse. It vector in that case, because the kernel will either be was not shown here how that last line followed from rank deficient (i.e., ill-conditioned), or poor- the previous equation, but the proof is straightforward
  3. 3. y = X nm X mn ( X nm X mn + λI ) y n conditioned, i.e., calculating the inverse will be  ˆ T T −1  numerically unstable. We call this phenomenon “the −1  machine learning dilemma:” (i) machine learning = K D ( K D + λI ) y n (13) from data can only occur when data contain  = K D wn redundancies; (ii) but, in that case the kernel inverse in Eq. (2) or Eq. (10) is either not defined or where a very different weight vector was introduced:  numerically unstable because of poor conditioning. wn . This weight vector is applied directly to the data Taking the inverse of a poor-conditioned matrix is kernel matrix (rather than the training data matrix) possible, but the inverse is not “sharply defined” and and has the same dimensionality as the number of most numerical methods, with the exception of training data. To make a prediction on the test set, methods based on single value decomposition (SVD), one proceeds in a similar way, but applies the weight will run into numerical instabilities. The data mining vector on the data kernel for the test data, which is dilemma seems to have some similarity with the generally a rectangular matrix, and projects the test uncertainty principle in physics, but we will not try to data on the training data according to: K D = X km ( X mn ) draw that parallel too far. test test train T Statisticians have been aware of the data mining (14) dilemma for a long time, and have devised various where it is assumed that there are k data points in methods around this paradox. In the next sections, we the test set. will propose several methods to deal with the data mining dilemma, and obtain efficient and robust II. THE KERNEL TRANSFORMATION prediction models in the process. The kernel transformation is an elegant way to C. Regression Models Based on the Data Kernel make a regression model nonlinear. The kernel transformation goes back at least to the early 1900’s, Reconsider the data kernel formulation of Eq. when Hilbert addressed kernels in the mathematical (10) for predictive modeling. There are several well- literature. A kernel is a matrix containing similarity known methods for dealing with the data mining measures for a dataset: either between the data of the dilemma by using techniques that ensure that the dataset itself, or with other data (e.g., support kernel matrix will not be rank deficient anymore. vectors[1,3]). A classical use of a kernel is the Two well-known methods are principal component correlation matrix used for determining the principal regression and ridge regression.[5] In order to keep the components in principal component analysis, where mathematical diversions to its bare minimum, only the feature kernel contains linear similarity measures ridge regression will be discussed. between (centered) attributes. In support vector Ridge regression is a very straightforward way to machines, the kernel entries are similarity measures ensure that the kernel matrix is positive definite (or between data rather than features and these similarity well-conditioned), before inverting the data kernel. In measures are usually nonlinear, unlike the dot ridge regression, a small positive value, λ, is added to product similarity measure that we used before to each element on the main diagonal of the data matrix. define a kernel. There are many possible nonlinear Usually the same value for λ is used for each entry. similarity measures, but in order to be mathematically Obviously, we are not solving the same problem tractable the kernel has to satisfy certain conditions, anymore. In order to not deviate too much from the the so-called Mercer conditions. [1] original problem, the value for λ will be kept as small  k11 k12 ... k1n  as we reasonably can tolerate. A good choice for λ is k  k 22 ... k 2 n  a small value that will make the newly defined data K nn =  21  (15) kernel matrix barely positive definite, so that the  ...  inverse exists and is mathematically stable. In data   kernel space, the solution for the weight vector that  k n1 kn2 ... k nn  will be used in the ridge regression prediction model The expression above, introduces the general  now becomes: structure for the data kernel matrix, K nm , for n data. wn = X mn ( X nm X mn + λI ) y n (12)  T T −1  The kernel matrix is a symmetrical matrix where each entry contains a (linear or nonlinear) similarity and predictions for y can now be made according to: between two data vectors. There are many different possibilities for defining similarity metrics such as the dot product, which is a linear similarity measure and the Radial Basis Function kernel or RBF kernel,
  4. 4. which is a nonlinear similarity measure. The RBF and code: only the data are different, and we operate kernel is the most widely used nonlinear kernel and on the kernel transformation of the data rather than the kernel entries are defined by the data themselves.   x j − xl 2 In order to make out-of-sample predictions on 2 − (16) true test data, a similar kernel transformation needs to k ij ≡ e 2 2σ be applied to the test data, as shown in Eq. (14). The Note that in the kernel definition above, the idea of direct kernel methods is illustrated in Fig. 2, kernel entry contains the square of the Euclidean by showing how any regression model can be applied distance (or two-norm) between data points, which is to kernel-transformed data. One could also represent a dissimilarity measure (rather than a similarity), in a the kernel transformation in a neural network type of negative exponential. The negative exponential also flow diagram and the first hidden layer would now contains a free parameter, σ, which is the Parzen yield the kernel-transformed data, and the weights in window width for the RBF kernel. The proper choice the first layer would be just the descriptors of the for selecting the Parzen window is usually training data. The second layer contains the weights determined by an additional tuning, also called hyper- that can be calculated with a hard computing method, tuning, on an external validation set. The precise such as kernel ridge regression. When a radial basis choice for σ is not crucial, there usually is a relatively function kernel is used, this type of neural network broad range for the choice for σ for which the model would look very similar to a radial basis function quality should be stable. neural network, except that the weights in the second Different learning methods distinguish layer are calculated differently. themselves in the way by which the weights are determined. Obviously, the model in Eqs. (12 - 14) to produce estimates or predictions for y is linear. Such a linear model has a handicap in the sense that it cannot capture inherent nonlinearities in the data. This handicap can easily be overcome by applying the kernel transformation directly as a data transformation. We will therefore not operate directly on the data, but on a nonlinear transform of the data, in this case the nonlinear data kernel. This is very Fig. 2. Direct kernels as a data pre-processing step similar to what is done in principal component analysis, where the data are substituted by their A. Dealing with Bias: Centering the Kernel principal components before building a model. A similar procedure will be applied here, but rather than There is still one important detail that was substituting data by their principal components, the overlooked so far, and that is necessary to make data will be substituted by their kernel transform direct kernel methods work. Looking at the prediction (either linear or nonlinear) before building a equations in which the weight vector is applied to predictive model. data as in Eq. (1), there is no constant offset term or The kernel transformation is applied here as a bias. It turns out that for data that are centered this data transformation in a separate pre-processing offset term is always zero and does not have to be stage. We actually replace the data by a nonlinear included explicitly. In machine learning lingo the data kernel and apply a traditional linear predictive proper name for this offset term is the bias, and rather model. Methods where a traditional linear algorithm than applying Eq. (1), a more general predictive is used on a nonlinear kernel transform of the data are model that includes this bias can be written as: introduced here as “direct kernel methods.” The  ˆ  elegance and advantage of such a direct kernel y n = X nm wm + b (17) method is that the nonlinear aspects of the problem where b is the bias term. Because we made it a are captured entirely in the kernel and are transparent practice in data mining to center the data first by to the applied algorithm. If a linear algorithm was Mahalanobis scaling, this bias term is zero and can be used before introducing the kernel transformation, the ignored. required mathematical operations remain linear. It is When dealing with kernels, the situation is more now clear how linear methods such as principal complex, as they need some type of bias as well. We component regression, ridge regression, and partial will give only a recipe here, that works well in least squares can be turned into nonlinear direct practice, and refer the reader to the literature for a kernel methods, by using exactly the same algorithm more detailed explanation.[3, 6] Even when the data
  5. 5. were Mahalanobis-scaled, before applying a kernel machine learning dilemma, a ridge can be applied to transform, the kernel still needs some type of the main diagonal of the data kernel matrix. Since the centering to be able to omit the bias term in the kernel transformation is applied directly on the data, prediction model. A straightforward way for kernel before applying ridge regression, this method is centering is to subtract the average from each column called direct-kernel ridge regression. of the training data kernel, and store this average for Kernel ridge regression and (direct) kernel ridge later recall, when centering the test kernel. A second regression are not new. The roots for ridge regression step for centering the kernel is going through the can be traced back to the statistics literature. [5] newly obtained vertically centered kernel again, this Methods equivalent to kernel ridge regression were time row by row, and subtracting the row average recently introduced under different names in the form each horizontal row. machine learning literature (e.g., proximal SVMs The kernel of the test data needs to be centered in were introduced by Mangasarian et al.[7] kernel ridge a consistent way, following a similar procedure. In regression was introduced by Poggio et al.[8] and this case, the stored column centers from the kernel Least-Squares Support Vector Machines were of the training data will be used for the vertical introduced by Suykens et al.[9-10]). In these works, centering of the kernel of the test data. This vertically Kerned Ridge Regression is usually introduced as a centered test kernel is then centered horizontally, i.e., regularization method that solves a convex for each row, the average of the vertically centered optimization problem in a Langrangian formulation test kernel is calculated, and each horizontal entry of for the dual problem that is very similar to traditional the vertically centered test kernel is substituted by SVMs. The equivalency with ridge regression that entry minus the row average. techniques then appears after a series of mathematical Mathematical formulations for centering square manipulations. By contrast, we introduced kernel kernels are explained in the literature.[3, 6] The ridge regression with few mathematical diversions in advantage of the kernel-centering algorithm the context of the machine learning dilemma and introduced (and described above in words) in this direct kernel methods. For all practical purposes, section is that it also applies to rectangular data kernel ride regression is similar to support vector kernels. The flow chart for pre-processing the data, machines, works in the same feature space as support applying a kernel transform on this data, and vector machines, and was therefore named least- centering the kernel for the training data, validation squares support vector machines by Suykens et al. data, and test data is shown in Fig. 3. Note that kernel ridge regression still requires the computation of an inverse for a n × n matrix, which can be quite large. This task is computationally demanding for large datasets, as is the case in a typical data mining problem. Since the kernel matrix now scales with the number of data squared, this method can also become prohibitive from a practical computer implementation point of view, because both memory and processing requirements can be very demanding. Krylov space-based methods[10] and conjugate gradient methods[1, 10] are relatively Fig. 3. Data pre-processing with kernel centering efficient ways to speed up the matrix inverse transformation of large matrices, where the B. Direct Kernel Ridge Regression computation time now scales as n2, rather than n3. The Analyze/Stripminer code[12] developed by the So far, the argument was made that by applying author applies MØller’s scaled conjugate gradient the kernel transformation in Eqs. (13) and (14), many method to calculate the matrix inverse.[13] traditional linear regression models can be The issue of dealing with large datasets is even transformed into a nonlinear direct kernel method. more profound. There are several potential solutions The kernel transformation and kernel centering that will not be discussed in detail. One approach proceed as data pre-processing steps (Fig. 2). In order would be to use a rectangular kernel, were not all the to make the predictive model inherently nonlinear, data are used as bases to calculate the kernel, but a the radial basis function kernel will be applied, rather good subset of “support vectors” is estimated by than the (linear) dot product kernel, used in Eqs. (2) chunking[1] or other techniques such as sensitivity and (10). There are actually several alternate choices analysis. More efficient ways for inverting large for the kernel,[1-3] but the RBF kernel is the most matrices are based on piece-wise inversion. widely applied kernel. In order to overcome the
  6. 6. Alternatively, the matrix inversion may be avoided ACKNOWLEDGEMENT altogether by adhering to the support vector machine formulation of kernel ridge regression and solving the The author acknowledges the National Science dual Lagrangian optimization problem and applying Foundation support of this work (IIS-9979860). The the sequential minimum optimization or SMO.[16] discussions with Robert Bress, Kristin Bennett, Karsten Sternickel, Boleslaw Szymanski and Seppo III. HEURISTIC REGULARIZATION FOR λ Ovaska were extremely helpful to prepare this paper. It has been shown that kernel ridge regression can REFERENCES be expressed as an optimization method,[10-15] where [1] Nello Cristianini and John Shawe-Taylor [2000] Support rather than minimizing the residual error on the Vector Machines and Other Kernel-Based Learning training set, according to: Methods, Cambridge University Press. 2 n train   [2] Vladimir Vapnik [1998] Statistical Learning Theory, John ∑i =1 ˆ yi − yi (18) [3] Wiley & Sons. Bernhard Schölkopf and Alexander J. Smola [2002] 2 Learning with Kernels, MIT Press. we now minimize: [4] Vladimir Cherkassky and Filip Mulier [1998] Learning 2 n train   λ 2 from Data: Concepts, Theory, and Methods, John Wiley & ∑i =1 ˆ yi − yi + w 2 (19) 2 [5] Sons, Inc. A. E. Hoerl, and R. W. Kennard [1970] “Ridge Regression: 2 The above equation is a form of Tikhonov Biased Estimation for Non-Orthogonal Problems,” Technometrics, Vol. 12, pp. 69-82. regularization[14] that has been explained in detail by [6] B. Schölkopf, A. Smola, and K-R Müller [1998] “Nonlinear Cherkassky and Mulier[4] in the context of empirical Component Analysis as a Kernel Eigenvalue Problem,” versus structural risk minimization. Minimizing the Neural Computation, Vol. 10, 1299-1319, 1998. norm of the weight vector is in a sense similar to an [7] Glenn Fung and Olvi L. Mangasarian, “Proximal Support Vector Machine Classifiers,” in Proceedings KDD 2001, error penalization for prediction models with a large San Francisco, CA. number of free parameters. An obvious question in [8] Evgeniou, T., Pontil, and M. Poggio, T. [2000] “Statistical this context relates to the proper choice for the Learning Theory: A Primer,” International Journal of Computer Vision, Vol. 38(1), pp. 9-13. regularization parameter or ridge parameter λ. [9] Suykens, J. A. K. and Vandewalle, J. [1999] “Least-Squares In the machine learning, it is common to tune the Support Vector Machine Classifiers,” Neural Processing hyper-parameter λ using a tuning/validation set. This letters, Vol. 9(3), pp. 293-300, Vol. 14, pp. 71-84. tuning procedure can be quite time consuming for [10] Suykens, J. A. K., van Gestel, T. de Brabanter, J. De Moor, M., and Vandewalle, J. [2003] Least Squares Support large datasets, especially in consideration that a Vector Machines, World Scientific Pub Co, Singapore. simultaneous tuning for the RBF kernel width must [11] Ilse C. F. Ipsen, and Carl D. Meyer [1998] “The Idea behind proceed in a similar manner. We therefore propose a Krylov Methods,” American Mathematical Monthly, Vol. heuristic formula for the proper choice for the ridge 105, 889-899. [12] The Analyze/StripMiner code is available on request for parameter, that has proven to be close to optimal in academic use, or can be downloaded from numerous practical cases [36]. If the data were www.drugmining.com. originally Mahalanobis scaled, it was found by [13] Møller, M. F., [1993] “A Scaled Conjugate Gradient scaling experiments that a near optimal choice for λ Algorithm for Fast Supervised Learning,” Neural Networks, Vol. 6, pp.525-534. is [14] A. N. Tikhonov and V. Y. Arsenin [1977] Solutions of ill-  3  Posed Problems, W.H. Winston, Washinton D.C.   n 2  λ = min 1; 0.05   (20) [15] Bennett, K. P., and Embrechts, M. J. [2003] “An   200     Optimization Perspective on Kernel Partial Least Squares where n is the number of data in the training set. Regression,” Chapter 11 in Advances in Learning Theory: Note that in order to apply the above heuristic the Methods, Models and Applications, Suykens J.A.K. et al., Eds., NATO-ASI Series in Computer and System Sciences, data have to be Mahalanobis scaled first. Eq. (20) IOS Press, Amsterdam, The Netherlands. was validated on a variety of standard benchmark [16] Keerthi, S. S., and Shevade S. K. [2003] “SMO Algorithm datasets from the UCI data repository, and provided for Least Squares SVM Formulations,” Neural results that are nearly identical to an optimally tuned Computation, Vol. 15, pp. 487-507. [17] Robert H. Kewley, and Mark J. Embrechts [2000] “Data λ on a tuning/validation set. In any case, the heuristic Strip Mining for the Virtual Design of Pharmaceuticals with formula for λ should be an excellent starting choice Neural Networks,” IEEE Transactions on Neural Networks, for the tuning process for λ. The above formula Vol.11 (3), pp. 668-679. proved to be also useful for the initial choice for the regularization parameter C of SVMs, where C is now taken as 1/λ.