Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. BALANCING BOARD MACHINES Frederic Maire, School of Software Engineering and Data Communication, Queensland University of Technology, Box 2434, Brisbane, Qld 4001, Australia Abstract such that k ( x, y ) = φ ( x),φ ( y ) . With such a kernel The support vector machine solution corresponds to the function k , the computation of inner products center of the largest sphere inscribed in version space. φ ( x), φ ( y ) does not require the explicit knowledge of Alternative approaches like Bayesian Point Machine (BPM) and Analytic Center Machine have suggested that φ . In fact for a given kernel function k , there may exist the generalization performance can be further enhanced many different mappings φ . Geometrically, each training example ( xi , y i ) defines a half-space in feature by considering other possible centers of version space like the centroid (center of mass). We present an algorithm to compute exactly the centroid of higher dimensional space through the constraint y i w, φ (x i ) > 0 on w . polyhedra, then derive approximation algorithms to build It is easy to see that version space is a polyhedral cone of a new learning machine whose performance is feature space. comparable to BPM. Figure 1 shows a bird-eye view (slice of the polyhedral Key Words cone) of version space. Kernel machines, Bayesian Point, Centroid. 1. Introduction Kernel classifiers are non-linear decision functions for binary classification. In the Kernel Machine framework (Muller & Mika & Ratch & Tsuda & Scholkopf, [1]; Scholkopft & Smola, [2]), a feature mapping x  φ (x) from an input space to a feature space is given (generally, implicitly via a kernel function), as well as a training set { } of input vectors x 1 ,  , x m with the corresponding class labels { y1 ,  , y m } where y i ∈ { − 1,+1} . The learning problem is formulated as a search problem for a linear classifier hypothesis (a weight vector w ) belonging Figure 1: An elongated version space. The SVM point is to a subset of the feature space called version space; the centre of the sphere. { w | ∀i ∈ [1, m], y w, φ (x i ) > 0} . In other words, i version space is the set of weight vectors w that are The SVM solution point wSVM is the centre of the largest consistent with the training set. Because only the sphere whose centre is a unit vector and is contained in direction of w matters for classification purpose, without the polyhedral cone. loss of generality, we can restrict the search for w to the unit sphere in feature space. The training algorithm of a Bayes Point Machines (BPM) are a well-founded Support Vector Machine (SVM) returns the weight vector improvement which approximates the Bayes-optimal w that has the smallest maximum angle between w and decision by the centroid (also known as the centre of mass the y iφ ( xi ) ’s. The Kernel trick is that for certain or barycentre) of version space. It happens that the Bayes point is very close to the centroid of version space in high feature spaces and mappings φ , there exist easily dimensional spaces. The Bayes point achieves better computable kernel functions k defined on the input space generalization performance in comparison to SVMs
  2. 2. (Opper & Haussler, [3]; Shawe-Taylor & Williamson, [4]; dimensional v o l u m e V ( n, A, b) o f a p o l y h e d r o n P = { x | Ax ≤ b} is related to the (n − 1) -dimensional Graepel & Herbrich & Campbell, [5]; Watkin, [6]). An intuitive way to see why the centroid is a good choice volumes of its facets and the row vectors of its matrix A is to view version space as a committee of experts who all by the following formula; agree on the training set. A new input vector x V (n, A, b) = (1 / n)∑ (bi / ai ) × Vi (n − 1, A, b) corresponds to a hyperplane in feature space that may cut i version space in two parts. In the example of Figure 1, where ai d e n o t e s t h e i t h row of A and the experts on the right of the hyperplane normal to φ ( x ) classify x positively, whereas the experts on the left Vi ( n − 1, A, b) denotes (n − 1) -dimensional volume of classify x negatively. It is reasonable to use the opinion the ith facet. T h e c o m p u t a t i o n o f t h e c e n t r o i d of the majority of the experts that successfully classified a n d t h e ( n − 1) -volume of a facet is done by variable the training set to predict the class label of x . The expert elimination. Geometrically, this amounts to projecting that agrees the most with the majority vote on new inputs the facet onto an axis parallel hyperplane, then computing is precisely the Bayesian point. In a standard committee the volume and the c e n t r o i d of this projection machine, for each new input we seek the opinions of a recursively in a lower dimensional space. From the finite number of experts’ then take a majority vote, volume and c e n t r o i d of the projected facet, we can whereas in a BPM, the expert that most often agrees with derive the c e n t r o i d a n d v o l u m e of the original facet. the majority vote of the infinite committee (version space) is delegated the task of classifying the new input. The formulae below are obtained by considering the n - fold integral defining the n -dimensional volume and Following Rujan [7], Herbrich and Graepel [8] introduced decomposing the polyhedron into cones. The centroid of two algorithms to stochastically approximate the centroid a polyhedron can be computed recursively in the of version space: a billiard sampling algorithm and a following manner; sampling algorithm based on the well known perceptron algorithm. • Compute recursively the centroids GFi and the In this paper, we present an algorithm to compute exactly (n − 1) -volumes VFi of each facet (face of the centroid of a polyhedron in a high dimensional space. dimension n − 1 ) Fi of P . Each facet Fi From this exact algorithm, we derive an algorithm to approximate a centroid position in a polyhedral cone. We corresponds to the intersection of P with the show empirically that the corresponding machine presents hyperplane defined by the i th row of the system better generalization capability than SVMs on a number a Ax ≤ b . benchmark data sets. VFi In section 2, we introduce an algorithm to compute GE = ∑ × GFi exactly the centroid of higher dimensional polyhedra. In • Compute i ∑V j Fj , the centroid of section 3, we show how to use this algorithm to approximate the centroid of version space. In section 4, the envelope of P (the union of the facets Fi ). some implementation issues are considered and some experimental results are presented. • Compute the centroids GCi and the n -volumes VCi 2. Exact Computation of the Centroid of a of the cones Ci = cone(GE , Fi ) rooted at GE . If Higher Dimensional Polyhedron hi is the distance from GE to the hyperplane h A polyhedron P is the intersection of a finite number of containing Fi , then VCi = i × VFi and half-spaces. It is best represented by a system of non n redundant linear inequalities P = { x | Ax ≤ b} . Recall n GE GCi = × GE GFi . that the 1-volume is the length, the 2-volume is the n +1 surface and the 3-volume is the every-day-life volume. The algorithm that we introduce for c o m p u t i n g t h e c e n t r o i d o f a n n -dimensional polyhedron i s a n e x t e n s i o n o f t h e w o r k b y Lasserre [10] who showed that the n -
  3. 3. • Compute G the centroid of P as the weighted sum VCi G=∑ × GCi . i ∑V j Cj It is useful to observe that the computation of the volume and the centroid of a ( n − 1) -dimensional polyhedron in a n -dimensional space is identical to the computation of the volume and the centroid of a facet of a n -dimensional polyhedron. For further details, see the Matlab source code at 3 Balancing Board Machines 3.1 A Mechanical Point of View Figure 2: Top left, initial board. Top right, after one iteration. Bottom left, after two iterations. Bottom right after three iterations. The point of contact of a board posed in equilibrium on a sphere (assumed to be the only source of gravity) is the centroid of the board. This observation is the 3.2 Exploring the Polyhedral Cone basis of our “balancing board algorithm”. In the rest of this paper, the term “board” will refer to the intersection of the polyhedral cone of version space Statistical learning theory (Scholkopft & Smola, [2]) with a hyperplane normal to a unit vector w of tells us that the Bayes point w belongs to the vector version space. This definition implies that if the subspace V generated by the family of vectors polyhedral cone is n -dimensional then a board will be a ( n − 1) -dimensional polyhedron tangent to the unit {φ ( x ),, φ ( x )} , 1 m that is w is of the form sphere. w = ∑α φ ( x ) . j j j In the algorithm we propose, the approximation w of Once we know an orthonormal basis of V (the the centroid direction of the cone is refined by orthonormality is with respect to the inner product in computing the centroid of the board normal to w , and feature space corresponding to the kernel function in the then rotating w towards the centroid of the board input space), we can express the polyhedral cone (stopping at a local minimum of the volume of the inequalities with respect to this orthonormal basis. Then board in this line search). we can apply the formulae of section 2 to compute the centroid of any polyhedron expressed in this orthonormal Figure 2 illustrates the balancing process of a board. basis. The kernel PCA basis is an orthonormal basis B Notice that Figure 2 is simply an illustration as in of V . Its basis vectors are the eigenvectors of the dimension 2 the line search would succeed in just one line-search iteration! {( symmetric matrix K = k xi , x j )} i, j . By expressing the polyhedral cone defined by the training examples in B , we will be able to approximate a centroid direction with the board balancing algorithm sketched in section 3.1 and detailed below. The complexity of the algorithm of section 2 to compute exactly the centroid is unfortunately exponential. The computational cost of the exact calculation of the centroid is too high even for medium size data sets. However, the recursive formulae allow us to derive an approximation of the volume and the centroid of a polyhedron once we have approximations for the volumes and the centroids of its facets.
  4. 4. polyhedral cone and whose centre is at distance one from Because the balancing board algorithm requires several zero) corresponds to the SVM solution. Because A is board centroid estimations, it is desirable to recycle square and non-singular, each facet of the polyhedral cone intermediate results as much as possible to achieve a touches the largest sphere. If each facet is moved by a significant reduction in computation time. Because the distance of one in the direction of its normal vector, the intersection of a hyperplane and a spheric cone is an new cone obtained is a translation of the original cone in ellipsoid, we estimate the volume and the centroid of the the direction of ws . That is ws can be obtained by intersection of the board and a facet of the polyhedral  cone (this intersection is (n-2)-dimensional) with the solving Ax = − 1 . volume and the centroid of the intersection of the board and the largest spheric cone contained in the facet (this Once the direction u = ws of the spheric centre of the spheric cone is (n-1)-dimensional). The computation of these largest spheric cones is done only once. The centre polyhedral cone is determined, the radius r of the largest of the ellipsoid and its quadratic matrix is easily derived sphere centered at u can be computed. Here the radius from the centre and radius of the spheric cone. These of the spheric cone is defined as the radius of the largest derivations are explained in the next sub-sections. (n-1)-sphere contained in the intersection of the cone and the hyperplane u T x = 1 . We use this definition to avoid To simplify the computations, we have restricted our geodesics. study to non-singular kernel matrices (like those obtained from Gaussian kernels). 3.2.3 Computation of r 3.2.1 Change of Basis We write A(k , :) to denote the kth row of matrix A . For Let wB be the coordinates of w with respect to each i , letα i = π / 2 − acos( − A( i,:) u ) . The scalar r is the minimum over all tan (α i ) . If we are interested in B = {φ ( x 1 ), , φ ( x m )} . Recall that the Kernel PCA the attributes of the cone contained in the facet basis is made of the eigenvectors of K . Let wU be the A( k ,:) x = 0 , we simply solve the system coordinates of w with respect to the Kernel PCA  A([1 : k −1, k +1 :],:) x = −1 orthonormal basis {u 1 ,  , u m } . We have wB = UwU .  def  A( k , :) x = 0 Let M = K + λI , where λ is a non-negative regularization parameter as in (Herbrich et al, [9]). We are 3.2.4 Spheric Cone Equation looking for wB such that − diag( y ) MwB ≤ 0 with Given the characteristic attributes u and r of a spheric w, w = 1 and w near the centroid direction of the cone, we can derive a simple equation for the cone. polyhedral cone. As we have w, w = wU ( ) T wU , in ( ) Let z = u T x u and y = x − z . The cone equation is practice, we look for the centroid direction of the cone y y = r z z . An alternative equation (derived from T 2 T − diag( y ) MUwU ≤ 0 ( )( Pythagoras theorem) is x T x = 1 + r 2 u T x ) 2 .   ( ) T wU wU = 1 (2) Our estimation of the volume and centroid of the board requires the estimation of the volume and centroid of the 3.2.2 Computing the Spheric Centre of a intersection of a cone and two hyperplanes (namely a Polyhedral Cone Derived from a Non facet and the hyperplane containing the board). Singular Mercer Kernel Matrix 3.2.5 Intersection of a Spheric Cone and Two Let Ax ≤ 0 be the non-empty polyhedral cone derived Hyperplanes from the kernel matrix. The matrix A is square ( m = n ). Without loss of generality, we assume that its rows are ( )( Consider the cone x T x = 1 + r 2 u T x ) 2 contained in the kth facet (that is A( k ,:) u = 0 ). Let’s compute the normalized. That is each row is a vector of norm 1. Recall that the spheric centre ws of the cone (direction of the centre of the largest sphere contained in the
  5. 5. ellipsoid defined by the intersection of this cone and the  1  hyperplane wT x = 1 ( w is normal to the board).  δ 0  0    wT    1  Let Q = [ q1 ,, qn −2 ] = null   0      A( k ,:)   . Let h be the M =     0    intersection of the ray defined by u and the hyperplane 1   0  0 wT x = 1 . Let us make the change of variables  δn    x = h+Q z . One can easily check that Recall that if f ( x ) = Mx is a linear transformation and ∀z ∈ R , w ( h + Q z ) = 1 and A( k ,:) ( h + Q z ) = 0 n−2 T S is a subset of the vector space, then we have, vol( f ( S ) ) = abs( det ( M ) ) × vol( S ) . We derive now the equation of the ellipsoid with respect to z . From x T x = 1 + r 2 u T x ( )( ) 2 , we obtain n 1 The volume of the ellipsoid is therefore ∏ ( h + Q z ) T ( h + Q z ) = (1 + r 2 )(u T h + u T Q z ) 2 i =1 δi times After developing the expression, we get the n-volume of the n-sphere. h h + 2h Q z + z Q Q z = (1 + r )(u h) 2 T T T T 2 T + For completeness, let us mention that the volume of a n- (1 + r )( z Q T uu T Q z ) + (1 + r 2 )( 2u T hu T Q z ) 1 n 2 T 2r n π 2 sphere of radius r is × , and the volume of After regrouping, we have n Γ( 1 n ) 2 z T Q T ( I n − (1 + r 2 ) uu T )Q z + n 1 the n-rectangle containing the ellipsoid is 2 × ∏ n 2( h − (1 + r T 2 )(u hu ) ) Q z + T T i =1 δi h T h − (1 + r 2 )(u h ) = 0 T 2 . From this expression we can derive an expression of the 3.2.7 Distance from w to a Facet form ( z − c ) D ( z − c ) = b that will tell us the (n-2)- T volume of the ellipsoid and its centre. It is easy to check To compute the ( n − 1) -volume of the intersection of the that h + Q c is the centre of the ellipsoid in R n . board w T x = 1 and the polyhedral cone P, we need to find for each facet A( k , :) x ≤ 0 the point x in the plane In the next subsection, we show how to compute the generated by w and A( k , :) T volume of the ellipsoid. that belongs to this intersection. That is the orthogonal projection of w on 3.2.6 Volume of an Ellipsoid the hinge defined as the intersection of the board and the ( x − c ) D( x − c ) = b T kth cone facet. The point x = α A( k , :) + β w must T satisfy Without loss of generality, we assume that  wT x = 1 c = 0 and b = 1 . The matrix D is symmetric non-  . negative, therefore there exists a decomposition  A( k , :) x = 0 D = P∆P where P is orthogonal and ∆ non-negative T γα + β = 1 and diagonal. Therefore  where γ = A( k , :) w . Let y = P T x , the equation of the ellipsoid becomes α + γβ = 0 That is, ∑δ i yi2 = 1 . Let z = δ y . We can check that if γ 1  1 −1 x = [ A( k , :) w] ×  i i i T i y is on the ellipsoid then z is on the unit sphere and   .  1 γ  0 reciprocally. That is the ellipsoid is obtained from the unit sphere by the linear transformation of matrix M , 4. Implementation Issues and Experimental where Results
  6. 6. We have implemented the exact computation of the The exact computation algorithm can be useful for centroid and the volume in Matlab. A direct recursive benchmarking to people developing new centroid implementation of Lasserre formula would be very approximation algorithms. We do not claim that our inefficient as faces of dimension k share faces of BBM approach is superior to any other given that the dimension k − 1 . Our implementation caches the computational cost is in the order of m times the cost of volumes and centroids of the lower dimensional faces in a a SVM computation (where m is the number of training hash-table. examples). Our algorithm has been validated by comparing the values Replacing the ellipsoids with a more accurate estimation returned with a Monte-Carlo method. would probably give better results, but deriving the volume and the centroid of the intersection of a facet and As Lasserre’s formula is valid only if the polyhedron is a board from the volume and the centroid of the represented as a system of non-redundant linear intersection of the same facet with another board seems to inequalities. Redundancy must be detected and be a hard problem. eliminated by using a linear optimization. The computation of the SVM point presented in section The kernel matrix of a Gaussian kernel can only be 3.2.2 provides an efficient learning algorithm for singular when identical input vectors occur more than Gaussian kernels. once the training set. We remove repeated occurrences of the same input vector and assign the most common label 4. Acknowledgement for this input vector to the occurrence that we leave in the training set. I would like to thank Professor Tom Downs and Professor Peter Bartlett for their valuable comments on a previous The table which follows summarises generalization version of the BBM algorithm. This work was partially performance (percentage of correct predictions on test supported by an ATN grant. sets) of the Balancing Board Machine (BBM) on 6 standard benchmarking data sets from the UCI References Repository, comparing results for illustrative purposes with equivalent hard margin support vector machines. In [1] Muller, K., Mika, S., Ratch, G., Tsuda, K., and each case the data was randomly partitioned into 20 Scholkopf, B. An Introduction To Kernel-Based training and test sets in the ratio 60%:40%. Learning Algorithms. IEEE Trans. on NN, vol 12, no 2, 2001, pp 181-201. Data set SVM BBM [2] Scholkopft, B., Smola, A., Learning with Kernels, heart disease 58.36 58.40 thyroid 94.34 95.23 [3] M. Opper and D. Haussler, Generalization diabetes 66.89 67.68 performance of Bayes optimal classification algorithm waveform 83.50 83.50 for learning a perceptron, Phys. Rev. Lett., vol. 66, p. sonar 85.06 85.78 2677, 1991. ionosphere 86.79 86.86 [4] J. Shawe-Taylor and R. C. Williamson, A PAC The results obtained with a BBM are comparable to those analysis of a Bayesian estimator, Royal Holloway, Univ. London, Tech. Rep. NC2-TR-1997-013, 1997. obtained with a BPM, but the improvement is not always as dramatic as those reported in (Herbrich et al., [9]). We [5] T. Graepel, R. Herbrich, and C. Campbell, Bayes observed that the improvement was generally better for point machines: Estimating the bayes point in kernel smaller data sets. We suspect that this is due to the fact space, in Proc.f IJCAI Workshop Support Vector the volumes considered become very small in high Machines, 1999, pp. 23-27. dimensional spaces. In fact, on a PC, unit spheres [6] T. Watkin, Optimal learning with a neural network, “vanish” when their dimension exceed 340. The volume Europhys. Lett., vol. 21, pp. 871-877, 1993. of a unit sphere of dimension 340 is 4.5 10 -223. This is why we consider the logarithm of the volume in our [7] P. Ruján, Playing billiard in version space, Neural programs. Comput., vol. 9, pp. 197-238, 1996. [8] R. Herbrich and T. Graepel, Large scale Bayes point 5. Conclusion machines, Advances in Neural Information System Processing 13, 2001.
  7. 7. [9] R. Herbrich, T. Graepel, and C. Bayes Point Machines, Journal of Machine Learning Research, 1 (2001) 245--279. [10] Lasserre, J., An analytical Expression and an Algorithm for the volume of a Convex Polyhedron in Rn, Journal of Optimization Theory and Applications, Vol 39, No 3, 1983. Schrijver, A. Theory of Linear and Integer Programming, Wiley-Interscience Publication (1990). Theodore B. Trafalis, Alexander M. Malyscheff: An Analytic Center Machine. 203-223, Machine Learning, Volume 46, 2002