An introduction to support vector machines

1. Machine Learning for Data Mining Introduction to Support Vector Machines Andres Mendez-Vazquez June 21, 2015 1 / 85

2. Outline 1 History 2 Separable Classes Separable Classes Hyperplanes 3 Support Vectors Support Vectors Quadratic Optimization Lagrange Multipliers Method Karush-Kuhn-Tucker Conditions Primal-Dual Problem for Lagrangian Properties 4 Kernel Kernel Idea Higher Dimensional Space Examples Now, How to select a Kernel? 2 / 85

3. History Invented by Vladimir Vapnik and Alexey Ya. Chervonenkis in 1963 At the Institute of Control Sciences, Moscow On the paper “Estimation of dependencies based on empirical data” Corinna Cortes and Vladimir Vapnik in 1995 They Invented their Current Incarnation - Soft Margins At the AT&T Labs BTW Corinna Cortes Danish computer scientist who is known for her contributions to the ﬁeld of machine learning. She is currently the Head of Google Research, New York. Cortes is a recipient of the Paris Kanellakis Theory and Practice Award (ACM) for her work on theoretical foundations of support vector machines. 3 / 85

10. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 4 / 85

11. In addition Alexey Yakovlevich Chervonenkis He was a Soviet and Russian mathematician, and, with Vladimir Vapnik, was one of the main developers of the Vapnik–Chervonenkis theory, also known as the "fundamental theory of learning" an important part of computational learning theory. He died in September 22nd, 2014 At Losiny Ostrov National Park on 22 September 2014. 4 / 85

12. Applications Partial List 1 Predictive Control Control of chaotic systems. 2 Inverse Geosounding Problem It is used to understand the internal structure of our planet. 3 Environmental Sciences Spatio-temporal environmental data analysis and modeling. 4 Protein Fold and Remote Homology Detection In the recognition if two different species contain similar genes. 5 Facial expression classification 6 Texture Classification 7 E-Learning 8 Handwritten Recognition 9 AND counting.... 5 / 85

22. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain decision functions g(x) = wt x + w0 7 / 85

23. Separable Classes Given xi, i = 1, · · · , N A set of samples belonging to two classes ω1, ω2. Objective We want to obtain decision functions g(x) = wt x + w0 7 / 85

24. Such that we can do the following A linear separation function g (x) = wt x + w0 8 / 85

26. In other words ... We have the following samples For x1, · · · , xm ∈ C1 For x1, · · · , xn ∈ C2 We want the following decision surfaces wT xi + w0 ≥ 0 for di = +1 if xi ∈ C1 wT xi + w0 ≤ 0 for di = −1 if xi ∈ C2 10 / 85

30. What do we want? Our goal is to search for a direction w that gives the maximum possible margin direction 2 MARGINSdirection 1 11 / 85

31. Remember We have the following d Projection r distance 0 12 / 85

32. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 13 / 85

33. A Little of Geometry Thus r d A B C Then d = |w0| w2 1 + w2 2 , r = |g (x)| w2 1 + w2 2 (1) 13 / 85

34. First d = |w0|√ w2 1 +w2 2 We can use the following rule in a triangle with a 90o angle Area = 1 2 Cd (2) In addition, the area can be calculated also as Area = 1 2 AB (3) Thus d = AB C Remark: Can you get the rest of values? 14 / 85

37. What about r = |g(x)|√ w2 1 +w2 2 ? First, remember g (xp) = 0 and x = xp + r w w (4) Thus, we have g(x) =wT xp + r w w + w0 =wT xp + w0 + r wT w wT =g (xp) + r w Then r = g(x) ||w|| 15 / 85

42. This has the following interpretation The distance from the projection 0 16 / 85

43. Now We know that the straight line that we are looking for looks like wT x + w0 = 0 (5) What about something like this wT x + w0 = δ (6) Clearly This will be above or below the initial line wT x + w0 = 0. 17 / 85

46. Come back to the hyperplanes We have then for each border support line an speciﬁc bias!!! Support Vectors 18 / 85

47. Then, normalize by δ The new margin functions w T x + w10 = 1 w T x + w01 = −1 where w = w δ , w10 = w0 δ ,and w01 = w0 δ Now, we come back to the middle separator hyperplane, but with the normalized term wT xi + w0 ≥ w T x + w10 for di = +1 wT xi + w0 ≤ w T x + w01 for di = −1 Where w0 is the bias of that central hyperplane!! And the w is the normalized direction of w 19 / 85

53. Come back to the hyperplanes The meaning of what I am saying!!! 20 / 85

55. A little about Support Vectors They are the vectors xi such that wT xi + w0 = 1 or wT xi + w0 = −1 Properties The vectors nearest to the decision surface and the most diﬃcult to classify. Because of that, we have the name “Support Vector Machines”. 22 / 85

58. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g(xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 23 / 85

59. Now, we can resume the decision rule for the hyperplane For the support vectors g (xi) = wT xi + w0 = −(+)1 for di = −(+)1 (7) Implies The distance to the support vectors is: r = g(xi) ||w|| =    1 ||w|| if di = +1 − 1 ||w|| if di = −1 23 / 85

60. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors deﬁne the value of ρ 24 / 85

61. Therefore ... We want the optimum value of the margin of separation as ρ = 1 ||w|| + 1 ||w|| = 2 ||w|| (8) And the support vectors deﬁne the value of ρ Support Vectors 24 / 85

63. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di(wT xi + w0) ≥ 1 i = 1, · · · , N 26 / 85

64. Quadratic Optimization Then, we have the samples with labels T = {(xi, di)}N i=1 Then we can put the decision rule as di(wT xi + w0) ≥ 1 i = 1, · · · , N 26 / 85

65. Then, we have the optimization problem The optimization problem minwΦ(w) = 1 2wT w s.t. di(wT xi + w0) ≥ 1 i = 1, · · · , N Observations The cost functions Φ (w) is convex. The constrains are linear with respect to w. 27 / 85

69. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspeciﬁed parameters known as Lagrange multipliers. 29 / 85

70. Lagrange Multipliers The method of Lagrange multipliers Gives a set of necessary conditions to identify optimal points of equality constrained optimization problems. This is done by converting a constrained problem to an equivalent unconstrained problem with the help of certain unspeciﬁed parameters known as Lagrange multipliers. 29 / 85

71. Lagrange Multipliers The classical problem formulation min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 It can be converted into min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} (9) where L(x, λ) is the Lagrangian function. λ is an unspeciﬁed positive or negative constant called the Lagrange Multiplier. 30 / 85

74. Finding an Optimum using Lagrange Multipliers New problem min L (x1, x2, ..., xn, λ) = min {f (x1, x2, ..., xn) − λh1 (x1, x2, ..., xn)} We want a λ = λ∗ optimal If the minimum of L (x1, x2, ..., xn, λ∗ ) occurs at (x1, x2, ..., xn)T = (x1, x2, ..., xn)T∗ and (x1, x2, ..., xn) T∗ satisﬁes h1 (x1, x2, ..., xn) = 0, then (x1, x2, ..., xn) T∗ minimizes: min f (x1, x2, ..., xn) s.t h1 (x1, x2, ..., xn) = 0 Trick It is to ﬁnd appropriate value for Lagrangian multiplier λ. 31 / 85

77. Remember Think about this Remember First Law of Newton!!! Yes!!! 32 / 85

78. Remember Think about this Remember First Law of Newton!!! Yes!!! A system in equilibrium does not move Static Body 32 / 85

79. Lagrange Multipliers Deﬁnition Gives a set of necessary conditions to identify optimal points of equality constrained optimization problem 33 / 85

80. Lagrange was a Physicists He was thinking in the following formula A system in equilibrium has the following equation: F1 + F2 + ... + FK = 0 (10) But functions do not have forces? Are you sure? Think about the following The Gradient of a surface. 34 / 85

83. Gradient to a Surface After all a gradient is a measure of the maximal change For example the gradient of a function of three variables: f (x) = i ∂f (x) ∂x + j ∂f (x) ∂y + k ∂f (x) ∂z (11) where i, j and k are unitary vectors in the directions x, y and z. 35 / 85

84. Example We have f (x, y) = x exp {−x2 − y2 } 36 / 85

85. Example With Gradient at the the contours when projecting in the 2D plane 37 / 85

86. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λK FK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 38 / 85

87. Now, Think about this Yes, we can use the gradient However, we need to do some scaling of the forces by using parameters λ Thus, we have F0 + λ1F1 + ... + λK FK = 0 (12) where F0 is the gradient of the principal cost function and Fi for i = 1, 2, .., K. 38 / 85

88. Thus If we have the following optimization: min f (x) s.tg1 (x) = 0 g2 (x) = 0 39 / 85

89. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can ﬁx by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0 39 / 85

90. Geometric interpretation in the case of minimization What is wrong? Gradients are going in the other direction, we can ﬁx by simple multiplying by -1 Here the cost function is f (x, y) = x exp −x2 − y2 we want to minimize f (−→x ) g1 (−→x ) g2 (−→x ) −∇f (−→x ) + λ1∇g1 (−→x ) + λ2∇g2 (−→x ) = 0 Nevertheless: it is equivalent to f (−→x ) − λ1 g1 (−→x ) − λ2 g2 (−→x ) = 0 39 / 85

92. Method Steps 1 Original problem is rewritten as: 1 minimize L(x, λ) = f (x) − λh1(x) 2 Take derivatives of L(x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of α in constraint h1(x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α). 41 / 85

93. Method Steps 1 Original problem is rewritten as: 1 minimize L(x, λ) = f (x) − λh1(x) 2 Take derivatives of L(x, λ) with respect to xi and set them equal to zero. 3 Express all xi in terms of Lagrangian multiplier λ. 4 Plug x in terms of α in constraint h1(x) = 0 and solve λ. 5 Calculate x by using the just found value for λ. From the step 2 If there are n variables (i.e., x1, · · · , xn) then you will get n equations with n + 1 unknowns (i.e., n variables xi and one Lagrangian multiplier α). 41 / 85

94. Example We can apply that to the following problem min f (x, y) = x2 − 8x + y2 − 12y + 48 s.t x + y = 8 42 / 85

95. Then, Rewriting The Optimization Problem The optimization with equality constraints minwΦ(w) = 1 2wT w s.t di(wT xi + w0) − 1 = 0 i = 1, · · · , N 43 / 85

96. Then, for our problem Using the Lagrange Multipliers (We will call them αi) We obtain the following cost function J(w, w0, α) = 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] Observation Minimize with respect to w and w0. Maximize with respect to α because it dominates − N i=1 αi[di(wT xi + w0) − 1]. (13) 44 / 85

101. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L(x, α) = f (x) − N i=1 αi gi (x) = 0 46 / 85

102. Karush-Kuhn-Tucker Conditions First An Inequality Constrained Problem P min f (x) s.t g1 (x) = 0 ... gN (x) = 0 A really minimal version!!! Hey, it is a patch work!!! A point x is a local minimum of an equality constrained problem P only if a set of non-negative αj’s may be found such that: L(x, α) = f (x) − N i=1 αi gi (x) = 0 46 / 85

103. Karush-Kuhn-Tucker Conditions Important Think about this each constraint correspond to a sample in both classes, thus The corresponding αi’s are going to be zero after optimization, if a constraint is not active i.e. di wT xi + w0 − 1 = 0 (Remember Maximization). Again the Support Vectors This actually deﬁnes the idea of support vectors!!! Thus Only the αi’s with active constraints (Support Vectors) will be diﬀerent from zero when di wT xi + w0 − 1 = 0. 47 / 85

106. A small deviation from the SVM’s for the sake of Vox Populi Theorem (Karush-Kuhn-Tucker Necessary Conditions) Let X be a non-empty open set Rn, and let f : Rn → R and gi : Rn → R for i = 1, ..., m. Consider the problem P to minimize f (x) subject to x ∈ X and gi (x) ≤ 0 i = 1, ..., m. Let x be a feasible solution, and denote I = {i|gi (x) = 0}. Suppose that f and gi for i ∈ I are diﬀerentiable at x and that gi i /∈ I are continuous at x. Furthermore, suppose that gi (x) for i ∈ I are linearly independent. If x solves problem P locally, there exist scalars ui for i ∈ I such that f (x) + i∈I ui gi (x) = 0 ui ≥ 0 for i ∈ I 48 / 85

107. It is more... In addition to the above assumptions If gi for each i /∈ I is also diﬀerentiable at x, the previous conditions can be written in the following equivalent form: f (x) + m i=1 ui gi (x) = 0 ugi (x) = 0 for i = 1, ..., m ui ≥ 0 for i = 1, ..., m 49 / 85

108. The necessary conditions for optimality We use the previous theorem 1 2 wT w − N i=1 αi[di(wT xi + w0) − 1] (14) Condition 1 ∂J(w, w0, α) ∂w = 0 Condition 2 ∂J(w, w0, α) ∂w0 = 0 50 / 85

111. Using the conditions We have the ﬁrst condition ∂J(w, w0, α) ∂w = ∂ 1 2wT w ∂w − ∂ N i=1 αi[di(wT xi + w0) − 1] ∂w = 0 ∂J(w, w0, α) ∂w = 1 2 (w + w) − N i=1 αidixi Thus w = N i=1 αidixi (15) 51 / 85

114. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi[di(wT xi + w0) − 1] = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di(wT xi + w0) − 1 = 0. 52 / 85

115. In a similar way ... We have by the second optimality condition N i=1 αidi = 0 Note αi[di(wT xi + w0) − 1] = 0 Because the constraint vanishes in the optimal solution i.e. αi = 0 or di(wT xi + w0) − 1 = 0. 52 / 85

116. Thus We need something extra Our classic trick of transforming a problem into another problem In this case We use the Primal-Dual Problem for Lagrangian Where We move from a minimization to a maximization!!! 53 / 85

120. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 55 / 85

121. Lagrangian Dual Problem Consider the following nonlinear programming problem Primal Problem P min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m hi (x) = 0 for i = 1, ..., l x ∈ X Lagrange Dual Problem D max Θ (u, v) s.t. u > 0 where Θ (u, v) = infx f (x) + m i=1 uigi (x) + l i=1 vihi (x) |x ∈ X 55 / 85

122. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we ﬁnish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 56 / 85

123. What does this mean? Assume that the equality constraint does not exist We have then min f (x) s.t gi (x) ≤ 0 for i = 1, ..., m x ∈ X Now assume that we ﬁnish with only one constraint We have then min f (x) s.t g (x) ≤ 0 x ∈ X 56 / 85

124. What does this mean? First, we have the following ﬁgure A B X G Slope: Slope: 57 / 85

125. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to ﬁnd θ (u) - Equivalent to f (x) + u g(x) = 0 58 / 85

126. What does this means? Thus at the y − z plane you have G = {(y, z) |y = g (x) , z = f (x) for some x ∈ X} (16) Thus Given u ≥ 0, we need to minimize f (x) + ug(x) to ﬁnd θ (u) - Equivalent to f (x) + u g(x) = 0 58 / 85

127. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 59 / 85

128. What does this means? Thus at the y − z plane, we have z + uy = α (17) a line with slope −u. Then, to minimize z + uy = α We need to move the line z + uy = α in a parallel to itself as far down as possible, along its negative gradient, while in contact with G. 59 / 85

129. In other words Move the line parallel to itself until it supports G A B X G Slope: Slope: Note The Set G lies above the line and touches it. 60 / 85

130. Thus Thus Then, the problem is to ﬁnd the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 61 / 85

131. Thus Thus Then, the problem is to ﬁnd the slope of the supporting hyperplane for G. Then intersection with the z-axis Gives θ(u) 61 / 85

132. Again We can see the θ A B X G Slope: Slope: 62 / 85

133. Thus The dual problem is equivalent Finding the slope of the supporting hyperplane such that its intercept on the z-axis is maximal 63 / 85

134. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 64 / 85

135. Or Such an hyperplane has slope −u and support G at (y, z) A B X G Slope: Slope: Remark: The optimal solution is u and the optimal dual objective is z. 64 / 85

136. For more on this Please!!! Look at this book From “Nonlinear Programming: Theory and Algorithms” by Mokhtar S. Bazaraa, and C. M. Shetty. Wiley, New York, (2006) At Page 260. 65 / 85

137. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 66 / 85

138. Example (Lagrange Dual) Primal min x2 1 + x2 2 s.t. −x1 − x2 + 4 ≤ 0 x1, x2 ≥ 0 Lagrange Dual Θ(u) = inf {x2 1 + x2 2 + u(−x1 − x2 + 4)|x1, x2 ≥ 0} 66 / 85

139. Solution Derive with respect to x1 and x2 We have two case to take in account: u ≥ 0 and u < 0 The ﬁrst case is clear What about when u < 0 We have that θ (u) = −1 2u2 + 4u if u ≥ 0 4u if u < 0 (18) 67 / 85

143. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and suﬃcient that w∗: It is feasible for the primal problem and Φ(w∗) = J(w∗, w0∗, α∗) = min w J(w, w0∗, α∗) 69 / 85

144. Duality Theorem First Property If the Primal has an optimal solution, the dual too. Thus In order to w ∗ and α∗ to be optimal solutions for the primal and dual problem respectively, It is necessary and suﬃcient that w∗: It is feasible for the primal problem and Φ(w∗) = J(w∗, w0∗, α∗) = min w J(w, w0∗, α∗) 69 / 85

145. Reformulate our Equations We have then J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − ω0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 70 / 85

146. Reformulate our Equations We have then J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi − ω0 N i=1 αidi + N i=1 αi Now for our 2nd optimality condition J(w, w0, α) = 1 2 wT w − N i=1 αidiwT xi + N i=1 αi 70 / 85

147. We have ﬁnally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J(w, ω0, α) = Q (α) Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 71 / 85

148. We have ﬁnally for the 1st Optimality Condition: First wT w = N i=1 αidiwT xi = N i=1 N j=1 αiαjdidjxT j xi Second, setting J(w, ω0, α) = Q (α) Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi 71 / 85

149. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, ﬁnd the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q(α). 72 / 85

150. From here, we have the problem This is the problem that we really solve Given the training sample {(xi, di)}N i=1, ﬁnd the Lagrange multipliers {αi}N i=1 that maximize the objective function Q(α) = N i=1 αi − 1 2 N i=1 N j=1 αiαjdidjxT j xi subject to the constraints N i=1 αidi = 0 (19) αi ≥ 0 for i = 1, · · · , N (20) Note In the Primal, we were trying to minimize the cost function, for this it is necessary to maximize α. That is the reason why we are maximizing Q(α). 72 / 85

151. Solving for α We can compute w∗ once we get the optimal α∗ i by using (Eq. 15) w∗ = N i=1 α∗ i dixi In addition, we can compute the optimal bias w∗ 0 using the optimal weight, w∗ For this, we use the positive margin equation: g(xi) = wT x(s) + w0 = 1 corresponding to a positive support vector. Then w0 = 1 − (w∗ )T x(s) for d(s) = 1 (21) 73 / 85

155. What do we need? Until now, we have only a maximal margin algorithm All this work ﬁne when the classes are separable Problem, What when they are not separable? What we can do? 75 / 85

159. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to deﬁne the following mapping 77 / 85

160. Map to a higher Dimensional Space Assume that exist a mapping x ∈ Rl → y ∈ Rk Then, it is possible to deﬁne the following mapping 77 / 85

161. Deﬁne a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi(x)}m i=1 from input space to the feature space. We can deﬁne the decision surface as m i=1 wiφi(x) + w0 = 0 . 78 / 85

162. Deﬁne a map to a higher Dimension Nonlinear transformations Given a series of nonlinear transformations {φi(x)}m i=1 from input space to the feature space. We can deﬁne the decision surface as m i=1 wiφi(x) + w0 = 0 . 78 / 85

163. This allows us to deﬁne The following vector φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T . That represents the mapping. From this mapping we can deﬁne the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 79 / 85

164. This allows us to deﬁne The following vector φ(x) = (φ0(x), φ1(x), · · · , φm(x)) T . That represents the mapping. From this mapping we can deﬁne the following kernel function K : X × X → R K (xi, xj) = φ (xi)T φ (xj) 79 / 85

166. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 81 / 85

167. Example Assume x ∈ R → y =    x2 1√ 2x1x2 x2 2    We can show that yT i yj = xT i xj 2 81 / 85

168. Example of Kernels Polynomials k(x, z) = (xT z + 1)q q > 0 Radial Basis Functions k(x, z) = exp − ||x − z||2 σ2 Hyperbolic Tangents k(x, z) = tanh(βxT z + γ) 82 / 85

172. Now, How to select a Kernel? We have a problem Selecting a speciﬁc kernel and parameters is usually done in a try-and-see manner. Thus In general, the Radial Basis Functions kernel is a reasonable ﬁrst choice. Then if this fails, we can try the other possible kernels. 84 / 85

175. Thus, we have something like this Step 1 Normalize the data. Step 2 Use cross-validation to adjust the parameters of the selected kernel. Step 3 Train against the entire dataset. 85 / 85

An introduction to support vector machines

Recommended

Recommended

More Related Content

Similar to An introduction to support vector machines

Similar to An introduction to support vector machines (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

An introduction to support vector machines