04 Machine Learning - Supervised Linear Classifier

1. Machine Learning for Data Mining Linear Classiﬁers Andres Mendez-Vazquez May 23, 2016 1 / 85

2. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 2 / 85

4. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) In the case of R2 We have the following function: g (x) = w1x1 + w2x2 + w0 (2) 4 / 85

5. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) In the case of R2 We have the following function: g (x) = w1x1 + w2x2 + w0 (2) 4 / 85

7. Splitting The Space R2 Using a simple straight line Class Class 6 / 85

9. Deﬁning a Decision Surface The equation g (x) = 0 deﬁnes a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Given x1 and x2 are both on the decision surface: wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 8 / 85

12. Deﬁning a Decision Surface Thus wT (x1 − x2) = 0 (4) Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT is normal to the hyperplane. Something Notable Properties 9 / 85

15. Therefore The space is split in two regions (Example in R3 ) by the hyperplane H 10 / 85

16. Some Properties of the Hyperplane Given that g (x) > 0 if x ∈ R1 11 / 85

17. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85

24. We have something like this We have then 13 / 85

25. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 14 / 85

31. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 15 / 85

35. In addition... If we do the following g (x) = w0 + d i=1 wixi = d i=0 wixi (7) By making x0 = 1 and y =       1 x1 ... xd       =      1 x      Where y is called an augmented feature vector. 16 / 85

38. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting y vectors, all lie in a d-dimensional subspace which is the x-space itself. 17 / 85

41. More Remarks In addition The hyperplane decision surface H defined by wT augy = 0 passes through the origin in y-space. Even though the corresponding hyperplane H can be in any position of the x-space. The distance from y to H is |wT augy| waug or |g(x)| waug . Since waug > w This distance is less or at least equal to the distance from x to H. This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 18 / 85

48. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. We suggest the following normalization We replace all the samples xi ∈ ω2 by their negative vectors!!! 21 / 85

52. The Usefulness of the Normalization Once the normalization is done We only need for a weight vector w such that wT xi > 0 for all the samples. The name of this weight vector It is called a separating vector or solution vector. 22 / 85

53. The Usefulness of the Normalization Once the normalization is done We only need for a weight vector w such that wT xi > 0 for all the samples. The name of this weight vector It is called a separating vector or solution vector. 22 / 85

54. Here, we have the solution region for w Do not confuse this region with the decision region!!! separating plane solution space Remark: w is not unique!!! We can have diﬀerent w’s solving the problem 23 / 85

55. Here, we have the solution region for w Do not confuse this region with the decision region!!! separating plane solution space Remark: w is not unique!!! We can have diﬀerent w’s solving the problem 23 / 85

56. Here, we have the solution region for w under normalization Do not confuse this region with the decision region!!! "separating" plane solution space Remark: w is not unique!!! 24 / 85

57. Here, we have the solution region for w under normalization Do not confuse this region with the decision region!!! "separating" plane solution space Remark: w is not unique!!! 24 / 85

58. How do we get this w? In order to be able to do this We need to impose constraints to the problem. Possible constraints!!! To ﬁnd a unit-length weight vector that maximizes the minimum distance from the samples to the separating plane. To ﬁnd the minimum-length weight vector satisfying wT xi ≥ b for all i where b is a constant called the margin. Here the solution space resulting from the intersections of the half-spaces such that wT xi ≥ b > 0 lies within the previous solution space!!! 25 / 85

62. We have then A new boundary by a distance b xi solution region 26 / 85

64. Gradient Descent For this, we will deﬁne a criterion function J (w) A classic optimization The basic procedure is as follow 1 Start with a random weight vector w (1). 2 Compute the gradient vector J (w (1)). 3 Obtain value w (2) by moving from w (1) in the direction of the steepest descent: 1 i.e. along the negative of the gradient. 2 By using the following equation: w (k + 1) = w (k) − η (k) J (w (k)) (8) 28 / 85

70. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85

79. Using the Taylor’s second-order expansion around value w (k) We do the following J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) (9) Remark: This is know as Taylor’s Second Order expansion!!! Here, we have J is the vector of partial derivatives ∂J ∂wi evaluated at w (k). H is the Hessian matrix of second partial derivatives ∂2J ∂wi∂wj evaluated at w (k). 30 / 85

83. Then We substitute (Eq. 8) into (Eq. 9) w (k + 1) − w (k) = η (k) J (w (k)) (10) We have then J (w (k + 1)) ∼=J (w (k)) + JT (−η (k) J (w (k))) + ... 1 2 (−η (k) J (w (k)))T H (−η (k) J (w (k))) Finally, we have J (w (k + 1)) ∼= J (w (k)) − η (k) J 2 + 1 2 η2 (k) JT H J (11) 31 / 85

86. Derive with respect to η (k) and make the result equal to zero We have then − J 2 + η (k) JT H J = 0 (12) Finally η (k) = J 2 JT H J (13) Remark This is the optimal step size!!! Problem!!! Calculating H can be quite expansive!!! 32 / 85

89. We can have an adaptive linear search!!! We can use the idea of having everything ﬁxed, but η (k) Then, we can have the following function f (η (k)) = w (k) − η (k) J (w (k)) We can optimized using linear search methods Linear Search Methods Backtracking linear search Bisection method Golden ratio Etc. 33 / 85

95. Example: Golden Ratio Imagine that you have a linear function f : L → R Where: Chose a and b such that a+b a = a b (The Golden Ratio). 34 / 85

96. Example: Golden Ratio Imagine that you have a linear function f : L → R Where: Chose a and b such that a+b a = a b (The Golden Ratio). 34 / 85

97. The process is as follow Given f1, f2, f3, where f1 = f (x1) f2 = f (x2) f3 = f (x3) We have then if f2 is smaller than f1 and f3 then the minimum lies in [x1, x3] Now, we generate x4 with f4 = f (x4) In the largest subinterval!!! [x2, x3] 35 / 85

100. Finally Two cases If f4a > f2 then the minimum lies between x1 and x4 and the new triplet is x1, x2 and x4. If f4b < f2 then the minimum lies between x2 and x3 and the new triplet is x2, x4 and x3. Then Repeat the procedure!!! For more, please read the paper “SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer 36 / 85

104. We have another method... Derive the second Taylor expansion by w J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) We get J + Hw − Hw (k) = 0 (14) Thus Hw = Hw (k) − J H−1 Hw = H−1 Hw (k) − H−1 J w = w (k) − H−1 J 37 / 85

107. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85

115. Initial Setup Important We get away from our initial normalization of the samples!!! Now, we are going to use the method know as Minimum Squared Error 40 / 85

116. Initial Setup Important We get away from our initial normalization of the samples!!! Now, we are going to use the method know as Minimum Squared Error 40 / 85

117. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 41 / 85

121. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 42 / 85

126. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + (15) Where the It has a ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + (16) 44 / 85

129. Thus, we have What to do? = ynoise − gideal (x) (17) Graphically 45 / 85

130. Thus, we have What to do? = ynoise − gideal (x) (17) Graphically 45 / 85

132. Sum Over All Errors We can do the following J (w) = N i=1 2 i = N i=1 (yi − gideal (x))2 (18) Remark: Know as least squares (Fitting the vertical oﬀset!!!) Generalize If The dimensionality of each sample (data point) is d, You can extend each vector sample to be xT = (1, x ), We have: N i=1 yi − xT w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (19) 47 / 85

137. What is X It is the Data Matrix X =          1 (x1)1 · · · (x1)j · · · (x1)d ... ... ... 1 (xi)1 (xi)j (xi)d ... ... ... 1 (xN )1 · · · (xN )j · · · (xN )d          (20) We know the following dxT Ax dx = Ax + AT x, dAx dx = A (21) 49 / 85

138. What is X It is the Data Matrix X =          1 (x1)1 · · · (x1)j · · · (x1)d ... ... ... 1 (xi)1 (xi)j (xi)d ... ... ... 1 (xN )1 · · · (xN )j · · · (xN )d          (20) We know the following dxT Ax dx = Ax + AT x, dAx dx = A (21) 49 / 85

139. Note about other representations We could have xT = (x1, x2, ..., xd, 1) thus X =          (x1)1 · · · (x1)j · · · (x1)d 1 ... ... ... (xi)1 (xi)j (xi)d 1 ... ... ... (xN )1 · · · (xN )j · · · (xN )d 1          (22) 50 / 85

140. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − wT XT y − yT Xw + wT XT Xw (23) Making Possible to have by deriving with respect to w and assuming that XT X is invertible ˆw = XT X −1 XT y (24) Note:XT X is always positive semi-deﬁnite. If it is also invertible, it is positive deﬁnite. Thus, How we get the discriminant function? Any Ideas? 51 / 85

143. The Final Discriminant Function Very Simple!!! g(x) = xT ˆw = xT XT X −1 XT y (25) 52 / 85

144. Pseudo-inverse of a Matrix Deﬁnition Suppose that A ∈ Rm×n and rank (A) = m. We call the matrix A+ = AT A −1 AT the pseudo inverse of A. Reason A+ inverts A on its image What? If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence: A+ w = A+ Av = AT A −1 AT Av 53 / 85

147. What lives where? We have X ∈ RN×(d+1) Image (X) = span Xcol 1 , ..., Xcol d+1 xi ∈ Rd w ∈ Rd+1 Xcol i , y ∈ RN Basically y, the list of desired inputs the is being protected into span Xcol 1 , ..., Xcol d+1 (26) by the projection operator X XT X −1 XT . 54 / 85

153. Geometric Interpretation We have 1 The image of the mapping w to Xw is a linear subspace of RN . 2 As w runs through all points Rd+1, the function value Xw runs through all points in the image space image (X) = span Xcol 1 , ..., Xcol d+1 . 3 Each w deﬁnes one point Xw = d j=0 wjXcol j . 4 ˆw is the point which minimizes the distance d (y, image (X)). 55 / 85

157. Geometrically Ahhhh!!! 56 / 85

159. Multi-Class Solution What to do? 1 We might reduce the problem to c − 1 two-class problems. 2 We might use c(c−1) 2 linear discriminants, one for every pair of classes. However 58 / 85

162. What to do? Deﬁne c linear discriminant functions gi (x) = wT x + wi0 for i = 1, ..., c (27) This is known as a linear machine Rule: if gk (x) > gj (x) for all j = k =⇒ x ∈ ωk Nice Properties (It can be proved!!!) 1 Decision Regions are Singly Connected. 2 Decision Regions are Convex. 59 / 85

165. Proof of Properties Proof Actually quite simple Given y = λxA + (1 − λ) xB with λ ∈ (0, 1). 60 / 85

166. Proof of Properties Proof Actually quite simple Given y = λxA + (1 − λ) xB with λ ∈ (0, 1). 60 / 85

167. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k deﬁned by the rule!!! This area is Convex and Singly Connected because the deﬁnition of y. 61 / 85

175. However!!! No so nice properties!!! It limits the power of classiﬁcation for multi-objective function. 62 / 85

176. How do we train this Linear Machine? We know that each ωk class is described by gk (x) = wT k x + w0 where k = 1, ..., c We then design a single machine g (x) = W T x (28) 63 / 85

177. How do we train this Linear Machine? We know that each ωk class is described by gk (x) = wT k x + w0 where k = 1, ..., c We then design a single machine g (x) = W T x (28) 63 / 85

178. Where We have the following W T =         1 w11 w12 · · · w1d 1 w21 w22 · · · w2d 1 w31 w32 · · · w3d ... ... ... ... 1 wc1 wc2 · · · wcd         (29) What about the labels? OK, we know how to do with 2 classes, What about many classes? 64 / 85

179. Where We have the following W T =         1 w11 w12 · · · w1d 1 w21 w22 · · · w2d 1 w31 w32 · · · w3d ... ... ... ... 1 wc1 wc2 · · · wcd         (29) What about the labels? OK, we know how to do with 2 classes, What about many classes? 64 / 85

180. How do we train this Linear Machine? Use a vector ti with dimensionality c to identify each element at each class We have then the following dataset {xi, ti} for i = 1, 2, ..., N We build the following Matrix of Vectors T =       tT 1 tT 2 ... tT N       (30) 65 / 85

181. How do we train this Linear Machine? Use a vector ti with dimensionality c to identify each element at each class We have then the following dataset {xi, ti} for i = 1, 2, ..., N We build the following Matrix of Vectors T =       tT 1 tT 2 ... tT N       (30) 65 / 85

182. Thus, we create the following Matrix A Matrix containing all the required information XW − T (31) Where we have the following vector xT i w1, xT i w2, xT i w3, ..., xT i wc (32) Remark: It is the vector result of multiplication of row i of X against W on XW . That is compared to the vector tT i on T by using the subtraction of vectors i = xT i w1, xT i w2, xT i w3, ..., xT i wc − tT i (33) 66 / 85

185. What do we want? We want the quadratic error 1 2 2 i This speciﬁc quadratic errors are at the diagonal of the matrix (XW − T )T (XW − T ) We can use the trace function to generate the desired total error of J (·) = 1 2 N i=1 2 i (34) 67 / 85

188. Then The trace allows to express the total error J (W ) = 1 2 Trace (XW − T )T (XW − T ) (35) Thus, we have by the same derivative method W = XT X XT T = X+ T (36) 68 / 85

189. Then The trace allows to express the total error J (W ) = 1 2 Trace (XW − T )T (XW − T ) (35) Thus, we have by the same derivative method W = XT X XT T = X+ T (36) 68 / 85

190. How we train this Linear Machine? Thus, we obtain the discriminant g (x) = W T x = T T X+ x (37) 69 / 85

192. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85

201. Issues with Least Squares Problem with Outliers No Outliers Outliers 72 / 85

202. Issues with Least Squares What about the Linear Machine? Please, run the algorithm and tell me... 73 / 85

203. What to Do About Numerical Stability? Regularity A matrix which is not invertible is also called a singular matrix. A matrix which is invertible (not singular) is called regular. In computations Intuitions: 1 A singular matrix maps an entire linear subspace into a single point. 2 If a matrix maps points far away from each other to points very close to each other, it almost behaves like a singular matrix. Mapping is related to the eigenvalues!!! Large positive eigenvalues ⇒ the mapping is large!!! Small positive eigenvalues ⇒ the mapping is small!!! 74 / 85

209. What to Do About Numerical Stability? All this comes from the following statement A positive semi-deﬁnite matrix A is singular ⇐⇒ smallest eigenvalue is 0 Consequence for Statistics If a statistical prediction involves the inverse of an almost-singular matrix, the predictions become unreliable (high variance). 76 / 85

210. What to Do About Numerical Stability? All this comes from the following statement A positive semi-deﬁnite matrix A is singular ⇐⇒ smallest eigenvalue is 0 Consequence for Statistics If a statistical prediction involves the inverse of an almost-singular matrix, the predictions become unreliable (high variance). 76 / 85

211. What can be done? Ridge Regression Ridge regression is a modiﬁcation of least squares. It tries to make least squares more robust if XT X is almost singular. The solution wRidge = XT X + λI −1 XT y (38) where λ is a tunning parameter Thus, we can do the following given that XT X is positive deﬁnite Assume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvalues λ1, λ2, ..., λd+1: XT X + λI ξi = (λi + λ) ξi (39) i.e. λi + λ is an eigenvalue for XT X + λI 77 / 85

214. What does this mean? Something Notable You can control the singularity by detecting the smallest eigenvalue. Thus We add an appropriate tunning value λ. 78 / 85

215. What does this mean? Something Notable You can control the singularity by detecting the smallest eigenvalue. Thus We add an appropriate tunning value λ. 78 / 85

216. Thus, what we need to do? Process 1 Find the eigenvalues of XT X 2 If all of them are bigger than zero we are ﬁne!!! 3 Find the smallest one, then tune if necessary. 4 Build wRidge = XT X + λI −1 XT y. 79 / 85

220. What about Thousands of Features? There is a technique for that Least Absolute Shrinkage and Selection Operator (LASSO) invented by Robert Tibshirani that uses L1 = d i=1 |wi|. The Least Squared Error takes the form of N i=1 yi − xT w 2 + d i=1 |wi| (40) However You have other regularizations as L2 = d i=1 |wi|2 80 / 85

223. Graphically The ﬁrst area correspond to the L1 regularization and the second one? 81 / 85

224. Graphically Yes the circle deﬁned as L2 = d i=1 |wi|2 82 / 85

225. The seminal paper by Robert Tibshirani An initial study of this regularization can be seen in “Regression Shrinkage and Selection via the LASSO” by Robert Tibshirani - 1996 83 / 85

226. This out the scope of this class However, it is worth noticing that the most eﬃcient method for solving LASSO problems is “Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie, Holger Ho and Robert Tibshirani Nevertheless It will be a great seminar paper!!! 84 / 85

227. This out the scope of this class However, it is worth noticing that the most eﬃcient method for solving LASSO problems is “Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie, Holger Ho and Robert Tibshirani Nevertheless It will be a great seminar paper!!! 84 / 85

228. Exercises Duda and Hart Chapter 5 1, 3, 4, 7, 13, 17 Bishop Chapter 4 4.1, 4.4, 4.7, Theodoridis Chapter 3 - Problems Using python 3.6 Chapter 3 - Computer Experiments Using python 3.1 Using python and Newton 3.2 85 / 85

04 Machine Learning - Supervised Linear Classifier

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (7)

Similar to 04 Machine Learning - Supervised Linear Classifier

Similar to 04 Machine Learning - Supervised Linear Classifier (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

04 Machine Learning - Supervised Linear Classifier