23 Machine Learning Feature Generation

1. Machine Learning for Data Mining Feature Generation Andres Mendez-Vazquez July 16, 2015 1 / 35

2. Images/cinvestav- Outline 1 Introduction What do we want? 2 Fisher Linear Discriminant The Rotation Idea Solution Scatter measure The Cost Function 3 Principal Component Analysis 2 / 35

4. Images/cinvestav- What do we want? What Given a set of measurements, the goal is to discover compact and informative representations of the obtained data. Our Approach We want to “squeeze” in a relatively small number of features, leading to a reduction of the necessary feature space dimension. Properties Thus removing information redundancies - Usually produced and the measurement. 4 / 35

7. Images/cinvestav- What Methods we will see? Fisher Linear Discriminant 1 Squeezing to the maximum. 2 From Many to One Dimension Principal Component Analysis 1 Not so much squeezing 2 You are willing to lose some information Singular Value Decomposition 1 Decompose a m × n data matrix A into A = USV T , U and V orthonormal matrices and S contains the eigenvalues. 2 You can read more of it on “Singular Value Decomposition - Dimensionality Reduction” 5 / 35

11. Images/cinvestav- Rotation Projecting Projecting well-separated samples onto an arbitrary line usually produces a confused mixture of samples from all of the classes and thus produces poor recognition performance. Something Notable However, moving and rotating the line around might result in an orientation for which the projected samples are well separated. Fisher linear discriminant (FLD) It is a discriminant analysis seeking directions that are eﬃcient for discriminating binary classiﬁcation problem. 7 / 35

14. Images/cinvestav- Example Example - From Left to Right the Improvement 8 / 35

15. Images/cinvestav- This is actually comming from... Classiﬁer as A machine for dimensionality reduction. Initial Setup We have: N d-dimensional samples x1, x2, ..., xN Ni is the number of samples in class Ci for i=1,2. Then, we ask for the projection of each xi into the line by means of yi = wT xi (1) 9 / 35

19. Images/cinvestav- Use the mean of each Class Then Select w such that class separation is maximized We then deﬁne the mean sample for ecah class 1 C1 ⇒ m1 = 1 N1 N1 i=1 xi 2 C2 ⇒ m2 = 1 N2 N2 i=1 xi Ok!!! This is giving us a measure of distance Thus, we want to maximize the distance the projected means: m1 − m2 = wT (m1 − m2) (2) where mk = wT mk for k = 1, 2. 11 / 35

22. Images/cinvestav- However We could simply seek maxwT (m1 − m2) s.t. d i=1 wi = 1 After all We do not care about the magnitude of w. 12 / 35

23. Images/cinvestav- However We could simply seek maxwT (m1 − m2) s.t. d i=1 wi = 1 After all We do not care about the magnitude of w. 12 / 35

24. Images/cinvestav- Example Here, we have the problem 13 / 35

26. Images/cinvestav- Fixing the Problem To obtain good separation of the projected data The difference between the means should be large relative to some measure of the standard deviations for each class. We define a SCATTER measure (Based in the Sample Variance) s2 k = xi∈Ck wT xi − mk 2 = yi=wT xi∈Ck (yi − mk)2 (3) We define then within-class variance for the whole data s2 1 + s2 2 (4) 15 / 35

30. Images/cinvestav- Finally, a Cost Function The between-class variance (m1 − m2)2 (5) The Fisher criterion between-class variance within-class variance (6) Finally J (w) = (m1 − m2)2 s2 1 + s2 2 (7) 17 / 35

33. Images/cinvestav- We use a transformation to simplify our life First J (w) = wT m1 − wT m2 2 yi =wT xi ∈C1 (yi − mk )2 + yi =wT xi ∈C2 (yi − mk )2 (8) Second J (w) = wT m1 − wT m2 wT m1 − wT m2 T yi =wT xi ∈C1 wT xi − mk wT xi − mk T + yi =wT xi ∈C2 wT xi − mk wT xi − mk T (9) Third J (w) = wT (m1 − m2) wT (m1 − m2) T yi =wT xi ∈C1 wT (xi − m1) wT (xi − m1) T + yi =wT xi ∈C2 wT (xi − m2) wT (xi − m2) T (10) 18 / 35

36. Images/cinvestav- Transformation Fourth J (w) = wT (m1 − m2) (m1 − m2)T w yi =wT xi ∈C1 wT (xi − m1) (xi − m1)T w + yi =wT xi ∈C2 wT (xi − m2) (xi − m2)T w (11) Fifth J (w) = wT (m1 − m2) (m1 − m2)T w wT yi =wT xi ∈C1 (xi − m1) (xi − m1)T + yi =wT xi ∈C2 (xi − m2) (xi − m2)T w (12) Now Rename J (w) = wT SBw wT Sww (13) 19 / 35

39. Images/cinvestav- Derive with respect to w Thus dJ (w) dw = d wT SBw wT Sww −1 dw = 0 (14) Then dJ (w) dw = SBw + ST B w wT Sww −1 − wT SBw wT Sww −2 Sww + ST w w = 0 (15) Now because the symmetry in SB and Sw dJ (w) dw = SB (wT Sww) − wT SBwSww (wT Sww)2 = 0 (16) 20 / 35

42. Images/cinvestav- Derive with respect to w Thus dJ (w) dw = SB (wT Sww) − wT SBwSww (wT Sww)2 = 0 (17) Then wT Sww SBw = wT SBw Sww (18) 21 / 35

43. Images/cinvestav- Derive with respect to w Thus dJ (w) dw = SB (wT Sww) − wT SBwSww (wT Sww)2 = 0 (17) Then wT Sww SBw = wT SBw Sww (18) 21 / 35

44. Images/cinvestav- Now, Several Tricks!!! First SBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19) Where α = (m1 − m2)T w is a simple constant It means that SBw is always in the direction m1 − m2!!! In addition wT Sww and wT SBw are constants 22 / 35

47. Images/cinvestav- Now, Several Tricks!!! Finally Sww ∝ (m1 − m2) ⇒ w ∝ S−1 w (m1 − m2) (20) Once the data is transformed into yi Use a threshold y0 ⇒ x ∈ C1 iﬀ y (x) ≥ y0 or x ∈ C2 iﬀ y (x) < y0 Or ML with a Gussian can be used to classify the new transformed data using a Naive Bayes (Central Limit Theorem and y = wT x sum of random variables). 23 / 35

50. Images/cinvestav- Please Your Reading Material, it is about the Multiclass 4.1.6 Fisher’s discriminant for multiple classes AT “Pattern Recognition” by Bishop 24 / 35

51. Images/cinvestav- Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 25 / 35

52. Images/cinvestav- Also Known as Karhunen-Loeve Transform Setup Consider a data set of observations {xn} with n = 1, 2, ..., N and xn ∈ Rd. Goal Project data onto space with dimensionality m < d (We assume m is given) 25 / 35

53. Images/cinvestav- Use the 1-Dimensional Variance Remember the Variance Sample VAR(X) = N i=1 (xi − x) (xi − x) N − 1 (21) You can do the same in the case of two variables X and Y COV (X, Y ) = N i=1 (xi − x) (yi − y) N − 1 (22) 26 / 35

54. Images/cinvestav- Use the 1-Dimensional Variance Remember the Variance Sample VAR(X) = N i=1 (xi − x) (xi − x) N − 1 (21) You can do the same in the case of two variables X and Y COV (X, Y ) = N i=1 (xi − x) (yi − y) N − 1 (22) 26 / 35

55. Images/cinvestav- Now, Deﬁne Given the data x1, x2, ..., xN (23) where xi is a column vector Construct the sample mean x = 1 N N i=1 xi (24) Build new data x1 − x, x2 − x, ..., xN − x (25) 27 / 35

58. Images/cinvestav- Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (26) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 3 What else? Look at a plane Center and Rotating!!! 28 / 35

59. Images/cinvestav- Build the Sample Mean The Covariance Matrix S = 1 N − 1 N i=1 (xi − x) (xi − x)T (26) Properties 1 The ijth value of S is equivalent to σ2 ij. 2 The iith value of S is equivalent to σ2 ii. 3 What else? Look at a plane Center and Rotating!!! 28 / 35

60. Images/cinvestav- Using S to Project Data As in Fisher We want to project the data to a line... For this we use a u1 with uT 1 u1 = 1 Question What is the Sample Variance of the Projected Data 29 / 35

63. Images/cinvestav- Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (27) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (28) 30 / 35

64. Images/cinvestav- Thus we have Variance of the projected data 1 N − 1 N i=1 [u1xi − u1x] = uT 1 Su1 (27) Use Lagrange Multipliers to Maximize uT 1 Su1 + λ1 1 − uT 1 u1 (28) 30 / 35

65. Images/cinvestav- Derive by u1 We get Su1 = λ1u1 (29) Then u1 is an eigenvector of S. If we left-multiply by u1 uT 1 Su1 = λ1 (30) 31 / 35

68. Images/cinvestav- Thus Variance will be the maximum when uT 1 Su1 = λ1 (31) is set to the largest eigenvalue. Also know as the First Principal Component By Induction It is possible for M-dimensional space to deﬁne M eigenvectors u1, u2, ..., uM of the data covariance S corresponding to λ1, λ2, ..., λM that maximize the variance of the projected data. Computational Cost 1 Full eigenvector decomposition O d3 2 Power Method O Md2 “Golub and Van Loan, 1996)” 3 Use the Expectation Maximization Algorithm 32 / 35

71. Images/cinvestav- Example From Bishop 33 / 35

23 Machine Learning Feature Generation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 23 Machine Learning Feature Generation

Similar to 23 Machine Learning Feature Generation (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

23 Machine Learning Feature Generation