05 Machine Learning - Supervised Fisher Classifier

1. Machine Learning for Data Mining Fisher Linear Discriminant Andres Mendez-Vazquez May 26, 2015 1 / 43

2. Outline 1 Introduction History The Rotation Idea Classifiers as Machines for dimensionality reduction 2 Solution Use the mean of each Class Scatter measure The Cost Function A Transformation for simplification and defining the cost function 3 Where is this used? Applications 4 Relation with Least Squared Error What? 2 / 43

5. Rotation Projecting Projecting well-separated samples onto an arbitrary line usually produces a confused mixture of samples from all of the classes and thus produces poor recognition performance. Something Notable However, moving and rotating the line around might result in an orientation for which the projected samples are well separated. Fisher linear discriminant (FLD) It is a discriminant analysis seeking directions that are eﬃcient for discriminating binary classiﬁcation problem. 5 / 43

8. Example Example - From Left to Right the Improvement 6 / 43

10. This is actually coming from... Classiﬁer as A machine for dimensionality reduction. Initial Setup We have: Nd-dimensional samples x1, x2, ..., xN . Ni is the number of samples in class Ci for i=1,2. Then, we ask for the projection of each xi into the line by means of yi = wT xi (1) 8 / 43

16. Use the mean of each Class Then Select w such that class separation is maximized We then deﬁne the mean sample for ecah class 1 C1 ⇒ m1 = 1 N1 N1 i=1 xi 2 C2 ⇒ m2 = 1 N2 N2 i=1 xi Ok!!! This is giving us a measure of distance Thus, we want to maximize the distance the projected means: m1 − m2 = wT (m1 − m2) (2) where mk = wT mk for k = 1, 2. 10 / 43

21. However We could simply seek max w wT (m1 − m2) s.t. d i=1 wi = 1 After all We do not care about the magnitude of w. 11 / 43

22. However We could simply seek max w wT (m1 − m2) s.t. d i=1 wi = 1 After all We do not care about the magnitude of w. 11 / 43

23. Example Here, we have the problem... The Scattering!!! 12 / 43

25. Fixing the Problem To obtain good separation of the projected data The difference between the means should be large relative to some measure of the standard deviations for each class. We define a SCATTER measure (Based in the Sample Variance) s2 k = xi∈Ck wT xi − mk 2 = yi=wT xi∈Ck (yi − mk)2 (3) We define then within-class variance for the whole data s2 1 + s2 2 (4) 14 / 43

29. Finally, a Cost Function The between-class variance (m1 − m2)2 (5) The Fisher criterion between-class variance within-class variance (6) Finally J (w) = (m1 − m2)2 s2 1 + s2 2 (7) 16 / 43

33. We use a transformation to simplify our life First J (w) = wT m1 − wT m2 2 yi =wT xi ∈C1 (yi − mk )2 + yi =wT xi ∈C2 (yi − mk )2 (8) Second J (w) = wT m1 − wT m2 wT m1 − wT m2 T yi =wT xi ∈C1 wT xi − mk wT xi − mk T + yi =wT xi ∈C2 wT xi − mk wT xi − mk T (9) Third J (w) = wT (m1 − m2) wT (m1 − m2) T yi =wT xi ∈C1 wT (xi − m1) wT (xi − m1) T + yi =wT xi ∈C2 wT (xi − m2) wT (xi − m2) T (10) 18 / 43

36. Transformation Fourth J (w) = wT (m1 − m2) (m1 − m2)T w yi =wT xi ∈C1 wT (xi − m1) (xi − m1)T w + yi =wT xi ∈C2 wT (xi − m2) (xi − m2)T w (11) Fifth J (w) = wT (m1 − m2) (m1 − m2)T w wT yi =wT xi ∈C1 (xi − m1) (xi − m1)T + yi =wT xi ∈C2 (xi − m2) (xi − m2)T w (12) Now Rename J (w) = wT SBw wT Sww (13) 19 / 43

39. Derive with respect to w Thus dJ (w) dw = d wT SBw wT Sww −1 dw = 0 (14) Then dJ (w) dw = SBw + ST B w wT Sww −1 − wT SBw wT Sww −2 Sww + ST w w = 0 (15) Now because the symmetry in SB and Sw dJ (w) dw = SB (wT Sww) − wT SBwSww (wT Sww)2 = 0 (16) 20 / 43

42. Derive with respect to w Thus dJ (w) dw = SB (wT Sww) − wT SBwSww (wT Sww)2 = 0 (17) Then wT Sww SBw = wT SBw Sww (18) 21 / 43

43. Derive with respect to w Thus dJ (w) dw = SB (wT Sww) − wT SBwSww (wT Sww)2 = 0 (17) Then wT Sww SBw = wT SBw Sww (18) 21 / 43

44. Now, Several Tricks!!! First SBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19) Where α = (m1 − m2)T w is a simple constant It means that SBw is always in the direction m1 − m2!!! In addition wT Sww and wT SBw are constants 22 / 43

47. Now, Several Tricks!!! Finally Sww ∝ (m1 − m2) ⇒ w ∝ S−1 w (m1 − m2) (20) Once the data is transformed into yi Use a threshold y0 ⇒ x ∈ C1 iﬀ y (x) ≥ y0 or x ∈ C2 iﬀ y (x) < y0 Or ML with a Gussian can be used to classify the new transformed data using a Naive Bayes (Central Limit Theorem and y = wT x sum of random variables). 23 / 43

51. Applications Something Notable Bankruptcy prediction In bankruptcy prediction based on accounting ratios and other financial variables, linear discriminant analysis was the first statistical method applied to systematically explain which firms entered bankruptcy vs. survived. Face recognition In computerized face recognition, each face is represented by a large number of pixel values. The linear combinations obtained using Fisher’s linear discriminant are called Fisher faces. Marketing In marketing, discriminant analysis was once often used to determine the factors which distinguish different types of customers and/or products on the basis of surveys or other forms of collected data. 25 / 43

59. Applications Something Notable Biomedical studies The main application of discriminant analysis in medicine is the assessment of severity state of a patient and prognosis of disease outcome. 26 / 43

60. Applications Something Notable Biomedical studies The main application of discriminant analysis in medicine is the assessment of severity state of a patient and prognosis of disease outcome. 26 / 43

61. Please Your Reading Material, it is about the Multi-class 4.1.6 Fisher’s discriminant for multiple classes AT “Pattern Recognition” by Bishop 27 / 43

63. Relation with Least Squared Error First The least-squares approach to the determination of a linear discriminant was based on the goal of making the model predictions as close as possible to a set of target values. Second The Fisher criterion was derived by requiring maximum class separation in the output space. 29 / 43

64. Relation with Least Squared Error First The least-squares approach to the determination of a linear discriminant was based on the goal of making the model predictions as close as possible to a set of target values. Second The Fisher criterion was derived by requiring maximum class separation in the output space. 29 / 43

65. How do we do this? First We have N samples. We have N1samples for class C1. We have N2 samples for class C2. So, we decide the following for the targets on each class We have then for class C1 is N N1 . We have then for class C2 is − N N2 . 30 / 43

69. Thus The new cost function E = 1 2 N n=1 wT xn + wo − tn 2 (21) Deriving with respect to w N n=1 wT xn + wo − tn xn = 0 (22) Deriving with respect to w0 N n=1 wT xn + wo − tn = 0 (23) 31 / 43

72. Then We have that N n=1 wT xn + wo − tn = N n=1 wT xn + wo − N n=1 tn = N n=1 wT xn + wo − N1 N N1 + N2 N N2 = N n=1 wT xn + wo Then N n=1 wT xn + Nwo = 0 32 / 43

73. Then We have that N n=1 wT xn + wo − tn = N n=1 wT xn + wo − N n=1 tn = N n=1 wT xn + wo − N1 N N1 + N2 N N2 = N n=1 wT xn + wo Then N n=1 wT xn + Nwo = 0 32 / 43

74. Then We have that wo = −wT 1 N N n=1 xn We rename 1 N N n=1 xn = m m = 1 N N n=1 xn = 1 N [N1m1 + N2m2] Finally wo = −wT m 33 / 43

77. Now In a similar way N n=1 wT xn + wo xn − N n=1 tnxn = 0 34 / 43

78. Thus, we have Something Notable N n=1 wT xn + wo xn − N N1 N1 n=1 xn + N N2 N2 n=1 xn = 0 Thus N n=1 wT xn + wo xn − N (m1 − m2) = 0 35 / 43

79. Thus, we have Something Notable N n=1 wT xn + wo xn − N N1 N1 n=1 xn + N N2 N2 n=1 xn = 0 Thus N n=1 wT xn + wo xn − N (m1 − m2) = 0 35 / 43

80. Next Then N n=1 wT xn − wT m xn = N (m1 − m2) Thus N n=1 wT xn − wT m xn = N (m1 − m2) Now What? Please do this as a take homework for next class 36 / 43

83. Now, Do you have the solution? You have a version in Duda and Hart Section 5.8 ˆw = XT X −1 XT y Thus XT X ˆw = XT y Now, we rewrite the data matrix by the normalization discussed early X = 11 X1 −12 −X2 37 / 43

86. In addition Our old augmented w w = w0 w And our new y y = N N1 11 N N2 12 (24) 38 / 43

87. In addition Our old augmented w w = w0 w And our new y y = N N1 11 N N2 12 (24) 38 / 43

88. Thus, we have Something Notable 1T 1 −1T 1 XT 1 −XT 2 11 X1 −12 −X2 w0 w = 1T 1 −1T 1 XT 1 −XT 2 N N1 11 N N2 12 Thus, if we use the following deﬁnitions mi = 1 Ni x∈Ci x SW = xi∈C1 (xi − m1) (xi − m1)T + xi∈C2 (xi − m2) (xi − m2)T 39 / 43

92. Then We have the following n (n1m1 + n2m2) T (n1m1 + n2m2) Sw + n1m1mT 1 + n2m2mT 2 w0 w = 0 n [m1 − m2] Then nw0 + (n1m1 + n2m2) T w (n1m1 + n2m2) w0 + Sw + n1m1mT 1 + n2m2mT 2 = 0 n [m1 − m2] 40 / 43

93. Then We have the following n (n1m1 + n2m2) T (n1m1 + n2m2) Sw + n1m1mT 1 + n2m2mT 2 w0 w = 0 n [m1 − m2] Then nw0 + (n1m1 + n2m2) T w (n1m1 + n2m2) w0 + Sw + n1m1mT 1 + n2m2mT 2 = 0 n [m1 − m2] 40 / 43

94. Thus We have that w0 = −wT m 1 n Sw + n1n2 n2 (m1 − m2) (m1 − m2)T w = m1 − m2 Thus Since the vector (m1 − m2) (m1 − m2)T w is in the direction of m1 − m2 We have that n1n2 n2 (m1 − m2) (m1 − m2)T w = (1 − α) (m1 − m2) 41 / 43

97. Finally We have that w = αnS−1 w (m1 − m2) (25) 42 / 43

98. Exercises Bishop Chapter 4 4.4, 4.5, 4.6 43 / 43

99. Exercises Bishop Chapter 4 4.4, 4.5, 4.6 43 / 43

05 Machine Learning - Supervised Fisher Classifier

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 05 Machine Learning - Supervised Fisher Classifier

Similar to 05 Machine Learning - Supervised Fisher Classifier (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

05 Machine Learning - Supervised Fisher Classifier