1. Machine Learning for Data Mining
Fisher Linear Discriminant
Andres Mendez-Vazquez
May 26, 2015
1 / 43
2. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
2 / 43
3. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
3 / 43
4. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
4 / 43
5. Rotation
Projecting
Projecting well-separated samples onto an arbitrary line usually produces a
confused mixture of samples from all of the classes and thus produces poor
recognition performance.
Something Notable
However, moving and rotating the line around might result in an
orientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)
It is a discriminant analysis seeking directions that are efficient for
discriminating binary classification problem.
5 / 43
6. Rotation
Projecting
Projecting well-separated samples onto an arbitrary line usually produces a
confused mixture of samples from all of the classes and thus produces poor
recognition performance.
Something Notable
However, moving and rotating the line around might result in an
orientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)
It is a discriminant analysis seeking directions that are efficient for
discriminating binary classification problem.
5 / 43
7. Rotation
Projecting
Projecting well-separated samples onto an arbitrary line usually produces a
confused mixture of samples from all of the classes and thus produces poor
recognition performance.
Something Notable
However, moving and rotating the line around might result in an
orientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)
It is a discriminant analysis seeking directions that are efficient for
discriminating binary classification problem.
5 / 43
9. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
7 / 43
10. This is actually coming from...
Classifier as
A machine for dimensionality reduction.
Initial Setup
We have:
Nd-dimensional samples x1, x2, ..., xN .
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT
xi (1)
8 / 43
11. This is actually coming from...
Classifier as
A machine for dimensionality reduction.
Initial Setup
We have:
Nd-dimensional samples x1, x2, ..., xN .
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT
xi (1)
8 / 43
12. This is actually coming from...
Classifier as
A machine for dimensionality reduction.
Initial Setup
We have:
Nd-dimensional samples x1, x2, ..., xN .
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT
xi (1)
8 / 43
13. This is actually coming from...
Classifier as
A machine for dimensionality reduction.
Initial Setup
We have:
Nd-dimensional samples x1, x2, ..., xN .
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT
xi (1)
8 / 43
14. This is actually coming from...
Classifier as
A machine for dimensionality reduction.
Initial Setup
We have:
Nd-dimensional samples x1, x2, ..., xN .
Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT
xi (1)
8 / 43
15. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
9 / 43
16. Use the mean of each Class
Then
Select w such that class separation is maximized
We then define the mean sample for ecah class
1 C1 ⇒ m1 = 1
N1
N1
i=1 xi
2 C2 ⇒ m2 = 1
N2
N2
i=1 xi
Ok!!! This is giving us a measure of distance
Thus, we want to maximize the distance the projected means:
m1 − m2 = wT
(m1 − m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
17. Use the mean of each Class
Then
Select w such that class separation is maximized
We then define the mean sample for ecah class
1 C1 ⇒ m1 = 1
N1
N1
i=1 xi
2 C2 ⇒ m2 = 1
N2
N2
i=1 xi
Ok!!! This is giving us a measure of distance
Thus, we want to maximize the distance the projected means:
m1 − m2 = wT
(m1 − m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
18. Use the mean of each Class
Then
Select w such that class separation is maximized
We then define the mean sample for ecah class
1 C1 ⇒ m1 = 1
N1
N1
i=1 xi
2 C2 ⇒ m2 = 1
N2
N2
i=1 xi
Ok!!! This is giving us a measure of distance
Thus, we want to maximize the distance the projected means:
m1 − m2 = wT
(m1 − m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
19. Use the mean of each Class
Then
Select w such that class separation is maximized
We then define the mean sample for ecah class
1 C1 ⇒ m1 = 1
N1
N1
i=1 xi
2 C2 ⇒ m2 = 1
N2
N2
i=1 xi
Ok!!! This is giving us a measure of distance
Thus, we want to maximize the distance the projected means:
m1 − m2 = wT
(m1 − m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
20. Use the mean of each Class
Then
Select w such that class separation is maximized
We then define the mean sample for ecah class
1 C1 ⇒ m1 = 1
N1
N1
i=1 xi
2 C2 ⇒ m2 = 1
N2
N2
i=1 xi
Ok!!! This is giving us a measure of distance
Thus, we want to maximize the distance the projected means:
m1 − m2 = wT
(m1 − m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
21. However
We could simply seek
max
w
wT
(m1 − m2)
s.t.
d
i=1
wi = 1
After all
We do not care about the magnitude of w.
11 / 43
22. However
We could simply seek
max
w
wT
(m1 − m2)
s.t.
d
i=1
wi = 1
After all
We do not care about the magnitude of w.
11 / 43
24. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
13 / 43
25. Fixing the Problem
To obtain good separation of the projected data
The difference between the means should be large relative to some
measure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2
k =
xi∈Ck
wT
xi − mk
2
=
yi=wT xi∈Ck
(yi − mk)2
(3)
We define then within-class variance for the whole data
s2
1 + s2
2 (4)
14 / 43
26. Fixing the Problem
To obtain good separation of the projected data
The difference between the means should be large relative to some
measure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2
k =
xi∈Ck
wT
xi − mk
2
=
yi=wT xi∈Ck
(yi − mk)2
(3)
We define then within-class variance for the whole data
s2
1 + s2
2 (4)
14 / 43
27. Fixing the Problem
To obtain good separation of the projected data
The difference between the means should be large relative to some
measure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2
k =
xi∈Ck
wT
xi − mk
2
=
yi=wT xi∈Ck
(yi − mk)2
(3)
We define then within-class variance for the whole data
s2
1 + s2
2 (4)
14 / 43
28. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
15 / 43
29. Finally, a Cost Function
The between-class variance
(m1 − m2)2
(5)
The Fisher criterion
between-class variance
within-class variance
(6)
Finally
J (w) =
(m1 − m2)2
s2
1 + s2
2
(7)
16 / 43
30. Finally, a Cost Function
The between-class variance
(m1 − m2)2
(5)
The Fisher criterion
between-class variance
within-class variance
(6)
Finally
J (w) =
(m1 − m2)2
s2
1 + s2
2
(7)
16 / 43
31. Finally, a Cost Function
The between-class variance
(m1 − m2)2
(5)
The Fisher criterion
between-class variance
within-class variance
(6)
Finally
J (w) =
(m1 − m2)2
s2
1 + s2
2
(7)
16 / 43
32. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
17 / 43
33. We use a transformation to simplify our life
First
J (w) =
wT
m1 − wT
m2
2
yi =wT xi ∈C1
(yi − mk )2 +
yi =wT xi ∈C2
(yi − mk )2
(8)
Second
J (w) =
wT
m1 − wT
m2 wT
m1 − wT
m2
T
yi =wT xi ∈C1
wT xi − mk wT xi − mk
T
+
yi =wT xi ∈C2
wT xi − mk wT xi − mk
T
(9)
Third
J (w) =
wT
(m1 − m2) wT
(m1 − m2)
T
yi =wT xi ∈C1
wT (xi − m1) wT (xi − m1)
T
+
yi =wT xi ∈C2
wT (xi − m2) wT (xi − m2)
T
(10)
18 / 43
34. We use a transformation to simplify our life
First
J (w) =
wT
m1 − wT
m2
2
yi =wT xi ∈C1
(yi − mk )2 +
yi =wT xi ∈C2
(yi − mk )2
(8)
Second
J (w) =
wT
m1 − wT
m2 wT
m1 − wT
m2
T
yi =wT xi ∈C1
wT xi − mk wT xi − mk
T
+
yi =wT xi ∈C2
wT xi − mk wT xi − mk
T
(9)
Third
J (w) =
wT
(m1 − m2) wT
(m1 − m2)
T
yi =wT xi ∈C1
wT (xi − m1) wT (xi − m1)
T
+
yi =wT xi ∈C2
wT (xi − m2) wT (xi − m2)
T
(10)
18 / 43
35. We use a transformation to simplify our life
First
J (w) =
wT
m1 − wT
m2
2
yi =wT xi ∈C1
(yi − mk )2 +
yi =wT xi ∈C2
(yi − mk )2
(8)
Second
J (w) =
wT
m1 − wT
m2 wT
m1 − wT
m2
T
yi =wT xi ∈C1
wT xi − mk wT xi − mk
T
+
yi =wT xi ∈C2
wT xi − mk wT xi − mk
T
(9)
Third
J (w) =
wT
(m1 − m2) wT
(m1 − m2)
T
yi =wT xi ∈C1
wT (xi − m1) wT (xi − m1)
T
+
yi =wT xi ∈C2
wT (xi − m2) wT (xi − m2)
T
(10)
18 / 43
36. Transformation
Fourth
J (w) =
wT
(m1 − m2) (m1 − m2)T
w
yi =wT xi ∈C1
wT (xi − m1) (xi − m1)T w +
yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w
(11)
Fifth
J (w) =
wT
(m1 − m2) (m1 − m2)T
w
wT
yi =wT xi ∈C1
(xi − m1) (xi − m1)T +
yi =wT xi ∈C2
(xi − m2) (xi − m2)T w
(12)
Now Rename
J (w) =
wT
SBw
wT Sww
(13)
19 / 43
37. Transformation
Fourth
J (w) =
wT
(m1 − m2) (m1 − m2)T
w
yi =wT xi ∈C1
wT (xi − m1) (xi − m1)T w +
yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w
(11)
Fifth
J (w) =
wT
(m1 − m2) (m1 − m2)T
w
wT
yi =wT xi ∈C1
(xi − m1) (xi − m1)T +
yi =wT xi ∈C2
(xi − m2) (xi − m2)T w
(12)
Now Rename
J (w) =
wT
SBw
wT Sww
(13)
19 / 43
38. Transformation
Fourth
J (w) =
wT
(m1 − m2) (m1 − m2)T
w
yi =wT xi ∈C1
wT (xi − m1) (xi − m1)T w +
yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w
(11)
Fifth
J (w) =
wT
(m1 − m2) (m1 − m2)T
w
wT
yi =wT xi ∈C1
(xi − m1) (xi − m1)T +
yi =wT xi ∈C2
(xi − m2) (xi − m2)T w
(12)
Now Rename
J (w) =
wT
SBw
wT Sww
(13)
19 / 43
39. Derive with respect to w
Thus
dJ (w)
dw
=
d wT
SBw wT
Sww
−1
dw
= 0 (14)
Then
dJ (w)
dw
= SBw + ST
B w wT
Sww
−1
− wT
SBw wT
Sww
−2
Sww + ST
w w = 0
(15)
Now because the symmetry in SB and Sw
dJ (w)
dw
=
SB
(wT Sww)
−
wT
SBwSww
(wT Sww)2 = 0 (16)
20 / 43
40. Derive with respect to w
Thus
dJ (w)
dw
=
d wT
SBw wT
Sww
−1
dw
= 0 (14)
Then
dJ (w)
dw
= SBw + ST
B w wT
Sww
−1
− wT
SBw wT
Sww
−2
Sww + ST
w w = 0
(15)
Now because the symmetry in SB and Sw
dJ (w)
dw
=
SB
(wT Sww)
−
wT
SBwSww
(wT Sww)2 = 0 (16)
20 / 43
41. Derive with respect to w
Thus
dJ (w)
dw
=
d wT
SBw wT
Sww
−1
dw
= 0 (14)
Then
dJ (w)
dw
= SBw + ST
B w wT
Sww
−1
− wT
SBw wT
Sww
−2
Sww + ST
w w = 0
(15)
Now because the symmetry in SB and Sw
dJ (w)
dw
=
SB
(wT Sww)
−
wT
SBwSww
(wT Sww)2 = 0 (16)
20 / 43
42. Derive with respect to w
Thus
dJ (w)
dw
=
SB
(wT Sww)
−
wT
SBwSww
(wT Sww)2 = 0 (17)
Then
wT
Sww SBw = wT
SBw Sww (18)
21 / 43
43. Derive with respect to w
Thus
dJ (w)
dw
=
SB
(wT Sww)
−
wT
SBwSww
(wT Sww)2 = 0 (17)
Then
wT
Sww SBw = wT
SBw Sww (18)
21 / 43
44. Now, Several Tricks!!!
First
SBw = (m1 − m2) (m1 − m2)T
w = α (m1 − m2) (19)
Where α = (m1 − m2)T
w is a simple constant
It means that SBw is always in the direction m1 − m2!!!
In addition
wT Sww and wT SBw are constants
22 / 43
45. Now, Several Tricks!!!
First
SBw = (m1 − m2) (m1 − m2)T
w = α (m1 − m2) (19)
Where α = (m1 − m2)T
w is a simple constant
It means that SBw is always in the direction m1 − m2!!!
In addition
wT Sww and wT SBw are constants
22 / 43
46. Now, Several Tricks!!!
First
SBw = (m1 − m2) (m1 − m2)T
w = α (m1 − m2) (19)
Where α = (m1 − m2)T
w is a simple constant
It means that SBw is always in the direction m1 − m2!!!
In addition
wT Sww and wT SBw are constants
22 / 43
47. Now, Several Tricks!!!
Finally
Sww ∝ (m1 − m2) ⇒ w ∝ S−1
w (m1 − m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformed
data using a Naive Bayes (Central Limit Theorem and y = wT x sum
of random variables).
23 / 43
48. Now, Several Tricks!!!
Finally
Sww ∝ (m1 − m2) ⇒ w ∝ S−1
w (m1 − m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformed
data using a Naive Bayes (Central Limit Theorem and y = wT x sum
of random variables).
23 / 43
49. Now, Several Tricks!!!
Finally
Sww ∝ (m1 − m2) ⇒ w ∝ S−1
w (m1 − m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformed
data using a Naive Bayes (Central Limit Theorem and y = wT x sum
of random variables).
23 / 43
50. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
24 / 43
51. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
52. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
53. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
54. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
55. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
56. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
57. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
58. Applications
Something Notable
Bankruptcy prediction
In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical method
applied to systematically explain which firms entered bankruptcy vs.
survived.
Face recognition
In computerized face recognition, each face is represented by a large
number of pixel values.
The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
Marketing
In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/or
products on the basis of surveys or other forms of collected data.
25 / 43
61. Please
Your Reading Material, it is about the Multi-class
4.1.6 Fisher’s discriminant for multiple classes AT “Pattern Recognition”
by Bishop
27 / 43
62. Outline
1 Introduction
History
The Rotation Idea
Classifiers as Machines for dimensionality reduction
2 Solution
Use the mean of each Class
Scatter measure
The Cost Function
A Transformation for simplification and defining the cost function
3 Where is this used?
Applications
4 Relation with Least Squared Error
What?
28 / 43
63. Relation with Least Squared Error
First
The least-squares approach to the determination of a linear discriminant
was based on the goal of making the model predictions as close as
possible to a set of target values.
Second
The Fisher criterion was derived by requiring maximum class separation in
the output space.
29 / 43
64. Relation with Least Squared Error
First
The least-squares approach to the determination of a linear discriminant
was based on the goal of making the model predictions as close as
possible to a set of target values.
Second
The Fisher criterion was derived by requiring maximum class separation in
the output space.
29 / 43
65. How do we do this?
First
We have N samples.
We have N1samples for class C1.
We have N2 samples for class C2.
So, we decide the following for the targets on each class
We have then for class C1 is N
N1
.
We have then for class C2 is − N
N2
.
30 / 43
66. How do we do this?
First
We have N samples.
We have N1samples for class C1.
We have N2 samples for class C2.
So, we decide the following for the targets on each class
We have then for class C1 is N
N1
.
We have then for class C2 is − N
N2
.
30 / 43
67. How do we do this?
First
We have N samples.
We have N1samples for class C1.
We have N2 samples for class C2.
So, we decide the following for the targets on each class
We have then for class C1 is N
N1
.
We have then for class C2 is − N
N2
.
30 / 43
68. How do we do this?
First
We have N samples.
We have N1samples for class C1.
We have N2 samples for class C2.
So, we decide the following for the targets on each class
We have then for class C1 is N
N1
.
We have then for class C2 is − N
N2
.
30 / 43
69. Thus
The new cost function
E =
1
2
N
n=1
wT
xn + wo − tn
2
(21)
Deriving with respect to w
N
n=1
wT
xn + wo − tn xn = 0 (22)
Deriving with respect to w0
N
n=1
wT
xn + wo − tn = 0 (23)
31 / 43
70. Thus
The new cost function
E =
1
2
N
n=1
wT
xn + wo − tn
2
(21)
Deriving with respect to w
N
n=1
wT
xn + wo − tn xn = 0 (22)
Deriving with respect to w0
N
n=1
wT
xn + wo − tn = 0 (23)
31 / 43
71. Thus
The new cost function
E =
1
2
N
n=1
wT
xn + wo − tn
2
(21)
Deriving with respect to w
N
n=1
wT
xn + wo − tn xn = 0 (22)
Deriving with respect to w0
N
n=1
wT
xn + wo − tn = 0 (23)
31 / 43
72. Then
We have that
N
n=1
wT
xn + wo − tn =
N
n=1
wT
xn + wo −
N
n=1
tn
=
N
n=1
wT
xn + wo − N1
N
N1
+ N2
N
N2
=
N
n=1
wT
xn + wo
Then
N
n=1
wT
xn + Nwo = 0
32 / 43
73. Then
We have that
N
n=1
wT
xn + wo − tn =
N
n=1
wT
xn + wo −
N
n=1
tn
=
N
n=1
wT
xn + wo − N1
N
N1
+ N2
N
N2
=
N
n=1
wT
xn + wo
Then
N
n=1
wT
xn + Nwo = 0
32 / 43
74. Then
We have that
wo = −wT 1
N
N
n=1
xn
We rename 1
N
N
n=1 xn = m
m =
1
N
N
n=1
xn =
1
N
[N1m1 + N2m2]
Finally
wo = −wT
m
33 / 43
75. Then
We have that
wo = −wT 1
N
N
n=1
xn
We rename 1
N
N
n=1 xn = m
m =
1
N
N
n=1
xn =
1
N
[N1m1 + N2m2]
Finally
wo = −wT
m
33 / 43
76. Then
We have that
wo = −wT 1
N
N
n=1
xn
We rename 1
N
N
n=1 xn = m
m =
1
N
N
n=1
xn =
1
N
[N1m1 + N2m2]
Finally
wo = −wT
m
33 / 43
77. Now
In a similar way
N
n=1
wT
xn + wo xn −
N
n=1
tnxn = 0
34 / 43
78. Thus, we have
Something Notable
N
n=1
wT
xn + wo xn −
N
N1
N1
n=1
xn +
N
N2
N2
n=1
xn = 0
Thus
N
n=1
wT
xn + wo xn − N (m1 − m2) = 0
35 / 43
79. Thus, we have
Something Notable
N
n=1
wT
xn + wo xn −
N
N1
N1
n=1
xn +
N
N2
N2
n=1
xn = 0
Thus
N
n=1
wT
xn + wo xn − N (m1 − m2) = 0
35 / 43
80. Next
Then
N
n=1
wT
xn − wT
m xn = N (m1 − m2)
Thus
N
n=1
wT
xn − wT
m xn = N (m1 − m2)
Now
What? Please do this as a take homework for next class
36 / 43
81. Next
Then
N
n=1
wT
xn − wT
m xn = N (m1 − m2)
Thus
N
n=1
wT
xn − wT
m xn = N (m1 − m2)
Now
What? Please do this as a take homework for next class
36 / 43
82. Next
Then
N
n=1
wT
xn − wT
m xn = N (m1 − m2)
Thus
N
n=1
wT
xn − wT
m xn = N (m1 − m2)
Now
What? Please do this as a take homework for next class
36 / 43
83. Now, Do you have the solution?
You have a version in Duda and Hart Section 5.8
ˆw = XT
X
−1
XT
y
Thus
XT
X ˆw = XT
y
Now, we rewrite the data matrix by the normalization discussed early
X =
11 X1
−12 −X2
37 / 43
84. Now, Do you have the solution?
You have a version in Duda and Hart Section 5.8
ˆw = XT
X
−1
XT
y
Thus
XT
X ˆw = XT
y
Now, we rewrite the data matrix by the normalization discussed early
X =
11 X1
−12 −X2
37 / 43
85. Now, Do you have the solution?
You have a version in Duda and Hart Section 5.8
ˆw = XT
X
−1
XT
y
Thus
XT
X ˆw = XT
y
Now, we rewrite the data matrix by the normalization discussed early
X =
11 X1
−12 −X2
37 / 43
86. In addition
Our old augmented w
w =
w0
w
And our new y
y =
N
N1
11
N
N2
12
(24)
38 / 43
87. In addition
Our old augmented w
w =
w0
w
And our new y
y =
N
N1
11
N
N2
12
(24)
38 / 43
88. Thus, we have
Something Notable
1T
1 −1T
1
XT
1 −XT
2
11 X1
−12 −X2
w0
w
=
1T
1 −1T
1
XT
1 −XT
2
N
N1
11
N
N2
12
Thus, if we use the following definitions
mi = 1
Ni x∈Ci
x
SW =
xi∈C1
(xi − m1) (xi − m1)T
+ xi∈C2
(xi − m2) (xi − m2)T
39 / 43
89. Thus, we have
Something Notable
1T
1 −1T
1
XT
1 −XT
2
11 X1
−12 −X2
w0
w
=
1T
1 −1T
1
XT
1 −XT
2
N
N1
11
N
N2
12
Thus, if we use the following definitions
mi = 1
Ni x∈Ci
x
SW =
xi∈C1
(xi − m1) (xi − m1)T
+ xi∈C2
(xi − m2) (xi − m2)T
39 / 43
90. Thus, we have
Something Notable
1T
1 −1T
1
XT
1 −XT
2
11 X1
−12 −X2
w0
w
=
1T
1 −1T
1
XT
1 −XT
2
N
N1
11
N
N2
12
Thus, if we use the following definitions
mi = 1
Ni x∈Ci
x
SW =
xi∈C1
(xi − m1) (xi − m1)T
+ xi∈C2
(xi − m2) (xi − m2)T
39 / 43
91. Thus, we have
Something Notable
1T
1 −1T
1
XT
1 −XT
2
11 X1
−12 −X2
w0
w
=
1T
1 −1T
1
XT
1 −XT
2
N
N1
11
N
N2
12
Thus, if we use the following definitions
mi = 1
Ni x∈Ci
x
SW =
xi∈C1
(xi − m1) (xi − m1)T
+ xi∈C2
(xi − m2) (xi − m2)T
39 / 43
92. Then
We have the following
n (n1m1 + n2m2)
T
(n1m1 + n2m2) Sw + n1m1mT
1 + n2m2mT
2
w0
w
=
0
n [m1 − m2]
Then
nw0 + (n1m1 + n2m2)
T
w
(n1m1 + n2m2) w0 + Sw + n1m1mT
1 + n2m2mT
2
=
0
n [m1 − m2]
40 / 43
93. Then
We have the following
n (n1m1 + n2m2)
T
(n1m1 + n2m2) Sw + n1m1mT
1 + n2m2mT
2
w0
w
=
0
n [m1 − m2]
Then
nw0 + (n1m1 + n2m2)
T
w
(n1m1 + n2m2) w0 + Sw + n1m1mT
1 + n2m2mT
2
=
0
n [m1 − m2]
40 / 43
94. Thus
We have that
w0 = −wT
m
1
n
Sw +
n1n2
n2
(m1 − m2) (m1 − m2)T
w = m1 − m2
Thus
Since the vector (m1 − m2) (m1 − m2)T
w is in the direction of
m1 − m2
We have that
n1n2
n2
(m1 − m2) (m1 − m2)T
w = (1 − α) (m1 − m2)
41 / 43
95. Thus
We have that
w0 = −wT
m
1
n
Sw +
n1n2
n2
(m1 − m2) (m1 − m2)T
w = m1 − m2
Thus
Since the vector (m1 − m2) (m1 − m2)T
w is in the direction of
m1 − m2
We have that
n1n2
n2
(m1 − m2) (m1 − m2)T
w = (1 − α) (m1 − m2)
41 / 43
96. Thus
We have that
w0 = −wT
m
1
n
Sw +
n1n2
n2
(m1 − m2) (m1 − m2)T
w = m1 − m2
Thus
Since the vector (m1 − m2) (m1 − m2)T
w is in the direction of
m1 − m2
We have that
n1n2
n2
(m1 − m2) (m1 − m2)T
w = (1 − α) (m1 − m2)
41 / 43