SlideShare a Scribd company logo
1 of 230
Download to read offline
Machine Learning for Data Mining
Linear Classifiers
Andres Mendez-Vazquez
May 23, 2016
1 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
2 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
3 / 85
What is it?
First than anything, we have a parametric model!!!
Here, we have an hyperplane as a model:
g(x) = wT
x + w0 (1)
In the case of R2
We have the following function:
g (x) = w1x1 + w2x2 + w0 (2)
4 / 85
What is it?
First than anything, we have a parametric model!!!
Here, we have an hyperplane as a model:
g(x) = wT
x + w0 (1)
In the case of R2
We have the following function:
g (x) = w1x1 + w2x2 + w0 (2)
4 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
5 / 85
Splitting The Space R2
Using a simple straight line
Class
Class
6 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
7 / 85
Defining a Decision Surface
The equation g (x) = 0 defines a decision surface
Separating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplane
Given x1 and x2 are both on the decision surface:
wT
x1 + w0 = 0
wT
x2 + w0 = 0
Thus
wT
x1 + w0 = wT
x2 + w0 (3)
8 / 85
Defining a Decision Surface
The equation g (x) = 0 defines a decision surface
Separating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplane
Given x1 and x2 are both on the decision surface:
wT
x1 + w0 = 0
wT
x2 + w0 = 0
Thus
wT
x1 + w0 = wT
x2 + w0 (3)
8 / 85
Defining a Decision Surface
The equation g (x) = 0 defines a decision surface
Separating the elements in classes, ω1 and ω2.
When g (x) is linear the decision surface is an hyperplane
Given x1 and x2 are both on the decision surface:
wT
x1 + w0 = 0
wT
x2 + w0 = 0
Thus
wT
x1 + w0 = wT
x2 + w0 (3)
8 / 85
Defining a Decision Surface
Thus
wT
(x1 − x2) = 0 (4)
Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT
is normal to the hyperplane.
Something Notable
Properties
9 / 85
Defining a Decision Surface
Thus
wT
(x1 − x2) = 0 (4)
Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT
is normal to the hyperplane.
Something Notable
Properties
9 / 85
Defining a Decision Surface
Thus
wT
(x1 − x2) = 0 (4)
Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT
is normal to the hyperplane.
Something Notable
Properties
9 / 85
Therefore
The space is split in two regions (Example in R3
) by the hyperplane H
10 / 85
Some Properties of the Hyperplane
Given that g (x) > 0 if x ∈ R1
11 / 85
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
12 / 85
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
12 / 85
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
12 / 85
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
12 / 85
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
12 / 85
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
12 / 85
It is more
We can say the following
Any x ∈ R1 is on the positive side of H.
Any x ∈ R2 is on the negative side of H.
In addition, g (x) can give us a way to obtain the distance from x to
the hyperplane H
First, we express any x as follows
x = xp + r
w
w
Where
xp is the normal projection of x onto H.
r is the desired distance
Positive, if x is in the positive side
Negative, if x is in the negative side
12 / 85
We have something like this
We have then
13 / 85
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
14 / 85
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
14 / 85
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
14 / 85
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
14 / 85
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
14 / 85
Now
Since g (xp) = 0
We have that
g (x) = g xp + r
w
w
= wT
xp + r
w
w
+ w0
= wT
xp + w0 + r
wT w
w
= g (xp) + r
w 2
w
= r w
Then, we have
r =
g (x)
w
(5)
14 / 85
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
15 / 85
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
15 / 85
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
15 / 85
In particular
The distance from the origin to H
r =
g (0)
w
=
wT (0) + w0
w
=
w0
w
(6)
Remarks
If w0 > 0, the origin is on the positive side of H.
If w0 < 0, the origin is on the negative side of H.
If w0 = 0, the hyperplane has the homogeneous form wT x and
hyperplane passes through the origin.
15 / 85
In addition...
If we do the following
g (x) = w0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
x0 = 1 and y =






1
x1
...
xd






=





1
x





Where
y is called an augmented feature vector.
16 / 85
In addition...
If we do the following
g (x) = w0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
x0 = 1 and y =






1
x1
...
xd






=





1
x





Where
y is called an augmented feature vector.
16 / 85
In addition...
If we do the following
g (x) = w0 +
d
i=1
wixi =
d
i=0
wixi (7)
By making
x0 = 1 and y =






1
x1
...
xd






=





1
x





Where
y is called an augmented feature vector.
16 / 85
In a similar way
We have the augmented weight vector
waug =






w0
w1
...
wd






=





w0
w





Remarks
The addition of a constant component to x preserves all the distance
relationship between samples.
The resulting y vectors, all lie in a d-dimensional subspace which is
the x-space itself.
17 / 85
In a similar way
We have the augmented weight vector
waug =






w0
w1
...
wd






=





w0
w





Remarks
The addition of a constant component to x preserves all the distance
relationship between samples.
The resulting y vectors, all lie in a d-dimensional subspace which is
the x-space itself.
17 / 85
In a similar way
We have the augmented weight vector
waug =






w0
w1
...
wd






=





w0
w





Remarks
The addition of a constant component to x preserves all the distance
relationship between samples.
The resulting y vectors, all lie in a d-dimensional subspace which is
the x-space itself.
17 / 85
More Remarks
In addition
The hyperplane decision surface H defined by wT
augy = 0 passes
through the origin in y-space.
Even though the corresponding hyperplane H can be in any position
of the x-space.
The distance from y to H is
|wT
augy|
waug
or |g(x)|
waug
.
Since waug > w
This distance is less or at least equal to the distance from x to H.
This mapping is quite useful
Because we only need to find a weight vector waug instead of finding the
weight vector w and the threshold w0.
18 / 85
More Remarks
In addition
The hyperplane decision surface H defined by wT
augy = 0 passes
through the origin in y-space.
Even though the corresponding hyperplane H can be in any position
of the x-space.
The distance from y to H is
|wT
augy|
waug
or |g(x)|
waug
.
Since waug > w
This distance is less or at least equal to the distance from x to H.
This mapping is quite useful
Because we only need to find a weight vector waug instead of finding the
weight vector w and the threshold w0.
18 / 85
More Remarks
In addition
The hyperplane decision surface H defined by wT
augy = 0 passes
through the origin in y-space.
Even though the corresponding hyperplane H can be in any position
of the x-space.
The distance from y to H is
|wT
augy|
waug
or |g(x)|
waug
.
Since waug > w
This distance is less or at least equal to the distance from x to H.
This mapping is quite useful
Because we only need to find a weight vector waug instead of finding the
weight vector w and the threshold w0.
18 / 85
More Remarks
In addition
The hyperplane decision surface H defined by wT
augy = 0 passes
through the origin in y-space.
Even though the corresponding hyperplane H can be in any position
of the x-space.
The distance from y to H is
|wT
augy|
waug
or |g(x)|
waug
.
Since waug > w
This distance is less or at least equal to the distance from x to H.
This mapping is quite useful
Because we only need to find a weight vector waug instead of finding the
weight vector w and the threshold w0.
18 / 85
More Remarks
In addition
The hyperplane decision surface H defined by wT
augy = 0 passes
through the origin in y-space.
Even though the corresponding hyperplane H can be in any position
of the x-space.
The distance from y to H is
|wT
augy|
waug
or |g(x)|
waug
.
Since waug > w
This distance is less or at least equal to the distance from x to H.
This mapping is quite useful
Because we only need to find a weight vector waug instead of finding the
weight vector w and the threshold w0.
18 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
19 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
20 / 85
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
We suggest the following normalization
We replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
We suggest the following normalization
We replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
We suggest the following normalization
We replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
Initial Supposition
Suppose, we have
n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2.
We want a vector weight w such that
wT xi > 0, if xi ∈ ω1.
wT xi < 0, if xi ∈ ω2.
We suggest the following normalization
We replace all the samples xi ∈ ω2 by their negative vectors!!!
21 / 85
The Usefulness of the Normalization
Once the normalization is done
We only need for a weight vector w such that wT xi > 0 for all the
samples.
The name of this weight vector
It is called a separating vector or solution vector.
22 / 85
The Usefulness of the Normalization
Once the normalization is done
We only need for a weight vector w such that wT xi > 0 for all the
samples.
The name of this weight vector
It is called a separating vector or solution vector.
22 / 85
Here, we have the solution region for w
Do not confuse this region with the decision region!!!
separating plane
solution space
Remark: w is not unique!!! We can have different w’s solving the
problem
23 / 85
Here, we have the solution region for w
Do not confuse this region with the decision region!!!
separating plane
solution space
Remark: w is not unique!!! We can have different w’s solving the
problem
23 / 85
Here, we have the solution region for w under
normalization
Do not confuse this region with the decision region!!!
"separating" plane
solution space
Remark: w is not unique!!!
24 / 85
Here, we have the solution region for w under
normalization
Do not confuse this region with the decision region!!!
"separating" plane
solution space
Remark: w is not unique!!!
24 / 85
How do we get this w?
In order to be able to do this
We need to impose constraints to the problem.
Possible constraints!!!
To find a unit-length weight vector that maximizes the minimum
distance from the samples to the separating plane.
To find the minimum-length weight vector satisfying wT xi ≥ b for all
i where b is a constant called the margin.
Here the solution space resulting from the intersections of the
half-spaces such that wT xi ≥ b > 0 lies within the previous solution
space!!!
25 / 85
How do we get this w?
In order to be able to do this
We need to impose constraints to the problem.
Possible constraints!!!
To find a unit-length weight vector that maximizes the minimum
distance from the samples to the separating plane.
To find the minimum-length weight vector satisfying wT xi ≥ b for all
i where b is a constant called the margin.
Here the solution space resulting from the intersections of the
half-spaces such that wT xi ≥ b > 0 lies within the previous solution
space!!!
25 / 85
How do we get this w?
In order to be able to do this
We need to impose constraints to the problem.
Possible constraints!!!
To find a unit-length weight vector that maximizes the minimum
distance from the samples to the separating plane.
To find the minimum-length weight vector satisfying wT xi ≥ b for all
i where b is a constant called the margin.
Here the solution space resulting from the intersections of the
half-spaces such that wT xi ≥ b > 0 lies within the previous solution
space!!!
25 / 85
How do we get this w?
In order to be able to do this
We need to impose constraints to the problem.
Possible constraints!!!
To find a unit-length weight vector that maximizes the minimum
distance from the samples to the separating plane.
To find the minimum-length weight vector satisfying wT xi ≥ b for all
i where b is a constant called the margin.
Here the solution space resulting from the intersections of the
half-spaces such that wT xi ≥ b > 0 lies within the previous solution
space!!!
25 / 85
We have then
A new boundary by a distance b
xi
solution region
26 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
27 / 85
Gradient Descent
For this, we will define a criterion function J (w)
A classic optimization
The basic procedure is as follow
1 Start with a random weight vector w (1).
2 Compute the gradient vector J (w (1)).
3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:
1 i.e. along the negative of the gradient.
2 By using the following equation:
w (k + 1) = w (k) − η (k) J (w (k)) (8)
28 / 85
Gradient Descent
For this, we will define a criterion function J (w)
A classic optimization
The basic procedure is as follow
1 Start with a random weight vector w (1).
2 Compute the gradient vector J (w (1)).
3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:
1 i.e. along the negative of the gradient.
2 By using the following equation:
w (k + 1) = w (k) − η (k) J (w (k)) (8)
28 / 85
Gradient Descent
For this, we will define a criterion function J (w)
A classic optimization
The basic procedure is as follow
1 Start with a random weight vector w (1).
2 Compute the gradient vector J (w (1)).
3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:
1 i.e. along the negative of the gradient.
2 By using the following equation:
w (k + 1) = w (k) − η (k) J (w (k)) (8)
28 / 85
Gradient Descent
For this, we will define a criterion function J (w)
A classic optimization
The basic procedure is as follow
1 Start with a random weight vector w (1).
2 Compute the gradient vector J (w (1)).
3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:
1 i.e. along the negative of the gradient.
2 By using the following equation:
w (k + 1) = w (k) − η (k) J (w (k)) (8)
28 / 85
Gradient Descent
For this, we will define a criterion function J (w)
A classic optimization
The basic procedure is as follow
1 Start with a random weight vector w (1).
2 Compute the gradient vector J (w (1)).
3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:
1 i.e. along the negative of the gradient.
2 By using the following equation:
w (k + 1) = w (k) − η (k) J (w (k)) (8)
28 / 85
Gradient Descent
For this, we will define a criterion function J (w)
A classic optimization
The basic procedure is as follow
1 Start with a random weight vector w (1).
2 Compute the gradient vector J (w (1)).
3 Obtain value w (2) by moving from w (1) in the direction of the
steepest descent:
1 i.e. along the negative of the gradient.
2 By using the following equation:
w (k + 1) = w (k) − η (k) J (w (k)) (8)
28 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
What is η (k)?
Here
η (k) is a positive scale factor or learning rate!!!
The basic algorithm looks like this
Algorithm 1 (Basic gradient descent)
1 begin initialize w, criterion θ, η (·), k = 0
2 do k = k + 1
3 w = w − η (k) J (w)
4 until η (k) J (w) < θ
5 return w
Problem!!! How to choose the learning rate?
If η (k) is too small, convergence is quite slow!!!
If η (k) is too large, correction will overshot and can even diverge!!!
29 / 85
Using the Taylor’s second-order expansion around value
w (k)
We do the following
J (w) = J (w (k)) + JT
(w − w (k)) +
1
2
(w − w (k))T
H (w − w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have
J is the vector of partial derivatives ∂J
∂wi
evaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J
∂wi∂wj
evaluated at w (k).
30 / 85
Using the Taylor’s second-order expansion around value
w (k)
We do the following
J (w) = J (w (k)) + JT
(w − w (k)) +
1
2
(w − w (k))T
H (w − w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have
J is the vector of partial derivatives ∂J
∂wi
evaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J
∂wi∂wj
evaluated at w (k).
30 / 85
Using the Taylor’s second-order expansion around value
w (k)
We do the following
J (w) = J (w (k)) + JT
(w − w (k)) +
1
2
(w − w (k))T
H (w − w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have
J is the vector of partial derivatives ∂J
∂wi
evaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J
∂wi∂wj
evaluated at w (k).
30 / 85
Using the Taylor’s second-order expansion around value
w (k)
We do the following
J (w) = J (w (k)) + JT
(w − w (k)) +
1
2
(w − w (k))T
H (w − w (k)) (9)
Remark: This is know as Taylor’s Second Order expansion!!!
Here, we have
J is the vector of partial derivatives ∂J
∂wi
evaluated at w (k).
H is the Hessian matrix of second partial derivatives ∂2J
∂wi∂wj
evaluated at w (k).
30 / 85
Then
We substitute (Eq. 8) into (Eq. 9)
w (k + 1) − w (k) = η (k) J (w (k)) (10)
We have then
J (w (k + 1)) ∼=J (w (k)) + JT
(−η (k) J (w (k))) + ...
1
2
(−η (k) J (w (k)))T
H (−η (k) J (w (k)))
Finally, we have
J (w (k + 1)) ∼= J (w (k)) − η (k) J 2
+
1
2
η2
(k) JT
H J (11)
31 / 85
Then
We substitute (Eq. 8) into (Eq. 9)
w (k + 1) − w (k) = η (k) J (w (k)) (10)
We have then
J (w (k + 1)) ∼=J (w (k)) + JT
(−η (k) J (w (k))) + ...
1
2
(−η (k) J (w (k)))T
H (−η (k) J (w (k)))
Finally, we have
J (w (k + 1)) ∼= J (w (k)) − η (k) J 2
+
1
2
η2
(k) JT
H J (11)
31 / 85
Then
We substitute (Eq. 8) into (Eq. 9)
w (k + 1) − w (k) = η (k) J (w (k)) (10)
We have then
J (w (k + 1)) ∼=J (w (k)) + JT
(−η (k) J (w (k))) + ...
1
2
(−η (k) J (w (k)))T
H (−η (k) J (w (k)))
Finally, we have
J (w (k + 1)) ∼= J (w (k)) − η (k) J 2
+
1
2
η2
(k) JT
H J (11)
31 / 85
Derive with respect to η (k) and make the result equal to
zero
We have then
− J 2
+ η (k) JT
H J = 0 (12)
Finally
η (k) =
J 2
JT H J
(13)
Remark This is the optimal step size!!!
Problem!!!
Calculating H can be quite expansive!!!
32 / 85
Derive with respect to η (k) and make the result equal to
zero
We have then
− J 2
+ η (k) JT
H J = 0 (12)
Finally
η (k) =
J 2
JT H J
(13)
Remark This is the optimal step size!!!
Problem!!!
Calculating H can be quite expansive!!!
32 / 85
Derive with respect to η (k) and make the result equal to
zero
We have then
− J 2
+ η (k) JT
H J = 0 (12)
Finally
η (k) =
J 2
JT H J
(13)
Remark This is the optimal step size!!!
Problem!!!
Calculating H can be quite expansive!!!
32 / 85
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)
Then, we can have the following function
f (η (k)) = w (k) − η (k) J (w (k))
We can optimized using linear search methods
Linear Search Methods
Backtracking linear search
Bisection method
Golden ratio
Etc.
33 / 85
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)
Then, we can have the following function
f (η (k)) = w (k) − η (k) J (w (k))
We can optimized using linear search methods
Linear Search Methods
Backtracking linear search
Bisection method
Golden ratio
Etc.
33 / 85
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)
Then, we can have the following function
f (η (k)) = w (k) − η (k) J (w (k))
We can optimized using linear search methods
Linear Search Methods
Backtracking linear search
Bisection method
Golden ratio
Etc.
33 / 85
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)
Then, we can have the following function
f (η (k)) = w (k) − η (k) J (w (k))
We can optimized using linear search methods
Linear Search Methods
Backtracking linear search
Bisection method
Golden ratio
Etc.
33 / 85
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)
Then, we can have the following function
f (η (k)) = w (k) − η (k) J (w (k))
We can optimized using linear search methods
Linear Search Methods
Backtracking linear search
Bisection method
Golden ratio
Etc.
33 / 85
We can have an adaptive linear search!!!
We can use the idea of having everything fixed, but η (k)
Then, we can have the following function
f (η (k)) = w (k) − η (k) J (w (k))
We can optimized using linear search methods
Linear Search Methods
Backtracking linear search
Bisection method
Golden ratio
Etc.
33 / 85
Example: Golden Ratio
Imagine that you have a linear function f : L → R
Where: Chose a and b such that a+b
a = a
b (The Golden Ratio).
34 / 85
Example: Golden Ratio
Imagine that you have a linear function f : L → R
Where: Chose a and b such that a+b
a = a
b (The Golden Ratio).
34 / 85
The process is as follow
Given f1, f2, f3, where
f1 = f (x1)
f2 = f (x2)
f3 = f (x3)
We have then
if f2 is smaller than f1 and f3 then the minimum lies in [x1, x3]
Now, we generate x4 with f4 = f (x4)
In the largest subinterval!!! [x2, x3]
35 / 85
The process is as follow
Given f1, f2, f3, where
f1 = f (x1)
f2 = f (x2)
f3 = f (x3)
We have then
if f2 is smaller than f1 and f3 then the minimum lies in [x1, x3]
Now, we generate x4 with f4 = f (x4)
In the largest subinterval!!! [x2, x3]
35 / 85
The process is as follow
Given f1, f2, f3, where
f1 = f (x1)
f2 = f (x2)
f3 = f (x3)
We have then
if f2 is smaller than f1 and f3 then the minimum lies in [x1, x3]
Now, we generate x4 with f4 = f (x4)
In the largest subinterval!!! [x2, x3]
35 / 85
Finally
Two cases
If f4a > f2 then the minimum lies between x1 and x4 and the new
triplet is x1, x2 and x4.
If f4b < f2 then the minimum lies between x2 and x3 and the new
triplet is x2, x4 and x3.
Then
Repeat the procedure!!!
For more, please read the paper
“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
Finally
Two cases
If f4a > f2 then the minimum lies between x1 and x4 and the new
triplet is x1, x2 and x4.
If f4b < f2 then the minimum lies between x2 and x3 and the new
triplet is x2, x4 and x3.
Then
Repeat the procedure!!!
For more, please read the paper
“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
Finally
Two cases
If f4a > f2 then the minimum lies between x1 and x4 and the new
triplet is x1, x2 and x4.
If f4b < f2 then the minimum lies between x2 and x3 and the new
triplet is x2, x4 and x3.
Then
Repeat the procedure!!!
For more, please read the paper
“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
Finally
Two cases
If f4a > f2 then the minimum lies between x1 and x4 and the new
triplet is x1, x2 and x4.
If f4b < f2 then the minimum lies between x2 and x3 and the new
triplet is x2, x4 and x3.
Then
Repeat the procedure!!!
For more, please read the paper
“SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer
36 / 85
We have another method...
Derive the second Taylor expansion by w
J (w) = J (w (k)) + JT
(w − w (k)) +
1
2
(w − w (k))T
H (w − w (k))
We get
J + Hw − Hw (k) = 0 (14)
Thus
Hw = Hw (k) − J
H−1
Hw = H−1
Hw (k) − H−1
J
w = w (k) − H−1
J
37 / 85
We have another method...
Derive the second Taylor expansion by w
J (w) = J (w (k)) + JT
(w − w (k)) +
1
2
(w − w (k))T
H (w − w (k))
We get
J + Hw − Hw (k) = 0 (14)
Thus
Hw = Hw (k) − J
H−1
Hw = H−1
Hw (k) − H−1
J
w = w (k) − H−1
J
37 / 85
We have another method...
Derive the second Taylor expansion by w
J (w) = J (w (k)) + JT
(w − w (k)) +
1
2
(w − w (k))T
H (w − w (k))
We get
J + Hw − Hw (k) = 0 (14)
Thus
Hw = Hw (k) − J
H−1
Hw = H−1
Hw (k) − H−1
J
w = w (k) − H−1
J
37 / 85
The Newton-Raphson Algorithm
We have the following algorithm
Algorithm 2 (Newton descent)
1 begin initialize w, criterion θ
2 do k = k + 1
3 w = w − H−1
J (w)
4 until H−1
J (w) < θ
5 return w
38 / 85
The Newton-Raphson Algorithm
We have the following algorithm
Algorithm 2 (Newton descent)
1 begin initialize w, criterion θ
2 do k = k + 1
3 w = w − H−1
J (w)
4 until H−1
J (w) < θ
5 return w
38 / 85
The Newton-Raphson Algorithm
We have the following algorithm
Algorithm 2 (Newton descent)
1 begin initialize w, criterion θ
2 do k = k + 1
3 w = w − H−1
J (w)
4 until H−1
J (w) < θ
5 return w
38 / 85
The Newton-Raphson Algorithm
We have the following algorithm
Algorithm 2 (Newton descent)
1 begin initialize w, criterion θ
2 do k = k + 1
3 w = w − H−1
J (w)
4 until H−1
J (w) < θ
5 return w
38 / 85
The Newton-Raphson Algorithm
We have the following algorithm
Algorithm 2 (Newton descent)
1 begin initialize w, criterion θ
2 do k = k + 1
3 w = w − H−1
J (w)
4 until H−1
J (w) < θ
5 return w
38 / 85
The Newton-Raphson Algorithm
We have the following algorithm
Algorithm 2 (Newton descent)
1 begin initialize w, criterion θ
2 do k = k + 1
3 w = w − H−1
J (w)
4 until H−1
J (w) < θ
5 return w
38 / 85
The Newton-Raphson Algorithm
We have the following algorithm
Algorithm 2 (Newton descent)
1 begin initialize w, criterion θ
2 do k = k + 1
3 w = w − H−1
J (w)
4 until H−1
J (w) < θ
5 return w
38 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
39 / 85
Initial Setup
Important
We get away from our initial normalization of the samples!!!
Now, we are going to use the method know as
Minimum Squared Error
40 / 85
Initial Setup
Important
We get away from our initial normalization of the samples!!!
Now, we are going to use the method know as
Minimum Squared Error
40 / 85
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
41 / 85
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
41 / 85
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
41 / 85
Now, assume the following
Imagine that your problem has two classes ω1 and ω2 in R2
1 They are linearly separable!!!
2 You require to label them.
We have a problem!!!
Which is the problem?
We do not know the hyperplane!!!
Thus, what distance each point has to the hyperplane?
41 / 85
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
A Simple Solution For Our Quandary
Label the Classes
ω1 =⇒ +1
ω2 =⇒ −1
We produce the following labels
1 if x ∈ ω1 then yideal = gideal (x) = +1.
2 if x ∈ ω2 then yideal = gideal (x) = −1.
Remark: We have a problem with this labels!!!
42 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
43 / 85
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT
x + w0 + (15)
Where the
It has a ∼ N µ, σ2
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + (16)
44 / 85
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT
x + w0 + (15)
Where the
It has a ∼ N µ, σ2
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + (16)
44 / 85
Now, What?
Assume true function f is given by
ynoise = gnoise (x) = wT
x + w0 + (15)
Where the
It has a ∼ N µ, σ2
Thus, we can do the following
ynoise = gnoise (x) = gideal (x) + (16)
44 / 85
Thus, we have
What to do?
= ynoise − gideal (x) (17)
Graphically
45 / 85
Thus, we have
What to do?
= ynoise − gideal (x) (17)
Graphically
45 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
46 / 85
Sum Over All Errors
We can do the following
J (w) =
N
i=1
2
i =
N
i=1
(yi − gideal (x))2
(18)
Remark: Know as least squares (Fitting the vertical offset!!!)
Generalize
If
The dimensionality of each sample (data point) is d,
You can extend each vector sample to be xT = (1, x ),
We have:
N
i=1
yi − xT
w
2
= (y − Xw)T
(y − Xw) = y − Xw 2
2 (19)
47 / 85
Sum Over All Errors
We can do the following
J (w) =
N
i=1
2
i =
N
i=1
(yi − gideal (x))2
(18)
Remark: Know as least squares (Fitting the vertical offset!!!)
Generalize
If
The dimensionality of each sample (data point) is d,
You can extend each vector sample to be xT = (1, x ),
We have:
N
i=1
yi − xT
w
2
= (y − Xw)T
(y − Xw) = y − Xw 2
2 (19)
47 / 85
Sum Over All Errors
We can do the following
J (w) =
N
i=1
2
i =
N
i=1
(yi − gideal (x))2
(18)
Remark: Know as least squares (Fitting the vertical offset!!!)
Generalize
If
The dimensionality of each sample (data point) is d,
You can extend each vector sample to be xT = (1, x ),
We have:
N
i=1
yi − xT
w
2
= (y − Xw)T
(y − Xw) = y − Xw 2
2 (19)
47 / 85
Sum Over All Errors
We can do the following
J (w) =
N
i=1
2
i =
N
i=1
(yi − gideal (x))2
(18)
Remark: Know as least squares (Fitting the vertical offset!!!)
Generalize
If
The dimensionality of each sample (data point) is d,
You can extend each vector sample to be xT = (1, x ),
We have:
N
i=1
yi − xT
w
2
= (y − Xw)T
(y − Xw) = y − Xw 2
2 (19)
47 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
48 / 85
What is X
It is the Data Matrix
X =









1 (x1)1 · · · (x1)j · · · (x1)d
...
...
...
1 (xi)1 (xi)j (xi)d
...
...
...
1 (xN )1 · · · (xN )j · · · (xN )d









(20)
We know the following
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (21)
49 / 85
What is X
It is the Data Matrix
X =









1 (x1)1 · · · (x1)j · · · (x1)d
...
...
...
1 (xi)1 (xi)j (xi)d
...
...
...
1 (xN )1 · · · (xN )j · · · (xN )d









(20)
We know the following
dxT Ax
dx
= Ax + AT
x,
dAx
dx
= A (21)
49 / 85
Note about other representations
We could have xT
= (x1, x2, ..., xd, 1) thus
X =









(x1)1 · · · (x1)j · · · (x1)d 1
...
...
...
(xi)1 (xi)j (xi)d 1
...
...
...
(xN )1 · · · (xN )j · · · (xN )d 1









(22)
50 / 85
We can expand our quadratic formula!!!
Thus
(y − Xw)T
(y − Xw) = yT
y − wT
XT
y − yT
Xw + wT
XT
Xw (23)
Making Possible to have by deriving with respect to w and assuming
that XT
X is invertible
ˆw = XT
X
−1
XT
y (24)
Note:XT
X is always positive semi-definite. If it is also invertible, it is
positive definite.
Thus, How we get the discriminant function?
Any Ideas?
51 / 85
We can expand our quadratic formula!!!
Thus
(y − Xw)T
(y − Xw) = yT
y − wT
XT
y − yT
Xw + wT
XT
Xw (23)
Making Possible to have by deriving with respect to w and assuming
that XT
X is invertible
ˆw = XT
X
−1
XT
y (24)
Note:XT
X is always positive semi-definite. If it is also invertible, it is
positive definite.
Thus, How we get the discriminant function?
Any Ideas?
51 / 85
We can expand our quadratic formula!!!
Thus
(y − Xw)T
(y − Xw) = yT
y − wT
XT
y − yT
Xw + wT
XT
Xw (23)
Making Possible to have by deriving with respect to w and assuming
that XT
X is invertible
ˆw = XT
X
−1
XT
y (24)
Note:XT
X is always positive semi-definite. If it is also invertible, it is
positive definite.
Thus, How we get the discriminant function?
Any Ideas?
51 / 85
The Final Discriminant Function
Very Simple!!!
g(x) = xT
ˆw = xT
XT
X
−1
XT
y (25)
52 / 85
Pseudo-inverse of a Matrix
Definition
Suppose that A ∈ Rm×n and rank (A) = m. We call the matrix
A+
= AT
A
−1
AT
the pseudo inverse of A.
Reason
A+ inverts A on its image
What?
If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence:
A+
w = A+
Av = AT
A
−1
AT
Av
53 / 85
Pseudo-inverse of a Matrix
Definition
Suppose that A ∈ Rm×n and rank (A) = m. We call the matrix
A+
= AT
A
−1
AT
the pseudo inverse of A.
Reason
A+ inverts A on its image
What?
If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence:
A+
w = A+
Av = AT
A
−1
AT
Av
53 / 85
Pseudo-inverse of a Matrix
Definition
Suppose that A ∈ Rm×n and rank (A) = m. We call the matrix
A+
= AT
A
−1
AT
the pseudo inverse of A.
Reason
A+ inverts A on its image
What?
If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence:
A+
w = A+
Av = AT
A
−1
AT
Av
53 / 85
What lives where?
We have
X ∈ RN×(d+1)
Image (X) = span Xcol
1 , ..., Xcol
d+1
xi ∈ Rd
w ∈ Rd+1
Xcol
i , y ∈ RN
Basically y, the list of desired inputs the is being protected into
span Xcol
1 , ..., Xcol
d+1 (26)
by the projection operator X XT X
−1
XT .
54 / 85
What lives where?
We have
X ∈ RN×(d+1)
Image (X) = span Xcol
1 , ..., Xcol
d+1
xi ∈ Rd
w ∈ Rd+1
Xcol
i , y ∈ RN
Basically y, the list of desired inputs the is being protected into
span Xcol
1 , ..., Xcol
d+1 (26)
by the projection operator X XT X
−1
XT .
54 / 85
What lives where?
We have
X ∈ RN×(d+1)
Image (X) = span Xcol
1 , ..., Xcol
d+1
xi ∈ Rd
w ∈ Rd+1
Xcol
i , y ∈ RN
Basically y, the list of desired inputs the is being protected into
span Xcol
1 , ..., Xcol
d+1 (26)
by the projection operator X XT X
−1
XT .
54 / 85
What lives where?
We have
X ∈ RN×(d+1)
Image (X) = span Xcol
1 , ..., Xcol
d+1
xi ∈ Rd
w ∈ Rd+1
Xcol
i , y ∈ RN
Basically y, the list of desired inputs the is being protected into
span Xcol
1 , ..., Xcol
d+1 (26)
by the projection operator X XT X
−1
XT .
54 / 85
What lives where?
We have
X ∈ RN×(d+1)
Image (X) = span Xcol
1 , ..., Xcol
d+1
xi ∈ Rd
w ∈ Rd+1
Xcol
i , y ∈ RN
Basically y, the list of desired inputs the is being protected into
span Xcol
1 , ..., Xcol
d+1 (26)
by the projection operator X XT X
−1
XT .
54 / 85
What lives where?
We have
X ∈ RN×(d+1)
Image (X) = span Xcol
1 , ..., Xcol
d+1
xi ∈ Rd
w ∈ Rd+1
Xcol
i , y ∈ RN
Basically y, the list of desired inputs the is being protected into
span Xcol
1 , ..., Xcol
d+1 (26)
by the projection operator X XT X
−1
XT .
54 / 85
Geometric Interpretation
We have
1 The image of the mapping w to Xw is a linear subspace of RN .
2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image space
image (X) = span Xcol
1 , ..., Xcol
d+1 .
3 Each w defines one point Xw = d
j=0 wjXcol
j .
4 ˆw is the point which minimizes the distance d (y, image (X)).
55 / 85
Geometric Interpretation
We have
1 The image of the mapping w to Xw is a linear subspace of RN .
2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image space
image (X) = span Xcol
1 , ..., Xcol
d+1 .
3 Each w defines one point Xw = d
j=0 wjXcol
j .
4 ˆw is the point which minimizes the distance d (y, image (X)).
55 / 85
Geometric Interpretation
We have
1 The image of the mapping w to Xw is a linear subspace of RN .
2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image space
image (X) = span Xcol
1 , ..., Xcol
d+1 .
3 Each w defines one point Xw = d
j=0 wjXcol
j .
4 ˆw is the point which minimizes the distance d (y, image (X)).
55 / 85
Geometric Interpretation
We have
1 The image of the mapping w to Xw is a linear subspace of RN .
2 As w runs through all points Rd+1, the function value Xw runs
through all points in the image space
image (X) = span Xcol
1 , ..., Xcol
d+1 .
3 Each w defines one point Xw = d
j=0 wjXcol
j .
4 ˆw is the point which minimizes the distance d (y, image (X)).
55 / 85
Geometrically
Ahhhh!!!
56 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
57 / 85
Multi-Class Solution
What to do?
1 We might reduce the problem to c − 1 two-class problems.
2 We might use c(c−1)
2 linear discriminants, one for every pair of classes.
However
58 / 85
Multi-Class Solution
What to do?
1 We might reduce the problem to c − 1 two-class problems.
2 We might use c(c−1)
2 linear discriminants, one for every pair of classes.
However
58 / 85
Multi-Class Solution
What to do?
1 We might reduce the problem to c − 1 two-class problems.
2 We might use c(c−1)
2 linear discriminants, one for every pair of classes.
However
58 / 85
What to do?
Define c linear discriminant functions
gi (x) = wT
x + wi0 for i = 1, ..., c (27)
This is known as a linear machine
Rule: if gk (x) > gj (x) for all j = k =⇒ x ∈ ωk
Nice Properties (It can be proved!!!)
1 Decision Regions are Singly Connected.
2 Decision Regions are Convex.
59 / 85
What to do?
Define c linear discriminant functions
gi (x) = wT
x + wi0 for i = 1, ..., c (27)
This is known as a linear machine
Rule: if gk (x) > gj (x) for all j = k =⇒ x ∈ ωk
Nice Properties (It can be proved!!!)
1 Decision Regions are Singly Connected.
2 Decision Regions are Convex.
59 / 85
What to do?
Define c linear discriminant functions
gi (x) = wT
x + wi0 for i = 1, ..., c (27)
This is known as a linear machine
Rule: if gk (x) > gj (x) for all j = k =⇒ x ∈ ωk
Nice Properties (It can be proved!!!)
1 Decision Regions are Singly Connected.
2 Decision Regions are Convex.
59 / 85
Proof of Properties
Proof
Actually quite simple
Given
y = λxA + (1 − λ) xB
with λ ∈ (0, 1).
60 / 85
Proof of Properties
Proof
Actually quite simple
Given
y = λxA + (1 − λ) xB
with λ ∈ (0, 1).
60 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
Proof of Properties
We know that
gk (y) = wT
(λxA + (1 − λ) xB) + w0
= λwT
xA + λw0 + (1 − λ) wT
xB + (1 − λ) w0
= λgk (xA) + (1 − λ) gk (xA)
> λgj (xA) + (1 − λ) gj (xA)
> gj (λxA + (1 − λ) xB)
> gj (y)
For all j = k
Or...
y belongs to an area k defined by the rule!!!
This area is Convex and Singly Connected because the definition of
y.
61 / 85
However!!!
No so nice properties!!!
It limits the power of classification for multi-objective function.
62 / 85
How do we train this Linear Machine?
We know that each ωk class is described by
gk (x) = wT
k x + w0 where k = 1, ..., c
We then design a single machine
g (x) = W T
x (28)
63 / 85
How do we train this Linear Machine?
We know that each ωk class is described by
gk (x) = wT
k x + w0 where k = 1, ..., c
We then design a single machine
g (x) = W T
x (28)
63 / 85
Where
We have the following
W T
=








1 w11 w12 · · · w1d
1 w21 w22 · · · w2d
1 w31 w32 · · · w3d
...
...
...
...
1 wc1 wc2 · · · wcd








(29)
What about the labels?
OK, we know how to do with 2 classes, What about many classes?
64 / 85
Where
We have the following
W T
=








1 w11 w12 · · · w1d
1 w21 w22 · · · w2d
1 w31 w32 · · · w3d
...
...
...
...
1 wc1 wc2 · · · wcd








(29)
What about the labels?
OK, we know how to do with 2 classes, What about many classes?
64 / 85
How do we train this Linear Machine?
Use a vector ti with dimensionality c to identify each element at each
class
We have then the following dataset
{xi, ti} for i = 1, 2, ..., N
We build the following Matrix of Vectors
T =






tT
1
tT
2
...
tT
N






(30)
65 / 85
How do we train this Linear Machine?
Use a vector ti with dimensionality c to identify each element at each
class
We have then the following dataset
{xi, ti} for i = 1, 2, ..., N
We build the following Matrix of Vectors
T =






tT
1
tT
2
...
tT
N






(30)
65 / 85
Thus, we create the following Matrix
A Matrix containing all the required information
XW − T (31)
Where we have the following vector
xT
i w1, xT
i w2, xT
i w3, ..., xT
i wc (32)
Remark: It is the vector result of multiplication of row i of X against
W on XW .
That is compared to the vector tT
i on T by using the subtraction of
vectors
i = xT
i w1, xT
i w2, xT
i w3, ..., xT
i wc − tT
i (33)
66 / 85
Thus, we create the following Matrix
A Matrix containing all the required information
XW − T (31)
Where we have the following vector
xT
i w1, xT
i w2, xT
i w3, ..., xT
i wc (32)
Remark: It is the vector result of multiplication of row i of X against
W on XW .
That is compared to the vector tT
i on T by using the subtraction of
vectors
i = xT
i w1, xT
i w2, xT
i w3, ..., xT
i wc − tT
i (33)
66 / 85
Thus, we create the following Matrix
A Matrix containing all the required information
XW − T (31)
Where we have the following vector
xT
i w1, xT
i w2, xT
i w3, ..., xT
i wc (32)
Remark: It is the vector result of multiplication of row i of X against
W on XW .
That is compared to the vector tT
i on T by using the subtraction of
vectors
i = xT
i w1, xT
i w2, xT
i w3, ..., xT
i wc − tT
i (33)
66 / 85
What do we want?
We want the quadratic error
1
2
2
i
This specific quadratic errors are at the diagonal of the matrix
(XW − T )T
(XW − T )
We can use the trace function to generate the desired total error of
J (·) =
1
2
N
i=1
2
i (34)
67 / 85
What do we want?
We want the quadratic error
1
2
2
i
This specific quadratic errors are at the diagonal of the matrix
(XW − T )T
(XW − T )
We can use the trace function to generate the desired total error of
J (·) =
1
2
N
i=1
2
i (34)
67 / 85
What do we want?
We want the quadratic error
1
2
2
i
This specific quadratic errors are at the diagonal of the matrix
(XW − T )T
(XW − T )
We can use the trace function to generate the desired total error of
J (·) =
1
2
N
i=1
2
i (34)
67 / 85
Then
The trace allows to express the total error
J (W ) =
1
2
Trace (XW − T )T
(XW − T ) (35)
Thus, we have by the same derivative method
W = XT
X XT
T = X+
T (36)
68 / 85
Then
The trace allows to express the total error
J (W ) =
1
2
Trace (XW − T )T
(XW − T ) (35)
Thus, we have by the same derivative method
W = XT
X XT
T = X+
T (36)
68 / 85
How we train this Linear Machine?
Thus, we obtain the discriminant
g (x) = W T
x = T T
X+
x (37)
69 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
70 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Robustness
1 Least squares works only if X has full column rank, i.e. if XT
X is
invertible.
2 If XT
X almost not invertible, least squares is numerically unstable.
1 Statistical consequence: High variance of predictions.
Not suited for high-dimensional data
1 Modern problems: Many dimensions/features/predictors (possibly
thousands).
2 Only a few of these may be important:
1 It needs some form of feature selection.
2 Possible some type of regularization
Why?
1 Treats all dimensions equally
2 Relevant dimensions are averaged with irrelevant ones
71 / 85
Issues with Least Squares
Problem with Outliers
No Outliers Outliers
72 / 85
Issues with Least Squares
What about the Linear Machine?
Please, run the algorithm and tell me...
73 / 85
What to Do About Numerical Stability?
Regularity
A matrix which is not invertible is also called a singular matrix. A matrix
which is invertible (not singular) is called regular.
In computations
Intuitions:
1 A singular matrix maps an entire linear subspace into a single point.
2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!
Large positive eigenvalues ⇒ the mapping is large!!!
Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
What to Do About Numerical Stability?
Regularity
A matrix which is not invertible is also called a singular matrix. A matrix
which is invertible (not singular) is called regular.
In computations
Intuitions:
1 A singular matrix maps an entire linear subspace into a single point.
2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!
Large positive eigenvalues ⇒ the mapping is large!!!
Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
What to Do About Numerical Stability?
Regularity
A matrix which is not invertible is also called a singular matrix. A matrix
which is invertible (not singular) is called regular.
In computations
Intuitions:
1 A singular matrix maps an entire linear subspace into a single point.
2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!
Large positive eigenvalues ⇒ the mapping is large!!!
Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
What to Do About Numerical Stability?
Regularity
A matrix which is not invertible is also called a singular matrix. A matrix
which is invertible (not singular) is called regular.
In computations
Intuitions:
1 A singular matrix maps an entire linear subspace into a single point.
2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!
Large positive eigenvalues ⇒ the mapping is large!!!
Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
What to Do About Numerical Stability?
Regularity
A matrix which is not invertible is also called a singular matrix. A matrix
which is invertible (not singular) is called regular.
In computations
Intuitions:
1 A singular matrix maps an entire linear subspace into a single point.
2 If a matrix maps points far away from each other to points very close
to each other, it almost behaves like a singular matrix.
Mapping is related to the eigenvalues!!!
Large positive eigenvalues ⇒ the mapping is large!!!
Small positive eigenvalues ⇒ the mapping is small!!!
74 / 85
Outline
1 Introduction
The Simplest Functions
Splitting the Space
The Decision Surface
2 Developing an Initial Solution
Gradient Descent Procedure
The Geometry of a Two-Category Linearly-Separable Case
Basic Method
Minimum Squared Error Procedure
The Error Idea
The Final Error Equation
The Data Matrix
Multi-Class Solution
Issues with Least Squares!!!
What about Numerical Stability?
75 / 85
What to Do About Numerical Stability?
All this comes from the following statement
A positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0
Consequence for Statistics
If a statistical prediction involves the inverse of an almost-singular matrix,
the predictions become unreliable (high variance).
76 / 85
What to Do About Numerical Stability?
All this comes from the following statement
A positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0
Consequence for Statistics
If a statistical prediction involves the inverse of an almost-singular matrix,
the predictions become unreliable (high variance).
76 / 85
What can be done?
Ridge Regression
Ridge regression is a modification of least squares. It tries to make least
squares more robust if XT
X is almost singular.
The solution
wRidge
= XT
X + λI
−1
XT
y (38)
where λ is a tunning parameter
Thus, we can do the following given that XT
X is positive definite
Assume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT
X with eigenvalues
λ1, λ2, ..., λd+1:
XT
X + λI ξi = (λi + λ) ξi (39)
i.e. λi + λ is an eigenvalue for XT
X + λI
77 / 85
What can be done?
Ridge Regression
Ridge regression is a modification of least squares. It tries to make least
squares more robust if XT
X is almost singular.
The solution
wRidge
= XT
X + λI
−1
XT
y (38)
where λ is a tunning parameter
Thus, we can do the following given that XT
X is positive definite
Assume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT
X with eigenvalues
λ1, λ2, ..., λd+1:
XT
X + λI ξi = (λi + λ) ξi (39)
i.e. λi + λ is an eigenvalue for XT
X + λI
77 / 85
What can be done?
Ridge Regression
Ridge regression is a modification of least squares. It tries to make least
squares more robust if XT
X is almost singular.
The solution
wRidge
= XT
X + λI
−1
XT
y (38)
where λ is a tunning parameter
Thus, we can do the following given that XT
X is positive definite
Assume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT
X with eigenvalues
λ1, λ2, ..., λd+1:
XT
X + λI ξi = (λi + λ) ξi (39)
i.e. λi + λ is an eigenvalue for XT
X + λI
77 / 85
What does this mean?
Something Notable
You can control the singularity by detecting the smallest eigenvalue.
Thus
We add an appropriate tunning value λ.
78 / 85
What does this mean?
Something Notable
You can control the singularity by detecting the smallest eigenvalue.
Thus
We add an appropriate tunning value λ.
78 / 85
Thus, what we need to do?
Process
1 Find the eigenvalues of XT
X
2 If all of them are bigger than zero we are fine!!!
3 Find the smallest one, then tune if necessary.
4 Build wRidge
= XT
X + λI
−1
XT
y.
79 / 85
Thus, what we need to do?
Process
1 Find the eigenvalues of XT
X
2 If all of them are bigger than zero we are fine!!!
3 Find the smallest one, then tune if necessary.
4 Build wRidge
= XT
X + λI
−1
XT
y.
79 / 85
Thus, what we need to do?
Process
1 Find the eigenvalues of XT
X
2 If all of them are bigger than zero we are fine!!!
3 Find the smallest one, then tune if necessary.
4 Build wRidge
= XT
X + λI
−1
XT
y.
79 / 85
Thus, what we need to do?
Process
1 Find the eigenvalues of XT
X
2 If all of them are bigger than zero we are fine!!!
3 Find the smallest one, then tune if necessary.
4 Build wRidge
= XT
X + λI
−1
XT
y.
79 / 85
What about Thousands of Features?
There is a technique for that
Least Absolute Shrinkage and Selection Operator (LASSO) invented by
Robert Tibshirani that uses L1 = d
i=1 |wi|.
The Least Squared Error takes the form of
N
i=1
yi − xT
w
2
+
d
i=1
|wi| (40)
However
You have other regularizations as L2 = d
i=1 |wi|2
80 / 85
What about Thousands of Features?
There is a technique for that
Least Absolute Shrinkage and Selection Operator (LASSO) invented by
Robert Tibshirani that uses L1 = d
i=1 |wi|.
The Least Squared Error takes the form of
N
i=1
yi − xT
w
2
+
d
i=1
|wi| (40)
However
You have other regularizations as L2 = d
i=1 |wi|2
80 / 85
What about Thousands of Features?
There is a technique for that
Least Absolute Shrinkage and Selection Operator (LASSO) invented by
Robert Tibshirani that uses L1 = d
i=1 |wi|.
The Least Squared Error takes the form of
N
i=1
yi − xT
w
2
+
d
i=1
|wi| (40)
However
You have other regularizations as L2 = d
i=1 |wi|2
80 / 85
Graphically
The first area correspond to the L1 regularization and the second one?
81 / 85
Graphically
Yes the circle defined as L2 = d
i=1 |wi|2
82 / 85
The seminal paper by Robert Tibshirani
An initial study of this regularization can be seen in
“Regression Shrinkage and Selection via the LASSO” by Robert Tibshirani
- 1996
83 / 85
This out the scope of this class
However, it is worth noticing that the most efficient method for
solving LASSO problems is
“Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie,
Holger Ho and Robert Tibshirani
Nevertheless
It will be a great seminar paper!!!
84 / 85
This out the scope of this class
However, it is worth noticing that the most efficient method for
solving LASSO problems is
“Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie,
Holger Ho and Robert Tibshirani
Nevertheless
It will be a great seminar paper!!!
84 / 85
Exercises
Duda and Hart
Chapter 5
1, 3, 4, 7, 13, 17
Bishop
Chapter 4
4.1, 4.4, 4.7,
Theodoridis
Chapter 3 - Problems
Using python 3.6
Chapter 3 - Computer Experiments
Using python 3.1
Using python and Newton 3.2
85 / 85
Exercises
Duda and Hart
Chapter 5
1, 3, 4, 7, 13, 17
Bishop
Chapter 4
4.1, 4.4, 4.7,
Theodoridis
Chapter 3 - Problems
Using python 3.6
Chapter 3 - Computer Experiments
Using python 3.1
Using python and Newton 3.2
85 / 85
Exercises
Duda and Hart
Chapter 5
1, 3, 4, 7, 13, 17
Bishop
Chapter 4
4.1, 4.4, 4.7,
Theodoridis
Chapter 3 - Problems
Using python 3.6
Chapter 3 - Computer Experiments
Using python 3.1
Using python and Newton 3.2
85 / 85

More Related Content

What's hot

Project in Calcu
Project in CalcuProject in Calcu
Project in Calcupatrickpaz
 
Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)Matthew Leingang
 
Roots equations
Roots equationsRoots equations
Roots equationsoscar
 
Rasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmRasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmKALAIRANJANI21
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Dahua Lin
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Asma Ben Slimene
 
23 improper integrals send-x
23 improper integrals send-x23 improper integrals send-x
23 improper integrals send-xmath266
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Christian Robert
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
Taylor Polynomials and Series
Taylor Polynomials and SeriesTaylor Polynomials and Series
Taylor Polynomials and SeriesMatthew Leingang
 

What's hot (17)

Project in Calcu
Project in CalcuProject in Calcu
Project in Calcu
 
Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)Lesson 15: Exponential Growth and Decay (slides)
Lesson 15: Exponential Growth and Decay (slides)
 
Analysis Solutions CVI
Analysis Solutions CVIAnalysis Solutions CVI
Analysis Solutions CVI
 
Ch04
Ch04Ch04
Ch04
 
Roots equations
Roots equationsRoots equations
Roots equations
 
Rasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithmRasterisation of a circle by the bresenham algorithm
Rasterisation of a circle by the bresenham algorithm
 
Stochastic Assignment Help
Stochastic Assignment Help Stochastic Assignment Help
Stochastic Assignment Help
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
23 improper integrals send-x
23 improper integrals send-x23 improper integrals send-x
23 improper integrals send-x
 
Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010Mark Girolami's Read Paper 2010
Mark Girolami's Read Paper 2010
 
Fougeres Besancon Archimax
Fougeres Besancon ArchimaxFougeres Besancon Archimax
Fougeres Besancon Archimax
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Taylor series
Taylor seriesTaylor series
Taylor series
 
Taylor Polynomials and Series
Taylor Polynomials and SeriesTaylor Polynomials and Series
Taylor Polynomials and Series
 
Analysis Solutions CIV
Analysis Solutions CIVAnalysis Solutions CIV
Analysis Solutions CIV
 

Viewers also liked

Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos butest
 
Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2Seung-gyu Byeon
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Turi, Inc.
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 

Viewers also liked (7)

Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos Introduction to Machine Learning Aristotelis Tsirigos
Introduction to Machine Learning Aristotelis Tsirigos
 
Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2Mncs 16-10-1주-변승규-introduction to the machine learning #2
Mncs 16-10-1주-변승규-introduction to the machine learning #2
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
CSC446: Pattern Recognition (LN8)
CSC446: Pattern Recognition (LN8)CSC446: Pattern Recognition (LN8)
CSC446: Pattern Recognition (LN8)
 
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
Introduction to Deep Learning for Image Analysis at Strata NYC, Sep 2015
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Similar to 04 Machine Learning - Supervised Linear Classifier

Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)
Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)
Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)shreemadghodasra
 
Application of partial derivatives with two variables
Application of partial derivatives with two variablesApplication of partial derivatives with two variables
Application of partial derivatives with two variablesSagar Patel
 
Parts of quadratic function and transforming to general form to vertex form a...
Parts of quadratic function and transforming to general form to vertex form a...Parts of quadratic function and transforming to general form to vertex form a...
Parts of quadratic function and transforming to general form to vertex form a...rowenaCARINO
 
Conctructing Polytopes via a Vertex Oracle
Conctructing Polytopes via a Vertex OracleConctructing Polytopes via a Vertex Oracle
Conctructing Polytopes via a Vertex OracleVissarion Fisikopoulos
 
Applications of partial differentiation
Applications of partial differentiationApplications of partial differentiation
Applications of partial differentiationVaibhav Tandel
 
differentiate free
differentiate freedifferentiate free
differentiate freelydmilaroy
 
An algorithm for computing resultant polytopes
An algorithm for computing resultant polytopesAn algorithm for computing resultant polytopes
An algorithm for computing resultant polytopesVissarion Fisikopoulos
 
2.4 defintion of derivative
2.4 defintion of derivative2.4 defintion of derivative
2.4 defintion of derivativemath265
 
Single Variable Calculus Assignment Help
Single Variable Calculus Assignment HelpSingle Variable Calculus Assignment Help
Single Variable Calculus Assignment HelpMath Homework Solver
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
Calculus Final Review Joshua Conyers
Calculus Final Review Joshua ConyersCalculus Final Review Joshua Conyers
Calculus Final Review Joshua Conyersjcon44
 
Class 10 Maths Ch Polynomial PPT
Class 10 Maths Ch Polynomial PPTClass 10 Maths Ch Polynomial PPT
Class 10 Maths Ch Polynomial PPTSanjayraj Balasara
 
Theoryofcomp science
Theoryofcomp scienceTheoryofcomp science
Theoryofcomp scienceRaghu nath
 
Gremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryMarko Rodriguez
 
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax
 
Physical Chemistry Assignment Help
Physical Chemistry Assignment HelpPhysical Chemistry Assignment Help
Physical Chemistry Assignment HelpEdu Assignment Help
 
"An output-sensitive algorithm for computing projections of resultant polytop...
"An output-sensitive algorithm for computing projections of resultant polytop..."An output-sensitive algorithm for computing projections of resultant polytop...
"An output-sensitive algorithm for computing projections of resultant polytop...Vissarion Fisikopoulos
 
Truth, deduction, computation lecture e
Truth, deduction, computation   lecture eTruth, deduction, computation   lecture e
Truth, deduction, computation lecture eVlad Patryshev
 

Similar to 04 Machine Learning - Supervised Linear Classifier (20)

Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)
Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)
Applicationofpartialderivativeswithtwovariables 140225070102-phpapp01 (1)
 
Application of partial derivatives with two variables
Application of partial derivatives with two variablesApplication of partial derivatives with two variables
Application of partial derivatives with two variables
 
Parts of quadratic function and transforming to general form to vertex form a...
Parts of quadratic function and transforming to general form to vertex form a...Parts of quadratic function and transforming to general form to vertex form a...
Parts of quadratic function and transforming to general form to vertex form a...
 
Conctructing Polytopes via a Vertex Oracle
Conctructing Polytopes via a Vertex OracleConctructing Polytopes via a Vertex Oracle
Conctructing Polytopes via a Vertex Oracle
 
Sol78
Sol78Sol78
Sol78
 
Sol78
Sol78Sol78
Sol78
 
Applications of partial differentiation
Applications of partial differentiationApplications of partial differentiation
Applications of partial differentiation
 
differentiate free
differentiate freedifferentiate free
differentiate free
 
An algorithm for computing resultant polytopes
An algorithm for computing resultant polytopesAn algorithm for computing resultant polytopes
An algorithm for computing resultant polytopes
 
2.4 defintion of derivative
2.4 defintion of derivative2.4 defintion of derivative
2.4 defintion of derivative
 
Single Variable Calculus Assignment Help
Single Variable Calculus Assignment HelpSingle Variable Calculus Assignment Help
Single Variable Calculus Assignment Help
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
Calculus Final Review Joshua Conyers
Calculus Final Review Joshua ConyersCalculus Final Review Joshua Conyers
Calculus Final Review Joshua Conyers
 
Class 10 Maths Ch Polynomial PPT
Class 10 Maths Ch Polynomial PPTClass 10 Maths Ch Polynomial PPT
Class 10 Maths Ch Polynomial PPT
 
Theoryofcomp science
Theoryofcomp scienceTheoryofcomp science
Theoryofcomp science
 
Gremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal Machinery
 
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
DataStax | Graph Computing with Apache TinkerPop (Marko Rodriguez) | Cassandr...
 
Physical Chemistry Assignment Help
Physical Chemistry Assignment HelpPhysical Chemistry Assignment Help
Physical Chemistry Assignment Help
 
"An output-sensitive algorithm for computing projections of resultant polytop...
"An output-sensitive algorithm for computing projections of resultant polytop..."An output-sensitive algorithm for computing projections of resultant polytop...
"An output-sensitive algorithm for computing projections of resultant polytop...
 
Truth, deduction, computation lecture e
Truth, deduction, computation   lecture eTruth, deduction, computation   lecture e
Truth, deduction, computation lecture e
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxPurva Nikam
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 

Recently uploaded (20)

POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
An introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptxAn introduction to Semiconductor and its types.pptx
An introduction to Semiconductor and its types.pptx
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 

04 Machine Learning - Supervised Linear Classifier

  • 1. Machine Learning for Data Mining Linear Classifiers Andres Mendez-Vazquez May 23, 2016 1 / 85
  • 2. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 2 / 85
  • 3. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 3 / 85
  • 4. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) In the case of R2 We have the following function: g (x) = w1x1 + w2x2 + w0 (2) 4 / 85
  • 5. What is it? First than anything, we have a parametric model!!! Here, we have an hyperplane as a model: g(x) = wT x + w0 (1) In the case of R2 We have the following function: g (x) = w1x1 + w2x2 + w0 (2) 4 / 85
  • 6. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 5 / 85
  • 7. Splitting The Space R2 Using a simple straight line Class Class 6 / 85
  • 8. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 7 / 85
  • 9. Defining a Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Given x1 and x2 are both on the decision surface: wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 8 / 85
  • 10. Defining a Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Given x1 and x2 are both on the decision surface: wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 8 / 85
  • 11. Defining a Decision Surface The equation g (x) = 0 defines a decision surface Separating the elements in classes, ω1 and ω2. When g (x) is linear the decision surface is an hyperplane Given x1 and x2 are both on the decision surface: wT x1 + w0 = 0 wT x2 + w0 = 0 Thus wT x1 + w0 = wT x2 + w0 (3) 8 / 85
  • 12. Defining a Decision Surface Thus wT (x1 − x2) = 0 (4) Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT is normal to the hyperplane. Something Notable Properties 9 / 85
  • 13. Defining a Decision Surface Thus wT (x1 − x2) = 0 (4) Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT is normal to the hyperplane. Something Notable Properties 9 / 85
  • 14. Defining a Decision Surface Thus wT (x1 − x2) = 0 (4) Remark: Any vector in the hyperplane is perpendicular to wT i.e. wT is normal to the hyperplane. Something Notable Properties 9 / 85
  • 15. Therefore The space is split in two regions (Example in R3 ) by the hyperplane H 10 / 85
  • 16. Some Properties of the Hyperplane Given that g (x) > 0 if x ∈ R1 11 / 85
  • 17. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85
  • 18. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85
  • 19. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85
  • 20. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85
  • 21. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85
  • 22. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85
  • 23. It is more We can say the following Any x ∈ R1 is on the positive side of H. Any x ∈ R2 is on the negative side of H. In addition, g (x) can give us a way to obtain the distance from x to the hyperplane H First, we express any x as follows x = xp + r w w Where xp is the normal projection of x onto H. r is the desired distance Positive, if x is in the positive side Negative, if x is in the negative side 12 / 85
  • 24. We have something like this We have then 13 / 85
  • 25. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 14 / 85
  • 26. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 14 / 85
  • 27. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 14 / 85
  • 28. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 14 / 85
  • 29. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 14 / 85
  • 30. Now Since g (xp) = 0 We have that g (x) = g xp + r w w = wT xp + r w w + w0 = wT xp + w0 + r wT w w = g (xp) + r w 2 w = r w Then, we have r = g (x) w (5) 14 / 85
  • 31. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 15 / 85
  • 32. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 15 / 85
  • 33. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 15 / 85
  • 34. In particular The distance from the origin to H r = g (0) w = wT (0) + w0 w = w0 w (6) Remarks If w0 > 0, the origin is on the positive side of H. If w0 < 0, the origin is on the negative side of H. If w0 = 0, the hyperplane has the homogeneous form wT x and hyperplane passes through the origin. 15 / 85
  • 35. In addition... If we do the following g (x) = w0 + d i=1 wixi = d i=0 wixi (7) By making x0 = 1 and y =       1 x1 ... xd       =      1 x      Where y is called an augmented feature vector. 16 / 85
  • 36. In addition... If we do the following g (x) = w0 + d i=1 wixi = d i=0 wixi (7) By making x0 = 1 and y =       1 x1 ... xd       =      1 x      Where y is called an augmented feature vector. 16 / 85
  • 37. In addition... If we do the following g (x) = w0 + d i=1 wixi = d i=0 wixi (7) By making x0 = 1 and y =       1 x1 ... xd       =      1 x      Where y is called an augmented feature vector. 16 / 85
  • 38. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting y vectors, all lie in a d-dimensional subspace which is the x-space itself. 17 / 85
  • 39. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting y vectors, all lie in a d-dimensional subspace which is the x-space itself. 17 / 85
  • 40. In a similar way We have the augmented weight vector waug =       w0 w1 ... wd       =      w0 w      Remarks The addition of a constant component to x preserves all the distance relationship between samples. The resulting y vectors, all lie in a d-dimensional subspace which is the x-space itself. 17 / 85
  • 41. More Remarks In addition The hyperplane decision surface H defined by wT augy = 0 passes through the origin in y-space. Even though the corresponding hyperplane H can be in any position of the x-space. The distance from y to H is |wT augy| waug or |g(x)| waug . Since waug > w This distance is less or at least equal to the distance from x to H. This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 18 / 85
  • 42. More Remarks In addition The hyperplane decision surface H defined by wT augy = 0 passes through the origin in y-space. Even though the corresponding hyperplane H can be in any position of the x-space. The distance from y to H is |wT augy| waug or |g(x)| waug . Since waug > w This distance is less or at least equal to the distance from x to H. This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 18 / 85
  • 43. More Remarks In addition The hyperplane decision surface H defined by wT augy = 0 passes through the origin in y-space. Even though the corresponding hyperplane H can be in any position of the x-space. The distance from y to H is |wT augy| waug or |g(x)| waug . Since waug > w This distance is less or at least equal to the distance from x to H. This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 18 / 85
  • 44. More Remarks In addition The hyperplane decision surface H defined by wT augy = 0 passes through the origin in y-space. Even though the corresponding hyperplane H can be in any position of the x-space. The distance from y to H is |wT augy| waug or |g(x)| waug . Since waug > w This distance is less or at least equal to the distance from x to H. This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 18 / 85
  • 45. More Remarks In addition The hyperplane decision surface H defined by wT augy = 0 passes through the origin in y-space. Even though the corresponding hyperplane H can be in any position of the x-space. The distance from y to H is |wT augy| waug or |g(x)| waug . Since waug > w This distance is less or at least equal to the distance from x to H. This mapping is quite useful Because we only need to find a weight vector waug instead of finding the weight vector w and the threshold w0. 18 / 85
  • 46. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 19 / 85
  • 47. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 20 / 85
  • 48. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. We suggest the following normalization We replace all the samples xi ∈ ω2 by their negative vectors!!! 21 / 85
  • 49. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. We suggest the following normalization We replace all the samples xi ∈ ω2 by their negative vectors!!! 21 / 85
  • 50. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. We suggest the following normalization We replace all the samples xi ∈ ω2 by their negative vectors!!! 21 / 85
  • 51. Initial Supposition Suppose, we have n samples x1, x2, ..., xn some labeled ω1 and some labeled ω2. We want a vector weight w such that wT xi > 0, if xi ∈ ω1. wT xi < 0, if xi ∈ ω2. We suggest the following normalization We replace all the samples xi ∈ ω2 by their negative vectors!!! 21 / 85
  • 52. The Usefulness of the Normalization Once the normalization is done We only need for a weight vector w such that wT xi > 0 for all the samples. The name of this weight vector It is called a separating vector or solution vector. 22 / 85
  • 53. The Usefulness of the Normalization Once the normalization is done We only need for a weight vector w such that wT xi > 0 for all the samples. The name of this weight vector It is called a separating vector or solution vector. 22 / 85
  • 54. Here, we have the solution region for w Do not confuse this region with the decision region!!! separating plane solution space Remark: w is not unique!!! We can have different w’s solving the problem 23 / 85
  • 55. Here, we have the solution region for w Do not confuse this region with the decision region!!! separating plane solution space Remark: w is not unique!!! We can have different w’s solving the problem 23 / 85
  • 56. Here, we have the solution region for w under normalization Do not confuse this region with the decision region!!! "separating" plane solution space Remark: w is not unique!!! 24 / 85
  • 57. Here, we have the solution region for w under normalization Do not confuse this region with the decision region!!! "separating" plane solution space Remark: w is not unique!!! 24 / 85
  • 58. How do we get this w? In order to be able to do this We need to impose constraints to the problem. Possible constraints!!! To find a unit-length weight vector that maximizes the minimum distance from the samples to the separating plane. To find the minimum-length weight vector satisfying wT xi ≥ b for all i where b is a constant called the margin. Here the solution space resulting from the intersections of the half-spaces such that wT xi ≥ b > 0 lies within the previous solution space!!! 25 / 85
  • 59. How do we get this w? In order to be able to do this We need to impose constraints to the problem. Possible constraints!!! To find a unit-length weight vector that maximizes the minimum distance from the samples to the separating plane. To find the minimum-length weight vector satisfying wT xi ≥ b for all i where b is a constant called the margin. Here the solution space resulting from the intersections of the half-spaces such that wT xi ≥ b > 0 lies within the previous solution space!!! 25 / 85
  • 60. How do we get this w? In order to be able to do this We need to impose constraints to the problem. Possible constraints!!! To find a unit-length weight vector that maximizes the minimum distance from the samples to the separating plane. To find the minimum-length weight vector satisfying wT xi ≥ b for all i where b is a constant called the margin. Here the solution space resulting from the intersections of the half-spaces such that wT xi ≥ b > 0 lies within the previous solution space!!! 25 / 85
  • 61. How do we get this w? In order to be able to do this We need to impose constraints to the problem. Possible constraints!!! To find a unit-length weight vector that maximizes the minimum distance from the samples to the separating plane. To find the minimum-length weight vector satisfying wT xi ≥ b for all i where b is a constant called the margin. Here the solution space resulting from the intersections of the half-spaces such that wT xi ≥ b > 0 lies within the previous solution space!!! 25 / 85
  • 62. We have then A new boundary by a distance b xi solution region 26 / 85
  • 63. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 27 / 85
  • 64. Gradient Descent For this, we will define a criterion function J (w) A classic optimization The basic procedure is as follow 1 Start with a random weight vector w (1). 2 Compute the gradient vector J (w (1)). 3 Obtain value w (2) by moving from w (1) in the direction of the steepest descent: 1 i.e. along the negative of the gradient. 2 By using the following equation: w (k + 1) = w (k) − η (k) J (w (k)) (8) 28 / 85
  • 65. Gradient Descent For this, we will define a criterion function J (w) A classic optimization The basic procedure is as follow 1 Start with a random weight vector w (1). 2 Compute the gradient vector J (w (1)). 3 Obtain value w (2) by moving from w (1) in the direction of the steepest descent: 1 i.e. along the negative of the gradient. 2 By using the following equation: w (k + 1) = w (k) − η (k) J (w (k)) (8) 28 / 85
  • 66. Gradient Descent For this, we will define a criterion function J (w) A classic optimization The basic procedure is as follow 1 Start with a random weight vector w (1). 2 Compute the gradient vector J (w (1)). 3 Obtain value w (2) by moving from w (1) in the direction of the steepest descent: 1 i.e. along the negative of the gradient. 2 By using the following equation: w (k + 1) = w (k) − η (k) J (w (k)) (8) 28 / 85
  • 67. Gradient Descent For this, we will define a criterion function J (w) A classic optimization The basic procedure is as follow 1 Start with a random weight vector w (1). 2 Compute the gradient vector J (w (1)). 3 Obtain value w (2) by moving from w (1) in the direction of the steepest descent: 1 i.e. along the negative of the gradient. 2 By using the following equation: w (k + 1) = w (k) − η (k) J (w (k)) (8) 28 / 85
  • 68. Gradient Descent For this, we will define a criterion function J (w) A classic optimization The basic procedure is as follow 1 Start with a random weight vector w (1). 2 Compute the gradient vector J (w (1)). 3 Obtain value w (2) by moving from w (1) in the direction of the steepest descent: 1 i.e. along the negative of the gradient. 2 By using the following equation: w (k + 1) = w (k) − η (k) J (w (k)) (8) 28 / 85
  • 69. Gradient Descent For this, we will define a criterion function J (w) A classic optimization The basic procedure is as follow 1 Start with a random weight vector w (1). 2 Compute the gradient vector J (w (1)). 3 Obtain value w (2) by moving from w (1) in the direction of the steepest descent: 1 i.e. along the negative of the gradient. 2 By using the following equation: w (k + 1) = w (k) − η (k) J (w (k)) (8) 28 / 85
  • 70. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 71. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 72. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 73. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 74. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 75. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 76. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 77. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 78. What is η (k)? Here η (k) is a positive scale factor or learning rate!!! The basic algorithm looks like this Algorithm 1 (Basic gradient descent) 1 begin initialize w, criterion θ, η (·), k = 0 2 do k = k + 1 3 w = w − η (k) J (w) 4 until η (k) J (w) < θ 5 return w Problem!!! How to choose the learning rate? If η (k) is too small, convergence is quite slow!!! If η (k) is too large, correction will overshot and can even diverge!!! 29 / 85
  • 79. Using the Taylor’s second-order expansion around value w (k) We do the following J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) (9) Remark: This is know as Taylor’s Second Order expansion!!! Here, we have J is the vector of partial derivatives ∂J ∂wi evaluated at w (k). H is the Hessian matrix of second partial derivatives ∂2J ∂wi∂wj evaluated at w (k). 30 / 85
  • 80. Using the Taylor’s second-order expansion around value w (k) We do the following J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) (9) Remark: This is know as Taylor’s Second Order expansion!!! Here, we have J is the vector of partial derivatives ∂J ∂wi evaluated at w (k). H is the Hessian matrix of second partial derivatives ∂2J ∂wi∂wj evaluated at w (k). 30 / 85
  • 81. Using the Taylor’s second-order expansion around value w (k) We do the following J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) (9) Remark: This is know as Taylor’s Second Order expansion!!! Here, we have J is the vector of partial derivatives ∂J ∂wi evaluated at w (k). H is the Hessian matrix of second partial derivatives ∂2J ∂wi∂wj evaluated at w (k). 30 / 85
  • 82. Using the Taylor’s second-order expansion around value w (k) We do the following J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) (9) Remark: This is know as Taylor’s Second Order expansion!!! Here, we have J is the vector of partial derivatives ∂J ∂wi evaluated at w (k). H is the Hessian matrix of second partial derivatives ∂2J ∂wi∂wj evaluated at w (k). 30 / 85
  • 83. Then We substitute (Eq. 8) into (Eq. 9) w (k + 1) − w (k) = η (k) J (w (k)) (10) We have then J (w (k + 1)) ∼=J (w (k)) + JT (−η (k) J (w (k))) + ... 1 2 (−η (k) J (w (k)))T H (−η (k) J (w (k))) Finally, we have J (w (k + 1)) ∼= J (w (k)) − η (k) J 2 + 1 2 η2 (k) JT H J (11) 31 / 85
  • 84. Then We substitute (Eq. 8) into (Eq. 9) w (k + 1) − w (k) = η (k) J (w (k)) (10) We have then J (w (k + 1)) ∼=J (w (k)) + JT (−η (k) J (w (k))) + ... 1 2 (−η (k) J (w (k)))T H (−η (k) J (w (k))) Finally, we have J (w (k + 1)) ∼= J (w (k)) − η (k) J 2 + 1 2 η2 (k) JT H J (11) 31 / 85
  • 85. Then We substitute (Eq. 8) into (Eq. 9) w (k + 1) − w (k) = η (k) J (w (k)) (10) We have then J (w (k + 1)) ∼=J (w (k)) + JT (−η (k) J (w (k))) + ... 1 2 (−η (k) J (w (k)))T H (−η (k) J (w (k))) Finally, we have J (w (k + 1)) ∼= J (w (k)) − η (k) J 2 + 1 2 η2 (k) JT H J (11) 31 / 85
  • 86. Derive with respect to η (k) and make the result equal to zero We have then − J 2 + η (k) JT H J = 0 (12) Finally η (k) = J 2 JT H J (13) Remark This is the optimal step size!!! Problem!!! Calculating H can be quite expansive!!! 32 / 85
  • 87. Derive with respect to η (k) and make the result equal to zero We have then − J 2 + η (k) JT H J = 0 (12) Finally η (k) = J 2 JT H J (13) Remark This is the optimal step size!!! Problem!!! Calculating H can be quite expansive!!! 32 / 85
  • 88. Derive with respect to η (k) and make the result equal to zero We have then − J 2 + η (k) JT H J = 0 (12) Finally η (k) = J 2 JT H J (13) Remark This is the optimal step size!!! Problem!!! Calculating H can be quite expansive!!! 32 / 85
  • 89. We can have an adaptive linear search!!! We can use the idea of having everything fixed, but η (k) Then, we can have the following function f (η (k)) = w (k) − η (k) J (w (k)) We can optimized using linear search methods Linear Search Methods Backtracking linear search Bisection method Golden ratio Etc. 33 / 85
  • 90. We can have an adaptive linear search!!! We can use the idea of having everything fixed, but η (k) Then, we can have the following function f (η (k)) = w (k) − η (k) J (w (k)) We can optimized using linear search methods Linear Search Methods Backtracking linear search Bisection method Golden ratio Etc. 33 / 85
  • 91. We can have an adaptive linear search!!! We can use the idea of having everything fixed, but η (k) Then, we can have the following function f (η (k)) = w (k) − η (k) J (w (k)) We can optimized using linear search methods Linear Search Methods Backtracking linear search Bisection method Golden ratio Etc. 33 / 85
  • 92. We can have an adaptive linear search!!! We can use the idea of having everything fixed, but η (k) Then, we can have the following function f (η (k)) = w (k) − η (k) J (w (k)) We can optimized using linear search methods Linear Search Methods Backtracking linear search Bisection method Golden ratio Etc. 33 / 85
  • 93. We can have an adaptive linear search!!! We can use the idea of having everything fixed, but η (k) Then, we can have the following function f (η (k)) = w (k) − η (k) J (w (k)) We can optimized using linear search methods Linear Search Methods Backtracking linear search Bisection method Golden ratio Etc. 33 / 85
  • 94. We can have an adaptive linear search!!! We can use the idea of having everything fixed, but η (k) Then, we can have the following function f (η (k)) = w (k) − η (k) J (w (k)) We can optimized using linear search methods Linear Search Methods Backtracking linear search Bisection method Golden ratio Etc. 33 / 85
  • 95. Example: Golden Ratio Imagine that you have a linear function f : L → R Where: Chose a and b such that a+b a = a b (The Golden Ratio). 34 / 85
  • 96. Example: Golden Ratio Imagine that you have a linear function f : L → R Where: Chose a and b such that a+b a = a b (The Golden Ratio). 34 / 85
  • 97. The process is as follow Given f1, f2, f3, where f1 = f (x1) f2 = f (x2) f3 = f (x3) We have then if f2 is smaller than f1 and f3 then the minimum lies in [x1, x3] Now, we generate x4 with f4 = f (x4) In the largest subinterval!!! [x2, x3] 35 / 85
  • 98. The process is as follow Given f1, f2, f3, where f1 = f (x1) f2 = f (x2) f3 = f (x3) We have then if f2 is smaller than f1 and f3 then the minimum lies in [x1, x3] Now, we generate x4 with f4 = f (x4) In the largest subinterval!!! [x2, x3] 35 / 85
  • 99. The process is as follow Given f1, f2, f3, where f1 = f (x1) f2 = f (x2) f3 = f (x3) We have then if f2 is smaller than f1 and f3 then the minimum lies in [x1, x3] Now, we generate x4 with f4 = f (x4) In the largest subinterval!!! [x2, x3] 35 / 85
  • 100. Finally Two cases If f4a > f2 then the minimum lies between x1 and x4 and the new triplet is x1, x2 and x4. If f4b < f2 then the minimum lies between x2 and x3 and the new triplet is x2, x4 and x3. Then Repeat the procedure!!! For more, please read the paper “SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer 36 / 85
  • 101. Finally Two cases If f4a > f2 then the minimum lies between x1 and x4 and the new triplet is x1, x2 and x4. If f4b < f2 then the minimum lies between x2 and x3 and the new triplet is x2, x4 and x3. Then Repeat the procedure!!! For more, please read the paper “SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer 36 / 85
  • 102. Finally Two cases If f4a > f2 then the minimum lies between x1 and x4 and the new triplet is x1, x2 and x4. If f4b < f2 then the minimum lies between x2 and x3 and the new triplet is x2, x4 and x3. Then Repeat the procedure!!! For more, please read the paper “SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer 36 / 85
  • 103. Finally Two cases If f4a > f2 then the minimum lies between x1 and x4 and the new triplet is x1, x2 and x4. If f4b < f2 then the minimum lies between x2 and x3 and the new triplet is x2, x4 and x3. Then Repeat the procedure!!! For more, please read the paper “SEQUENTIAL MINIMAX SEARCH FOR A MAXIMUM” by J. Kiefer 36 / 85
  • 104. We have another method... Derive the second Taylor expansion by w J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) We get J + Hw − Hw (k) = 0 (14) Thus Hw = Hw (k) − J H−1 Hw = H−1 Hw (k) − H−1 J w = w (k) − H−1 J 37 / 85
  • 105. We have another method... Derive the second Taylor expansion by w J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) We get J + Hw − Hw (k) = 0 (14) Thus Hw = Hw (k) − J H−1 Hw = H−1 Hw (k) − H−1 J w = w (k) − H−1 J 37 / 85
  • 106. We have another method... Derive the second Taylor expansion by w J (w) = J (w (k)) + JT (w − w (k)) + 1 2 (w − w (k))T H (w − w (k)) We get J + Hw − Hw (k) = 0 (14) Thus Hw = Hw (k) − J H−1 Hw = H−1 Hw (k) − H−1 J w = w (k) − H−1 J 37 / 85
  • 107. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85
  • 108. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85
  • 109. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85
  • 110. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85
  • 111. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85
  • 112. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85
  • 113. The Newton-Raphson Algorithm We have the following algorithm Algorithm 2 (Newton descent) 1 begin initialize w, criterion θ 2 do k = k + 1 3 w = w − H−1 J (w) 4 until H−1 J (w) < θ 5 return w 38 / 85
  • 114. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 39 / 85
  • 115. Initial Setup Important We get away from our initial normalization of the samples!!! Now, we are going to use the method know as Minimum Squared Error 40 / 85
  • 116. Initial Setup Important We get away from our initial normalization of the samples!!! Now, we are going to use the method know as Minimum Squared Error 40 / 85
  • 117. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 41 / 85
  • 118. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 41 / 85
  • 119. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 41 / 85
  • 120. Now, assume the following Imagine that your problem has two classes ω1 and ω2 in R2 1 They are linearly separable!!! 2 You require to label them. We have a problem!!! Which is the problem? We do not know the hyperplane!!! Thus, what distance each point has to the hyperplane? 41 / 85
  • 121. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 42 / 85
  • 122. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 42 / 85
  • 123. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 42 / 85
  • 124. A Simple Solution For Our Quandary Label the Classes ω1 =⇒ +1 ω2 =⇒ −1 We produce the following labels 1 if x ∈ ω1 then yideal = gideal (x) = +1. 2 if x ∈ ω2 then yideal = gideal (x) = −1. Remark: We have a problem with this labels!!! 42 / 85
  • 125. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 43 / 85
  • 126. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + (15) Where the It has a ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + (16) 44 / 85
  • 127. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + (15) Where the It has a ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + (16) 44 / 85
  • 128. Now, What? Assume true function f is given by ynoise = gnoise (x) = wT x + w0 + (15) Where the It has a ∼ N µ, σ2 Thus, we can do the following ynoise = gnoise (x) = gideal (x) + (16) 44 / 85
  • 129. Thus, we have What to do? = ynoise − gideal (x) (17) Graphically 45 / 85
  • 130. Thus, we have What to do? = ynoise − gideal (x) (17) Graphically 45 / 85
  • 131. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 46 / 85
  • 132. Sum Over All Errors We can do the following J (w) = N i=1 2 i = N i=1 (yi − gideal (x))2 (18) Remark: Know as least squares (Fitting the vertical offset!!!) Generalize If The dimensionality of each sample (data point) is d, You can extend each vector sample to be xT = (1, x ), We have: N i=1 yi − xT w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (19) 47 / 85
  • 133. Sum Over All Errors We can do the following J (w) = N i=1 2 i = N i=1 (yi − gideal (x))2 (18) Remark: Know as least squares (Fitting the vertical offset!!!) Generalize If The dimensionality of each sample (data point) is d, You can extend each vector sample to be xT = (1, x ), We have: N i=1 yi − xT w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (19) 47 / 85
  • 134. Sum Over All Errors We can do the following J (w) = N i=1 2 i = N i=1 (yi − gideal (x))2 (18) Remark: Know as least squares (Fitting the vertical offset!!!) Generalize If The dimensionality of each sample (data point) is d, You can extend each vector sample to be xT = (1, x ), We have: N i=1 yi − xT w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (19) 47 / 85
  • 135. Sum Over All Errors We can do the following J (w) = N i=1 2 i = N i=1 (yi − gideal (x))2 (18) Remark: Know as least squares (Fitting the vertical offset!!!) Generalize If The dimensionality of each sample (data point) is d, You can extend each vector sample to be xT = (1, x ), We have: N i=1 yi − xT w 2 = (y − Xw)T (y − Xw) = y − Xw 2 2 (19) 47 / 85
  • 136. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 48 / 85
  • 137. What is X It is the Data Matrix X =          1 (x1)1 · · · (x1)j · · · (x1)d ... ... ... 1 (xi)1 (xi)j (xi)d ... ... ... 1 (xN )1 · · · (xN )j · · · (xN )d          (20) We know the following dxT Ax dx = Ax + AT x, dAx dx = A (21) 49 / 85
  • 138. What is X It is the Data Matrix X =          1 (x1)1 · · · (x1)j · · · (x1)d ... ... ... 1 (xi)1 (xi)j (xi)d ... ... ... 1 (xN )1 · · · (xN )j · · · (xN )d          (20) We know the following dxT Ax dx = Ax + AT x, dAx dx = A (21) 49 / 85
  • 139. Note about other representations We could have xT = (x1, x2, ..., xd, 1) thus X =          (x1)1 · · · (x1)j · · · (x1)d 1 ... ... ... (xi)1 (xi)j (xi)d 1 ... ... ... (xN )1 · · · (xN )j · · · (xN )d 1          (22) 50 / 85
  • 140. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − wT XT y − yT Xw + wT XT Xw (23) Making Possible to have by deriving with respect to w and assuming that XT X is invertible ˆw = XT X −1 XT y (24) Note:XT X is always positive semi-definite. If it is also invertible, it is positive definite. Thus, How we get the discriminant function? Any Ideas? 51 / 85
  • 141. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − wT XT y − yT Xw + wT XT Xw (23) Making Possible to have by deriving with respect to w and assuming that XT X is invertible ˆw = XT X −1 XT y (24) Note:XT X is always positive semi-definite. If it is also invertible, it is positive definite. Thus, How we get the discriminant function? Any Ideas? 51 / 85
  • 142. We can expand our quadratic formula!!! Thus (y − Xw)T (y − Xw) = yT y − wT XT y − yT Xw + wT XT Xw (23) Making Possible to have by deriving with respect to w and assuming that XT X is invertible ˆw = XT X −1 XT y (24) Note:XT X is always positive semi-definite. If it is also invertible, it is positive definite. Thus, How we get the discriminant function? Any Ideas? 51 / 85
  • 143. The Final Discriminant Function Very Simple!!! g(x) = xT ˆw = xT XT X −1 XT y (25) 52 / 85
  • 144. Pseudo-inverse of a Matrix Definition Suppose that A ∈ Rm×n and rank (A) = m. We call the matrix A+ = AT A −1 AT the pseudo inverse of A. Reason A+ inverts A on its image What? If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence: A+ w = A+ Av = AT A −1 AT Av 53 / 85
  • 145. Pseudo-inverse of a Matrix Definition Suppose that A ∈ Rm×n and rank (A) = m. We call the matrix A+ = AT A −1 AT the pseudo inverse of A. Reason A+ inverts A on its image What? If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence: A+ w = A+ Av = AT A −1 AT Av 53 / 85
  • 146. Pseudo-inverse of a Matrix Definition Suppose that A ∈ Rm×n and rank (A) = m. We call the matrix A+ = AT A −1 AT the pseudo inverse of A. Reason A+ inverts A on its image What? If w ∈ image (A), then there is some v ∈ Rn such that w = Av. Hence: A+ w = A+ Av = AT A −1 AT Av 53 / 85
  • 147. What lives where? We have X ∈ RN×(d+1) Image (X) = span Xcol 1 , ..., Xcol d+1 xi ∈ Rd w ∈ Rd+1 Xcol i , y ∈ RN Basically y, the list of desired inputs the is being protected into span Xcol 1 , ..., Xcol d+1 (26) by the projection operator X XT X −1 XT . 54 / 85
  • 148. What lives where? We have X ∈ RN×(d+1) Image (X) = span Xcol 1 , ..., Xcol d+1 xi ∈ Rd w ∈ Rd+1 Xcol i , y ∈ RN Basically y, the list of desired inputs the is being protected into span Xcol 1 , ..., Xcol d+1 (26) by the projection operator X XT X −1 XT . 54 / 85
  • 149. What lives where? We have X ∈ RN×(d+1) Image (X) = span Xcol 1 , ..., Xcol d+1 xi ∈ Rd w ∈ Rd+1 Xcol i , y ∈ RN Basically y, the list of desired inputs the is being protected into span Xcol 1 , ..., Xcol d+1 (26) by the projection operator X XT X −1 XT . 54 / 85
  • 150. What lives where? We have X ∈ RN×(d+1) Image (X) = span Xcol 1 , ..., Xcol d+1 xi ∈ Rd w ∈ Rd+1 Xcol i , y ∈ RN Basically y, the list of desired inputs the is being protected into span Xcol 1 , ..., Xcol d+1 (26) by the projection operator X XT X −1 XT . 54 / 85
  • 151. What lives where? We have X ∈ RN×(d+1) Image (X) = span Xcol 1 , ..., Xcol d+1 xi ∈ Rd w ∈ Rd+1 Xcol i , y ∈ RN Basically y, the list of desired inputs the is being protected into span Xcol 1 , ..., Xcol d+1 (26) by the projection operator X XT X −1 XT . 54 / 85
  • 152. What lives where? We have X ∈ RN×(d+1) Image (X) = span Xcol 1 , ..., Xcol d+1 xi ∈ Rd w ∈ Rd+1 Xcol i , y ∈ RN Basically y, the list of desired inputs the is being protected into span Xcol 1 , ..., Xcol d+1 (26) by the projection operator X XT X −1 XT . 54 / 85
  • 153. Geometric Interpretation We have 1 The image of the mapping w to Xw is a linear subspace of RN . 2 As w runs through all points Rd+1, the function value Xw runs through all points in the image space image (X) = span Xcol 1 , ..., Xcol d+1 . 3 Each w defines one point Xw = d j=0 wjXcol j . 4 ˆw is the point which minimizes the distance d (y, image (X)). 55 / 85
  • 154. Geometric Interpretation We have 1 The image of the mapping w to Xw is a linear subspace of RN . 2 As w runs through all points Rd+1, the function value Xw runs through all points in the image space image (X) = span Xcol 1 , ..., Xcol d+1 . 3 Each w defines one point Xw = d j=0 wjXcol j . 4 ˆw is the point which minimizes the distance d (y, image (X)). 55 / 85
  • 155. Geometric Interpretation We have 1 The image of the mapping w to Xw is a linear subspace of RN . 2 As w runs through all points Rd+1, the function value Xw runs through all points in the image space image (X) = span Xcol 1 , ..., Xcol d+1 . 3 Each w defines one point Xw = d j=0 wjXcol j . 4 ˆw is the point which minimizes the distance d (y, image (X)). 55 / 85
  • 156. Geometric Interpretation We have 1 The image of the mapping w to Xw is a linear subspace of RN . 2 As w runs through all points Rd+1, the function value Xw runs through all points in the image space image (X) = span Xcol 1 , ..., Xcol d+1 . 3 Each w defines one point Xw = d j=0 wjXcol j . 4 ˆw is the point which minimizes the distance d (y, image (X)). 55 / 85
  • 158. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 57 / 85
  • 159. Multi-Class Solution What to do? 1 We might reduce the problem to c − 1 two-class problems. 2 We might use c(c−1) 2 linear discriminants, one for every pair of classes. However 58 / 85
  • 160. Multi-Class Solution What to do? 1 We might reduce the problem to c − 1 two-class problems. 2 We might use c(c−1) 2 linear discriminants, one for every pair of classes. However 58 / 85
  • 161. Multi-Class Solution What to do? 1 We might reduce the problem to c − 1 two-class problems. 2 We might use c(c−1) 2 linear discriminants, one for every pair of classes. However 58 / 85
  • 162. What to do? Define c linear discriminant functions gi (x) = wT x + wi0 for i = 1, ..., c (27) This is known as a linear machine Rule: if gk (x) > gj (x) for all j = k =⇒ x ∈ ωk Nice Properties (It can be proved!!!) 1 Decision Regions are Singly Connected. 2 Decision Regions are Convex. 59 / 85
  • 163. What to do? Define c linear discriminant functions gi (x) = wT x + wi0 for i = 1, ..., c (27) This is known as a linear machine Rule: if gk (x) > gj (x) for all j = k =⇒ x ∈ ωk Nice Properties (It can be proved!!!) 1 Decision Regions are Singly Connected. 2 Decision Regions are Convex. 59 / 85
  • 164. What to do? Define c linear discriminant functions gi (x) = wT x + wi0 for i = 1, ..., c (27) This is known as a linear machine Rule: if gk (x) > gj (x) for all j = k =⇒ x ∈ ωk Nice Properties (It can be proved!!!) 1 Decision Regions are Singly Connected. 2 Decision Regions are Convex. 59 / 85
  • 165. Proof of Properties Proof Actually quite simple Given y = λxA + (1 − λ) xB with λ ∈ (0, 1). 60 / 85
  • 166. Proof of Properties Proof Actually quite simple Given y = λxA + (1 − λ) xB with λ ∈ (0, 1). 60 / 85
  • 167. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 168. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 169. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 170. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 171. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 172. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 173. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 174. Proof of Properties We know that gk (y) = wT (λxA + (1 − λ) xB) + w0 = λwT xA + λw0 + (1 − λ) wT xB + (1 − λ) w0 = λgk (xA) + (1 − λ) gk (xA) > λgj (xA) + (1 − λ) gj (xA) > gj (λxA + (1 − λ) xB) > gj (y) For all j = k Or... y belongs to an area k defined by the rule!!! This area is Convex and Singly Connected because the definition of y. 61 / 85
  • 175. However!!! No so nice properties!!! It limits the power of classification for multi-objective function. 62 / 85
  • 176. How do we train this Linear Machine? We know that each ωk class is described by gk (x) = wT k x + w0 where k = 1, ..., c We then design a single machine g (x) = W T x (28) 63 / 85
  • 177. How do we train this Linear Machine? We know that each ωk class is described by gk (x) = wT k x + w0 where k = 1, ..., c We then design a single machine g (x) = W T x (28) 63 / 85
  • 178. Where We have the following W T =         1 w11 w12 · · · w1d 1 w21 w22 · · · w2d 1 w31 w32 · · · w3d ... ... ... ... 1 wc1 wc2 · · · wcd         (29) What about the labels? OK, we know how to do with 2 classes, What about many classes? 64 / 85
  • 179. Where We have the following W T =         1 w11 w12 · · · w1d 1 w21 w22 · · · w2d 1 w31 w32 · · · w3d ... ... ... ... 1 wc1 wc2 · · · wcd         (29) What about the labels? OK, we know how to do with 2 classes, What about many classes? 64 / 85
  • 180. How do we train this Linear Machine? Use a vector ti with dimensionality c to identify each element at each class We have then the following dataset {xi, ti} for i = 1, 2, ..., N We build the following Matrix of Vectors T =       tT 1 tT 2 ... tT N       (30) 65 / 85
  • 181. How do we train this Linear Machine? Use a vector ti with dimensionality c to identify each element at each class We have then the following dataset {xi, ti} for i = 1, 2, ..., N We build the following Matrix of Vectors T =       tT 1 tT 2 ... tT N       (30) 65 / 85
  • 182. Thus, we create the following Matrix A Matrix containing all the required information XW − T (31) Where we have the following vector xT i w1, xT i w2, xT i w3, ..., xT i wc (32) Remark: It is the vector result of multiplication of row i of X against W on XW . That is compared to the vector tT i on T by using the subtraction of vectors i = xT i w1, xT i w2, xT i w3, ..., xT i wc − tT i (33) 66 / 85
  • 183. Thus, we create the following Matrix A Matrix containing all the required information XW − T (31) Where we have the following vector xT i w1, xT i w2, xT i w3, ..., xT i wc (32) Remark: It is the vector result of multiplication of row i of X against W on XW . That is compared to the vector tT i on T by using the subtraction of vectors i = xT i w1, xT i w2, xT i w3, ..., xT i wc − tT i (33) 66 / 85
  • 184. Thus, we create the following Matrix A Matrix containing all the required information XW − T (31) Where we have the following vector xT i w1, xT i w2, xT i w3, ..., xT i wc (32) Remark: It is the vector result of multiplication of row i of X against W on XW . That is compared to the vector tT i on T by using the subtraction of vectors i = xT i w1, xT i w2, xT i w3, ..., xT i wc − tT i (33) 66 / 85
  • 185. What do we want? We want the quadratic error 1 2 2 i This specific quadratic errors are at the diagonal of the matrix (XW − T )T (XW − T ) We can use the trace function to generate the desired total error of J (·) = 1 2 N i=1 2 i (34) 67 / 85
  • 186. What do we want? We want the quadratic error 1 2 2 i This specific quadratic errors are at the diagonal of the matrix (XW − T )T (XW − T ) We can use the trace function to generate the desired total error of J (·) = 1 2 N i=1 2 i (34) 67 / 85
  • 187. What do we want? We want the quadratic error 1 2 2 i This specific quadratic errors are at the diagonal of the matrix (XW − T )T (XW − T ) We can use the trace function to generate the desired total error of J (·) = 1 2 N i=1 2 i (34) 67 / 85
  • 188. Then The trace allows to express the total error J (W ) = 1 2 Trace (XW − T )T (XW − T ) (35) Thus, we have by the same derivative method W = XT X XT T = X+ T (36) 68 / 85
  • 189. Then The trace allows to express the total error J (W ) = 1 2 Trace (XW − T )T (XW − T ) (35) Thus, we have by the same derivative method W = XT X XT T = X+ T (36) 68 / 85
  • 190. How we train this Linear Machine? Thus, we obtain the discriminant g (x) = W T x = T T X+ x (37) 69 / 85
  • 191. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 70 / 85
  • 192. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 193. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 194. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 195. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 196. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 197. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 198. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 199. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 200. Issues with Least Squares Robustness 1 Least squares works only if X has full column rank, i.e. if XT X is invertible. 2 If XT X almost not invertible, least squares is numerically unstable. 1 Statistical consequence: High variance of predictions. Not suited for high-dimensional data 1 Modern problems: Many dimensions/features/predictors (possibly thousands). 2 Only a few of these may be important: 1 It needs some form of feature selection. 2 Possible some type of regularization Why? 1 Treats all dimensions equally 2 Relevant dimensions are averaged with irrelevant ones 71 / 85
  • 201. Issues with Least Squares Problem with Outliers No Outliers Outliers 72 / 85
  • 202. Issues with Least Squares What about the Linear Machine? Please, run the algorithm and tell me... 73 / 85
  • 203. What to Do About Numerical Stability? Regularity A matrix which is not invertible is also called a singular matrix. A matrix which is invertible (not singular) is called regular. In computations Intuitions: 1 A singular matrix maps an entire linear subspace into a single point. 2 If a matrix maps points far away from each other to points very close to each other, it almost behaves like a singular matrix. Mapping is related to the eigenvalues!!! Large positive eigenvalues ⇒ the mapping is large!!! Small positive eigenvalues ⇒ the mapping is small!!! 74 / 85
  • 204. What to Do About Numerical Stability? Regularity A matrix which is not invertible is also called a singular matrix. A matrix which is invertible (not singular) is called regular. In computations Intuitions: 1 A singular matrix maps an entire linear subspace into a single point. 2 If a matrix maps points far away from each other to points very close to each other, it almost behaves like a singular matrix. Mapping is related to the eigenvalues!!! Large positive eigenvalues ⇒ the mapping is large!!! Small positive eigenvalues ⇒ the mapping is small!!! 74 / 85
  • 205. What to Do About Numerical Stability? Regularity A matrix which is not invertible is also called a singular matrix. A matrix which is invertible (not singular) is called regular. In computations Intuitions: 1 A singular matrix maps an entire linear subspace into a single point. 2 If a matrix maps points far away from each other to points very close to each other, it almost behaves like a singular matrix. Mapping is related to the eigenvalues!!! Large positive eigenvalues ⇒ the mapping is large!!! Small positive eigenvalues ⇒ the mapping is small!!! 74 / 85
  • 206. What to Do About Numerical Stability? Regularity A matrix which is not invertible is also called a singular matrix. A matrix which is invertible (not singular) is called regular. In computations Intuitions: 1 A singular matrix maps an entire linear subspace into a single point. 2 If a matrix maps points far away from each other to points very close to each other, it almost behaves like a singular matrix. Mapping is related to the eigenvalues!!! Large positive eigenvalues ⇒ the mapping is large!!! Small positive eigenvalues ⇒ the mapping is small!!! 74 / 85
  • 207. What to Do About Numerical Stability? Regularity A matrix which is not invertible is also called a singular matrix. A matrix which is invertible (not singular) is called regular. In computations Intuitions: 1 A singular matrix maps an entire linear subspace into a single point. 2 If a matrix maps points far away from each other to points very close to each other, it almost behaves like a singular matrix. Mapping is related to the eigenvalues!!! Large positive eigenvalues ⇒ the mapping is large!!! Small positive eigenvalues ⇒ the mapping is small!!! 74 / 85
  • 208. Outline 1 Introduction The Simplest Functions Splitting the Space The Decision Surface 2 Developing an Initial Solution Gradient Descent Procedure The Geometry of a Two-Category Linearly-Separable Case Basic Method Minimum Squared Error Procedure The Error Idea The Final Error Equation The Data Matrix Multi-Class Solution Issues with Least Squares!!! What about Numerical Stability? 75 / 85
  • 209. What to Do About Numerical Stability? All this comes from the following statement A positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0 Consequence for Statistics If a statistical prediction involves the inverse of an almost-singular matrix, the predictions become unreliable (high variance). 76 / 85
  • 210. What to Do About Numerical Stability? All this comes from the following statement A positive semi-definite matrix A is singular ⇐⇒ smallest eigenvalue is 0 Consequence for Statistics If a statistical prediction involves the inverse of an almost-singular matrix, the predictions become unreliable (high variance). 76 / 85
  • 211. What can be done? Ridge Regression Ridge regression is a modification of least squares. It tries to make least squares more robust if XT X is almost singular. The solution wRidge = XT X + λI −1 XT y (38) where λ is a tunning parameter Thus, we can do the following given that XT X is positive definite Assume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvalues λ1, λ2, ..., λd+1: XT X + λI ξi = (λi + λ) ξi (39) i.e. λi + λ is an eigenvalue for XT X + λI 77 / 85
  • 212. What can be done? Ridge Regression Ridge regression is a modification of least squares. It tries to make least squares more robust if XT X is almost singular. The solution wRidge = XT X + λI −1 XT y (38) where λ is a tunning parameter Thus, we can do the following given that XT X is positive definite Assume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvalues λ1, λ2, ..., λd+1: XT X + λI ξi = (λi + λ) ξi (39) i.e. λi + λ is an eigenvalue for XT X + λI 77 / 85
  • 213. What can be done? Ridge Regression Ridge regression is a modification of least squares. It tries to make least squares more robust if XT X is almost singular. The solution wRidge = XT X + λI −1 XT y (38) where λ is a tunning parameter Thus, we can do the following given that XT X is positive definite Assume that ξ1, ξ2, ..., ξd+1 are eigenvectors of XT X with eigenvalues λ1, λ2, ..., λd+1: XT X + λI ξi = (λi + λ) ξi (39) i.e. λi + λ is an eigenvalue for XT X + λI 77 / 85
  • 214. What does this mean? Something Notable You can control the singularity by detecting the smallest eigenvalue. Thus We add an appropriate tunning value λ. 78 / 85
  • 215. What does this mean? Something Notable You can control the singularity by detecting the smallest eigenvalue. Thus We add an appropriate tunning value λ. 78 / 85
  • 216. Thus, what we need to do? Process 1 Find the eigenvalues of XT X 2 If all of them are bigger than zero we are fine!!! 3 Find the smallest one, then tune if necessary. 4 Build wRidge = XT X + λI −1 XT y. 79 / 85
  • 217. Thus, what we need to do? Process 1 Find the eigenvalues of XT X 2 If all of them are bigger than zero we are fine!!! 3 Find the smallest one, then tune if necessary. 4 Build wRidge = XT X + λI −1 XT y. 79 / 85
  • 218. Thus, what we need to do? Process 1 Find the eigenvalues of XT X 2 If all of them are bigger than zero we are fine!!! 3 Find the smallest one, then tune if necessary. 4 Build wRidge = XT X + λI −1 XT y. 79 / 85
  • 219. Thus, what we need to do? Process 1 Find the eigenvalues of XT X 2 If all of them are bigger than zero we are fine!!! 3 Find the smallest one, then tune if necessary. 4 Build wRidge = XT X + λI −1 XT y. 79 / 85
  • 220. What about Thousands of Features? There is a technique for that Least Absolute Shrinkage and Selection Operator (LASSO) invented by Robert Tibshirani that uses L1 = d i=1 |wi|. The Least Squared Error takes the form of N i=1 yi − xT w 2 + d i=1 |wi| (40) However You have other regularizations as L2 = d i=1 |wi|2 80 / 85
  • 221. What about Thousands of Features? There is a technique for that Least Absolute Shrinkage and Selection Operator (LASSO) invented by Robert Tibshirani that uses L1 = d i=1 |wi|. The Least Squared Error takes the form of N i=1 yi − xT w 2 + d i=1 |wi| (40) However You have other regularizations as L2 = d i=1 |wi|2 80 / 85
  • 222. What about Thousands of Features? There is a technique for that Least Absolute Shrinkage and Selection Operator (LASSO) invented by Robert Tibshirani that uses L1 = d i=1 |wi|. The Least Squared Error takes the form of N i=1 yi − xT w 2 + d i=1 |wi| (40) However You have other regularizations as L2 = d i=1 |wi|2 80 / 85
  • 223. Graphically The first area correspond to the L1 regularization and the second one? 81 / 85
  • 224. Graphically Yes the circle defined as L2 = d i=1 |wi|2 82 / 85
  • 225. The seminal paper by Robert Tibshirani An initial study of this regularization can be seen in “Regression Shrinkage and Selection via the LASSO” by Robert Tibshirani - 1996 83 / 85
  • 226. This out the scope of this class However, it is worth noticing that the most efficient method for solving LASSO problems is “Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie, Holger Ho and Robert Tibshirani Nevertheless It will be a great seminar paper!!! 84 / 85
  • 227. This out the scope of this class However, it is worth noticing that the most efficient method for solving LASSO problems is “Pathwise Coordinate Optimization” By Jerome Friedman, Trevor Hastie, Holger Ho and Robert Tibshirani Nevertheless It will be a great seminar paper!!! 84 / 85
  • 228. Exercises Duda and Hart Chapter 5 1, 3, 4, 7, 13, 17 Bishop Chapter 4 4.1, 4.4, 4.7, Theodoridis Chapter 3 - Problems Using python 3.6 Chapter 3 - Computer Experiments Using python 3.1 Using python and Newton 3.2 85 / 85
  • 229. Exercises Duda and Hart Chapter 5 1, 3, 4, 7, 13, 17 Bishop Chapter 4 4.1, 4.4, 4.7, Theodoridis Chapter 3 - Problems Using python 3.6 Chapter 3 - Computer Experiments Using python 3.1 Using python and Newton 3.2 85 / 85
  • 230. Exercises Duda and Hart Chapter 5 1, 3, 4, 7, 13, 17 Bishop Chapter 4 4.1, 4.4, 4.7, Theodoridis Chapter 3 - Problems Using python 3.6 Chapter 3 - Computer Experiments Using python 3.1 Using python and Newton 3.2 85 / 85