SlideShare a Scribd company logo
1 of 228
Download to read offline
Neural Networks
Introduction Single-Layer Perceptron
Andres Mendez-Vazquez
July 3, 2016
1 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 2 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 3 / 101
History
At the beginning of Neural Networks (1943 - 1958)
McCulloch and Pitts (1943) for introducing the idea of neural
networks as computing machines.
Hebb (1949) for postulating the first rule for self-organized learning.
Rosenblatt (1958) for proposing the perceptron as the first model for
learning with a teacher (i.e., supervised learning).
In this chapter, we are interested in the perceptron
The perceptron is the simplest form of a neural network used for the
classifica tion of patterns said to be linearly separable (i.e., patterns that
lie on opposite sides of a hyperplane).
4 / 101
History
At the beginning of Neural Networks (1943 - 1958)
McCulloch and Pitts (1943) for introducing the idea of neural
networks as computing machines.
Hebb (1949) for postulating the first rule for self-organized learning.
Rosenblatt (1958) for proposing the perceptron as the first model for
learning with a teacher (i.e., supervised learning).
In this chapter, we are interested in the perceptron
The perceptron is the simplest form of a neural network used for the
classifica tion of patterns said to be linearly separable (i.e., patterns that
lie on opposite sides of a hyperplane).
4 / 101
History
At the beginning of Neural Networks (1943 - 1958)
McCulloch and Pitts (1943) for introducing the idea of neural
networks as computing machines.
Hebb (1949) for postulating the first rule for self-organized learning.
Rosenblatt (1958) for proposing the perceptron as the first model for
learning with a teacher (i.e., supervised learning).
In this chapter, we are interested in the perceptron
The perceptron is the simplest form of a neural network used for the
classifica tion of patterns said to be linearly separable (i.e., patterns that
lie on opposite sides of a hyperplane).
4 / 101
History
At the beginning of Neural Networks (1943 - 1958)
McCulloch and Pitts (1943) for introducing the idea of neural
networks as computing machines.
Hebb (1949) for postulating the first rule for self-organized learning.
Rosenblatt (1958) for proposing the perceptron as the first model for
learning with a teacher (i.e., supervised learning).
In this chapter, we are interested in the perceptron
The perceptron is the simplest form of a neural network used for the
classifica tion of patterns said to be linearly separable (i.e., patterns that
lie on opposite sides of a hyperplane).
4 / 101
In addition
Something Notable
The single neuron also forms the basis of an adaptive filter.
A functional block that is basic to the ever-expanding subject of
signal processing.
Furthermore
The development of adaptive filtering owes much to the classic paper of
Widrow and Hoff (1960) for pioneering the so-called least-mean-square
(LMS) algorithm, also known as the delta rule.
5 / 101
In addition
Something Notable
The single neuron also forms the basis of an adaptive filter.
A functional block that is basic to the ever-expanding subject of
signal processing.
Furthermore
The development of adaptive filtering owes much to the classic paper of
Widrow and Hoff (1960) for pioneering the so-called least-mean-square
(LMS) algorithm, also known as the delta rule.
5 / 101
In addition
Something Notable
The single neuron also forms the basis of an adaptive filter.
A functional block that is basic to the ever-expanding subject of
signal processing.
Furthermore
The development of adaptive filtering owes much to the classic paper of
Widrow and Hoff (1960) for pioneering the so-called least-mean-square
(LMS) algorithm, also known as the delta rule.
5 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 6 / 101
Adapting Filtering Problem
Consider a dynamical system
Unknown
dynamical
system
Output
INPUTS
7 / 101
Signal-Flow Graph of Adaptive Model
We have the following equivalence
-1
8 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 9 / 101
Description of the Behavior of the System
We have the data set
T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1)
Where
x (i) = (x1 (i) , x2 (i) ..., xm (i))T
(2)
10 / 101
Description of the Behavior of the System
We have the data set
T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1)
Where
x (i) = (x1 (i) , x2 (i) ..., xm (i))T
(2)
10 / 101
The Stimulus x (i)
The stimulus x(i) can arise from
The m elements of x(i) originate at different points in space (spatial)
11 / 101
The Stimulus x (i)
The stimulus x(i) can arise from
The m elements of x(i) represent the set of present and (m − 1) past
values of some excitation that are uniformly spaced in time (temporal).
Time
12 / 101
Problem
Quite important
How do we design a multiple input-single output model of the unknown
dynamical system?
It is more
We want to build this around a single neuron!!!
13 / 101
Problem
Quite important
How do we design a multiple input-single output model of the unknown
dynamical system?
It is more
We want to build this around a single neuron!!!
13 / 101
Thus, we have the following...
We need an algorithm to control the weight adjustment of the neuron
Control Algorithm
14 / 101
Which steps do you need for the algorithm?
First
The algorithms starts from an arbitrary setting of the neuron’s synaptic
weight.
Second
Adjustments, with respect to changes on the environment, are made on a
continuous basis.
Time is incorporated to the algorithm.
Third
Computation of adjustments to synaptic weights are completed inside a
time interval that is one sampling period long.
15 / 101
Which steps do you need for the algorithm?
First
The algorithms starts from an arbitrary setting of the neuron’s synaptic
weight.
Second
Adjustments, with respect to changes on the environment, are made on a
continuous basis.
Time is incorporated to the algorithm.
Third
Computation of adjustments to synaptic weights are completed inside a
time interval that is one sampling period long.
15 / 101
Which steps do you need for the algorithm?
First
The algorithms starts from an arbitrary setting of the neuron’s synaptic
weight.
Second
Adjustments, with respect to changes on the environment, are made on a
continuous basis.
Time is incorporated to the algorithm.
Third
Computation of adjustments to synaptic weights are completed inside a
time interval that is one sampling period long.
15 / 101
Signal-Flow Graph of Adaptive Model
We have the following equivalence
-1
16 / 101
Thus, This Neural Model ≈ Adaptive Filter with two
continous processes
Filtering processes
1 An output, denoted by y(i), that is produced in response to the m
elements of the stimulus vector x (i).
2 An error signal, e (i), that is obtained by comparing the output y(i)
to the corresponding desired output d(i) produced by the unknown
system.
Adaptive Process
It involves the automatic adjustment of the synaptic weights of the neuron
in accordance with the error signal e(i)
Remark
The combination of these two processes working together constitutes a
feedback loop acting around the neuron.
17 / 101
Thus, This Neural Model ≈ Adaptive Filter with two
continous processes
Filtering processes
1 An output, denoted by y(i), that is produced in response to the m
elements of the stimulus vector x (i).
2 An error signal, e (i), that is obtained by comparing the output y(i)
to the corresponding desired output d(i) produced by the unknown
system.
Adaptive Process
It involves the automatic adjustment of the synaptic weights of the neuron
in accordance with the error signal e(i)
Remark
The combination of these two processes working together constitutes a
feedback loop acting around the neuron.
17 / 101
Thus, This Neural Model ≈ Adaptive Filter with two
continous processes
Filtering processes
1 An output, denoted by y(i), that is produced in response to the m
elements of the stimulus vector x (i).
2 An error signal, e (i), that is obtained by comparing the output y(i)
to the corresponding desired output d(i) produced by the unknown
system.
Adaptive Process
It involves the automatic adjustment of the synaptic weights of the neuron
in accordance with the error signal e(i)
Remark
The combination of these two processes working together constitutes a
feedback loop acting around the neuron.
17 / 101
Thus
The output y(i) is exactly the same as the induced local field v(i)
y (i) = v (i) =
m
i=1
wk (i) xk (i) (3)
In matrix form, we have - remember we only have a neuron, so we do
not have neuron k
y (i) = xT
(i) w (i) (4)
Error
e (i) = d (i) − y (i) (5)
18 / 101
Thus
The output y(i) is exactly the same as the induced local field v(i)
y (i) = v (i) =
m
i=1
wk (i) xk (i) (3)
In matrix form, we have - remember we only have a neuron, so we do
not have neuron k
y (i) = xT
(i) w (i) (4)
Error
e (i) = d (i) − y (i) (5)
18 / 101
Thus
The output y(i) is exactly the same as the induced local field v(i)
y (i) = v (i) =
m
i=1
wk (i) xk (i) (3)
In matrix form, we have - remember we only have a neuron, so we do
not have neuron k
y (i) = xT
(i) w (i) (4)
Error
e (i) = d (i) − y (i) (5)
18 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 19 / 101
Consider
A continous differentiable function J (w)
We want to find an optimal solution w∗ such that
J (w∗
) ≤ J (w) , ∀w (6)
We want to
Minimize the cost function J(w) with respect to the weight vector w.
For this
J (w) = 0 (7)
20 / 101
Consider
A continous differentiable function J (w)
We want to find an optimal solution w∗ such that
J (w∗
) ≤ J (w) , ∀w (6)
We want to
Minimize the cost function J(w) with respect to the weight vector w.
For this
J (w) = 0 (7)
20 / 101
Consider
A continous differentiable function J (w)
We want to find an optimal solution w∗ such that
J (w∗
) ≤ J (w) , ∀w (6)
We want to
Minimize the cost function J(w) with respect to the weight vector w.
For this
J (w) = 0 (7)
20 / 101
Where
is the gradient operator
=
∂
∂w1
,
∂
∂w2
, ...,
∂
∂wm
T
(8)
Thus
J (w) =
∂J (w)
∂w1
,
∂J (w)
∂w2
, ...,
∂J (w)
∂wm
T
(9)
21 / 101
Where
is the gradient operator
=
∂
∂w1
,
∂
∂w2
, ...,
∂
∂wm
T
(8)
Thus
J (w) =
∂J (w)
∂w1
,
∂J (w)
∂w2
, ...,
∂J (w)
∂wm
T
(9)
21 / 101
Thus
Starting with an initial guess denoted by w(0),
Then, generate a sequence of weight vectors w (1) , w (2) , ...
Such that you can reduce J (w) at each iteration
J (w (n + 1)) < J (w (n)) (10)
Where: w (n) is the old value and w (n + 1) is the new value.
22 / 101
Thus
Starting with an initial guess denoted by w(0),
Then, generate a sequence of weight vectors w (1) , w (2) , ...
Such that you can reduce J (w) at each iteration
J (w (n + 1)) < J (w (n)) (10)
Where: w (n) is the old value and w (n + 1) is the new value.
22 / 101
The Three Main Methods for Unconstrained Optimization
We will look at
1 Steepest Descent.
2 Newton’s Method
3 Gauss-Newton Method
23 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 24 / 101
Steepest Descent
In the method of steepest descent, we have a cost function J (w)
where
w (n + 1) = w (n) − η J (w (n))
How, we prove that J (w (n + 1)) < J (w (n))?
zuA626NR
We use the first-order Taylor series expansion around w (n)
J (w (n + 1)) ≈ J (w (n)) + JT
(w (n)) ∆w (n) (11)
Remark: This is quite true when the step size is quite small!!! In
addition, ∆w (n) = w (n + 1) − w (n)
25 / 101
Steepest Descent
In the method of steepest descent, we have a cost function J (w)
where
w (n + 1) = w (n) − η J (w (n))
How, we prove that J (w (n + 1)) < J (w (n))?
zuA626NR
We use the first-order Taylor series expansion around w (n)
J (w (n + 1)) ≈ J (w (n)) + JT
(w (n)) ∆w (n) (11)
Remark: This is quite true when the step size is quite small!!! In
addition, ∆w (n) = w (n + 1) − w (n)
25 / 101
Why? Look at the case in R
The equation of the tangent line to the curve y = J (w (n))
L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12)
Example
26 / 101
Why? Look at the case in R
The equation of the tangent line to the curve y = J (w (n))
L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12)
Example
26 / 101
Thus, we have that in R
Remember Something quite Classic
tan θ =
J (w (n + 1)) − J (w (n))
w (n + 1) − w (n)
tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n))
J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n))
27 / 101
Thus, we have that in R
Remember Something quite Classic
tan θ =
J (w (n + 1)) − J (w (n))
w (n + 1) − w (n)
tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n))
J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n))
27 / 101
Thus, we have that in R
Remember Something quite Classic
tan θ =
J (w (n + 1)) − J (w (n))
w (n + 1) − w (n)
tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n))
J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n))
27 / 101
Thus, we have that
Using the First Taylor expansion
J (w (n)) ≈ J (w (n)) + J (w (n)) [w (n + 1) − w (n)] (13)
28 / 101
Now, for Many Variables
An hyperplane in Rn
is a set of the form
H = x|aT
x = b (14)
Given x ∈ H and x0 ∈ H
b = aT
x = aT
x0
Thus, we have that
H = x|aT
(x − x0) = 0
29 / 101
Now, for Many Variables
An hyperplane in Rn
is a set of the form
H = x|aT
x = b (14)
Given x ∈ H and x0 ∈ H
b = aT
x = aT
x0
Thus, we have that
H = x|aT
(x − x0) = 0
29 / 101
Now, for Many Variables
An hyperplane in Rn
is a set of the form
H = x|aT
x = b (14)
Given x ∈ H and x0 ∈ H
b = aT
x = aT
x0
Thus, we have that
H = x|aT
(x − x0) = 0
29 / 101
Thus, we have the following definition
Definition (Differentiability)
Assume that J is defined in a disk D containing w (n). We say that J is
differentiable at w (n) if:
1
∂J(w(n))
∂wi
exist for all i = 1, ..., n.
2 J is locally linear at w (n).
30 / 101
Thus, we have the following definition
Definition (Differentiability)
Assume that J is defined in a disk D containing w (n). We say that J is
differentiable at w (n) if:
1
∂J(w(n))
∂wi
exist for all i = 1, ..., n.
2 J is locally linear at w (n).
30 / 101
Thus, we have the following definition
Definition (Differentiability)
Assume that J is defined in a disk D containing w (n). We say that J is
differentiable at w (n) if:
1
∂J(w(n))
∂wi
exist for all i = 1, ..., n.
2 J is locally linear at w (n).
30 / 101
Thus, given J (w (n))
We know that we have the following operator
=
∂
∂w1
,
∂
∂w2
, ...,
∂
∂wm
(15)
Thus, we have
J (w (n)) =
∂J (w (n))
∂w1
,
∂J (w (n))
∂w2
, ...,
∂J (w (n))
∂wm
=
m
i=1
ˆwi
∂J (w (n))
∂wi
Where: ˆwT
i = (1, 0, ..., 0) ∈ R
31 / 101
Thus, given J (w (n))
We know that we have the following operator
=
∂
∂w1
,
∂
∂w2
, ...,
∂
∂wm
(15)
Thus, we have
J (w (n)) =
∂J (w (n))
∂w1
,
∂J (w (n))
∂w2
, ...,
∂J (w (n))
∂wm
=
m
i=1
ˆwi
∂J (w (n))
∂wi
Where: ˆwT
i = (1, 0, ..., 0) ∈ R
31 / 101
Now
Given a curve function r (t) that lies on the level set J (w (n)) = c
(When is in R3
)
32 / 101
Level Set
Definition
{(w1, w2, ..., wm) ∈ Rm
|J (w1, w2, ..., wm) = c} (16)
Remark: In a normal Calculus course we will use x and f instead of w
and J.
33 / 101
Where
Any curve has the following parametrization
r : [a, b] → Rm
r(t) = (w1 (t) , ..., wm (t))
With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1))
We can write the parametrized version of it
z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17)
Differentiating with respect to t and using the chain rule for multiple
variables
dz(t)
dt
=
m
i=1
∂J (w (t))
∂wi
·
dwi(t)
dt
= 0 (18)
34 / 101
Where
Any curve has the following parametrization
r : [a, b] → Rm
r(t) = (w1 (t) , ..., wm (t))
With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1))
We can write the parametrized version of it
z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17)
Differentiating with respect to t and using the chain rule for multiple
variables
dz(t)
dt
=
m
i=1
∂J (w (t))
∂wi
·
dwi(t)
dt
= 0 (18)
34 / 101
Where
Any curve has the following parametrization
r : [a, b] → Rm
r(t) = (w1 (t) , ..., wm (t))
With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1))
We can write the parametrized version of it
z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17)
Differentiating with respect to t and using the chain rule for multiple
variables
dz(t)
dt
=
m
i=1
∂J (w (t))
∂wi
·
dwi(t)
dt
= 0 (18)
34 / 101
Note
First
Given y = f (u) = (f1 (u) , ..., fl (u)) and
u = g (x) = (g1 (x) , ..., gm (x)).
We have then that
∂ (f1, f2, ..., fl)
∂ (x1, x2, ..., xk)
=
∂ (f1, f2, ..., fl)
∂ (g1, g2, ..., gm)
·
∂ (g1, g2, ..., gm)
∂ (x1, x2, ..., xk)
(19)
Thus
∂ (f1, f2, ..., fl)
∂xi
=
∂ (f1, f2, ..., fl)
∂ (g1, g2, ..., gm)
·
∂ (g1, g2, ..., gm)
∂xi
=
m
k=1
∂ (f1, f2, ..., fl)
∂gk
∂gk
∂xi
35 / 101
Note
First
Given y = f (u) = (f1 (u) , ..., fl (u)) and
u = g (x) = (g1 (x) , ..., gm (x)).
We have then that
∂ (f1, f2, ..., fl)
∂ (x1, x2, ..., xk)
=
∂ (f1, f2, ..., fl)
∂ (g1, g2, ..., gm)
·
∂ (g1, g2, ..., gm)
∂ (x1, x2, ..., xk)
(19)
Thus
∂ (f1, f2, ..., fl)
∂xi
=
∂ (f1, f2, ..., fl)
∂ (g1, g2, ..., gm)
·
∂ (g1, g2, ..., gm)
∂xi
=
m
k=1
∂ (f1, f2, ..., fl)
∂gk
∂gk
∂xi
35 / 101
Note
First
Given y = f (u) = (f1 (u) , ..., fl (u)) and
u = g (x) = (g1 (x) , ..., gm (x)).
We have then that
∂ (f1, f2, ..., fl)
∂ (x1, x2, ..., xk)
=
∂ (f1, f2, ..., fl)
∂ (g1, g2, ..., gm)
·
∂ (g1, g2, ..., gm)
∂ (x1, x2, ..., xk)
(19)
Thus
∂ (f1, f2, ..., fl)
∂xi
=
∂ (f1, f2, ..., fl)
∂ (g1, g2, ..., gm)
·
∂ (g1, g2, ..., gm)
∂xi
=
m
k=1
∂ (f1, f2, ..., fl)
∂gk
∂gk
∂xi
35 / 101
Thus
Evaluating at t = n
m
i=1
∂J (w (n))
∂wi
·
dwi(n)
dt
= 0
We have that
J (w (n)) · r (n) = 0 (20)
This proves that for every level set the gradient is perpendicular to
the tangent to any curve that lies on the level set
In particular to the point w (n).
36 / 101
Thus
Evaluating at t = n
m
i=1
∂J (w (n))
∂wi
·
dwi(n)
dt
= 0
We have that
J (w (n)) · r (n) = 0 (20)
This proves that for every level set the gradient is perpendicular to
the tangent to any curve that lies on the level set
In particular to the point w (n).
36 / 101
Thus
Evaluating at t = n
m
i=1
∂J (w (n))
∂wi
·
dwi(n)
dt
= 0
We have that
J (w (n)) · r (n) = 0 (20)
This proves that for every level set the gradient is perpendicular to
the tangent to any curve that lies on the level set
In particular to the point w (n).
36 / 101
Now the tangent plane to the surface can be described
generally
Thus
L (w (n + 1)) = J (w (n)) + JT
(w (n)) [w (n + 1) − w (n)] (21)
This looks like
37 / 101
Now the tangent plane to the surface can be described
generally
Thus
L (w (n + 1)) = J (w (n)) + JT
(w (n)) [w (n + 1) − w (n)] (21)
This looks like
37 / 101
Proving the fact about the Steepest Descent
We want the following
J (w (n + 1)) < J (w (n))
Using the first-order Taylor approximation
J (w (n + 1)) − J (w (n)) ≈ JT
(w (n)) ∆w (n)
So, we ask the following
∆w (n) ≈ −η J (w (n)) with η > 0
38 / 101
Proving the fact about the Steepest Descent
We want the following
J (w (n + 1)) < J (w (n))
Using the first-order Taylor approximation
J (w (n + 1)) − J (w (n)) ≈ JT
(w (n)) ∆w (n)
So, we ask the following
∆w (n) ≈ −η J (w (n)) with η > 0
38 / 101
Proving the fact about the Steepest Descent
We want the following
J (w (n + 1)) < J (w (n))
Using the first-order Taylor approximation
J (w (n + 1)) − J (w (n)) ≈ JT
(w (n)) ∆w (n)
So, we ask the following
∆w (n) ≈ −η J (w (n)) with η > 0
38 / 101
Then
We have that
J (w (n + 1)) − J (w (n)) ≈ −η JT
(w (n)) J (w (n)) = −η J (w (n))
2
Thus
J (w (n + 1)) − J (w (n)) < 0
Or
J (w (n + 1)) < J (w (n))
39 / 101
Then
We have that
J (w (n + 1)) − J (w (n)) ≈ −η JT
(w (n)) J (w (n)) = −η J (w (n))
2
Thus
J (w (n + 1)) − J (w (n)) < 0
Or
J (w (n + 1)) < J (w (n))
39 / 101
Then
We have that
J (w (n + 1)) − J (w (n)) ≈ −η JT
(w (n)) J (w (n)) = −η J (w (n))
2
Thus
J (w (n + 1)) − J (w (n)) < 0
Or
J (w (n + 1)) < J (w (n))
39 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 40 / 101
Newton’s Method
Here
The basic idea of Newton’s method is to minimize the quadratic approximation of
the cost function J (w) around the current point w (n).
Using a second-order Taylor series expansion of the cost function
around the point w (n)
∆J (w (n)) = J (w (n + 1)) − J (w (n))
≈ JT
(w (n)) ∆w (n) +
1
2
∆wT
(n) H (n) ∆w (n)
Where given that w (n) is a vector with dimension m
H = 2
J (w) =







∂2
J(w)
∂w2
1
∂2
J(w)
∂w1∂w2
· · · ∂2
J(w)
∂w1∂wm
∂2
J(w)
∂w2∂w1
∂2
J(w)
∂w2
2
· · · ∂2
J(w)
∂w2∂wm
...
...
...
∂2
J(w)
∂wm∂w1
∂2
J(w)
∂wm∂w2
· · · ∂2
J(w)
∂w2
m







41 / 101
Newton’s Method
Here
The basic idea of Newton’s method is to minimize the quadratic approximation of
the cost function J (w) around the current point w (n).
Using a second-order Taylor series expansion of the cost function
around the point w (n)
∆J (w (n)) = J (w (n + 1)) − J (w (n))
≈ JT
(w (n)) ∆w (n) +
1
2
∆wT
(n) H (n) ∆w (n)
Where given that w (n) is a vector with dimension m
H = 2
J (w) =







∂2
J(w)
∂w2
1
∂2
J(w)
∂w1∂w2
· · · ∂2
J(w)
∂w1∂wm
∂2
J(w)
∂w2∂w1
∂2
J(w)
∂w2
2
· · · ∂2
J(w)
∂w2∂wm
...
...
...
∂2
J(w)
∂wm∂w1
∂2
J(w)
∂wm∂w2
· · · ∂2
J(w)
∂w2
m







41 / 101
Newton’s Method
Here
The basic idea of Newton’s method is to minimize the quadratic approximation of
the cost function J (w) around the current point w (n).
Using a second-order Taylor series expansion of the cost function
around the point w (n)
∆J (w (n)) = J (w (n + 1)) − J (w (n))
≈ JT
(w (n)) ∆w (n) +
1
2
∆wT
(n) H (n) ∆w (n)
Where given that w (n) is a vector with dimension m
H = 2
J (w) =







∂2
J(w)
∂w2
1
∂2
J(w)
∂w1∂w2
· · · ∂2
J(w)
∂w1∂wm
∂2
J(w)
∂w2∂w1
∂2
J(w)
∂w2
2
· · · ∂2
J(w)
∂w2∂wm
...
...
...
∂2
J(w)
∂wm∂w1
∂2
J(w)
∂wm∂w2
· · · ∂2
J(w)
∂w2
m







41 / 101
Now, we want to minimize J (w (n + 1))
Do you have any idea?
Look again
J (w (n)) + JT
(w (n)) ∆w (n) +
1
2
∆wT
(n) H (n) ∆w (n) (22)
Derive with respect to ∆w (n)
J (w (n)) + H (n) ∆w (n) = 0 (23)
Thus
∆w (n) = −H−1
(n) J (w (n))
42 / 101
Now, we want to minimize J (w (n + 1))
Do you have any idea?
Look again
J (w (n)) + JT
(w (n)) ∆w (n) +
1
2
∆wT
(n) H (n) ∆w (n) (22)
Derive with respect to ∆w (n)
J (w (n)) + H (n) ∆w (n) = 0 (23)
Thus
∆w (n) = −H−1
(n) J (w (n))
42 / 101
Now, we want to minimize J (w (n + 1))
Do you have any idea?
Look again
J (w (n)) + JT
(w (n)) ∆w (n) +
1
2
∆wT
(n) H (n) ∆w (n) (22)
Derive with respect to ∆w (n)
J (w (n)) + H (n) ∆w (n) = 0 (23)
Thus
∆w (n) = −H−1
(n) J (w (n))
42 / 101
The Final Method
Define the following
J (w (n + 1)) − J (w (n)) = −H−1
(n) J (w (n))
Then
J (w (n + 1)) = J (w (n)) − H−1
(n) J (w (n))
43 / 101
The Final Method
Define the following
J (w (n + 1)) − J (w (n)) = −H−1
(n) J (w (n))
Then
J (w (n + 1)) = J (w (n)) − H−1
(n) J (w (n))
43 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 44 / 101
We have then an error
Something Notable
J (w) =
1
2
n
i=1
e2
(i)
Thus using the first order Taylor expansion, we linearize
el (i, w) = e (i) +
∂e (i)
∂w
T
[w − w (n)]
In matrix form
el (n, w) = e (n) + J (n) [w − w (n)]
45 / 101
We have then an error
Something Notable
J (w) =
1
2
n
i=1
e2
(i)
Thus using the first order Taylor expansion, we linearize
el (i, w) = e (i) +
∂e (i)
∂w
T
[w − w (n)]
In matrix form
el (n, w) = e (n) + J (n) [w − w (n)]
45 / 101
We have then an error
Something Notable
J (w) =
1
2
n
i=1
e2
(i)
Thus using the first order Taylor expansion, we linearize
el (i, w) = e (i) +
∂e (i)
∂w
T
[w − w (n)]
In matrix form
el (n, w) = e (n) + J (n) [w − w (n)]
45 / 101
Where
The error vector is equal to
e (n) = [e (1) , e (2) , ..., e (n)]T
(24)
Thus, we get the famous Jacobian once we derive ∂e(i)
∂w
J (n) =







∂e(1)
∂w1
∂e(1)
∂w2
· · · ∂e(1)
∂wm
∂e(2)
∂w1
∂e(2)
∂w2
· · · ∂e(2)
∂wm
...
...
...
∂e(n)
∂w1
∂e(n)
∂w2
· · · ∂e(n)
∂wm







46 / 101
Where
The error vector is equal to
e (n) = [e (1) , e (2) , ..., e (n)]T
(24)
Thus, we get the famous Jacobian once we derive ∂e(i)
∂w
J (n) =







∂e(1)
∂w1
∂e(1)
∂w2
· · · ∂e(1)
∂wm
∂e(2)
∂w1
∂e(2)
∂w2
· · · ∂e(2)
∂wm
...
...
...
∂e(n)
∂w1
∂e(n)
∂w2
· · · ∂e(n)
∂wm







46 / 101
Where
We want the following
w (n + 1) = arg min
w
1
2
el (n, w) 2
Ideas
What if we expand out the equation?
47 / 101
Where
We want the following
w (n + 1) = arg min
w
1
2
el (n, w) 2
Ideas
What if we expand out the equation?
47 / 101
Expanded Version
We get
1
2
el (n, w) 2
=
1
2
e (n) 2
+ eT
(n) J (n) (w − w (n)) + ...
1
2
(w − w (n))T
JT
(n) J (n) (w − w (n))
48 / 101
Now,doing the Differential, we get
Differentiating the equation with respect to w
JT
(n) e (n) + JT
(n) J (n) [w − w (n)] = 0
We get finally
w (n + 1) = w (n) − JT
(n) J (n)
−1
JT
(n) e (n) (25)
49 / 101
Now,doing the Differential, we get
Differentiating the equation with respect to w
JT
(n) e (n) + JT
(n) J (n) [w − w (n)] = 0
We get finally
w (n + 1) = w (n) − JT
(n) J (n)
−1
JT
(n) e (n) (25)
49 / 101
Remarks
We have that
The Newton’s method that requires knowledge of the Hessian matrix
of the cost function.
The Gauss-Newton method only requires the Jacobian matrix of the
error vector e (n).
However
The Gauss-Newton iteration to be computable, the matrix product
JT
(n) J (n) must be nonsingular!!!
50 / 101
Remarks
We have that
The Newton’s method that requires knowledge of the Hessian matrix
of the cost function.
The Gauss-Newton method only requires the Jacobian matrix of the
error vector e (n).
However
The Gauss-Newton iteration to be computable, the matrix product
JT
(n) J (n) must be nonsingular!!!
50 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 51 / 101
Introduction
A linear least-squares filter has two distinctive characteristics
First, the single neuron around which it is built is linear.
The cost function J (w) used to design the filter consists of the sum
of error squares.
Thus, expressing the error
e (n) = d (n) − (x (1) , ..., x (n))T
w (n)
Short Version - error is linear in the weight vector w (n)
e (n) = d (n) − X (n) w (n)
Where d (n) is a n × 1 desired response vector.
Where X(n) is the n × m data matrix.
52 / 101
Introduction
A linear least-squares filter has two distinctive characteristics
First, the single neuron around which it is built is linear.
The cost function J (w) used to design the filter consists of the sum
of error squares.
Thus, expressing the error
e (n) = d (n) − (x (1) , ..., x (n))T
w (n)
Short Version - error is linear in the weight vector w (n)
e (n) = d (n) − X (n) w (n)
Where d (n) is a n × 1 desired response vector.
Where X(n) is the n × m data matrix.
52 / 101
Introduction
A linear least-squares filter has two distinctive characteristics
First, the single neuron around which it is built is linear.
The cost function J (w) used to design the filter consists of the sum
of error squares.
Thus, expressing the error
e (n) = d (n) − (x (1) , ..., x (n))T
w (n)
Short Version - error is linear in the weight vector w (n)
e (n) = d (n) − X (n) w (n)
Where d (n) is a n × 1 desired response vector.
Where X(n) is the n × m data matrix.
52 / 101
Now, differentiate e (n) with respect to w (n)
Thus
e (n) = −XT
(n)
Correspondingly, the Jacobian of e(n) is
J (n) = −X (n)
Let us to use the Gaussian-Newton
w (n + 1) = w (n) − JT
(n) J (n)
−1
JT
(n) e (n)
53 / 101
Now, differentiate e (n) with respect to w (n)
Thus
e (n) = −XT
(n)
Correspondingly, the Jacobian of e(n) is
J (n) = −X (n)
Let us to use the Gaussian-Newton
w (n + 1) = w (n) − JT
(n) J (n)
−1
JT
(n) e (n)
53 / 101
Now, differentiate e (n) with respect to w (n)
Thus
e (n) = −XT
(n)
Correspondingly, the Jacobian of e(n) is
J (n) = −X (n)
Let us to use the Gaussian-Newton
w (n + 1) = w (n) − JT
(n) J (n)
−1
JT
(n) e (n)
53 / 101
Thus
We have the following
w (n + 1) = w (n) − −XT
(n) · −X (n)
−1
· −XT
(n) [d (n) − X (n) w (n)]
We have then
w (n + 1) =w (n) + XT
(n) X (n)
−1
XT
(n) d (n) − ...
XT
(n) X (n)
−1
XT
(n) X (n) w (n)
Thus, we have
w (n + 1) =w (n) + XT
(n) X (n)
−1
XT
(n) d (n) − w (n)
= XT
(n) X (n)
−1
XT
(n) d (n)
54 / 101
Thus
We have the following
w (n + 1) = w (n) − −XT
(n) · −X (n)
−1
· −XT
(n) [d (n) − X (n) w (n)]
We have then
w (n + 1) =w (n) + XT
(n) X (n)
−1
XT
(n) d (n) − ...
XT
(n) X (n)
−1
XT
(n) X (n) w (n)
Thus, we have
w (n + 1) =w (n) + XT
(n) X (n)
−1
XT
(n) d (n) − w (n)
= XT
(n) X (n)
−1
XT
(n) d (n)
54 / 101
Thus
We have the following
w (n + 1) = w (n) − −XT
(n) · −X (n)
−1
· −XT
(n) [d (n) − X (n) w (n)]
We have then
w (n + 1) =w (n) + XT
(n) X (n)
−1
XT
(n) d (n) − ...
XT
(n) X (n)
−1
XT
(n) X (n) w (n)
Thus, we have
w (n + 1) =w (n) + XT
(n) X (n)
−1
XT
(n) d (n) − w (n)
= XT
(n) X (n)
−1
XT
(n) d (n)
54 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 55 / 101
Again Our Error Cost function
We have
J (w) =
1
2
e2
(n)
where e (n) is the error signal measured at time n.
Again differentiating against the vector w
∂J (w)
∂w
= e (n)
∂e (n)
∂w
LMS algorithm operates with a linear neuron so we may express the
error signal as
e (n) = d (n) − xT
(n) w (n) (26)
56 / 101
Again Our Error Cost function
We have
J (w) =
1
2
e2
(n)
where e (n) is the error signal measured at time n.
Again differentiating against the vector w
∂J (w)
∂w
= e (n)
∂e (n)
∂w
LMS algorithm operates with a linear neuron so we may express the
error signal as
e (n) = d (n) − xT
(n) w (n) (26)
56 / 101
Again Our Error Cost function
We have
J (w) =
1
2
e2
(n)
where e (n) is the error signal measured at time n.
Again differentiating against the vector w
∂J (w)
∂w
= e (n)
∂e (n)
∂w
LMS algorithm operates with a linear neuron so we may express the
error signal as
e (n) = d (n) − xT
(n) w (n) (26)
56 / 101
We have
Something Notable
∂e (n)
∂w
= −x (n)
Then
∂J (w)
∂w
= −x (n) e (n)
Using this as an estimate for the gradient vector, we have for the
gradient descent
w (n + 1) = w (n) + ηx (n) e (n) (27)
57 / 101
We have
Something Notable
∂e (n)
∂w
= −x (n)
Then
∂J (w)
∂w
= −x (n) e (n)
Using this as an estimate for the gradient vector, we have for the
gradient descent
w (n + 1) = w (n) + ηx (n) e (n) (27)
57 / 101
We have
Something Notable
∂e (n)
∂w
= −x (n)
Then
∂J (w)
∂w
= −x (n) e (n)
Using this as an estimate for the gradient vector, we have for the
gradient descent
w (n + 1) = w (n) + ηx (n) e (n) (27)
57 / 101
Remarks
The feedback loop around the weight vector low-pass filter
It behaves like a low-pass filter.
It passes the low frequency component of the error signal and
attenuating its high frequency component.
Low-Pass filter
58 / 101
Remarks
The feedback loop around the weight vector low-pass filter
It behaves like a low-pass filter.
It passes the low frequency component of the error signal and
attenuating its high frequency component.
Low-Pass filter
58 / 101
Remarks
The feedback loop around the weight vector low-pass filter
It behaves like a low-pass filter.
It passes the low frequency component of the error signal and
attenuating its high frequency component.
Low-Pass filter
R
C
Vin
Vout
58 / 101
Actually
Thus
The average time constant of this filtering action is inversely proportional
to the learning-rate parameter η.
Thus
Assigning a small value to η, the adaptive process progresses slowly.
Thus
More of the past data are remembered by the LMS algorithm.
Thus, LMS is a more accurate filter.
59 / 101
Actually
Thus
The average time constant of this filtering action is inversely proportional
to the learning-rate parameter η.
Thus
Assigning a small value to η, the adaptive process progresses slowly.
Thus
More of the past data are remembered by the LMS algorithm.
Thus, LMS is a more accurate filter.
59 / 101
Actually
Thus
The average time constant of this filtering action is inversely proportional
to the learning-rate parameter η.
Thus
Assigning a small value to η, the adaptive process progresses slowly.
Thus
More of the past data are remembered by the LMS algorithm.
Thus, LMS is a more accurate filter.
59 / 101
Actually
Thus
The average time constant of this filtering action is inversely proportional
to the learning-rate parameter η.
Thus
Assigning a small value to η, the adaptive process progresses slowly.
Thus
More of the past data are remembered by the LMS algorithm.
Thus, LMS is a more accurate filter.
59 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 60 / 101
Convergence of the LMS
This convergence depends on the following points
The statistical characteristics of the input vector x (n).
The learning-rate parameter η.
Something Notable
However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant
as n → ∞
61 / 101
Convergence of the LMS
This convergence depends on the following points
The statistical characteristics of the input vector x (n).
The learning-rate parameter η.
Something Notable
However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant
as n → ∞
61 / 101
To make this analysis practical
We take the following assumptions
The successive input vectors x(1), x(2), .. are statistically
independent of each other.
At time step n, the input vector x(n) is statistically independent of
all previous samples of the desired response, namely
d(1), d(2), ..., d(n − 1).
At time step n, the desired response d(n) is dependent on x(n), but
statistically independent of all previous values of the desired response.
The input vector x(n) and desired response d(n) are drawn from
Gaussiandistributed populations.
62 / 101
To make this analysis practical
We take the following assumptions
The successive input vectors x(1), x(2), .. are statistically
independent of each other.
At time step n, the input vector x(n) is statistically independent of
all previous samples of the desired response, namely
d(1), d(2), ..., d(n − 1).
At time step n, the desired response d(n) is dependent on x(n), but
statistically independent of all previous values of the desired response.
The input vector x(n) and desired response d(n) are drawn from
Gaussiandistributed populations.
62 / 101
To make this analysis practical
We take the following assumptions
The successive input vectors x(1), x(2), .. are statistically
independent of each other.
At time step n, the input vector x(n) is statistically independent of
all previous samples of the desired response, namely
d(1), d(2), ..., d(n − 1).
At time step n, the desired response d(n) is dependent on x(n), but
statistically independent of all previous values of the desired response.
The input vector x(n) and desired response d(n) are drawn from
Gaussiandistributed populations.
62 / 101
To make this analysis practical
We take the following assumptions
The successive input vectors x(1), x(2), .. are statistically
independent of each other.
At time step n, the input vector x(n) is statistically independent of
all previous samples of the desired response, namely
d(1), d(2), ..., d(n − 1).
At time step n, the desired response d(n) is dependent on x(n), but
statistically independent of all previous values of the desired response.
The input vector x(n) and desired response d(n) are drawn from
Gaussiandistributed populations.
62 / 101
We get the following
The LMS is convergent in the mean square provided that η satisfies
0 < η <
2
λmax
(28)
Because λmax is the largest eigenvalue of the correlation sample Rx
This can be difficult in reality.... then we use the trace instead
0 < η <
2
trace [Rx]
(29)
However, each diagonal element of Rx is equal the mean-squared
value of the corresponding of the sensor input
We can re-state the previous condition as
0 < η <
2
sum of the mean-square values of the sensor input
(30)
63 / 101
We get the following
The LMS is convergent in the mean square provided that η satisfies
0 < η <
2
λmax
(28)
Because λmax is the largest eigenvalue of the correlation sample Rx
This can be difficult in reality.... then we use the trace instead
0 < η <
2
trace [Rx]
(29)
However, each diagonal element of Rx is equal the mean-squared
value of the corresponding of the sensor input
We can re-state the previous condition as
0 < η <
2
sum of the mean-square values of the sensor input
(30)
63 / 101
We get the following
The LMS is convergent in the mean square provided that η satisfies
0 < η <
2
λmax
(28)
Because λmax is the largest eigenvalue of the correlation sample Rx
This can be difficult in reality.... then we use the trace instead
0 < η <
2
trace [Rx]
(29)
However, each diagonal element of Rx is equal the mean-squared
value of the corresponding of the sensor input
We can re-state the previous condition as
0 < η <
2
sum of the mean-square values of the sensor input
(30)
63 / 101
Virtues and Limitations of the LMS Algorithm
Virtues
An important virtue of the LMS algorithm is its simplicity.
The model is independent and robust to the error (small disturbances
= small estimation error).
Not only that, the LMS algorithm is optimal in accordance with the
minimax criterion
If you do not know what you are up against, plan for the worst and
optimize.
Primary Limitation
The slow rate of convergence and sensitivity to variations in the
eigenstructure of the input.
The LMS algorithms requires about 10 times the dimensionality of
the input space for convergence.
64 / 101
Virtues and Limitations of the LMS Algorithm
Virtues
An important virtue of the LMS algorithm is its simplicity.
The model is independent and robust to the error (small disturbances
= small estimation error).
Not only that, the LMS algorithm is optimal in accordance with the
minimax criterion
If you do not know what you are up against, plan for the worst and
optimize.
Primary Limitation
The slow rate of convergence and sensitivity to variations in the
eigenstructure of the input.
The LMS algorithms requires about 10 times the dimensionality of
the input space for convergence.
64 / 101
Virtues and Limitations of the LMS Algorithm
Virtues
An important virtue of the LMS algorithm is its simplicity.
The model is independent and robust to the error (small disturbances
= small estimation error).
Not only that, the LMS algorithm is optimal in accordance with the
minimax criterion
If you do not know what you are up against, plan for the worst and
optimize.
Primary Limitation
The slow rate of convergence and sensitivity to variations in the
eigenstructure of the input.
The LMS algorithms requires about 10 times the dimensionality of
the input space for convergence.
64 / 101
Virtues and Limitations of the LMS Algorithm
Virtues
An important virtue of the LMS algorithm is its simplicity.
The model is independent and robust to the error (small disturbances
= small estimation error).
Not only that, the LMS algorithm is optimal in accordance with the
minimax criterion
If you do not know what you are up against, plan for the worst and
optimize.
Primary Limitation
The slow rate of convergence and sensitivity to variations in the
eigenstructure of the input.
The LMS algorithms requires about 10 times the dimensionality of
the input space for convergence.
64 / 101
Virtues and Limitations of the LMS Algorithm
Virtues
An important virtue of the LMS algorithm is its simplicity.
The model is independent and robust to the error (small disturbances
= small estimation error).
Not only that, the LMS algorithm is optimal in accordance with the
minimax criterion
If you do not know what you are up against, plan for the worst and
optimize.
Primary Limitation
The slow rate of convergence and sensitivity to variations in the
eigenstructure of the input.
The LMS algorithms requires about 10 times the dimensionality of
the input space for convergence.
64 / 101
More of this in...
Simon Haykin
Simon Haykin - Adaptive Filter Theory (3rd Edition)
65 / 101
Exercises
We have from NN by Haykin
3.1, 3.2, 3.3, 3.4, 3.5, 3.7 and 3.8
66 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 67 / 101
Objective
Goal
Correctly classify a series of samples (External applied stimuli)
x1, x2, x3, ..., xm into one of two classes, C1 and C2.
Output of each input
1 Class C1 output y +1.
2 Class C2 output y -1.
68 / 101
Objective
Goal
Correctly classify a series of samples (External applied stimuli)
x1, x2, x3, ..., xm into one of two classes, C1 and C2.
Output of each input
1 Class C1 output y +1.
2 Class C2 output y -1.
68 / 101
History
Frank Rosenblatt
The perceptron algorithm was invented in 1957 at the Cornell Aeronautical
Laboratory by Frank Rosenblatt.
Something Notable
Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!
Frank Rosenblatt
He helped to develop the Mark I Perceptron - a new machine based in the
connectivity of neural networks!!!
Some problems with it
The most important is the impossibility to use the perceptron with a
single neuron to solve the XOR problem
69 / 101
History
Frank Rosenblatt
The perceptron algorithm was invented in 1957 at the Cornell Aeronautical
Laboratory by Frank Rosenblatt.
Something Notable
Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!
Frank Rosenblatt
He helped to develop the Mark I Perceptron - a new machine based in the
connectivity of neural networks!!!
Some problems with it
The most important is the impossibility to use the perceptron with a
single neuron to solve the XOR problem
69 / 101
History
Frank Rosenblatt
The perceptron algorithm was invented in 1957 at the Cornell Aeronautical
Laboratory by Frank Rosenblatt.
Something Notable
Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!
Frank Rosenblatt
He helped to develop the Mark I Perceptron - a new machine based in the
connectivity of neural networks!!!
Some problems with it
The most important is the impossibility to use the perceptron with a
single neuron to solve the XOR problem
69 / 101
History
Frank Rosenblatt
The perceptron algorithm was invented in 1957 at the Cornell Aeronautical
Laboratory by Frank Rosenblatt.
Something Notable
Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!!
Frank Rosenblatt
He helped to develop the Mark I Perceptron - a new machine based in the
connectivity of neural networks!!!
Some problems with it
The most important is the impossibility to use the perceptron with a
single neuron to solve the XOR problem
69 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 70 / 101
Perceptron: Local Field of a Neuron
Signal-Flow
Bias
Inputs
Induced local field of a neuron
v =
m
i=1
wixi + b (31)
71 / 101
Perceptron: Local Field of a Neuron
Signal-Flow
Bias
Inputs
Induced local field of a neuron
v =
m
i=1
wixi + b (31)
71 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 72 / 101
Perceptron: One Neuron Structure
Based in the previous induced local field
In the simplest form of the perceptron there are two decision regions
separated by an hyperplane:
m
i=1
wixi + b = 0 (32)
Example with two signals
73 / 101
Perceptron: One Neuron Structure
Based in the previous induced local field
In the simplest form of the perceptron there are two decision regions
separated by an hyperplane:
m
i=1
wixi + b = 0 (32)
Example with two signals
Decision Boundary
Class
Class
73 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 74 / 101
Deriving the Algorithm
First, you put signals together
x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T
(33)
Weights
v(n) =
m
i=0
wi (n) xi (n) = wT
(n) x (n) (34)
Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable
75 / 101
Deriving the Algorithm
First, you put signals together
x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T
(33)
Weights
v(n) =
m
i=0
wi (n) xi (n) = wT
(n) x (n) (34)
Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable
75 / 101
Deriving the Algorithm
First, you put signals together
x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T
(33)
Weights
v(n) =
m
i=0
wi (n) xi (n) = wT
(n) x (n) (34)
Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable
Decision Boundary
Class
Class
75 / 101
Rule for Linear Separable Classes
There must exist a vector w
1 wT x > 0 for every input vector x belonging to class C1.
2 wT x ≤ 0 for every input vector x belonging to class C2.
What is the derivative of dv(n)
dw
?
dv (n)
dw
= x (n) (35)
76 / 101
Rule for Linear Separable Classes
There must exist a vector w
1 wT x > 0 for every input vector x belonging to class C1.
2 wT x ≤ 0 for every input vector x belonging to class C2.
What is the derivative of dv(n)
dw
?
dv (n)
dw
= x (n) (35)
76 / 101
Rule for Linear Separable Classes
There must exist a vector w
1 wT x > 0 for every input vector x belonging to class C1.
2 wT x ≤ 0 for every input vector x belonging to class C2.
What is the derivative of dv(n)
dw
?
dv (n)
dw
= x (n) (35)
76 / 101
Finally
No correction is necessary
1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.
2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class
C2.
Correction is necessary
1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs
to class C2.
2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)
belongs to class C1.
Where η (n) is a learning parameter adjusting the learning rate.
77 / 101
Finally
No correction is necessary
1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.
2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class
C2.
Correction is necessary
1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs
to class C2.
2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)
belongs to class C1.
Where η (n) is a learning parameter adjusting the learning rate.
77 / 101
Finally
No correction is necessary
1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.
2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class
C2.
Correction is necessary
1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs
to class C2.
2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)
belongs to class C1.
Where η (n) is a learning parameter adjusting the learning rate.
77 / 101
Finally
No correction is necessary
1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.
2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class
C2.
Correction is necessary
1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs
to class C2.
2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)
belongs to class C1.
Where η (n) is a learning parameter adjusting the learning rate.
77 / 101
Finally
No correction is necessary
1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1.
2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class
C2.
Correction is necessary
1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs
to class C2.
2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n)
belongs to class C1.
Where η (n) is a learning parameter adjusting the learning rate.
77 / 101
A little bit on the Geometry
For Example, w(n + 1) = w(n) − η (n) x (n)
78 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 79 / 101
Under Linear Separability - Convergence happens!!!
If we assume
Linear Separabilty for the classes C1 and C2.
Rosenblatt - 1962
Let the subsets of training vectors C1 and C2 be linearly separable.
Let the inputs presented to the perceptron originate from these two
subsets.
The perceptron converges after some n0 iterations, in the sense that
is a solution vector for
w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)
is a solution vector for n0 ≤ nmax
80 / 101
Under Linear Separability - Convergence happens!!!
If we assume
Linear Separabilty for the classes C1 and C2.
Rosenblatt - 1962
Let the subsets of training vectors C1 and C2 be linearly separable.
Let the inputs presented to the perceptron originate from these two
subsets.
The perceptron converges after some n0 iterations, in the sense that
is a solution vector for
w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)
is a solution vector for n0 ≤ nmax
80 / 101
Under Linear Separability - Convergence happens!!!
If we assume
Linear Separabilty for the classes C1 and C2.
Rosenblatt - 1962
Let the subsets of training vectors C1 and C2 be linearly separable.
Let the inputs presented to the perceptron originate from these two
subsets.
The perceptron converges after some n0 iterations, in the sense that
is a solution vector for
w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)
is a solution vector for n0 ≤ nmax
80 / 101
Under Linear Separability - Convergence happens!!!
If we assume
Linear Separabilty for the classes C1 and C2.
Rosenblatt - 1962
Let the subsets of training vectors C1 and C2 be linearly separable.
Let the inputs presented to the perceptron originate from these two
subsets.
The perceptron converges after some n0 iterations, in the sense that
is a solution vector for
w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36)
is a solution vector for n0 ≤ nmax
80 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 81 / 101
Proof I - First a Lower Bound for w (n + 1) 2
Initialization
w (0) = 0 (37)
Now assume for time n = 1, 2, 3, ...
wT
(n) x (n) < 0 (38)
with x(n) belongs to class C1.
PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS
x (1) , x (2) , ...
Apply the correction formula
w (n + 1) = w (n) + x (n) (39)
82 / 101
Proof I - First a Lower Bound for w (n + 1) 2
Initialization
w (0) = 0 (37)
Now assume for time n = 1, 2, 3, ...
wT
(n) x (n) < 0 (38)
with x(n) belongs to class C1.
PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS
x (1) , x (2) , ...
Apply the correction formula
w (n + 1) = w (n) + x (n) (39)
82 / 101
Proof I - First a Lower Bound for w (n + 1) 2
Initialization
w (0) = 0 (37)
Now assume for time n = 1, 2, 3, ...
wT
(n) x (n) < 0 (38)
with x(n) belongs to class C1.
PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS
x (1) , x (2) , ...
Apply the correction formula
w (n + 1) = w (n) + x (n) (39)
82 / 101
Proof I - First a Lower Bound for w (n + 1) 2
Initialization
w (0) = 0 (37)
Now assume for time n = 1, 2, 3, ...
wT
(n) x (n) < 0 (38)
with x(n) belongs to class C1.
PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS
x (1) , x (2) , ...
Apply the correction formula
w (n + 1) = w (n) + x (n) (39)
82 / 101
Proof I - First a Lower Bound for w (n + 1) 2
Initialization
w (0) = 0 (37)
Now assume for time n = 1, 2, 3, ...
wT
(n) x (n) < 0 (38)
with x(n) belongs to class C1.
PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS
x (1) , x (2) , ...
Apply the correction formula
w (n + 1) = w (n) + x (n) (39)
82 / 101
Proof II
Apply the correction iteratively
w (n + 1) = x (1) + x (2) + ... + x (n) (40)
We know that there is a solution w0(Linear Separability)
α = min
x(n)∈C1
wT
0 x (n) (41)
Then, we have
wT
0 w (n + 1) = wT
0 x (1) + wT
0 x (2) + ... + wT
0 x (n) (42)
83 / 101
Proof II
Apply the correction iteratively
w (n + 1) = x (1) + x (2) + ... + x (n) (40)
We know that there is a solution w0(Linear Separability)
α = min
x(n)∈C1
wT
0 x (n) (41)
Then, we have
wT
0 w (n + 1) = wT
0 x (1) + wT
0 x (2) + ... + wT
0 x (n) (42)
83 / 101
Proof II
Apply the correction iteratively
w (n + 1) = x (1) + x (2) + ... + x (n) (40)
We know that there is a solution w0(Linear Separability)
α = min
x(n)∈C1
wT
0 x (n) (41)
Then, we have
wT
0 w (n + 1) = wT
0 x (1) + wT
0 x (2) + ... + wT
0 x (n) (42)
83 / 101
Proof III
Apply the correction iteratively
w (n + 1) = x (1) + x (2) + ... + x (n) (43)
We know that there is a solution w0(Linear Separability)
α = min
x(n)∈C1
wT
0 x (n) (44)
Then, we have
wT
0 w (n + 1) = wT
0 x (1) + wT
0 x (2) + ... + wT
0 x (n) (45)
84 / 101
Proof III
Apply the correction iteratively
w (n + 1) = x (1) + x (2) + ... + x (n) (43)
We know that there is a solution w0(Linear Separability)
α = min
x(n)∈C1
wT
0 x (n) (44)
Then, we have
wT
0 w (n + 1) = wT
0 x (1) + wT
0 x (2) + ... + wT
0 x (n) (45)
84 / 101
Proof III
Apply the correction iteratively
w (n + 1) = x (1) + x (2) + ... + x (n) (43)
We know that there is a solution w0(Linear Separability)
α = min
x(n)∈C1
wT
0 x (n) (44)
Then, we have
wT
0 w (n + 1) = wT
0 x (1) + wT
0 x (2) + ... + wT
0 x (n) (45)
84 / 101
Proof IV
Thus we use the α
wT
0 w (n + 1) ≥ nα (46)
Thus using the Cauchy-Schwartz Inequality
wT
0
2
w (n + 1) 2
≥ wT
0 w (n + 1)
2
(47)
· is the Euclidean distance.
Thus
wT
0
2
w (n + 1) 2
≥ n2
α2
w (n + 1) 2
≥
n2α2
wT
0
2
85 / 101
Proof IV
Thus we use the α
wT
0 w (n + 1) ≥ nα (46)
Thus using the Cauchy-Schwartz Inequality
wT
0
2
w (n + 1) 2
≥ wT
0 w (n + 1)
2
(47)
· is the Euclidean distance.
Thus
wT
0
2
w (n + 1) 2
≥ n2
α2
w (n + 1) 2
≥
n2α2
wT
0
2
85 / 101
Proof IV
Thus we use the α
wT
0 w (n + 1) ≥ nα (46)
Thus using the Cauchy-Schwartz Inequality
wT
0
2
w (n + 1) 2
≥ wT
0 w (n + 1)
2
(47)
· is the Euclidean distance.
Thus
wT
0
2
w (n + 1) 2
≥ n2
α2
w (n + 1) 2
≥
n2α2
wT
0
2
85 / 101
Proof V - Now a Upper Bound for w (n + 1) 2
Now rewrite equation 39
w (k + 1) = w (k) + x (k) (48)
for k = 1, 2, ..., n and x (k) ∈ C1
Squaring the Euclidean norm of both sides
w (k + 1) 2
= w (k) 2
+ x (k) 2
+ 2wT
(k) x (k) (49)
Now taking that wT
(k) x (k) < 0
w (k + 1) 2
≤ w (k) 2
+ x (k) 2
w (k + 1) 2
− w (k) 2
≤ x (k) 2
86 / 101
Proof V - Now a Upper Bound for w (n + 1) 2
Now rewrite equation 39
w (k + 1) = w (k) + x (k) (48)
for k = 1, 2, ..., n and x (k) ∈ C1
Squaring the Euclidean norm of both sides
w (k + 1) 2
= w (k) 2
+ x (k) 2
+ 2wT
(k) x (k) (49)
Now taking that wT
(k) x (k) < 0
w (k + 1) 2
≤ w (k) 2
+ x (k) 2
w (k + 1) 2
− w (k) 2
≤ x (k) 2
86 / 101
Proof V - Now a Upper Bound for w (n + 1) 2
Now rewrite equation 39
w (k + 1) = w (k) + x (k) (48)
for k = 1, 2, ..., n and x (k) ∈ C1
Squaring the Euclidean norm of both sides
w (k + 1) 2
= w (k) 2
+ x (k) 2
+ 2wT
(k) x (k) (49)
Now taking that wT
(k) x (k) < 0
w (k + 1) 2
≤ w (k) 2
+ x (k) 2
w (k + 1) 2
− w (k) 2
≤ x (k) 2
86 / 101
Proof VI
Use the telescopic sum
n
k=0
w (k + 1) 2
− w (k) 2
≤
n
k=0
x (k) 2
(50)
Assume
w (0) = 0
x (0) = 0
Thus
w (n + 1) 2
≤ n
k=1 x (k) 2
87 / 101
Proof VI
Use the telescopic sum
n
k=0
w (k + 1) 2
− w (k) 2
≤
n
k=0
x (k) 2
(50)
Assume
w (0) = 0
x (0) = 0
Thus
w (n + 1) 2
≤ n
k=1 x (k) 2
87 / 101
Proof VI
Use the telescopic sum
n
k=0
w (k + 1) 2
− w (k) 2
≤
n
k=0
x (k) 2
(50)
Assume
w (0) = 0
x (0) = 0
Thus
w (n + 1) 2
≤ n
k=1 x (k) 2
87 / 101
Proof VII
Then, we can define a positive number
β = max
x(k)∈C1
x (k) 2
(51)
Thus
w (k + 1) 2
≤ n
k=1 x (k) 2
≤ nβ
Thus, we satisfies the equations only when exists a nmax (Using Our
Sandwich )
n2
maxα2
w0
2 = nmaxβ (52)
88 / 101
Proof VII
Then, we can define a positive number
β = max
x(k)∈C1
x (k) 2
(51)
Thus
w (k + 1) 2
≤ n
k=1 x (k) 2
≤ nβ
Thus, we satisfies the equations only when exists a nmax (Using Our
Sandwich )
n2
maxα2
w0
2 = nmaxβ (52)
88 / 101
Proof VII
Then, we can define a positive number
β = max
x(k)∈C1
x (k) 2
(51)
Thus
w (k + 1) 2
≤ n
k=1 x (k) 2
≤ nβ
Thus, we satisfies the equations only when exists a nmax (Using Our
Sandwich )
n2
maxα2
w0
2 = nmaxβ (52)
88 / 101
Proof VIII
Solving
nmax =
β w0
2
α2
(53)
Thus
For η (n) = 1 for all n, w (0) = 0 and a solution vector w0:
The rule for adaptying the synaptic weights of the perceptron must
terminate after at most nmax steps.
In addition
Because w0 the solution is not unique.
89 / 101
Proof VIII
Solving
nmax =
β w0
2
α2
(53)
Thus
For η (n) = 1 for all n, w (0) = 0 and a solution vector w0:
The rule for adaptying the synaptic weights of the perceptron must
terminate after at most nmax steps.
In addition
Because w0 the solution is not unique.
89 / 101
Proof VIII
Solving
nmax =
β w0
2
α2
(53)
Thus
For η (n) = 1 for all n, w (0) = 0 and a solution vector w0:
The rule for adaptying the synaptic weights of the perceptron must
terminate after at most nmax steps.
In addition
Because w0 the solution is not unique.
89 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 90 / 101
Algorithm Using Error-Correcting
Now, if we use the 1
2
ek (n)2
We can actually simplify the rules and the final algorithm!!!
Thus, we have the following Delta Value
∆w (n) = η ((dj − yj (n))) x (n) (54)
91 / 101
Algorithm Using Error-Correcting
Now, if we use the 1
2
ek (n)2
We can actually simplify the rules and the final algorithm!!!
Thus, we have the following Delta Value
∆w (n) = η ((dj − yj (n))) x (n) (54)
91 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 92 / 101
We could use the previous Rule
In order to generate an algorithm
However, you need classes that are linearly separable!!!
Therefore, we can use a more generals Gradient Descent Rule
To obtain an algorithm to the best separation hyperplane!!!
93 / 101
We could use the previous Rule
In order to generate an algorithm
However, you need classes that are linearly separable!!!
Therefore, we can use a more generals Gradient Descent Rule
To obtain an algorithm to the best separation hyperplane!!!
93 / 101
Gradient Descent Algorithm
Algorithm - Off-line/Batch Learning
1 Set n = 0.
2 Set dj =
+1 if xj (n) ∈ Class 1
−1 if xj (n) ∈ Class 2
for all j = 1, 2, ..., m.
3 Initialize the weights, wT
= (w1 (n) , w2 (n) , ..., wn (n)).
Weights may be initialized to 0 or to a small random value.
4 Initialize Dummy outputs so you can enter loop yt
= y1 (n) ., y2 (n) , ..., ym (n)
5 Initialize Stopping error > 0.
6 Initialize learning error η.
7 While 1
m
m
j=1
dj − yj (n) >
For each sample (xj, dj) for j = 1, ..., m:
Calculate output yj = ϕ wT
(n) · xj
Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij.
n = n + 1
94 / 101
Gradient Descent Algorithm
Algorithm - Off-line/Batch Learning
1 Set n = 0.
2 Set dj =
+1 if xj (n) ∈ Class 1
−1 if xj (n) ∈ Class 2
for all j = 1, 2, ..., m.
3 Initialize the weights, wT
= (w1 (n) , w2 (n) , ..., wn (n)).
Weights may be initialized to 0 or to a small random value.
4 Initialize Dummy outputs so you can enter loop yt
= y1 (n) ., y2 (n) , ..., ym (n)
5 Initialize Stopping error > 0.
6 Initialize learning error η.
7 While 1
m
m
j=1
dj − yj (n) >
For each sample (xj, dj) for j = 1, ..., m:
Calculate output yj = ϕ wT
(n) · xj
Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij.
n = n + 1
94 / 101
Nevertheless
We have the following problem
> 0 (55)
Thus...
Convergence to the best linear separation is a tweaking business!!!
95 / 101
Nevertheless
We have the following problem
> 0 (55)
Thus...
Convergence to the best linear separation is a tweaking business!!!
95 / 101
Outline
1 Introduction
History
2 Adapting Filtering Problem
Definition
Description of the Behavior of the System
3 Unconstrained Optimization
Introduction
Method of Steepest Descent
Newton’s Method
Gauss-Newton Method
4 Linear Least-Squares Filter
Introduction
Least-Mean-Square (LMS) Algorithm
Convergence of the LMS
5 Perceptron
Objective
Perceptron: Local Field of a Neuron
Perceptron: One Neuron Structure
Deriving the Algorithm
Under Linear Separability - Convergence happens!!!
Proof
Algorithm Using Error-Correcting
Final Perceptron Algorithm (One Version)
Other Algorithms for the Perceptron 96 / 101
However, if we limit our features!!!
The Winnow Algorithm!!!
It converges even with no-linear separability.
Feature Vector
A Boolean-valued features X = {0, 1}d
Weight Vector w
1 wt = (w1, w2, ..., wp) for all wi ∈ R
2 For all i, wi ≥ 0.
97 / 101
However, if we limit our features!!!
The Winnow Algorithm!!!
It converges even with no-linear separability.
Feature Vector
A Boolean-valued features X = {0, 1}d
Weight Vector w
1 wt = (w1, w2, ..., wp) for all wi ∈ R
2 For all i, wi ≥ 0.
97 / 101
However, if we limit our features!!!
The Winnow Algorithm!!!
It converges even with no-linear separability.
Feature Vector
A Boolean-valued features X = {0, 1}d
Weight Vector w
1 wt = (w1, w2, ..., wp) for all wi ∈ R
2 For all i, wi ≥ 0.
97 / 101
Classification Scheme
We use a specific θ
1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1
2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2
Rule
We use two possible Rules for training!!! With a learning rate of α > 1.
Rule 1
When misclassifying a positive training example x ∈Class 1 i.e.
wT x < θ
∀xi = 1 : wi ← αwi (56)
98 / 101
Classification Scheme
We use a specific θ
1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1
2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2
Rule
We use two possible Rules for training!!! With a learning rate of α > 1.
Rule 1
When misclassifying a positive training example x ∈Class 1 i.e.
wT x < θ
∀xi = 1 : wi ← αwi (56)
98 / 101
Classification Scheme
We use a specific θ
1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1
2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2
Rule
We use two possible Rules for training!!! With a learning rate of α > 1.
Rule 1
When misclassifying a positive training example x ∈Class 1 i.e.
wT x < θ
∀xi = 1 : wi ← αwi (56)
98 / 101
Classification Scheme
Rule 2
When misclassifying a negative training example x ∈Class 2 i.e.
wT x ≥ θ
∀xi = 1 : wi ←
wi
α
(57)
Rule 3
If samples are correctly classified do nothing!!!
99 / 101
Classification Scheme
Rule 2
When misclassifying a negative training example x ∈Class 2 i.e.
wT x ≥ θ
∀xi = 1 : wi ←
wi
α
(57)
Rule 3
If samples are correctly classified do nothing!!!
99 / 101
Properties of Winnow
Property
If there are many irrelevant variables Winnow is better than the
Perceptron.
Drawback
Sensitive to the learning rate α.
100 / 101
Properties of Winnow
Property
If there are many irrelevant variables Winnow is better than the
Perceptron.
Drawback
Sensitive to the learning rate α.
100 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
The Pocket Algorithm
A variant of the Perceptron Algorithm
It was suggested by Geman et al. in
“Perceptron based learning algorithms,” IEEE Transactions on Neural
Networks,Vol. 1(2), pp. 179–191, 1990.
It converges to an optimal solution even if the linear separability is
not fulfilled.
It consists of the following steps
1 Initialize the weight vector w (0) in a random way.
2 Define a storage pocket vector ws and a history counter hs to zero
for the same pocket vector.
3 At the ith iteration step compute the update w (n + 1) using the
Perceptron rule.
4 Use the update weight to find the number h of samples correctly
classified.
5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101

More Related Content

What's hot

Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications Ahmed_hashmi
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networksParveenMalik18
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkKnoldus Inc.
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronMostafa G. M. Mostafa
 
Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsJon Lederman
 
Learning by analogy
Learning by analogyLearning by analogy
Learning by analogyNitesh Singh
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Mostafa G. M. Mostafa
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extractionskylian
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networksSi Haem
 
Soft Computing
Soft ComputingSoft Computing
Soft ComputingMANISH T I
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural NetworksAniket Maurya
 
Artificial Intelligence techniques
Artificial Intelligence techniquesArtificial Intelligence techniques
Artificial Intelligence techniquesPavan Kumar Talla
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningYan Xu
 

What's hot (20)

Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
intelligent system
intelligent systemintelligent system
intelligent system
 
Neural network & its applications
Neural network & its applications Neural network & its applications
Neural network & its applications
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Lecture 4 neural networks
Lecture 4 neural networksLecture 4 neural networks
Lecture 4 neural networks
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and DhanashriRadial basis function network ppt bySheetal,Samreen and Dhanashri
Radial basis function network ppt bySheetal,Samreen and Dhanashri
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Neural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's PerceptronNeural Networks: Rosenblatt's Perceptron
Neural Networks: Rosenblatt's Perceptron
 
Neural Networks and Deep Learning Basics
Neural Networks and Deep Learning BasicsNeural Networks and Deep Learning Basics
Neural Networks and Deep Learning Basics
 
Neural networks introduction
Neural networks introductionNeural networks introduction
Neural networks introduction
 
Learning by analogy
Learning by analogyLearning by analogy
Learning by analogy
 
Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)Neural Networks: Radial Bases Functions (RBF)
Neural Networks: Radial Bases Functions (RBF)
 
Feature Extraction
Feature ExtractionFeature Extraction
Feature Extraction
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
Soft Computing
Soft ComputingSoft Computing
Soft Computing
 
Deep Learning With Neural Networks
Deep Learning With Neural NetworksDeep Learning With Neural Networks
Deep Learning With Neural Networks
 
Artificial Intelligence techniques
Artificial Intelligence techniquesArtificial Intelligence techniques
Artificial Intelligence techniques
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
 

Viewers also liked

Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersMohammed Bennamoun
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron SlidesESCOM
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationMohammed Bennamoun
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksstellajoseph
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationMohammed Bennamoun
 
MPerceptron
MPerceptronMPerceptron
MPerceptronbutest
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Pengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptronPengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptronArief Fatchul Huda
 
Backlog, Deferred Maintenance and its use in Planning
Backlog, Deferred Maintenance and its use in PlanningBacklog, Deferred Maintenance and its use in Planning
Backlog, Deferred Maintenance and its use in PlanningSightlines
 
Pg connects bangalore re engagment advertising a dummies guide
Pg connects bangalore re engagment advertising a dummies guidePg connects bangalore re engagment advertising a dummies guide
Pg connects bangalore re engagment advertising a dummies guideFiksu
 
Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...
Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...
Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...Sightlines
 
Enhancing the Value of Streets for Pedestrians, Bicyclists, and Motorists
Enhancing the Value of Streets for Pedestrians, Bicyclists, and MotoristsEnhancing the Value of Streets for Pedestrians, Bicyclists, and Motorists
Enhancing the Value of Streets for Pedestrians, Bicyclists, and MotoristsAndy Boenau
 
Selenium: What Is It Good For
Selenium: What Is It Good ForSelenium: What Is It Good For
Selenium: What Is It Good ForAllan Chappell
 
Kakamega County Audit Report 2014/2015
Kakamega County Audit Report 2014/2015Kakamega County Audit Report 2014/2015
Kakamega County Audit Report 2014/2015The Star Newspaper
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031frdos
 

Viewers also liked (19)

Perceptron
PerceptronPerceptron
Perceptron
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron Slides
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & BackpropagationArtificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
Artificial Neural Networks Lect5: Multi-Layer Perceptron & Backpropagation
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Artificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimizationArtificial Neural Networks Lect8: Neural networks for constrained optimization
Artificial Neural Networks Lect8: Neural networks for constrained optimization
 
MPerceptron
MPerceptronMPerceptron
MPerceptron
 
Aprendizaje Redes Neuronales
Aprendizaje Redes NeuronalesAprendizaje Redes Neuronales
Aprendizaje Redes Neuronales
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Pengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptronPengenalan pola sederhana dg perceptron
Pengenalan pola sederhana dg perceptron
 
More than a score - Using Ratings and Reviews in Real Estate
More than a score - Using Ratings and Reviews in Real EstateMore than a score - Using Ratings and Reviews in Real Estate
More than a score - Using Ratings and Reviews in Real Estate
 
Backlog, Deferred Maintenance and its use in Planning
Backlog, Deferred Maintenance and its use in PlanningBacklog, Deferred Maintenance and its use in Planning
Backlog, Deferred Maintenance and its use in Planning
 
Pg connects bangalore re engagment advertising a dummies guide
Pg connects bangalore re engagment advertising a dummies guidePg connects bangalore re engagment advertising a dummies guide
Pg connects bangalore re engagment advertising a dummies guide
 
Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...
Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...
Take Control of Your Facilities: Explore the Tools for Aligning Space, Capita...
 
Enhancing the Value of Streets for Pedestrians, Bicyclists, and Motorists
Enhancing the Value of Streets for Pedestrians, Bicyclists, and MotoristsEnhancing the Value of Streets for Pedestrians, Bicyclists, and Motorists
Enhancing the Value of Streets for Pedestrians, Bicyclists, and Motorists
 
Selenium: What Is It Good For
Selenium: What Is It Good ForSelenium: What Is It Good For
Selenium: What Is It Good For
 
Kakamega County Audit Report 2014/2015
Kakamega County Audit Report 2014/2015Kakamega County Audit Report 2014/2015
Kakamega County Audit Report 2014/2015
 
Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031Ann chapter-3-single layerperceptron20021031
Ann chapter-3-single layerperceptron20021031
 

Similar to 14 Machine Learning Single Layer Perceptron

Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalizationKamal Bhatt
 
Neural networks
Neural networksNeural networks
Neural networksBasil John
 
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...DurgadeviParamasivam
 
TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...
TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...
TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...Nimai Chand Das Adhikari
 
NeurSciACone
NeurSciAConeNeurSciACone
NeurSciAConeAdam Cone
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationMohammed Bennamoun
 
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience hirokazutanaka
 
Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)Alex Klibisz
 
Analysis of intelligent system design by neuro adaptive control
Analysis of intelligent system design by neuro adaptive controlAnalysis of intelligent system design by neuro adaptive control
Analysis of intelligent system design by neuro adaptive controliaemedu
 
Analysis of intelligent system design by neuro adaptive control no restriction
Analysis of intelligent system design by neuro adaptive control no restrictionAnalysis of intelligent system design by neuro adaptive control no restriction
Analysis of intelligent system design by neuro adaptive control no restrictioniaemedu
 
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptxACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptxgnans Kgnanshek
 
NEURAL NETWORKS
NEURAL NETWORKSNEURAL NETWORKS
NEURAL NETWORKSESCOM
 
Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...
Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...
Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...CSCJournals
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learningbutest
 

Similar to 14 Machine Learning Single Layer Perceptron (20)

ANN.ppt
ANN.pptANN.ppt
ANN.ppt
 
Adaptive equalization
Adaptive equalizationAdaptive equalization
Adaptive equalization
 
Neural networks
Neural networksNeural networks
Neural networks
 
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
BACKPROPOGATION ALGO.pdfLECTURE NOTES WITH SOLVED EXAMPLE AND FEED FORWARD NE...
 
TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...
TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...
TFFN: Two Hidden Layer Feed Forward Network using the randomness of Extreme L...
 
JACT 5-3_Christakis
JACT 5-3_ChristakisJACT 5-3_Christakis
JACT 5-3_Christakis
 
NeurSciACone
NeurSciAConeNeurSciACone
NeurSciACone
 
Artificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computationArtificial Neural Networks Lect1: Introduction & neural computation
Artificial Neural Networks Lect1: Introduction & neural computation
 
tsopze2011
tsopze2011tsopze2011
tsopze2011
 
Soft computing BY:- Dr. Rakesh Kumar Maurya
Soft computing BY:- Dr. Rakesh Kumar MauryaSoft computing BY:- Dr. Rakesh Kumar Maurya
Soft computing BY:- Dr. Rakesh Kumar Maurya
 
SoftComputing5
SoftComputing5SoftComputing5
SoftComputing5
 
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience
 
Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)Reservoir Computing Overview (with emphasis on Liquid State Machines)
Reservoir Computing Overview (with emphasis on Liquid State Machines)
 
Analysis of intelligent system design by neuro adaptive control
Analysis of intelligent system design by neuro adaptive controlAnalysis of intelligent system design by neuro adaptive control
Analysis of intelligent system design by neuro adaptive control
 
Analysis of intelligent system design by neuro adaptive control no restriction
Analysis of intelligent system design by neuro adaptive control no restrictionAnalysis of intelligent system design by neuro adaptive control no restriction
Analysis of intelligent system design by neuro adaptive control no restriction
 
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptxACUMENS ON NEURAL NET AKG 20 7 23.pptx
ACUMENS ON NEURAL NET AKG 20 7 23.pptx
 
NEURAL NETWORKS
NEURAL NETWORKSNEURAL NETWORKS
NEURAL NETWORKS
 
neural-networks (1)
neural-networks (1)neural-networks (1)
neural-networks (1)
 
Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...
Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...
Word Recognition in Continuous Speech and Speaker Independent by Means of Rec...
 
Supervised Learning
Supervised LearningSupervised Learning
Supervised Learning
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 

Recently uploaded (20)

Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 

14 Machine Learning Single Layer Perceptron

  • 1. Neural Networks Introduction Single-Layer Perceptron Andres Mendez-Vazquez July 3, 2016 1 / 101
  • 2. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 2 / 101
  • 3. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 3 / 101
  • 4. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  • 5. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  • 6. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  • 7. History At the beginning of Neural Networks (1943 - 1958) McCulloch and Pitts (1943) for introducing the idea of neural networks as computing machines. Hebb (1949) for postulating the first rule for self-organized learning. Rosenblatt (1958) for proposing the perceptron as the first model for learning with a teacher (i.e., supervised learning). In this chapter, we are interested in the perceptron The perceptron is the simplest form of a neural network used for the classifica tion of patterns said to be linearly separable (i.e., patterns that lie on opposite sides of a hyperplane). 4 / 101
  • 8. In addition Something Notable The single neuron also forms the basis of an adaptive filter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive filtering owes much to the classic paper of Widrow and Hoff (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
  • 9. In addition Something Notable The single neuron also forms the basis of an adaptive filter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive filtering owes much to the classic paper of Widrow and Hoff (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
  • 10. In addition Something Notable The single neuron also forms the basis of an adaptive filter. A functional block that is basic to the ever-expanding subject of signal processing. Furthermore The development of adaptive filtering owes much to the classic paper of Widrow and Hoff (1960) for pioneering the so-called least-mean-square (LMS) algorithm, also known as the delta rule. 5 / 101
  • 11. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 6 / 101
  • 12. Adapting Filtering Problem Consider a dynamical system Unknown dynamical system Output INPUTS 7 / 101
  • 13. Signal-Flow Graph of Adaptive Model We have the following equivalence -1 8 / 101
  • 14. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 9 / 101
  • 15. Description of the Behavior of the System We have the data set T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1) Where x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2) 10 / 101
  • 16. Description of the Behavior of the System We have the data set T = {(x (i) , d (i)) |i = 1, 2, ..., n, ...} (1) Where x (i) = (x1 (i) , x2 (i) ..., xm (i))T (2) 10 / 101
  • 17. The Stimulus x (i) The stimulus x(i) can arise from The m elements of x(i) originate at different points in space (spatial) 11 / 101
  • 18. The Stimulus x (i) The stimulus x(i) can arise from The m elements of x(i) represent the set of present and (m − 1) past values of some excitation that are uniformly spaced in time (temporal). Time 12 / 101
  • 19. Problem Quite important How do we design a multiple input-single output model of the unknown dynamical system? It is more We want to build this around a single neuron!!! 13 / 101
  • 20. Problem Quite important How do we design a multiple input-single output model of the unknown dynamical system? It is more We want to build this around a single neuron!!! 13 / 101
  • 21. Thus, we have the following... We need an algorithm to control the weight adjustment of the neuron Control Algorithm 14 / 101
  • 22. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
  • 23. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
  • 24. Which steps do you need for the algorithm? First The algorithms starts from an arbitrary setting of the neuron’s synaptic weight. Second Adjustments, with respect to changes on the environment, are made on a continuous basis. Time is incorporated to the algorithm. Third Computation of adjustments to synaptic weights are completed inside a time interval that is one sampling period long. 15 / 101
  • 25. Signal-Flow Graph of Adaptive Model We have the following equivalence -1 16 / 101
  • 26. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
  • 27. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
  • 28. Thus, This Neural Model ≈ Adaptive Filter with two continous processes Filtering processes 1 An output, denoted by y(i), that is produced in response to the m elements of the stimulus vector x (i). 2 An error signal, e (i), that is obtained by comparing the output y(i) to the corresponding desired output d(i) produced by the unknown system. Adaptive Process It involves the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i) Remark The combination of these two processes working together constitutes a feedback loop acting around the neuron. 17 / 101
  • 29. Thus The output y(i) is exactly the same as the induced local field v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
  • 30. Thus The output y(i) is exactly the same as the induced local field v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
  • 31. Thus The output y(i) is exactly the same as the induced local field v(i) y (i) = v (i) = m i=1 wk (i) xk (i) (3) In matrix form, we have - remember we only have a neuron, so we do not have neuron k y (i) = xT (i) w (i) (4) Error e (i) = d (i) − y (i) (5) 18 / 101
  • 32. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 19 / 101
  • 33. Consider A continous differentiable function J (w) We want to find an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
  • 34. Consider A continous differentiable function J (w) We want to find an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
  • 35. Consider A continous differentiable function J (w) We want to find an optimal solution w∗ such that J (w∗ ) ≤ J (w) , ∀w (6) We want to Minimize the cost function J(w) with respect to the weight vector w. For this J (w) = 0 (7) 20 / 101
  • 36. Where is the gradient operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm T (8) Thus J (w) = ∂J (w) ∂w1 , ∂J (w) ∂w2 , ..., ∂J (w) ∂wm T (9) 21 / 101
  • 37. Where is the gradient operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm T (8) Thus J (w) = ∂J (w) ∂w1 , ∂J (w) ∂w2 , ..., ∂J (w) ∂wm T (9) 21 / 101
  • 38. Thus Starting with an initial guess denoted by w(0), Then, generate a sequence of weight vectors w (1) , w (2) , ... Such that you can reduce J (w) at each iteration J (w (n + 1)) < J (w (n)) (10) Where: w (n) is the old value and w (n + 1) is the new value. 22 / 101
  • 39. Thus Starting with an initial guess denoted by w(0), Then, generate a sequence of weight vectors w (1) , w (2) , ... Such that you can reduce J (w) at each iteration J (w (n + 1)) < J (w (n)) (10) Where: w (n) is the old value and w (n + 1) is the new value. 22 / 101
  • 40. The Three Main Methods for Unconstrained Optimization We will look at 1 Steepest Descent. 2 Newton’s Method 3 Gauss-Newton Method 23 / 101
  • 41. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 24 / 101
  • 42. Steepest Descent In the method of steepest descent, we have a cost function J (w) where w (n + 1) = w (n) − η J (w (n)) How, we prove that J (w (n + 1)) < J (w (n))? zuA626NR We use the first-order Taylor series expansion around w (n) J (w (n + 1)) ≈ J (w (n)) + JT (w (n)) ∆w (n) (11) Remark: This is quite true when the step size is quite small!!! In addition, ∆w (n) = w (n + 1) − w (n) 25 / 101
  • 43. Steepest Descent In the method of steepest descent, we have a cost function J (w) where w (n + 1) = w (n) − η J (w (n)) How, we prove that J (w (n + 1)) < J (w (n))? zuA626NR We use the first-order Taylor series expansion around w (n) J (w (n + 1)) ≈ J (w (n)) + JT (w (n)) ∆w (n) (11) Remark: This is quite true when the step size is quite small!!! In addition, ∆w (n) = w (n + 1) − w (n) 25 / 101
  • 44. Why? Look at the case in R The equation of the tangent line to the curve y = J (w (n)) L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12) Example 26 / 101
  • 45. Why? Look at the case in R The equation of the tangent line to the curve y = J (w (n)) L (w (n)) = J (w (n)) [w (n + 1) − w (n)] + J (w (n)) (12) Example 26 / 101
  • 46. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
  • 47. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
  • 48. Thus, we have that in R Remember Something quite Classic tan θ = J (w (n + 1)) − J (w (n)) w (n + 1) − w (n) tan θ (w (n + 1) − w (n)) =J (w (n + 1)) − J (w (n)) J (w (n)) (w (n + 1) − w (n)) = J (w (n + 1)) − J (w (n)) 27 / 101
  • 49. Thus, we have that Using the First Taylor expansion J (w (n)) ≈ J (w (n)) + J (w (n)) [w (n + 1) − w (n)] (13) 28 / 101
  • 50. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
  • 51. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
  • 52. Now, for Many Variables An hyperplane in Rn is a set of the form H = x|aT x = b (14) Given x ∈ H and x0 ∈ H b = aT x = aT x0 Thus, we have that H = x|aT (x − x0) = 0 29 / 101
  • 53. Thus, we have the following definition Definition (Differentiability) Assume that J is defined in a disk D containing w (n). We say that J is differentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
  • 54. Thus, we have the following definition Definition (Differentiability) Assume that J is defined in a disk D containing w (n). We say that J is differentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
  • 55. Thus, we have the following definition Definition (Differentiability) Assume that J is defined in a disk D containing w (n). We say that J is differentiable at w (n) if: 1 ∂J(w(n)) ∂wi exist for all i = 1, ..., n. 2 J is locally linear at w (n). 30 / 101
  • 56. Thus, given J (w (n)) We know that we have the following operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm (15) Thus, we have J (w (n)) = ∂J (w (n)) ∂w1 , ∂J (w (n)) ∂w2 , ..., ∂J (w (n)) ∂wm = m i=1 ˆwi ∂J (w (n)) ∂wi Where: ˆwT i = (1, 0, ..., 0) ∈ R 31 / 101
  • 57. Thus, given J (w (n)) We know that we have the following operator = ∂ ∂w1 , ∂ ∂w2 , ..., ∂ ∂wm (15) Thus, we have J (w (n)) = ∂J (w (n)) ∂w1 , ∂J (w (n)) ∂w2 , ..., ∂J (w (n)) ∂wm = m i=1 ˆwi ∂J (w (n)) ∂wi Where: ˆwT i = (1, 0, ..., 0) ∈ R 31 / 101
  • 58. Now Given a curve function r (t) that lies on the level set J (w (n)) = c (When is in R3 ) 32 / 101
  • 59. Level Set Definition {(w1, w2, ..., wm) ∈ Rm |J (w1, w2, ..., wm) = c} (16) Remark: In a normal Calculus course we will use x and f instead of w and J. 33 / 101
  • 60. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Differentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
  • 61. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Differentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
  • 62. Where Any curve has the following parametrization r : [a, b] → Rm r(t) = (w1 (t) , ..., wm (t)) With r(n + 1) = (w1 (n + 1) , ..., wm (n + 1)) We can write the parametrized version of it z(t) = J (w1 (t) , w2 (t) , ..., wm (t)) = c (17) Differentiating with respect to t and using the chain rule for multiple variables dz(t) dt = m i=1 ∂J (w (t)) ∂wi · dwi(t) dt = 0 (18) 34 / 101
  • 63. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
  • 64. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
  • 65. Note First Given y = f (u) = (f1 (u) , ..., fl (u)) and u = g (x) = (g1 (x) , ..., gm (x)). We have then that ∂ (f1, f2, ..., fl) ∂ (x1, x2, ..., xk) = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂ (x1, x2, ..., xk) (19) Thus ∂ (f1, f2, ..., fl) ∂xi = ∂ (f1, f2, ..., fl) ∂ (g1, g2, ..., gm) · ∂ (g1, g2, ..., gm) ∂xi = m k=1 ∂ (f1, f2, ..., fl) ∂gk ∂gk ∂xi 35 / 101
  • 66. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
  • 67. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
  • 68. Thus Evaluating at t = n m i=1 ∂J (w (n)) ∂wi · dwi(n) dt = 0 We have that J (w (n)) · r (n) = 0 (20) This proves that for every level set the gradient is perpendicular to the tangent to any curve that lies on the level set In particular to the point w (n). 36 / 101
  • 69. Now the tangent plane to the surface can be described generally Thus L (w (n + 1)) = J (w (n)) + JT (w (n)) [w (n + 1) − w (n)] (21) This looks like 37 / 101
  • 70. Now the tangent plane to the surface can be described generally Thus L (w (n + 1)) = J (w (n)) + JT (w (n)) [w (n + 1) − w (n)] (21) This looks like 37 / 101
  • 71. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the first-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
  • 72. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the first-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
  • 73. Proving the fact about the Steepest Descent We want the following J (w (n + 1)) < J (w (n)) Using the first-order Taylor approximation J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) So, we ask the following ∆w (n) ≈ −η J (w (n)) with η > 0 38 / 101
  • 74. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
  • 75. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
  • 76. Then We have that J (w (n + 1)) − J (w (n)) ≈ −η JT (w (n)) J (w (n)) = −η J (w (n)) 2 Thus J (w (n + 1)) − J (w (n)) < 0 Or J (w (n + 1)) < J (w (n)) 39 / 101
  • 77. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 40 / 101
  • 78. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) =        ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m        41 / 101
  • 79. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) =        ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m        41 / 101
  • 80. Newton’s Method Here The basic idea of Newton’s method is to minimize the quadratic approximation of the cost function J (w) around the current point w (n). Using a second-order Taylor series expansion of the cost function around the point w (n) ∆J (w (n)) = J (w (n + 1)) − J (w (n)) ≈ JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) Where given that w (n) is a vector with dimension m H = 2 J (w) =        ∂2 J(w) ∂w2 1 ∂2 J(w) ∂w1∂w2 · · · ∂2 J(w) ∂w1∂wm ∂2 J(w) ∂w2∂w1 ∂2 J(w) ∂w2 2 · · · ∂2 J(w) ∂w2∂wm ... ... ... ∂2 J(w) ∂wm∂w1 ∂2 J(w) ∂wm∂w2 · · · ∂2 J(w) ∂w2 m        41 / 101
  • 81. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
  • 82. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
  • 83. Now, we want to minimize J (w (n + 1)) Do you have any idea? Look again J (w (n)) + JT (w (n)) ∆w (n) + 1 2 ∆wT (n) H (n) ∆w (n) (22) Derive with respect to ∆w (n) J (w (n)) + H (n) ∆w (n) = 0 (23) Thus ∆w (n) = −H−1 (n) J (w (n)) 42 / 101
  • 84. The Final Method Define the following J (w (n + 1)) − J (w (n)) = −H−1 (n) J (w (n)) Then J (w (n + 1)) = J (w (n)) − H−1 (n) J (w (n)) 43 / 101
  • 85. The Final Method Define the following J (w (n + 1)) − J (w (n)) = −H−1 (n) J (w (n)) Then J (w (n + 1)) = J (w (n)) − H−1 (n) J (w (n)) 43 / 101
  • 86. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 44 / 101
  • 87. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the first order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
  • 88. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the first order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
  • 89. We have then an error Something Notable J (w) = 1 2 n i=1 e2 (i) Thus using the first order Taylor expansion, we linearize el (i, w) = e (i) + ∂e (i) ∂w T [w − w (n)] In matrix form el (n, w) = e (n) + J (n) [w − w (n)] 45 / 101
  • 90. Where The error vector is equal to e (n) = [e (1) , e (2) , ..., e (n)]T (24) Thus, we get the famous Jacobian once we derive ∂e(i) ∂w J (n) =        ∂e(1) ∂w1 ∂e(1) ∂w2 · · · ∂e(1) ∂wm ∂e(2) ∂w1 ∂e(2) ∂w2 · · · ∂e(2) ∂wm ... ... ... ∂e(n) ∂w1 ∂e(n) ∂w2 · · · ∂e(n) ∂wm        46 / 101
  • 91. Where The error vector is equal to e (n) = [e (1) , e (2) , ..., e (n)]T (24) Thus, we get the famous Jacobian once we derive ∂e(i) ∂w J (n) =        ∂e(1) ∂w1 ∂e(1) ∂w2 · · · ∂e(1) ∂wm ∂e(2) ∂w1 ∂e(2) ∂w2 · · · ∂e(2) ∂wm ... ... ... ∂e(n) ∂w1 ∂e(n) ∂w2 · · · ∂e(n) ∂wm        46 / 101
  • 92. Where We want the following w (n + 1) = arg min w 1 2 el (n, w) 2 Ideas What if we expand out the equation? 47 / 101
  • 93. Where We want the following w (n + 1) = arg min w 1 2 el (n, w) 2 Ideas What if we expand out the equation? 47 / 101
  • 94. Expanded Version We get 1 2 el (n, w) 2 = 1 2 e (n) 2 + eT (n) J (n) (w − w (n)) + ... 1 2 (w − w (n))T JT (n) J (n) (w − w (n)) 48 / 101
  • 95. Now,doing the Differential, we get Differentiating the equation with respect to w JT (n) e (n) + JT (n) J (n) [w − w (n)] = 0 We get finally w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) (25) 49 / 101
  • 96. Now,doing the Differential, we get Differentiating the equation with respect to w JT (n) e (n) + JT (n) J (n) [w − w (n)] = 0 We get finally w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) (25) 49 / 101
  • 97. Remarks We have that The Newton’s method that requires knowledge of the Hessian matrix of the cost function. The Gauss-Newton method only requires the Jacobian matrix of the error vector e (n). However The Gauss-Newton iteration to be computable, the matrix product JT (n) J (n) must be nonsingular!!! 50 / 101
  • 98. Remarks We have that The Newton’s method that requires knowledge of the Hessian matrix of the cost function. The Gauss-Newton method only requires the Jacobian matrix of the error vector e (n). However The Gauss-Newton iteration to be computable, the matrix product JT (n) J (n) must be nonsingular!!! 50 / 101
  • 99. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 51 / 101
  • 100. Introduction A linear least-squares filter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the filter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
  • 101. Introduction A linear least-squares filter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the filter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
  • 102. Introduction A linear least-squares filter has two distinctive characteristics First, the single neuron around which it is built is linear. The cost function J (w) used to design the filter consists of the sum of error squares. Thus, expressing the error e (n) = d (n) − (x (1) , ..., x (n))T w (n) Short Version - error is linear in the weight vector w (n) e (n) = d (n) − X (n) w (n) Where d (n) is a n × 1 desired response vector. Where X(n) is the n × m data matrix. 52 / 101
  • 103. Now, differentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
  • 104. Now, differentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
  • 105. Now, differentiate e (n) with respect to w (n) Thus e (n) = −XT (n) Correspondingly, the Jacobian of e(n) is J (n) = −X (n) Let us to use the Gaussian-Newton w (n + 1) = w (n) − JT (n) J (n) −1 JT (n) e (n) 53 / 101
  • 106. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
  • 107. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
  • 108. Thus We have the following w (n + 1) = w (n) − −XT (n) · −X (n) −1 · −XT (n) [d (n) − X (n) w (n)] We have then w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − ... XT (n) X (n) −1 XT (n) X (n) w (n) Thus, we have w (n + 1) =w (n) + XT (n) X (n) −1 XT (n) d (n) − w (n) = XT (n) X (n) −1 XT (n) d (n) 54 / 101
  • 109. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 55 / 101
  • 110. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again differentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
  • 111. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again differentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
  • 112. Again Our Error Cost function We have J (w) = 1 2 e2 (n) where e (n) is the error signal measured at time n. Again differentiating against the vector w ∂J (w) ∂w = e (n) ∂e (n) ∂w LMS algorithm operates with a linear neuron so we may express the error signal as e (n) = d (n) − xT (n) w (n) (26) 56 / 101
  • 113. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
  • 114. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
  • 115. We have Something Notable ∂e (n) ∂w = −x (n) Then ∂J (w) ∂w = −x (n) e (n) Using this as an estimate for the gradient vector, we have for the gradient descent w (n + 1) = w (n) + ηx (n) e (n) (27) 57 / 101
  • 116. Remarks The feedback loop around the weight vector low-pass filter It behaves like a low-pass filter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass filter 58 / 101
  • 117. Remarks The feedback loop around the weight vector low-pass filter It behaves like a low-pass filter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass filter 58 / 101
  • 118. Remarks The feedback loop around the weight vector low-pass filter It behaves like a low-pass filter. It passes the low frequency component of the error signal and attenuating its high frequency component. Low-Pass filter R C Vin Vout 58 / 101
  • 119. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  • 120. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  • 121. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  • 122. Actually Thus The average time constant of this filtering action is inversely proportional to the learning-rate parameter η. Thus Assigning a small value to η, the adaptive process progresses slowly. Thus More of the past data are remembered by the LMS algorithm. Thus, LMS is a more accurate filter. 59 / 101
  • 123. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 60 / 101
  • 124. Convergence of the LMS This convergence depends on the following points The statistical characteristics of the input vector x (n). The learning-rate parameter η. Something Notable However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant as n → ∞ 61 / 101
  • 125. Convergence of the LMS This convergence depends on the following points The statistical characteristics of the input vector x (n). The learning-rate parameter η. Something Notable However instead using E [w (n)] as n → ∞, we use E e2 (n) →constant as n → ∞ 61 / 101
  • 126. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  • 127. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  • 128. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  • 129. To make this analysis practical We take the following assumptions The successive input vectors x(1), x(2), .. are statistically independent of each other. At time step n, the input vector x(n) is statistically independent of all previous samples of the desired response, namely d(1), d(2), ..., d(n − 1). At time step n, the desired response d(n) is dependent on x(n), but statistically independent of all previous values of the desired response. The input vector x(n) and desired response d(n) are drawn from Gaussiandistributed populations. 62 / 101
  • 130. We get the following The LMS is convergent in the mean square provided that η satisfies 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be difficult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
  • 131. We get the following The LMS is convergent in the mean square provided that η satisfies 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be difficult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
  • 132. We get the following The LMS is convergent in the mean square provided that η satisfies 0 < η < 2 λmax (28) Because λmax is the largest eigenvalue of the correlation sample Rx This can be difficult in reality.... then we use the trace instead 0 < η < 2 trace [Rx] (29) However, each diagonal element of Rx is equal the mean-squared value of the corresponding of the sensor input We can re-state the previous condition as 0 < η < 2 sum of the mean-square values of the sensor input (30) 63 / 101
  • 133. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  • 134. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  • 135. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  • 136. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  • 137. Virtues and Limitations of the LMS Algorithm Virtues An important virtue of the LMS algorithm is its simplicity. The model is independent and robust to the error (small disturbances = small estimation error). Not only that, the LMS algorithm is optimal in accordance with the minimax criterion If you do not know what you are up against, plan for the worst and optimize. Primary Limitation The slow rate of convergence and sensitivity to variations in the eigenstructure of the input. The LMS algorithms requires about 10 times the dimensionality of the input space for convergence. 64 / 101
  • 138. More of this in... Simon Haykin Simon Haykin - Adaptive Filter Theory (3rd Edition) 65 / 101
  • 139. Exercises We have from NN by Haykin 3.1, 3.2, 3.3, 3.4, 3.5, 3.7 and 3.8 66 / 101
  • 140. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 67 / 101
  • 141. Objective Goal Correctly classify a series of samples (External applied stimuli) x1, x2, x3, ..., xm into one of two classes, C1 and C2. Output of each input 1 Class C1 output y +1. 2 Class C2 output y -1. 68 / 101
  • 142. Objective Goal Correctly classify a series of samples (External applied stimuli) x1, x2, x3, ..., xm into one of two classes, C1 and C2. Output of each input 1 Class C1 output y +1. 2 Class C2 output y -1. 68 / 101
  • 143. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  • 144. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  • 145. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  • 146. History Frank Rosenblatt The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. Something Notable Frank Rosenblatt was a Psychologist!!! Working at a militar R&D!!! Frank Rosenblatt He helped to develop the Mark I Perceptron - a new machine based in the connectivity of neural networks!!! Some problems with it The most important is the impossibility to use the perceptron with a single neuron to solve the XOR problem 69 / 101
  • 147. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 70 / 101
  • 148. Perceptron: Local Field of a Neuron Signal-Flow Bias Inputs Induced local field of a neuron v = m i=1 wixi + b (31) 71 / 101
  • 149. Perceptron: Local Field of a Neuron Signal-Flow Bias Inputs Induced local field of a neuron v = m i=1 wixi + b (31) 71 / 101
  • 150. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 72 / 101
  • 151. Perceptron: One Neuron Structure Based in the previous induced local field In the simplest form of the perceptron there are two decision regions separated by an hyperplane: m i=1 wixi + b = 0 (32) Example with two signals 73 / 101
  • 152. Perceptron: One Neuron Structure Based in the previous induced local field In the simplest form of the perceptron there are two decision regions separated by an hyperplane: m i=1 wixi + b = 0 (32) Example with two signals Decision Boundary Class Class 73 / 101
  • 153. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 74 / 101
  • 154. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable 75 / 101
  • 155. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable 75 / 101
  • 156. Deriving the Algorithm First, you put signals together x (n) = [1, x1 (n) , x2 (n) , ..., xm (n)]T (33) Weights v(n) = m i=0 wi (n) xi (n) = wT (n) x (n) (34) Note IMPORTANT - Perceptron works only if C1 and C2 are linearly separable Decision Boundary Class Class 75 / 101
  • 157. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
  • 158. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
  • 159. Rule for Linear Separable Classes There must exist a vector w 1 wT x > 0 for every input vector x belonging to class C1. 2 wT x ≤ 0 for every input vector x belonging to class C2. What is the derivative of dv(n) dw ? dv (n) dw = x (n) (35) 76 / 101
  • 160. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  • 161. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  • 162. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  • 163. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  • 164. Finally No correction is necessary 1 w(n + 1) = w(n) if wT x(n) > 0 and x(n) belongs to class C1. 2 w(n + 1) = w(n) if and wT x(n) ≤ 0 and x(n) > 0 belongs to class C2. Correction is necessary 1 w(n + 1) = w(n) − η (n) x (n) if wT (n) x(n) > 0 and x(n) belongs to class C2. 2 w(n + 1) = w(n) + η (n) x (n) if and wT (n) x(n) ≤ 0 and x(n) belongs to class C1. Where η (n) is a learning parameter adjusting the learning rate. 77 / 101
  • 165. A little bit on the Geometry For Example, w(n + 1) = w(n) − η (n) x (n) 78 / 101
  • 166. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 79 / 101
  • 167. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  • 168. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  • 169. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  • 170. Under Linear Separability - Convergence happens!!! If we assume Linear Separabilty for the classes C1 and C2. Rosenblatt - 1962 Let the subsets of training vectors C1 and C2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that is a solution vector for w(n0) = w(n0 + 1) = w(n0 + 2) = ... (36) is a solution vector for n0 ≤ nmax 80 / 101
  • 171. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 81 / 101
  • 172. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  • 173. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  • 174. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  • 175. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  • 176. Proof I - First a Lower Bound for w (n + 1) 2 Initialization w (0) = 0 (37) Now assume for time n = 1, 2, 3, ... wT (n) x (n) < 0 (38) with x(n) belongs to class C1. PERCEPTRON INCORRECTLY CLASSIFY THE VECTORS x (1) , x (2) , ... Apply the correction formula w (n + 1) = w (n) + x (n) (39) 82 / 101
  • 177. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
  • 178. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
  • 179. Proof II Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (40) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (41) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (42) 83 / 101
  • 180. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
  • 181. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
  • 182. Proof III Apply the correction iteratively w (n + 1) = x (1) + x (2) + ... + x (n) (43) We know that there is a solution w0(Linear Separability) α = min x(n)∈C1 wT 0 x (n) (44) Then, we have wT 0 w (n + 1) = wT 0 x (1) + wT 0 x (2) + ... + wT 0 x (n) (45) 84 / 101
  • 183. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
  • 184. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
  • 185. Proof IV Thus we use the α wT 0 w (n + 1) ≥ nα (46) Thus using the Cauchy-Schwartz Inequality wT 0 2 w (n + 1) 2 ≥ wT 0 w (n + 1) 2 (47) · is the Euclidean distance. Thus wT 0 2 w (n + 1) 2 ≥ n2 α2 w (n + 1) 2 ≥ n2α2 wT 0 2 85 / 101
  • 186. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
  • 187. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
  • 188. Proof V - Now a Upper Bound for w (n + 1) 2 Now rewrite equation 39 w (k + 1) = w (k) + x (k) (48) for k = 1, 2, ..., n and x (k) ∈ C1 Squaring the Euclidean norm of both sides w (k + 1) 2 = w (k) 2 + x (k) 2 + 2wT (k) x (k) (49) Now taking that wT (k) x (k) < 0 w (k + 1) 2 ≤ w (k) 2 + x (k) 2 w (k + 1) 2 − w (k) 2 ≤ x (k) 2 86 / 101
  • 189. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
  • 190. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
  • 191. Proof VI Use the telescopic sum n k=0 w (k + 1) 2 − w (k) 2 ≤ n k=0 x (k) 2 (50) Assume w (0) = 0 x (0) = 0 Thus w (n + 1) 2 ≤ n k=1 x (k) 2 87 / 101
  • 192. Proof VII Then, we can define a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisfies the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
  • 193. Proof VII Then, we can define a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisfies the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
  • 194. Proof VII Then, we can define a positive number β = max x(k)∈C1 x (k) 2 (51) Thus w (k + 1) 2 ≤ n k=1 x (k) 2 ≤ nβ Thus, we satisfies the equations only when exists a nmax (Using Our Sandwich ) n2 maxα2 w0 2 = nmaxβ (52) 88 / 101
  • 195. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
  • 196. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
  • 197. Proof VIII Solving nmax = β w0 2 α2 (53) Thus For η (n) = 1 for all n, w (0) = 0 and a solution vector w0: The rule for adaptying the synaptic weights of the perceptron must terminate after at most nmax steps. In addition Because w0 the solution is not unique. 89 / 101
  • 198. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 90 / 101
  • 199. Algorithm Using Error-Correcting Now, if we use the 1 2 ek (n)2 We can actually simplify the rules and the final algorithm!!! Thus, we have the following Delta Value ∆w (n) = η ((dj − yj (n))) x (n) (54) 91 / 101
  • 200. Algorithm Using Error-Correcting Now, if we use the 1 2 ek (n)2 We can actually simplify the rules and the final algorithm!!! Thus, we have the following Delta Value ∆w (n) = η ((dj − yj (n))) x (n) (54) 91 / 101
  • 201. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 92 / 101
  • 202. We could use the previous Rule In order to generate an algorithm However, you need classes that are linearly separable!!! Therefore, we can use a more generals Gradient Descent Rule To obtain an algorithm to the best separation hyperplane!!! 93 / 101
  • 203. We could use the previous Rule In order to generate an algorithm However, you need classes that are linearly separable!!! Therefore, we can use a more generals Gradient Descent Rule To obtain an algorithm to the best separation hyperplane!!! 93 / 101
  • 204. Gradient Descent Algorithm Algorithm - Off-line/Batch Learning 1 Set n = 0. 2 Set dj = +1 if xj (n) ∈ Class 1 −1 if xj (n) ∈ Class 2 for all j = 1, 2, ..., m. 3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)). Weights may be initialized to 0 or to a small random value. 4 Initialize Dummy outputs so you can enter loop yt = y1 (n) ., y2 (n) , ..., ym (n) 5 Initialize Stopping error > 0. 6 Initialize learning error η. 7 While 1 m m j=1 dj − yj (n) > For each sample (xj, dj) for j = 1, ..., m: Calculate output yj = ϕ wT (n) · xj Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij. n = n + 1 94 / 101
  • 205. Gradient Descent Algorithm Algorithm - Off-line/Batch Learning 1 Set n = 0. 2 Set dj = +1 if xj (n) ∈ Class 1 −1 if xj (n) ∈ Class 2 for all j = 1, 2, ..., m. 3 Initialize the weights, wT = (w1 (n) , w2 (n) , ..., wn (n)). Weights may be initialized to 0 or to a small random value. 4 Initialize Dummy outputs so you can enter loop yt = y1 (n) ., y2 (n) , ..., ym (n) 5 Initialize Stopping error > 0. 6 Initialize learning error η. 7 While 1 m m j=1 dj − yj (n) > For each sample (xj, dj) for j = 1, ..., m: Calculate output yj = ϕ wT (n) · xj Update weights wi (n + 1) = wi (n) + η (dj − yj (n)) xij. n = n + 1 94 / 101
  • 206. Nevertheless We have the following problem > 0 (55) Thus... Convergence to the best linear separation is a tweaking business!!! 95 / 101
  • 207. Nevertheless We have the following problem > 0 (55) Thus... Convergence to the best linear separation is a tweaking business!!! 95 / 101
  • 208. Outline 1 Introduction History 2 Adapting Filtering Problem Definition Description of the Behavior of the System 3 Unconstrained Optimization Introduction Method of Steepest Descent Newton’s Method Gauss-Newton Method 4 Linear Least-Squares Filter Introduction Least-Mean-Square (LMS) Algorithm Convergence of the LMS 5 Perceptron Objective Perceptron: Local Field of a Neuron Perceptron: One Neuron Structure Deriving the Algorithm Under Linear Separability - Convergence happens!!! Proof Algorithm Using Error-Correcting Final Perceptron Algorithm (One Version) Other Algorithms for the Perceptron 96 / 101
  • 209. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
  • 210. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
  • 211. However, if we limit our features!!! The Winnow Algorithm!!! It converges even with no-linear separability. Feature Vector A Boolean-valued features X = {0, 1}d Weight Vector w 1 wt = (w1, w2, ..., wp) for all wi ∈ R 2 For all i, wi ≥ 0. 97 / 101
  • 212. Classification Scheme We use a specific θ 1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
  • 213. Classification Scheme We use a specific θ 1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
  • 214. Classification Scheme We use a specific θ 1 wT x ≥ θ ⇒ positive classification Class 1 if x ∈Class 1 2 wT x < θ ⇒ negative classification Class 2 if x ∈Class 2 Rule We use two possible Rules for training!!! With a learning rate of α > 1. Rule 1 When misclassifying a positive training example x ∈Class 1 i.e. wT x < θ ∀xi = 1 : wi ← αwi (56) 98 / 101
  • 215. Classification Scheme Rule 2 When misclassifying a negative training example x ∈Class 2 i.e. wT x ≥ θ ∀xi = 1 : wi ← wi α (57) Rule 3 If samples are correctly classified do nothing!!! 99 / 101
  • 216. Classification Scheme Rule 2 When misclassifying a negative training example x ∈Class 2 i.e. wT x ≥ θ ∀xi = 1 : wi ← wi α (57) Rule 3 If samples are correctly classified do nothing!!! 99 / 101
  • 217. Properties of Winnow Property If there are many irrelevant variables Winnow is better than the Perceptron. Drawback Sensitive to the learning rate α. 100 / 101
  • 218. Properties of Winnow Property If there are many irrelevant variables Winnow is better than the Perceptron. Drawback Sensitive to the learning rate α. 100 / 101
  • 219. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 220. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 221. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 222. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 223. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 224. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 225. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 226. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 227. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101
  • 228. The Pocket Algorithm A variant of the Perceptron Algorithm It was suggested by Geman et al. in “Perceptron based learning algorithms,” IEEE Transactions on Neural Networks,Vol. 1(2), pp. 179–191, 1990. It converges to an optimal solution even if the linear separability is not fulfilled. It consists of the following steps 1 Initialize the weight vector w (0) in a random way. 2 Define a storage pocket vector ws and a history counter hs to zero for the same pocket vector. 3 At the ith iteration step compute the update w (n + 1) using the Perceptron rule. 4 Use the update weight to find the number h of samples correctly classified. 5 If at any moment h > hs replace ws with w (n + 1) and hs with h101 / 101