SlideShare a Scribd company logo
1 of 211
Download to read offline
Neural Networks
Multilayer Perceptron
Andres Mendez-Vazquez
December 12, 2015
1 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
2 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
3 / 94
Do you remember?
The Perceptron has the following problem
Given that the perceptron is a linear classifier
It is clear that
It will never be able to classify stuff that is not linearly separable
4 / 94
Do you remember?
The Perceptron has the following problem
Given that the perceptron is a linear classifier
It is clear that
It will never be able to classify stuff that is not linearly separable
4 / 94
Example: XOR Problem
The Problem
0
1
1
Class 1
Class 2
5 / 94
The Perceptron cannot solve it
Because
The perceptron is a linear classifier!!!
Thus
Something needs to be done!!!
Maybe
Add an extra layer!!!
6 / 94
The Perceptron cannot solve it
Because
The perceptron is a linear classifier!!!
Thus
Something needs to be done!!!
Maybe
Add an extra layer!!!
6 / 94
The Perceptron cannot solve it
Because
The perceptron is a linear classifier!!!
Thus
Something needs to be done!!!
Maybe
Add an extra layer!!!
6 / 94
A little bit of history
It was first cited by Vapnik
Vapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal
programming problems with inequality constraints. I: Necessary conditions
for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the first
publication of the backpropagation algorithm in his book "Support Vector
Machines."
It was first used by
Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic
system optimization method in 1969.
However
It was not until 1974 and later, when applied in the context of neural
networks and through the work of Paul Werbos, David E. Rumelhart,
Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.
7 / 94
A little bit of history
It was first cited by Vapnik
Vapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal
programming problems with inequality constraints. I: Necessary conditions
for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the first
publication of the backpropagation algorithm in his book "Support Vector
Machines."
It was first used by
Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic
system optimization method in 1969.
However
It was not until 1974 and later, when applied in the context of neural
networks and through the work of Paul Werbos, David E. Rumelhart,
Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.
7 / 94
A little bit of history
It was first cited by Vapnik
Vapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal
programming problems with inequality constraints. I: Necessary conditions
for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the first
publication of the backpropagation algorithm in his book "Support Vector
Machines."
It was first used by
Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic
system optimization method in 1969.
However
It was not until 1974 and later, when applied in the context of neural
networks and through the work of Paul Werbos, David E. Rumelhart,
Geoffrey E. Hinton and Ronald J. Williams that it gained recognition.
7 / 94
Then
Something Notable
It led to a “renaissance” in the field of artificial neural network research.
Nevertheless
During the 2000s it fell out of favour but has returned again in the 2010s,
now able to train much larger networks using huge modern computing
power such as GPUs.
8 / 94
Then
Something Notable
It led to a “renaissance” in the field of artificial neural network research.
Nevertheless
During the 2000s it fell out of favour but has returned again in the 2010s,
now able to train much larger networks using huge modern computing
power such as GPUs.
8 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
9 / 94
Multi-Layer Perceptron (MLP)
Multi-Layer Architecture·
Output
Input
Target
Hidden
Sigmoid Activation
function
Identity Activation
function
10 / 94
Information Flow
We have the following information flow
Function Signals
Error Signals
11 / 94
Explanation
Problems with Hidden Layers
1 Increase complexity of Training
2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer
1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.
2 A hidden unit in the second layer than ANDs them together to bound
the region.
Advantages
It has been proven that an MLP with one hidden layer can learn any
nonlinear function of the input.
12 / 94
Explanation
Problems with Hidden Layers
1 Increase complexity of Training
2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer
1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.
2 A hidden unit in the second layer than ANDs them together to bound
the region.
Advantages
It has been proven that an MLP with one hidden layer can learn any
nonlinear function of the input.
12 / 94
Explanation
Problems with Hidden Layers
1 Increase complexity of Training
2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer
1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.
2 A hidden unit in the second layer than ANDs them together to bound
the region.
Advantages
It has been proven that an MLP with one hidden layer can learn any
nonlinear function of the input.
12 / 94
Explanation
Problems with Hidden Layers
1 Increase complexity of Training
2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer
1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.
2 A hidden unit in the second layer than ANDs them together to bound
the region.
Advantages
It has been proven that an MLP with one hidden layer can learn any
nonlinear function of the input.
12 / 94
Explanation
Problems with Hidden Layers
1 Increase complexity of Training
2 It is necessary to think about “Long and Narrow” network vs “Short
and Fat” network.
Intuition for a One Hidden Layer
1 For every input case of region, that region can be delimited by
hyperplanes on all sides using hidden units on the first hidden layer.
2 A hidden unit in the second layer than ANDs them together to bound
the region.
Advantages
It has been proven that an MLP with one hidden layer can learn any
nonlinear function of the input.
12 / 94
The Process
We have something like this
Layer 1
(0,0,1)
(1,0,0)
(1,0,0)
Layer 2
13 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
14 / 94
Remember!!! The Quadratic Learning Error function
Cost Function our well know error at pattern m
J (m) =
1
2
e2
k (m) (1)
Delta Rule or Widrow-Hoff Rule
∆wkj (m) = −ηek (m) xj(m) (2)
Actually this is know as Gradient Descent
wkj (m + 1) = wkj (m) + ∆wkj (m) (3)
15 / 94
Remember!!! The Quadratic Learning Error function
Cost Function our well know error at pattern m
J (m) =
1
2
e2
k (m) (1)
Delta Rule or Widrow-Hoff Rule
∆wkj (m) = −ηek (m) xj(m) (2)
Actually this is know as Gradient Descent
wkj (m + 1) = wkj (m) + ∆wkj (m) (3)
15 / 94
Remember!!! The Quadratic Learning Error function
Cost Function our well know error at pattern m
J (m) =
1
2
e2
k (m) (1)
Delta Rule or Widrow-Hoff Rule
∆wkj (m) = −ηek (m) xj(m) (2)
Actually this is know as Gradient Descent
wkj (m + 1) = wkj (m) + ∆wkj (m) (3)
15 / 94
Back-propagation
Setup
Let tk be the k-th target (or desired) output and zk be the k-th computed
output with k = 1, . . . , d and w represents all the weights of the network
Training Error for a single Pattern or Sample!!!
J (w) =
1
2
c
k=1
(tk − zk)2
=
1
2
t − z 2
(4)
16 / 94
Back-propagation
Setup
Let tk be the k-th target (or desired) output and zk be the k-th computed
output with k = 1, . . . , d and w represents all the weights of the network
Training Error for a single Pattern or Sample!!!
J (w) =
1
2
c
k=1
(tk − zk)2
=
1
2
t − z 2
(4)
16 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
17 / 94
Gradient Descent
Gradient Descent
The back-propagation learning rule is based on gradient descent.
Reducing the Error
The weights are initialized with pseudo-random values and are changed in
a direction that will reduce the error:
∆w = −η
∂J
∂w
(5)
Where
η is the learning rate which indicates the relative size of the change in
weights:
w (m + 1) = w (m) + ∆w (m) (6)
where m is the m-th pattern presented
18 / 94
Gradient Descent
Gradient Descent
The back-propagation learning rule is based on gradient descent.
Reducing the Error
The weights are initialized with pseudo-random values and are changed in
a direction that will reduce the error:
∆w = −η
∂J
∂w
(5)
Where
η is the learning rate which indicates the relative size of the change in
weights:
w (m + 1) = w (m) + ∆w (m) (6)
where m is the m-th pattern presented
18 / 94
Gradient Descent
Gradient Descent
The back-propagation learning rule is based on gradient descent.
Reducing the Error
The weights are initialized with pseudo-random values and are changed in
a direction that will reduce the error:
∆w = −η
∂J
∂w
(5)
Where
η is the learning rate which indicates the relative size of the change in
weights:
w (m + 1) = w (m) + ∆w (m) (6)
where m is the m-th pattern presented
18 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
19 / 94
Multilayer Architecture
Multilayer Architecture: hidden–to-output weights
Output
Input
Target
Hidden
20 / 94
Observation about the activation function
Hidden Output is equal to
yj = f
d
i=1
wjixi
Output is equal to
zk = f


ynH
j=1
wkjyj


21 / 94
Observation about the activation function
Hidden Output is equal to
yj = f
d
i=1
wjixi
Output is equal to
zk = f


ynH
j=1
wkjyj


21 / 94
Hidden–to-Output Weights
Error on the hidden–to-output weights
∂J
∂wkj
=
∂J
∂netk
·
∂netk
∂wkj
= −δk ·
∂netk
∂wkj
(7)
netk
It describes how the overall error changes with the activation of the unit’s
net:
netk =
ynH
j=1
wkjyj = wT
k · y (8)
Now
δk = −
∂J
∂netk
= −
∂J
∂zk
·
∂zk
∂netk
= (tk − zk) f (netk) (9)
22 / 94
Hidden–to-Output Weights
Error on the hidden–to-output weights
∂J
∂wkj
=
∂J
∂netk
·
∂netk
∂wkj
= −δk ·
∂netk
∂wkj
(7)
netk
It describes how the overall error changes with the activation of the unit’s
net:
netk =
ynH
j=1
wkjyj = wT
k · y (8)
Now
δk = −
∂J
∂netk
= −
∂J
∂zk
·
∂zk
∂netk
= (tk − zk) f (netk) (9)
22 / 94
Hidden–to-Output Weights
Error on the hidden–to-output weights
∂J
∂wkj
=
∂J
∂netk
·
∂netk
∂wkj
= −δk ·
∂netk
∂wkj
(7)
netk
It describes how the overall error changes with the activation of the unit’s
net:
netk =
ynH
j=1
wkjyj = wT
k · y (8)
Now
δk = −
∂J
∂netk
= −
∂J
∂zk
·
∂zk
∂netk
= (tk − zk) f (netk) (9)
22 / 94
Hidden–to-Output Weights
Why?
zk = f (netk) (10)
Thus
∂zk
∂netk
= f (netk) (11)
Since netk = wT
k · y therefore:
∂netk
∂wkj
= yj (12)
23 / 94
Hidden–to-Output Weights
Why?
zk = f (netk) (10)
Thus
∂zk
∂netk
= f (netk) (11)
Since netk = wT
k · y therefore:
∂netk
∂wkj
= yj (12)
23 / 94
Hidden–to-Output Weights
Why?
zk = f (netk) (10)
Thus
∂zk
∂netk
= f (netk) (11)
Since netk = wT
k · y therefore:
∂netk
∂wkj
= yj (12)
23 / 94
Finally
The weight update (or learning rule) for the hidden-to-output weights
is:
wkj = ηδkyj = η (tk − zk) f (netk) yj (13)
24 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
25 / 94
Multi-Layer Architecture
Multi-Layer Architecture: Input–to-Hidden weights
Output
Input
Target
Hidden
26 / 94
Input–to-Hidden Weights
Error on the Input–to-Hidden weights
∂J
∂wji
=
∂J
∂yj
·
∂yj
∂netj
·
∂netj
∂wji
(14)
Thus
∂J
∂yj
=
∂
∂yj
1
2
c
k=1
(tk − zk)2
= −
c
k=1
(tk − zk)
∂zk
∂yj
= −
c
k=1
(tk − zk)
∂zk
∂netk
·
∂netk
∂yj
= −
c
k=1
(tk − zk)
∂f (netk)
∂netk
· wkj
27 / 94
Input–to-Hidden Weights
Error on the Input–to-Hidden weights
∂J
∂wji
=
∂J
∂yj
·
∂yj
∂netj
·
∂netj
∂wji
(14)
Thus
∂J
∂yj
=
∂
∂yj
1
2
c
k=1
(tk − zk)2
= −
c
k=1
(tk − zk)
∂zk
∂yj
= −
c
k=1
(tk − zk)
∂zk
∂netk
·
∂netk
∂yj
= −
c
k=1
(tk − zk)
∂f (netk)
∂netk
· wkj
27 / 94
Input–to-Hidden Weights
Finally
∂J
∂yj
= −
c
k=1
(tk − zk) f (netk) · wkj (15)
Remember
δk = −
∂J
∂netk
= (tk − zk) f (netk) (16)
28 / 94
What is ∂yj
∂netj
?
First
netj =
d
i=1
wjixi = wT
j · x (17)
Then
yj = f (netj)
Then
∂yj
∂netj
=
∂f (netj)
∂netj
= f (netj)
29 / 94
What is ∂yj
∂netj
?
First
netj =
d
i=1
wjixi = wT
j · x (17)
Then
yj = f (netj)
Then
∂yj
∂netj
=
∂f (netj)
∂netj
= f (netj)
29 / 94
What is ∂yj
∂netj
?
First
netj =
d
i=1
wjixi = wT
j · x (17)
Then
yj = f (netj)
Then
∂yj
∂netj
=
∂f (netj)
∂netj
= f (netj)
29 / 94
Then, we can define δj
By defying the sensitivity for a hidden unit:
δj = f (netj)
c
k=1
wkjδk (18)
Which means that:
“The sensitivity at a hidden unit is simply the sum of the individual
sensitivities at the output units weighted by the hidden-to-output
weights wkj; all multiplied by f (netj)”
30 / 94
Then, we can define δj
By defying the sensitivity for a hidden unit:
δj = f (netj)
c
k=1
wkjδk (18)
Which means that:
“The sensitivity at a hidden unit is simply the sum of the individual
sensitivities at the output units weighted by the hidden-to-output
weights wkj; all multiplied by f (netj)”
30 / 94
Then, we can define δj
By defying the sensitivity for a hidden unit:
δj = f (netj)
c
k=1
wkjδk (18)
Which means that:
“The sensitivity at a hidden unit is simply the sum of the individual
sensitivities at the output units weighted by the hidden-to-output
weights wkj; all multiplied by f (netj)”
30 / 94
What about ∂netj
∂wji
?
We have that
∂netj
∂wji
=
∂wT
j · x
∂wji
=
∂ d
i=1 wjixi
∂wji
= xi
31 / 94
Finally
The learning rule for the input-to-hidden weights is:
∆wji = ηxiδj = η
c
k=1
wkjδk f (netj) xi (19)
32 / 94
Basically, the entire training process has the following steps
Initialization
Assuming that no prior information is available, pick the synaptic weights
and thresholds
Forward Computation
Compute the induced function signals of the network by proceeding
forward through the network, layer by layer.
Backward Computation
Compute the local gradients of the network.
Finally
Adjust the weights!!!
33 / 94
Basically, the entire training process has the following steps
Initialization
Assuming that no prior information is available, pick the synaptic weights
and thresholds
Forward Computation
Compute the induced function signals of the network by proceeding
forward through the network, layer by layer.
Backward Computation
Compute the local gradients of the network.
Finally
Adjust the weights!!!
33 / 94
Basically, the entire training process has the following steps
Initialization
Assuming that no prior information is available, pick the synaptic weights
and thresholds
Forward Computation
Compute the induced function signals of the network by proceeding
forward through the network, layer by layer.
Backward Computation
Compute the local gradients of the network.
Finally
Adjust the weights!!!
33 / 94
Basically, the entire training process has the following steps
Initialization
Assuming that no prior information is available, pick the synaptic weights
and thresholds
Forward Computation
Compute the induced function signals of the network by proceeding
forward through the network, layer by layer.
Backward Computation
Compute the local gradients of the network.
Finally
Adjust the weights!!!
33 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
34 / 94
Now, Calculating Total Change
We have for that
The Total Training Error is the sum over the errors of N individual patterns
The Total Training Error
J =
N
p=1
Jp =
1
2
N
p=1
d
k=1
(tp
k − zp
k )
2
=
1
2
n
p=1
tp
− zp 2
(20)
35 / 94
Now, Calculating Total Change
We have for that
The Total Training Error is the sum over the errors of N individual patterns
The Total Training Error
J =
N
p=1
Jp =
1
2
N
p=1
d
k=1
(tp
k − zp
k )
2
=
1
2
n
p=1
tp
− zp 2
(20)
35 / 94
About the Total Training Error
Remarks
A weight update may reduce the error on the single pattern being
presented but can increase the error on the full training set.
However, given a large number of such individual updates, the total
error of equation (20) decreases.
36 / 94
About the Total Training Error
Remarks
A weight update may reduce the error on the single pattern being
presented but can increase the error on the full training set.
However, given a large number of such individual updates, the total
error of equation (20) decreases.
36 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
37 / 94
Now, we want the training to stop
Therefore
It is necessary to have a way to stop when the change of the weights is
enough!!!
A simple way to stop the training
The algorithm terminates when the change in the criterion function
J(w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1)) − J (w (t))| (21)
There are other stopping criteria that lead to better performance than
this one.
38 / 94
Now, we want the training to stop
Therefore
It is necessary to have a way to stop when the change of the weights is
enough!!!
A simple way to stop the training
The algorithm terminates when the change in the criterion function
J(w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1)) − J (w (t))| (21)
There are other stopping criteria that lead to better performance than
this one.
38 / 94
Now, we want the training to stop
Therefore
It is necessary to have a way to stop when the change of the weights is
enough!!!
A simple way to stop the training
The algorithm terminates when the change in the criterion function
J(w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1)) − J (w (t))| (21)
There are other stopping criteria that lead to better performance than
this one.
38 / 94
Now, we want the training to stop
Therefore
It is necessary to have a way to stop when the change of the weights is
enough!!!
A simple way to stop the training
The algorithm terminates when the change in the criterion function
J(w) is smaller than some preset value Θ.
∆J (w) = |J (w (t + 1)) − J (w (t))| (21)
There are other stopping criteria that lead to better performance than
this one.
38 / 94
Other Stopping Criteria
Norm of the Gradient
The back-propagation algorithm is considered to have converged when the
Euclidean norm of the gradient vector reaches a sufficiently small gradient
threshold.
wJ (m) < Θ (22)
Rate of change in the average error per epoch
The back-propagation algorithm is considered to have converged when the
absolute rate of change in the average squared error per epoch is
sufficiently small.
1
N
N
p=1
Jp < Θ (23)
39 / 94
Other Stopping Criteria
Norm of the Gradient
The back-propagation algorithm is considered to have converged when the
Euclidean norm of the gradient vector reaches a sufficiently small gradient
threshold.
wJ (m) < Θ (22)
Rate of change in the average error per epoch
The back-propagation algorithm is considered to have converged when the
absolute rate of change in the average squared error per epoch is
sufficiently small.
1
N
N
p=1
Jp < Θ (23)
39 / 94
About the Stopping Criteria
Observations
1 Before training starts, the error on the training set is high.
Through the learning process, the error becomes smaller.
2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.
3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.
4 A validation set is used in order to decide when to stop training.
We do not want to over-fit the network and decrease the power of the
classifier generalization “we stop training at a minimum of the error on
the validation set”
40 / 94
About the Stopping Criteria
Observations
1 Before training starts, the error on the training set is high.
Through the learning process, the error becomes smaller.
2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.
3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.
4 A validation set is used in order to decide when to stop training.
We do not want to over-fit the network and decrease the power of the
classifier generalization “we stop training at a minimum of the error on
the validation set”
40 / 94
About the Stopping Criteria
Observations
1 Before training starts, the error on the training set is high.
Through the learning process, the error becomes smaller.
2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.
3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.
4 A validation set is used in order to decide when to stop training.
We do not want to over-fit the network and decrease the power of the
classifier generalization “we stop training at a minimum of the error on
the validation set”
40 / 94
About the Stopping Criteria
Observations
1 Before training starts, the error on the training set is high.
Through the learning process, the error becomes smaller.
2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.
3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.
4 A validation set is used in order to decide when to stop training.
We do not want to over-fit the network and decrease the power of the
classifier generalization “we stop training at a minimum of the error on
the validation set”
40 / 94
About the Stopping Criteria
Observations
1 Before training starts, the error on the training set is high.
Through the learning process, the error becomes smaller.
2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.
3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.
4 A validation set is used in order to decide when to stop training.
We do not want to over-fit the network and decrease the power of the
classifier generalization “we stop training at a minimum of the error on
the validation set”
40 / 94
About the Stopping Criteria
Observations
1 Before training starts, the error on the training set is high.
Through the learning process, the error becomes smaller.
2 The error per pattern depends on the amount of training data and the
expressive power (such as the number of weights) in the network.
3 The average error on an independent test set is always higher than on
the training set, and it can decrease as well as increase.
4 A validation set is used in order to decide when to stop training.
We do not want to over-fit the network and decrease the power of the
classifier generalization “we stop training at a minimum of the error on
the validation set”
40 / 94
Some More Terminology
Epoch
As with other types of backpropagation, ’learning’ is a supervised process
that occurs with each cycle or ’epoch’ through a forward activation flow of
outputs, and the backwards error propagation of weight adjustments.
In our case
I am using the batch sum of all correcting weights to define that epoch.
41 / 94
Some More Terminology
Epoch
As with other types of backpropagation, ’learning’ is a supervised process
that occurs with each cycle or ’epoch’ through a forward activation flow of
outputs, and the backwards error propagation of weight adjustments.
In our case
I am using the batch sum of all correcting weights to define that epoch.
41 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
42 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Final Basic Batch Algorithm
Perceptron(X)
1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch
m = 0
2 do
3 m = m + 1
4 for s = 1 to N
5 x (m) = X (:, s)
6 for k = 1 to c
7 δk = (tk − zk ) f wT
k
· y
8 for j = 1 to nH
9 netj = wT
j · x;yj = f netj
10 wkj (m) = wkj (m) + ηδk yj (m)
11 for j = 1 to nH
12 δj = f netj
c
k=1
wkj δk
13 for i = 1 to d
14 wji (m) = wji (m) + ηδj xi (m)
15 until w J (m) < Θ
16 return w (m) 43 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
44 / 94
Example of Architecture to be used
Given the following Architecture and assuming N samples
45 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
46 / 94
Generating the output zk
Given the input
X = x1 x2 · · · xN (24)
Where
xi is a vector of features
xi =






x1i
x2i
...
xdi






(25)
47 / 94
Generating the output zk
Given the input
X = x1 x2 · · · xN (24)
Where
xi is a vector of features
xi =






x1i
x2i
...
xdi






(25)
47 / 94
Therefore
We must have the following matrix for the input to hidden inputs
WIH =



w11 w12 · · · w1d
w21 w22 · · · w2d
...
...
...
...
wnH 1 wnH 2 · · · wnH d


 =




wT
1
wT
2
...
wT
nH



 (26)
Given that wj =



wj1
wj2
...
wjd



Thus
We can create the netj for all the inputs by simply
netj = WIH X =




wT
1 x1 wT
1 x2 · · · wT
1 xN
wT
2 x1 wT
2 x2 · · · wT
2 xN
.
..
.
..
...
.
..
wT
nH
x1 wT
nH
x2 · · · wT
nH
xN



 (27)
48 / 94
Therefore
We must have the following matrix for the input to hidden inputs
WIH =



w11 w12 · · · w1d
w21 w22 · · · w2d
...
...
...
...
wnH 1 wnH 2 · · · wnH d


 =




wT
1
wT
2
...
wT
nH



 (26)
Given that wj =



wj1
wj2
...
wjd



Thus
We can create the netj for all the inputs by simply
netj = WIH X =




wT
1 x1 wT
1 x2 · · · wT
1 xN
wT
2 x1 wT
2 x2 · · · wT
2 xN
.
..
.
..
...
.
..
wT
nH
x1 wT
nH
x2 · · · wT
nH
xN



 (27)
48 / 94
Now, we need to generate the yk
We apply the activation function element by element in netj
y1 =








f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
...
...
...
...
f wT
nH
x1 f wT
nH
x2 · · · f wT
nH
xN








(28)
IMPORTANT about overflows!!!
Be careful about the numeric stability of the activation function.
I the case of python, we can use the ones provided by scipy.special
49 / 94
Now, we need to generate the yk
We apply the activation function element by element in netj
y1 =








f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
...
...
...
...
f wT
nH
x1 f wT
nH
x2 · · · f wT
nH
xN








(28)
IMPORTANT about overflows!!!
Be careful about the numeric stability of the activation function.
I the case of python, we can use the ones provided by scipy.special
49 / 94
Now, we need to generate the yk
We apply the activation function element by element in netj
y1 =








f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
...
...
...
...
f wT
nH
x1 f wT
nH
x2 · · · f wT
nH
xN








(28)
IMPORTANT about overflows!!!
Be careful about the numeric stability of the activation function.
I the case of python, we can use the ones provided by scipy.special
49 / 94
However, We can create a Sigmoid function
It is possible to use the following pseudo-code
Sigmoid(x)
1 if try 1.0
1.0+exp{−αx} catch {OVERFLOW } I will use a
try-and-catch to catch
the overflow
2 if x < 0
3 return 0
4 else
5 return 1
6 else
7 return 1.0
1.0+exp{−αx} 1.0 refers to the floating point (Rationals
trying to represent Reals)
50 / 94
However, We can create a Sigmoid function
It is possible to use the following pseudo-code
Sigmoid(x)
1 if try 1.0
1.0+exp{−αx} catch {OVERFLOW } I will use a
try-and-catch to catch
the overflow
2 if x < 0
3 return 0
4 else
5 return 1
6 else
7 return 1.0
1.0+exp{−αx} 1.0 refers to the floating point (Rationals
trying to represent Reals)
50 / 94
However, We can create a Sigmoid function
It is possible to use the following pseudo-code
Sigmoid(x)
1 if try 1.0
1.0+exp{−αx} catch {OVERFLOW } I will use a
try-and-catch to catch
the overflow
2 if x < 0
3 return 0
4 else
5 return 1
6 else
7 return 1.0
1.0+exp{−αx} 1.0 refers to the floating point (Rationals
trying to represent Reals)
50 / 94
However, We can create a Sigmoid function
It is possible to use the following pseudo-code
Sigmoid(x)
1 if try 1.0
1.0+exp{−αx} catch {OVERFLOW } I will use a
try-and-catch to catch
the overflow
2 if x < 0
3 return 0
4 else
5 return 1
6 else
7 return 1.0
1.0+exp{−αx} 1.0 refers to the floating point (Rationals
trying to represent Reals)
50 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
51 / 94
For this, we get netk
For this, we obtain the WHO
WHO = wo
11 wo
12 · · · wo
1nH
= wT
o (29)
Thus
netk = wo
11 wo
12 · · · wo
1nH






f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
.
.
.
.
.
.
.
.
.
.
.
.
f w
T
nH
x1 f w
T
nH
x2 · · · f w
T
nH
xN






yk1 yk2 · · · ykN
(30)
In matrix notation
netk = wT
o yk1 wT
o yk2 · · · wT
o ykN (31)
52 / 94
For this, we get netk
For this, we obtain the WHO
WHO = wo
11 wo
12 · · · wo
1nH
= wT
o (29)
Thus
netk = wo
11 wo
12 · · · wo
1nH






f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
.
.
.
.
.
.
.
.
.
.
.
.
f w
T
nH
x1 f w
T
nH
x2 · · · f w
T
nH
xN






yk1 yk2 · · · ykN
(30)
In matrix notation
netk = wT
o yk1 wT
o yk2 · · · wT
o ykN (31)
52 / 94
For this, we get netk
For this, we obtain the WHO
WHO = wo
11 wo
12 · · · wo
1nH
= wT
o (29)
Thus
netk = wo
11 wo
12 · · · wo
1nH






f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
.
.
.
.
.
.
.
.
.
.
.
.
f w
T
nH
x1 f w
T
nH
x2 · · · f w
T
nH
xN






yk1 yk2 · · · ykN
(30)
In matrix notation
netk = wT
o yk1 wT
o yk2 · · · wT
o ykN (31)
52 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
53 / 94
Now, we have
Thus, we have zk (In our case k = 1, but it could be a range of
values)
zk = f wT
o yk1 f wT
o yk2 · · · f wT
o ykN (32)
Thus, we generate a vector of differences
d = t − zk = t1 − f wT
o yk1 t2 − f wT
o yk2 · · · tN − f wT
o ykN (33)
where t = t1 t2 · · · tN is a row vector of desired outputs for each
sample.
54 / 94
Now, we have
Thus, we have zk (In our case k = 1, but it could be a range of
values)
zk = f wT
o yk1 f wT
o yk2 · · · f wT
o ykN (32)
Thus, we generate a vector of differences
d = t − zk = t1 − f wT
o yk1 t2 − f wT
o yk2 · · · tN − f wT
o ykN (33)
where t = t1 t2 · · · tN is a row vector of desired outputs for each
sample.
54 / 94
Now, we multiply element wise
We have the following vector of derivatives of net
Df = ηf wT
o yk1 ηf wT
o yk2 · · · ηf wT
o ykN (34)
where η is the step rate.
Finally, by element wise multiplication (Hadamard Product)
d = η t1 − f wT
o yk1 f wT
o yk1 η t2 − f wT
o yk2 f wT
o yk2 · · ·
η tN − f wT
o ykN f wT
o ykN
55 / 94
Now, we multiply element wise
We have the following vector of derivatives of net
Df = ηf wT
o yk1 ηf wT
o yk2 · · · ηf wT
o ykN (34)
where η is the step rate.
Finally, by element wise multiplication (Hadamard Product)
d = η t1 − f wT
o yk1 f wT
o yk1 η t2 − f wT
o yk2 f wT
o yk2 · · ·
η tN − f wT
o ykN f wT
o ykN
55 / 94
Tile d
Tile downward
dtile = nH rows









d
d
...
d






(35)
Finally, we multiply element wise against y1 (Hadamard Product)
∆wtemp
1j = y1 ◦ dtile (36)
56 / 94
Tile d
Tile downward
dtile = nH rows









d
d
...
d






(35)
Finally, we multiply element wise against y1 (Hadamard Product)
∆wtemp
1j = y1 ◦ dtile (36)
56 / 94
We obtain the total ∆w1j
We sum along the rows of ∆wtemp
1j
∆w1j =



η t1 − f wT
o yk1 f wT
o yk1 y11 + η t1 − f wT
o yk1 f wT
o yk1 y1N
.
.
.
η t1 − f wT
o yk1 f wT
o yk1 ynH 1 + η t1 − f wT
o yk1 f wT
o yk1 ynH N


 (37)
where yhm = f wT
h xm with h = 1, 2, ..., nH and m = 1, 2, ..., N.
57 / 94
Finally, we update the first weights
We have then
WHO (t + 1) = WHO (t) + ∆wT
1j (t) (38)
58 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
59 / 94
First
We multiply element wise the WHO and ∆w1j
T = ∆wT
1j ◦ WT
HO (39)
Now, we obtain the element wise derivative of netj
Dnetj =








f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
...
...
...
...
f wT
nH
x1 f wT
nH
x2 · · · f wT
nH
xN








(40)
60 / 94
First
We multiply element wise the WHO and ∆w1j
T = ∆wT
1j ◦ WT
HO (39)
Now, we obtain the element wise derivative of netj
Dnetj =








f wT
1 x1 f wT
1 x2 · · · f wT
1 xN
f wT
2 x1 f wT
2 x2 · · · f wT
2 xN
...
...
...
...
f wT
nH
x1 f wT
nH
x2 · · · f wT
nH
xN








(40)
60 / 94
Thus
We tile to the right T
Ttile = T T · · · T
N Columns
(41)
Now, we multiply element wise together with η
Pt = η (Dnetj ◦ Ttile) (42)
where η is constant multiplied against the result the Hadamar Product
(Result a nH × N matrix)
61 / 94
Thus
We tile to the right T
Ttile = T T · · · T
N Columns
(41)
Now, we multiply element wise together with η
Pt = η (Dnetj ◦ Ttile) (42)
where η is constant multiplied against the result the Hadamar Product
(Result a nH × N matrix)
61 / 94
Finally
We get use the transpose of X which is a N × d matrix
XT
=






xT
1
xT
2
...
xT
N






(43)
Finally, we get a nH × d matrix
∆wij = PtXT
(44)
Thus, given WIH
WIH (t + 1) = WHO (t) + ∆wT
ij (t) (45)
62 / 94
Finally
We get use the transpose of X which is a N × d matrix
XT
=






xT
1
xT
2
...
xT
N






(43)
Finally, we get a nH × d matrix
∆wij = PtXT
(44)
Thus, given WIH
WIH (t + 1) = WHO (t) + ∆wT
ij (t) (45)
62 / 94
Finally
We get use the transpose of X which is a N × d matrix
XT
=






xT
1
xT
2
...
xT
N






(43)
Finally, we get a nH × d matrix
∆wij = PtXT
(44)
Thus, given WIH
WIH (t + 1) = WHO (t) + ∆wT
ij (t) (45)
62 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
63 / 94
We have different activation functions
The two most important
1 Sigmoid function.
2 Hyperbolic tangent function
64 / 94
We have different activation functions
The two most important
1 Sigmoid function.
2 Hyperbolic tangent function
64 / 94
Logistic Function
This non-linear function has the following definition for a neuron j
fj (vj (n)) =
1
1 + exp {−avj (n)}
a > 0 and − ∞ < vj (n) < ∞ (46)
Example
65 / 94
Logistic Function
This non-linear function has the following definition for a neuron j
fj (vj (n)) =
1
1 + exp {−avj (n)}
a > 0 and − ∞ < vj (n) < ∞ (46)
Example
65 / 94
The differential of the sigmoid function
Now if we differentiate, we have
fj (vj (n)) =
1
1 + exp {−avj (n)}
1 −
1
1 + exp {−avj (n)}
=
exp {−avj (n)}
(1 + exp {−avj (n)})2
66 / 94
The differential of the sigmoid function
Now if we differentiate, we have
fj (vj (n)) =
1
1 + exp {−avj (n)}
1 −
1
1 + exp {−avj (n)}
=
exp {−avj (n)}
(1 + exp {−avj (n)})2
66 / 94
The outputs finish as
For the output neurons
δk = (tk − zk) f (netk)
= (tk − fk (vk (n))) fk (vk (n)) (1 − fk (vk (n)))
For the hidden neurons
δj =fj (vj (n)) (1 − fj (vj (n)))
c
k=1
wkjδk
67 / 94
The outputs finish as
For the output neurons
δk = (tk − zk) f (netk)
= (tk − fk (vk (n))) fk (vk (n)) (1 − fk (vk (n)))
For the hidden neurons
δj =fj (vj (n)) (1 − fj (vj (n)))
c
k=1
wkjδk
67 / 94
The outputs finish as
For the output neurons
δk = (tk − zk) f (netk)
= (tk − fk (vk (n))) fk (vk (n)) (1 − fk (vk (n)))
For the hidden neurons
δj =fj (vj (n)) (1 − fj (vj (n)))
c
k=1
wkjδk
67 / 94
Hyperbolic tangent function
Another commonly used form of sigmoidal non linearity is the
hyperbolic tangent function
fj (vj (n)) = a tanh (bvj (n)) (47)
Example
68 / 94
Hyperbolic tangent function
Another commonly used form of sigmoidal non linearity is the
hyperbolic tangent function
fj (vj (n)) = a tanh (bvj (n)) (47)
Example
68 / 94
The differential of the hyperbolic tangent
We have
fj (vj (n)) =absech2
(bvj (n))
=ab 1 − tanh2
(bvj (n))
BTW
I leave to you to figure out the outputs.
69 / 94
The differential of the hyperbolic tangent
We have
fj (vj (n)) =absech2
(bvj (n))
=ab 1 − tanh2
(bvj (n))
BTW
I leave to you to figure out the outputs.
69 / 94
The differential of the hyperbolic tangent
We have
fj (vj (n)) =absech2
(bvj (n))
=ab 1 − tanh2
(bvj (n))
BTW
I leave to you to figure out the outputs.
69 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
70 / 94
Maximizing information content
Two ways of achieving this, LeCun 1993
The use of an example that results in the largest training error.
The use of an example that is radically different from all those
previously used.
For this
Randomized the samples presented to the multilayer perceptron when not
doing batch training.
Or use an emphasizing scheme
By using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
Maximizing information content
Two ways of achieving this, LeCun 1993
The use of an example that results in the largest training error.
The use of an example that is radically different from all those
previously used.
For this
Randomized the samples presented to the multilayer perceptron when not
doing batch training.
Or use an emphasizing scheme
By using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
Maximizing information content
Two ways of achieving this, LeCun 1993
The use of an example that results in the largest training error.
The use of an example that is radically different from all those
previously used.
For this
Randomized the samples presented to the multilayer perceptron when not
doing batch training.
Or use an emphasizing scheme
By using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
Maximizing information content
Two ways of achieving this, LeCun 1993
The use of an example that results in the largest training error.
The use of an example that is radically different from all those
previously used.
For this
Randomized the samples presented to the multilayer perceptron when not
doing batch training.
Or use an emphasizing scheme
By using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
Maximizing information content
Two ways of achieving this, LeCun 1993
The use of an example that results in the largest training error.
The use of an example that is radically different from all those
previously used.
For this
Randomized the samples presented to the multilayer perceptron when not
doing batch training.
Or use an emphasizing scheme
By using the error identify the difficult vs. easy patterns:
Use them to train the neural network
71 / 94
However
Be careful about emphasizing scheme
The distribution of examples within an epoch presented to the
network is distorted.
The presence of an outlier or a mislabeled example can have a
catastrophic consequence on the performance of the algorithm.
Definition of Outlier
An outlier is an observation that lies outside the overall pattern of a
distribution (Moore and McCabe 1999).
72 / 94
However
Be careful about emphasizing scheme
The distribution of examples within an epoch presented to the
network is distorted.
The presence of an outlier or a mislabeled example can have a
catastrophic consequence on the performance of the algorithm.
Definition of Outlier
An outlier is an observation that lies outside the overall pattern of a
distribution (Moore and McCabe 1999).
72 / 94
However
Be careful about emphasizing scheme
The distribution of examples within an epoch presented to the
network is distorted.
The presence of an outlier or a mislabeled example can have a
catastrophic consequence on the performance of the algorithm.
Definition of Outlier
An outlier is an observation that lies outside the overall pattern of a
distribution (Moore and McCabe 1999).
0 1 2 3
Outlier
72 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
73 / 94
Activation Function
We say that
An activation function f (v) is antisymmetric if f (−v) = −f (v)
It seems to be
That the multilayer perceptron learns faster using an antisymmetric
function.
Example: The hyperbolic tangent
74 / 94
Activation Function
We say that
An activation function f (v) is antisymmetric if f (−v) = −f (v)
It seems to be
That the multilayer perceptron learns faster using an antisymmetric
function.
Example: The hyperbolic tangent
74 / 94
Activation Function
We say that
An activation function f (v) is antisymmetric if f (−v) = −f (v)
It seems to be
That the multilayer perceptron learns faster using an antisymmetric
function.
Example: The hyperbolic tangent
74 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
75 / 94
Target Values
Important
It is important that the target values be chosen within the range of the
sigmoid activation function.
Specifically
The desired response for neuron in the output layer of the multilayer
perceptron should be offset by some amount
76 / 94
Target Values
Important
It is important that the target values be chosen within the range of the
sigmoid activation function.
Specifically
The desired response for neuron in the output layer of the multilayer
perceptron should be offset by some amount
76 / 94
For example
Given the a limiting value
We have then
If we have a limiting value +a, we set t = a − .
If we have a limiting value −a, we set t = −a + .
77 / 94
For example
Given the a limiting value
We have then
If we have a limiting value +a, we set t = a − .
If we have a limiting value −a, we set t = −a + .
77 / 94
For example
Given the a limiting value
We have then
If we have a limiting value +a, we set t = a − .
If we have a limiting value −a, we set t = −a + .
77 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
78 / 94
Normalizing the inputs
Something Important (LeCun, 1993)
Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.
Or it is smalll compared to its standard deviation.
Example
79 / 94
Normalizing the inputs
Something Important (LeCun, 1993)
Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.
Or it is smalll compared to its standard deviation.
Example
79 / 94
Normalizing the inputs
Something Important (LeCun, 1993)
Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.
Or it is smalll compared to its standard deviation.
Example
79 / 94
Normalizing the inputs
Something Important (LeCun, 1993)
Each input variable should be preprocessed so that:
The mean value, averaged over the entire training set, is close to zero.
Or it is smalll compared to its standard deviation.
Example
Mean Value
79 / 94
The normalization must include two other measures
Uncorrelated
We can use the principal component analysis
Example
80 / 94
The normalization must include two other measures
Uncorrelated
We can use the principal component analysis
Example
80 / 94
In addition
Quite interesting
The decorrelated input variables should be scaled so that their covariances
are The decorrelated approximately equal.
Why
Ensuring that the different synaptic weights in the ely the same speed
network learn at approximately the same speed.
81 / 94
In addition
Quite interesting
The decorrelated input variables should be scaled so that their covariances
are The decorrelated approximately equal.
Why
Ensuring that the different synaptic weights in the ely the same speed
network learn at approximately the same speed.
81 / 94
There are other heuristics
As
Initialization
Learning form hints
Learning rates
etc
82 / 94
There are other heuristics
As
Initialization
Learning form hints
Learning rates
etc
82 / 94
There are other heuristics
As
Initialization
Learning form hints
Learning rates
etc
82 / 94
There are other heuristics
As
Initialization
Learning form hints
Learning rates
etc
82 / 94
In addition
In section 4.15, Simon Haykin
We have the following techniques:
Network growing
You start with a small network and add neurons and layers to
accomplish the learning task.
Network pruning
Start with a large network, then prune weights that are not necessary in
an orderly fashion.
83 / 94
In addition
In section 4.15, Simon Haykin
We have the following techniques:
Network growing
You start with a small network and add neurons and layers to
accomplish the learning task.
Network pruning
Start with a large network, then prune weights that are not necessary in
an orderly fashion.
83 / 94
In addition
In section 4.15, Simon Haykin
We have the following techniques:
Network growing
You start with a small network and add neurons and layers to
accomplish the learning task.
Network pruning
Start with a large network, then prune weights that are not necessary in
an orderly fashion.
83 / 94
Outline
1 Introduction
The XOR Problem
2 Multi-Layer Perceptron
Architecture
Back-propagation
Gradient Descent
Hidden–to-Output Weights
Input-to-Hidden Weights
Total Training Error
About Stopping Criteria
Final Basic Batch Algorithm
3 Using Matrix Operations to Simplify
Using Matrix Operations to Simplify the Pseudo-Code
Generating the Output zk
Generating zk
Generating the Weights from Hidden to Output Layer
Generating the Weights from Input to Hidden Layer
Activation Functions
4 Heuristic for Multilayer Perceptron
Maximizing information content
Activation Function
Target Values
Normalizing the inputs
Virtues and limitations of Back-Propagation Layer
84 / 94
Virtues and limitations of Back-Propagation Layer
Something Notable
The back-propagation algorithm has emerged as the most popular
algorithm for the training of multilayer perceptrons.
It has two distinct properties
It is simple to compute locally.
It performs stochastic gradient descent in weight space when doing
pattern-by-pattern training
85 / 94
Virtues and limitations of Back-Propagation Layer
Something Notable
The back-propagation algorithm has emerged as the most popular
algorithm for the training of multilayer perceptrons.
It has two distinct properties
It is simple to compute locally.
It performs stochastic gradient descent in weight space when doing
pattern-by-pattern training
85 / 94
Virtues and limitations of Back-Propagation Layer
Something Notable
The back-propagation algorithm has emerged as the most popular
algorithm for the training of multilayer perceptrons.
It has two distinct properties
It is simple to compute locally.
It performs stochastic gradient descent in weight space when doing
pattern-by-pattern training
85 / 94
Connectionism
Back-propagation
t is an example of a connectionist paradigm that relies on local
computations to discover the processing capabilities of neural networks.
This form of restriction
It is known as the locality constraint
86 / 94
Connectionism
Back-propagation
t is an example of a connectionist paradigm that relies on local
computations to discover the processing capabilities of neural networks.
This form of restriction
It is known as the locality constraint
86 / 94
Why this is advocated in Artificial Neural Networks
First
Artificial neural networks that perform local computations are often held
up as metaphors for biological neural networks.
Second
The use of local computations permits a graceful degradation in
performance due to hardware errors, and therefore provides the basis for a
fault-tolerant network design.
Third
Local computations favor the use of parallel architectures as an efficient
method for the implementation of artificial neural networks.
87 / 94
Why this is advocated in Artificial Neural Networks
First
Artificial neural networks that perform local computations are often held
up as metaphors for biological neural networks.
Second
The use of local computations permits a graceful degradation in
performance due to hardware errors, and therefore provides the basis for a
fault-tolerant network design.
Third
Local computations favor the use of parallel architectures as an efficient
method for the implementation of artificial neural networks.
87 / 94
Why this is advocated in Artificial Neural Networks
First
Artificial neural networks that perform local computations are often held
up as metaphors for biological neural networks.
Second
The use of local computations permits a graceful degradation in
performance due to hardware errors, and therefore provides the basis for a
fault-tolerant network design.
Third
Local computations favor the use of parallel architectures as an efficient
method for the implementation of artificial neural networks.
87 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
First
The reciprocal synaptic connections between the neurons of a
multilayer perceptron may assume weights that are excitatory or
inhibitory.
In the real nervous system, neurons usually appear to be the one or
the other.
Second
In a multilayer perceptron, hormonal and other types of global
communications are ignored.
88 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
First
The reciprocal synaptic connections between the neurons of a
multilayer perceptron may assume weights that are excitatory or
inhibitory.
In the real nervous system, neurons usually appear to be the one or
the other.
Second
In a multilayer perceptron, hormonal and other types of global
communications are ignored.
88 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
First
The reciprocal synaptic connections between the neurons of a
multilayer perceptron may assume weights that are excitatory or
inhibitory.
In the real nervous system, neurons usually appear to be the one or
the other.
Second
In a multilayer perceptron, hormonal and other types of global
communications are ignored.
88 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
Third
In back-propagation learning, a synaptic weight is modified by a
presynaptic activity and an error (learning) signal independent of
postsynaptic activity.
There is evidence from neurobiology to suggest otherwise.
Fourth
In a neurobiological sense, the implementation of back-propagation
learning requires the rapid transmission of information backward along
an axon.
It appears highly unlikely that such an operation actually takes place
in the brain.
89 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
Third
In back-propagation learning, a synaptic weight is modified by a
presynaptic activity and an error (learning) signal independent of
postsynaptic activity.
There is evidence from neurobiology to suggest otherwise.
Fourth
In a neurobiological sense, the implementation of back-propagation
learning requires the rapid transmission of information backward along
an axon.
It appears highly unlikely that such an operation actually takes place
in the brain.
89 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
Third
In back-propagation learning, a synaptic weight is modified by a
presynaptic activity and an error (learning) signal independent of
postsynaptic activity.
There is evidence from neurobiology to suggest otherwise.
Fourth
In a neurobiological sense, the implementation of back-propagation
learning requires the rapid transmission of information backward along
an axon.
It appears highly unlikely that such an operation actually takes place
in the brain.
89 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
Third
In back-propagation learning, a synaptic weight is modified by a
presynaptic activity and an error (learning) signal independent of
postsynaptic activity.
There is evidence from neurobiology to suggest otherwise.
Fourth
In a neurobiological sense, the implementation of back-propagation
learning requires the rapid transmission of information backward along
an axon.
It appears highly unlikely that such an operation actually takes place
in the brain.
89 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
Fifth
Back-propagation learning implies the existence of a "teacher," which
in the con text of the brain would presumably be another set of
neurons with novel properties.
The existence of such neurons is biologically implausible.
90 / 94
However, all this has been seriously questioned on the
following grounds(Shepherd, 1990b; Crick, 1989; Stork,
1989)
Fifth
Back-propagation learning implies the existence of a "teacher," which
in the con text of the brain would presumably be another set of
neurons with novel properties.
The existence of such neurons is biologically implausible.
90 / 94
Computational Efficiency
Something Notable
The computational complexity of an algorithm is usually measured in
terms of the number of multiplications, additions, and storage involved in
its implementation.
This is the electrical engineering approach!!!
Taking in account the total number of synapses, W including biases
We have wkj = ηδkyj = η (tk − zk) f (netk) yj (Backward Pass)
We have that for this step
1 We need to calculate netk linear in the number of weights.
2 We need to calculate yj = f (netj) which is linear in the number of
weights.
91 / 94
Computational Efficiency
Something Notable
The computational complexity of an algorithm is usually measured in
terms of the number of multiplications, additions, and storage involved in
its implementation.
This is the electrical engineering approach!!!
Taking in account the total number of synapses, W including biases
We have wkj = ηδkyj = η (tk − zk) f (netk) yj (Backward Pass)
We have that for this step
1 We need to calculate netk linear in the number of weights.
2 We need to calculate yj = f (netj) which is linear in the number of
weights.
91 / 94
Computational Efficiency
Something Notable
The computational complexity of an algorithm is usually measured in
terms of the number of multiplications, additions, and storage involved in
its implementation.
This is the electrical engineering approach!!!
Taking in account the total number of synapses, W including biases
We have wkj = ηδkyj = η (tk − zk) f (netk) yj (Backward Pass)
We have that for this step
1 We need to calculate netk linear in the number of weights.
2 We need to calculate yj = f (netj) which is linear in the number of
weights.
91 / 94
Computational Efficiency
Now the Forward Pass
∆wji = ηxiδj = ηf (netj)
c
k=1
wkjδk xi
We have that for this step
[ c
k=1 wkjδk] takes, because of the previous calculations of δk’s, linear on
the number of weights
Clearly all this takes to have memory
In addition the calculation of the derivatives of the activation functions,
but assuming a constant time.
92 / 94
Computational Efficiency
Now the Forward Pass
∆wji = ηxiδj = ηf (netj)
c
k=1
wkjδk xi
We have that for this step
[ c
k=1 wkjδk] takes, because of the previous calculations of δk’s, linear on
the number of weights
Clearly all this takes to have memory
In addition the calculation of the derivatives of the activation functions,
but assuming a constant time.
92 / 94
Computational Efficiency
Now the Forward Pass
∆wji = ηxiδj = ηf (netj)
c
k=1
wkjδk xi
We have that for this step
[ c
k=1 wkjδk] takes, because of the previous calculations of δk’s, linear on
the number of weights
Clearly all this takes to have memory
In addition the calculation of the derivatives of the activation functions,
but assuming a constant time.
92 / 94
We have that
The Complexity of the multi-layer perceptron is
O (W ) Complexity
93 / 94
Exercises
We have from NN by Haykin
4.2, 4.3, 4.6, 4.8, 4.16, 4.17, 3.7
94 / 94

More Related Content

What's hot

Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep LearningOswald Campesato
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkmustafa aadel
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learningleopauly
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNNAshray Bhandare
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Neural networks...
Neural networks...Neural networks...
Neural networks...Molly Chugh
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)EdutechLearners
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector MachineShao-Chuan Wang
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningMohamed Loey
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNNNoura Hussein
 
14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer Perceptron14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer PerceptronAndres Mendez-Vazquez
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & OpportunityiTrain
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentMuhammad Rasel
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 

What's hot (20)

Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Introduction to Deep learning
Introduction to Deep learningIntroduction to Deep learning
Introduction to Deep learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Deep Learning - CNN and RNN
Deep Learning - CNN and RNNDeep Learning - CNN and RNN
Deep Learning - CNN and RNN
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
Neural networks...
Neural networks...Neural networks...
Neural networks...
 
Perceptron (neural network)
Perceptron (neural network)Perceptron (neural network)
Perceptron (neural network)
 
Perceptron
PerceptronPerceptron
Perceptron
 
Perceptron & Neural Networks
Perceptron & Neural NetworksPerceptron & Neural Networks
Perceptron & Neural Networks
 
Image Classification And Support Vector Machine
Image Classification And Support Vector MachineImage Classification And Support Vector Machine
Image Classification And Support Vector Machine
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Deep learning
Deep learningDeep learning
Deep learning
 
Image classification using CNN
Image classification using CNNImage classification using CNN
Image classification using CNN
 
14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer Perceptron14 Machine Learning Single Layer Perceptron
14 Machine Learning Single Layer Perceptron
 
Deep Learning: Application & Opportunity
Deep Learning: Application & OpportunityDeep Learning: Application & Opportunity
Deep Learning: Application & Opportunity
 
Feed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descentFeed forward ,back propagation,gradient descent
Feed forward ,back propagation,gradient descent
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 

Viewers also liked

Neural network and GA approaches for dwelling fire occurrence prediction
Neural network and GA approaches for dwelling fire occurrence predictionNeural network and GA approaches for dwelling fire occurrence prediction
Neural network and GA approaches for dwelling fire occurrence predictionPhisan Shukkhi
 
นาย จตุรพัฒน์ ภัควนิตย์
นาย จตุรพัฒน์ ภัควนิตย์นาย จตุรพัฒน์ ภัควนิตย์
นาย จตุรพัฒน์ ภัควนิตย์Pornchanida Ansietà
 
Time series data mining
Time series data miningTime series data mining
Time series data miningO-Ta Kung
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptronESCOM
 
03 neural network
03 neural network03 neural network
03 neural networkTianlu Wang
 
01 introduction to data mining
01 introduction to data mining01 introduction to data mining
01 introduction to data miningphakhwan22
 
โครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืด
โครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืดโครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืด
โครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืดRungnaree Uun
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptronsmitamm
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by OptimizationAndres Mendez-Vazquez
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Financethilankm
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron SlidesESCOM
 
Using Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningUsing Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningDr. Volkan OBAN
 

Viewers also liked (20)

Neural network and GA approaches for dwelling fire occurrence prediction
Neural network and GA approaches for dwelling fire occurrence predictionNeural network and GA approaches for dwelling fire occurrence prediction
Neural network and GA approaches for dwelling fire occurrence prediction
 
Thai Glossary Repository
Thai Glossary RepositoryThai Glossary Repository
Thai Glossary Repository
 
นาย จตุรพัฒน์ ภัควนิตย์
นาย จตุรพัฒน์ ภัควนิตย์นาย จตุรพัฒน์ ภัควนิตย์
นาย จตุรพัฒน์ ภัควนิตย์
 
Time series data mining
Time series data miningTime series data mining
Time series data mining
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptron
 
My topic
My topicMy topic
My topic
 
Microsoft Azure day 1
Microsoft Azure day 1Microsoft Azure day 1
Microsoft Azure day 1
 
03 neural network
03 neural network03 neural network
03 neural network
 
Neural network
Neural networkNeural network
Neural network
 
01 introduction to data mining
01 introduction to data mining01 introduction to data mining
01 introduction to data mining
 
โครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืด
โครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืดโครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืด
โครงงานคอมพิวเตอร์ เรื่อง สมุนไพรรางจืด
 
Multilayer perceptron
Multilayer perceptronMultilayer perceptron
Multilayer perceptron
 
02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization02.03 Artificial Intelligence: Search by Optimization
02.03 Artificial Intelligence: Search by Optimization
 
Optimization Methods in Finance
Optimization Methods in FinanceOptimization Methods in Finance
Optimization Methods in Finance
 
Media&tech2learn 001-Part 1
Media&tech2learn 001-Part 1Media&tech2learn 001-Part 1
Media&tech2learn 001-Part 1
 
Perceptron Slides
Perceptron SlidesPerceptron Slides
Perceptron Slides
 
07 classification 3 neural network
07 classification 3 neural network07 classification 3 neural network
07 classification 3 neural network
 
Nervous system
Nervous systemNervous system
Nervous system
 
Using Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and LearningUsing Gradient Descent for Optimization and Learning
Using Gradient Descent for Optimization and Learning
 
Data mining
Data   miningData   mining
Data mining
 

Similar to 15 Machine Learning Multilayer Perceptron

Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Akash Goel
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
MPerceptron
MPerceptronMPerceptron
MPerceptronbutest
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspectiveAnirban Santara
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learningJunaid Bhat
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxDebabrataPain1
 
Backpropagation
BackpropagationBackpropagation
Backpropagationariffast
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANNwaseem khan
 
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...ali hassan
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learningStanley Wang
 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network AlgorithmsWeb Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network Algorithmsaciijournal
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Simplilearn
 
Web spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsWeb spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsaciijournal
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionIJCSIS Research Publications
 

Similar to 15 Machine Learning Multilayer Perceptron (20)

MNN
MNNMNN
MNN
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
MPerceptron
MPerceptronMPerceptron
MPerceptron
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
Introduction to deep learning
Introduction to deep learningIntroduction to deep learning
Introduction to deep learning
 
ML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptxML Module 3 Non Linear Learning.pptx
ML Module 3 Non Linear Learning.pptx
 
Backpropagation
BackpropagationBackpropagation
Backpropagation
 
Basic Learning Algorithms of ANN
Basic Learning Algorithms of ANNBasic Learning Algorithms of ANN
Basic Learning Algorithms of ANN
 
Ffnn
FfnnFfnn
Ffnn
 
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
2017 (albawi-alkabi)image-net classification with deep convolutional neural n...
 
Fundamental of deep learning
Fundamental of deep learningFundamental of deep learning
Fundamental of deep learning
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network AlgorithmsWeb Spam Classification Using Supervised Artificial Neural Network Algorithms
Web Spam Classification Using Supervised Artificial Neural Network Algorithms
 
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
 
AINL 2016: Filchenkov
AINL 2016: FilchenkovAINL 2016: Filchenkov
AINL 2016: Filchenkov
 
Lec 6-bp
Lec 6-bpLec 6-bp
Lec 6-bp
 
Java and Deep Learning
Java and Deep LearningJava and Deep Learning
Java and Deep Learning
 
Web spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithmsWeb spam classification using supervised artificial neural network algorithms
Web spam classification using supervised artificial neural network algorithms
 
A Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware DetectionA Survey of Deep Learning Algorithms for Malware Detection
A Survey of Deep Learning Algorithms for Malware Detection
 

More from Andres Mendez-Vazquez

01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectorsAndres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issuesAndres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiationAndres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep LearningAndres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusAndres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusAndres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesAndres Mendez-Vazquez
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variationsAndres Mendez-Vazquez
 

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
 
Zetta global
Zetta globalZetta global
Zetta global
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
 

Recently uploaded

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 

Recently uploaded (20)

the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
VIP Call Girls Service Hitech City Hyderabad Call +91-8250192130
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 

15 Machine Learning Multilayer Perceptron

  • 1. Neural Networks Multilayer Perceptron Andres Mendez-Vazquez December 12, 2015 1 / 94
  • 2. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 2 / 94
  • 3. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 3 / 94
  • 4. Do you remember? The Perceptron has the following problem Given that the perceptron is a linear classifier It is clear that It will never be able to classify stuff that is not linearly separable 4 / 94
  • 5. Do you remember? The Perceptron has the following problem Given that the perceptron is a linear classifier It is clear that It will never be able to classify stuff that is not linearly separable 4 / 94
  • 6. Example: XOR Problem The Problem 0 1 1 Class 1 Class 2 5 / 94
  • 7. The Perceptron cannot solve it Because The perceptron is a linear classifier!!! Thus Something needs to be done!!! Maybe Add an extra layer!!! 6 / 94
  • 8. The Perceptron cannot solve it Because The perceptron is a linear classifier!!! Thus Something needs to be done!!! Maybe Add an extra layer!!! 6 / 94
  • 9. The Perceptron cannot solve it Because The perceptron is a linear classifier!!! Thus Something needs to be done!!! Maybe Add an extra layer!!! 6 / 94
  • 10. A little bit of history It was first cited by Vapnik Vapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the first publication of the backpropagation algorithm in his book "Support Vector Machines." It was first used by Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969. However It was not until 1974 and later, when applied in the context of neural networks and through the work of Paul Werbos, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams that it gained recognition. 7 / 94
  • 11. A little bit of history It was first cited by Vapnik Vapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the first publication of the backpropagation algorithm in his book "Support Vector Machines." It was first used by Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969. However It was not until 1974 and later, when applied in the context of neural networks and through the work of Paul Werbos, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams that it gained recognition. 7 / 94
  • 12. A little bit of history It was first cited by Vapnik Vapnik cites (Bryson, A.E.; W.F. Denham; S.E. Dreyfus. Optimal programming problems with inequality constraints. I: Necessary conditions for extremal solutions. AIAA J. 1, 11 (1963) 2544-2550) as the first publication of the backpropagation algorithm in his book "Support Vector Machines." It was first used by Arthur E. Bryson and Yu-Chi Ho described it as a multi-stage dynamic system optimization method in 1969. However It was not until 1974 and later, when applied in the context of neural networks and through the work of Paul Werbos, David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams that it gained recognition. 7 / 94
  • 13. Then Something Notable It led to a “renaissance” in the field of artificial neural network research. Nevertheless During the 2000s it fell out of favour but has returned again in the 2010s, now able to train much larger networks using huge modern computing power such as GPUs. 8 / 94
  • 14. Then Something Notable It led to a “renaissance” in the field of artificial neural network research. Nevertheless During the 2000s it fell out of favour but has returned again in the 2010s, now able to train much larger networks using huge modern computing power such as GPUs. 8 / 94
  • 15. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 9 / 94
  • 16. Multi-Layer Perceptron (MLP) Multi-Layer Architecture· Output Input Target Hidden Sigmoid Activation function Identity Activation function 10 / 94
  • 17. Information Flow We have the following information flow Function Signals Error Signals 11 / 94
  • 18. Explanation Problems with Hidden Layers 1 Increase complexity of Training 2 It is necessary to think about “Long and Narrow” network vs “Short and Fat” network. Intuition for a One Hidden Layer 1 For every input case of region, that region can be delimited by hyperplanes on all sides using hidden units on the first hidden layer. 2 A hidden unit in the second layer than ANDs them together to bound the region. Advantages It has been proven that an MLP with one hidden layer can learn any nonlinear function of the input. 12 / 94
  • 19. Explanation Problems with Hidden Layers 1 Increase complexity of Training 2 It is necessary to think about “Long and Narrow” network vs “Short and Fat” network. Intuition for a One Hidden Layer 1 For every input case of region, that region can be delimited by hyperplanes on all sides using hidden units on the first hidden layer. 2 A hidden unit in the second layer than ANDs them together to bound the region. Advantages It has been proven that an MLP with one hidden layer can learn any nonlinear function of the input. 12 / 94
  • 20. Explanation Problems with Hidden Layers 1 Increase complexity of Training 2 It is necessary to think about “Long and Narrow” network vs “Short and Fat” network. Intuition for a One Hidden Layer 1 For every input case of region, that region can be delimited by hyperplanes on all sides using hidden units on the first hidden layer. 2 A hidden unit in the second layer than ANDs them together to bound the region. Advantages It has been proven that an MLP with one hidden layer can learn any nonlinear function of the input. 12 / 94
  • 21. Explanation Problems with Hidden Layers 1 Increase complexity of Training 2 It is necessary to think about “Long and Narrow” network vs “Short and Fat” network. Intuition for a One Hidden Layer 1 For every input case of region, that region can be delimited by hyperplanes on all sides using hidden units on the first hidden layer. 2 A hidden unit in the second layer than ANDs them together to bound the region. Advantages It has been proven that an MLP with one hidden layer can learn any nonlinear function of the input. 12 / 94
  • 22. Explanation Problems with Hidden Layers 1 Increase complexity of Training 2 It is necessary to think about “Long and Narrow” network vs “Short and Fat” network. Intuition for a One Hidden Layer 1 For every input case of region, that region can be delimited by hyperplanes on all sides using hidden units on the first hidden layer. 2 A hidden unit in the second layer than ANDs them together to bound the region. Advantages It has been proven that an MLP with one hidden layer can learn any nonlinear function of the input. 12 / 94
  • 23. The Process We have something like this Layer 1 (0,0,1) (1,0,0) (1,0,0) Layer 2 13 / 94
  • 24. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 14 / 94
  • 25. Remember!!! The Quadratic Learning Error function Cost Function our well know error at pattern m J (m) = 1 2 e2 k (m) (1) Delta Rule or Widrow-Hoff Rule ∆wkj (m) = −ηek (m) xj(m) (2) Actually this is know as Gradient Descent wkj (m + 1) = wkj (m) + ∆wkj (m) (3) 15 / 94
  • 26. Remember!!! The Quadratic Learning Error function Cost Function our well know error at pattern m J (m) = 1 2 e2 k (m) (1) Delta Rule or Widrow-Hoff Rule ∆wkj (m) = −ηek (m) xj(m) (2) Actually this is know as Gradient Descent wkj (m + 1) = wkj (m) + ∆wkj (m) (3) 15 / 94
  • 27. Remember!!! The Quadratic Learning Error function Cost Function our well know error at pattern m J (m) = 1 2 e2 k (m) (1) Delta Rule or Widrow-Hoff Rule ∆wkj (m) = −ηek (m) xj(m) (2) Actually this is know as Gradient Descent wkj (m + 1) = wkj (m) + ∆wkj (m) (3) 15 / 94
  • 28. Back-propagation Setup Let tk be the k-th target (or desired) output and zk be the k-th computed output with k = 1, . . . , d and w represents all the weights of the network Training Error for a single Pattern or Sample!!! J (w) = 1 2 c k=1 (tk − zk)2 = 1 2 t − z 2 (4) 16 / 94
  • 29. Back-propagation Setup Let tk be the k-th target (or desired) output and zk be the k-th computed output with k = 1, . . . , d and w represents all the weights of the network Training Error for a single Pattern or Sample!!! J (w) = 1 2 c k=1 (tk − zk)2 = 1 2 t − z 2 (4) 16 / 94
  • 30. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 17 / 94
  • 31. Gradient Descent Gradient Descent The back-propagation learning rule is based on gradient descent. Reducing the Error The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error: ∆w = −η ∂J ∂w (5) Where η is the learning rate which indicates the relative size of the change in weights: w (m + 1) = w (m) + ∆w (m) (6) where m is the m-th pattern presented 18 / 94
  • 32. Gradient Descent Gradient Descent The back-propagation learning rule is based on gradient descent. Reducing the Error The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error: ∆w = −η ∂J ∂w (5) Where η is the learning rate which indicates the relative size of the change in weights: w (m + 1) = w (m) + ∆w (m) (6) where m is the m-th pattern presented 18 / 94
  • 33. Gradient Descent Gradient Descent The back-propagation learning rule is based on gradient descent. Reducing the Error The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error: ∆w = −η ∂J ∂w (5) Where η is the learning rate which indicates the relative size of the change in weights: w (m + 1) = w (m) + ∆w (m) (6) where m is the m-th pattern presented 18 / 94
  • 34. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 19 / 94
  • 35. Multilayer Architecture Multilayer Architecture: hidden–to-output weights Output Input Target Hidden 20 / 94
  • 36. Observation about the activation function Hidden Output is equal to yj = f d i=1 wjixi Output is equal to zk = f   ynH j=1 wkjyj   21 / 94
  • 37. Observation about the activation function Hidden Output is equal to yj = f d i=1 wjixi Output is equal to zk = f   ynH j=1 wkjyj   21 / 94
  • 38. Hidden–to-Output Weights Error on the hidden–to-output weights ∂J ∂wkj = ∂J ∂netk · ∂netk ∂wkj = −δk · ∂netk ∂wkj (7) netk It describes how the overall error changes with the activation of the unit’s net: netk = ynH j=1 wkjyj = wT k · y (8) Now δk = − ∂J ∂netk = − ∂J ∂zk · ∂zk ∂netk = (tk − zk) f (netk) (9) 22 / 94
  • 39. Hidden–to-Output Weights Error on the hidden–to-output weights ∂J ∂wkj = ∂J ∂netk · ∂netk ∂wkj = −δk · ∂netk ∂wkj (7) netk It describes how the overall error changes with the activation of the unit’s net: netk = ynH j=1 wkjyj = wT k · y (8) Now δk = − ∂J ∂netk = − ∂J ∂zk · ∂zk ∂netk = (tk − zk) f (netk) (9) 22 / 94
  • 40. Hidden–to-Output Weights Error on the hidden–to-output weights ∂J ∂wkj = ∂J ∂netk · ∂netk ∂wkj = −δk · ∂netk ∂wkj (7) netk It describes how the overall error changes with the activation of the unit’s net: netk = ynH j=1 wkjyj = wT k · y (8) Now δk = − ∂J ∂netk = − ∂J ∂zk · ∂zk ∂netk = (tk − zk) f (netk) (9) 22 / 94
  • 41. Hidden–to-Output Weights Why? zk = f (netk) (10) Thus ∂zk ∂netk = f (netk) (11) Since netk = wT k · y therefore: ∂netk ∂wkj = yj (12) 23 / 94
  • 42. Hidden–to-Output Weights Why? zk = f (netk) (10) Thus ∂zk ∂netk = f (netk) (11) Since netk = wT k · y therefore: ∂netk ∂wkj = yj (12) 23 / 94
  • 43. Hidden–to-Output Weights Why? zk = f (netk) (10) Thus ∂zk ∂netk = f (netk) (11) Since netk = wT k · y therefore: ∂netk ∂wkj = yj (12) 23 / 94
  • 44. Finally The weight update (or learning rule) for the hidden-to-output weights is: wkj = ηδkyj = η (tk − zk) f (netk) yj (13) 24 / 94
  • 45. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 25 / 94
  • 46. Multi-Layer Architecture Multi-Layer Architecture: Input–to-Hidden weights Output Input Target Hidden 26 / 94
  • 47. Input–to-Hidden Weights Error on the Input–to-Hidden weights ∂J ∂wji = ∂J ∂yj · ∂yj ∂netj · ∂netj ∂wji (14) Thus ∂J ∂yj = ∂ ∂yj 1 2 c k=1 (tk − zk)2 = − c k=1 (tk − zk) ∂zk ∂yj = − c k=1 (tk − zk) ∂zk ∂netk · ∂netk ∂yj = − c k=1 (tk − zk) ∂f (netk) ∂netk · wkj 27 / 94
  • 48. Input–to-Hidden Weights Error on the Input–to-Hidden weights ∂J ∂wji = ∂J ∂yj · ∂yj ∂netj · ∂netj ∂wji (14) Thus ∂J ∂yj = ∂ ∂yj 1 2 c k=1 (tk − zk)2 = − c k=1 (tk − zk) ∂zk ∂yj = − c k=1 (tk − zk) ∂zk ∂netk · ∂netk ∂yj = − c k=1 (tk − zk) ∂f (netk) ∂netk · wkj 27 / 94
  • 49. Input–to-Hidden Weights Finally ∂J ∂yj = − c k=1 (tk − zk) f (netk) · wkj (15) Remember δk = − ∂J ∂netk = (tk − zk) f (netk) (16) 28 / 94
  • 50. What is ∂yj ∂netj ? First netj = d i=1 wjixi = wT j · x (17) Then yj = f (netj) Then ∂yj ∂netj = ∂f (netj) ∂netj = f (netj) 29 / 94
  • 51. What is ∂yj ∂netj ? First netj = d i=1 wjixi = wT j · x (17) Then yj = f (netj) Then ∂yj ∂netj = ∂f (netj) ∂netj = f (netj) 29 / 94
  • 52. What is ∂yj ∂netj ? First netj = d i=1 wjixi = wT j · x (17) Then yj = f (netj) Then ∂yj ∂netj = ∂f (netj) ∂netj = f (netj) 29 / 94
  • 53. Then, we can define δj By defying the sensitivity for a hidden unit: δj = f (netj) c k=1 wkjδk (18) Which means that: “The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights wkj; all multiplied by f (netj)” 30 / 94
  • 54. Then, we can define δj By defying the sensitivity for a hidden unit: δj = f (netj) c k=1 wkjδk (18) Which means that: “The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights wkj; all multiplied by f (netj)” 30 / 94
  • 55. Then, we can define δj By defying the sensitivity for a hidden unit: δj = f (netj) c k=1 wkjδk (18) Which means that: “The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights wkj; all multiplied by f (netj)” 30 / 94
  • 56. What about ∂netj ∂wji ? We have that ∂netj ∂wji = ∂wT j · x ∂wji = ∂ d i=1 wjixi ∂wji = xi 31 / 94
  • 57. Finally The learning rule for the input-to-hidden weights is: ∆wji = ηxiδj = η c k=1 wkjδk f (netj) xi (19) 32 / 94
  • 58. Basically, the entire training process has the following steps Initialization Assuming that no prior information is available, pick the synaptic weights and thresholds Forward Computation Compute the induced function signals of the network by proceeding forward through the network, layer by layer. Backward Computation Compute the local gradients of the network. Finally Adjust the weights!!! 33 / 94
  • 59. Basically, the entire training process has the following steps Initialization Assuming that no prior information is available, pick the synaptic weights and thresholds Forward Computation Compute the induced function signals of the network by proceeding forward through the network, layer by layer. Backward Computation Compute the local gradients of the network. Finally Adjust the weights!!! 33 / 94
  • 60. Basically, the entire training process has the following steps Initialization Assuming that no prior information is available, pick the synaptic weights and thresholds Forward Computation Compute the induced function signals of the network by proceeding forward through the network, layer by layer. Backward Computation Compute the local gradients of the network. Finally Adjust the weights!!! 33 / 94
  • 61. Basically, the entire training process has the following steps Initialization Assuming that no prior information is available, pick the synaptic weights and thresholds Forward Computation Compute the induced function signals of the network by proceeding forward through the network, layer by layer. Backward Computation Compute the local gradients of the network. Finally Adjust the weights!!! 33 / 94
  • 62. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 34 / 94
  • 63. Now, Calculating Total Change We have for that The Total Training Error is the sum over the errors of N individual patterns The Total Training Error J = N p=1 Jp = 1 2 N p=1 d k=1 (tp k − zp k ) 2 = 1 2 n p=1 tp − zp 2 (20) 35 / 94
  • 64. Now, Calculating Total Change We have for that The Total Training Error is the sum over the errors of N individual patterns The Total Training Error J = N p=1 Jp = 1 2 N p=1 d k=1 (tp k − zp k ) 2 = 1 2 n p=1 tp − zp 2 (20) 35 / 94
  • 65. About the Total Training Error Remarks A weight update may reduce the error on the single pattern being presented but can increase the error on the full training set. However, given a large number of such individual updates, the total error of equation (20) decreases. 36 / 94
  • 66. About the Total Training Error Remarks A weight update may reduce the error on the single pattern being presented but can increase the error on the full training set. However, given a large number of such individual updates, the total error of equation (20) decreases. 36 / 94
  • 67. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 37 / 94
  • 68. Now, we want the training to stop Therefore It is necessary to have a way to stop when the change of the weights is enough!!! A simple way to stop the training The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value Θ. ∆J (w) = |J (w (t + 1)) − J (w (t))| (21) There are other stopping criteria that lead to better performance than this one. 38 / 94
  • 69. Now, we want the training to stop Therefore It is necessary to have a way to stop when the change of the weights is enough!!! A simple way to stop the training The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value Θ. ∆J (w) = |J (w (t + 1)) − J (w (t))| (21) There are other stopping criteria that lead to better performance than this one. 38 / 94
  • 70. Now, we want the training to stop Therefore It is necessary to have a way to stop when the change of the weights is enough!!! A simple way to stop the training The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value Θ. ∆J (w) = |J (w (t + 1)) − J (w (t))| (21) There are other stopping criteria that lead to better performance than this one. 38 / 94
  • 71. Now, we want the training to stop Therefore It is necessary to have a way to stop when the change of the weights is enough!!! A simple way to stop the training The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value Θ. ∆J (w) = |J (w (t + 1)) − J (w (t))| (21) There are other stopping criteria that lead to better performance than this one. 38 / 94
  • 72. Other Stopping Criteria Norm of the Gradient The back-propagation algorithm is considered to have converged when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold. wJ (m) < Θ (22) Rate of change in the average error per epoch The back-propagation algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small. 1 N N p=1 Jp < Θ (23) 39 / 94
  • 73. Other Stopping Criteria Norm of the Gradient The back-propagation algorithm is considered to have converged when the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold. wJ (m) < Θ (22) Rate of change in the average error per epoch The back-propagation algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small. 1 N N p=1 Jp < Θ (23) 39 / 94
  • 74. About the Stopping Criteria Observations 1 Before training starts, the error on the training set is high. Through the learning process, the error becomes smaller. 2 The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network. 3 The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase. 4 A validation set is used in order to decide when to stop training. We do not want to over-fit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set” 40 / 94
  • 75. About the Stopping Criteria Observations 1 Before training starts, the error on the training set is high. Through the learning process, the error becomes smaller. 2 The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network. 3 The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase. 4 A validation set is used in order to decide when to stop training. We do not want to over-fit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set” 40 / 94
  • 76. About the Stopping Criteria Observations 1 Before training starts, the error on the training set is high. Through the learning process, the error becomes smaller. 2 The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network. 3 The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase. 4 A validation set is used in order to decide when to stop training. We do not want to over-fit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set” 40 / 94
  • 77. About the Stopping Criteria Observations 1 Before training starts, the error on the training set is high. Through the learning process, the error becomes smaller. 2 The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network. 3 The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase. 4 A validation set is used in order to decide when to stop training. We do not want to over-fit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set” 40 / 94
  • 78. About the Stopping Criteria Observations 1 Before training starts, the error on the training set is high. Through the learning process, the error becomes smaller. 2 The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network. 3 The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase. 4 A validation set is used in order to decide when to stop training. We do not want to over-fit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set” 40 / 94
  • 79. About the Stopping Criteria Observations 1 Before training starts, the error on the training set is high. Through the learning process, the error becomes smaller. 2 The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network. 3 The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase. 4 A validation set is used in order to decide when to stop training. We do not want to over-fit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set” 40 / 94
  • 80. Some More Terminology Epoch As with other types of backpropagation, ’learning’ is a supervised process that occurs with each cycle or ’epoch’ through a forward activation flow of outputs, and the backwards error propagation of weight adjustments. In our case I am using the batch sum of all correcting weights to define that epoch. 41 / 94
  • 81. Some More Terminology Epoch As with other types of backpropagation, ’learning’ is a supervised process that occurs with each cycle or ’epoch’ through a forward activation flow of outputs, and the backwards error propagation of weight adjustments. In our case I am using the batch sum of all correcting weights to define that epoch. 41 / 94
  • 82. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 42 / 94
  • 83. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 84. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 85. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 86. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 87. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 88. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 89. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 90. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 91. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 92. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 93. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 94. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 95. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 96. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 97. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 98. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 99. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 100. Final Basic Batch Algorithm Perceptron(X) 1 Initialize random w, number of hidden units nH , number of outputs z, stopping criterion Θ, learning rateη, epoch m = 0 2 do 3 m = m + 1 4 for s = 1 to N 5 x (m) = X (:, s) 6 for k = 1 to c 7 δk = (tk − zk ) f wT k · y 8 for j = 1 to nH 9 netj = wT j · x;yj = f netj 10 wkj (m) = wkj (m) + ηδk yj (m) 11 for j = 1 to nH 12 δj = f netj c k=1 wkj δk 13 for i = 1 to d 14 wji (m) = wji (m) + ηδj xi (m) 15 until w J (m) < Θ 16 return w (m) 43 / 94
  • 101. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 44 / 94
  • 102. Example of Architecture to be used Given the following Architecture and assuming N samples 45 / 94
  • 103. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 46 / 94
  • 104. Generating the output zk Given the input X = x1 x2 · · · xN (24) Where xi is a vector of features xi =       x1i x2i ... xdi       (25) 47 / 94
  • 105. Generating the output zk Given the input X = x1 x2 · · · xN (24) Where xi is a vector of features xi =       x1i x2i ... xdi       (25) 47 / 94
  • 106. Therefore We must have the following matrix for the input to hidden inputs WIH =    w11 w12 · · · w1d w21 w22 · · · w2d ... ... ... ... wnH 1 wnH 2 · · · wnH d    =     wT 1 wT 2 ... wT nH     (26) Given that wj =    wj1 wj2 ... wjd    Thus We can create the netj for all the inputs by simply netj = WIH X =     wT 1 x1 wT 1 x2 · · · wT 1 xN wT 2 x1 wT 2 x2 · · · wT 2 xN . .. . .. ... . .. wT nH x1 wT nH x2 · · · wT nH xN     (27) 48 / 94
  • 107. Therefore We must have the following matrix for the input to hidden inputs WIH =    w11 w12 · · · w1d w21 w22 · · · w2d ... ... ... ... wnH 1 wnH 2 · · · wnH d    =     wT 1 wT 2 ... wT nH     (26) Given that wj =    wj1 wj2 ... wjd    Thus We can create the netj for all the inputs by simply netj = WIH X =     wT 1 x1 wT 1 x2 · · · wT 1 xN wT 2 x1 wT 2 x2 · · · wT 2 xN . .. . .. ... . .. wT nH x1 wT nH x2 · · · wT nH xN     (27) 48 / 94
  • 108. Now, we need to generate the yk We apply the activation function element by element in netj y1 =         f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN ... ... ... ... f wT nH x1 f wT nH x2 · · · f wT nH xN         (28) IMPORTANT about overflows!!! Be careful about the numeric stability of the activation function. I the case of python, we can use the ones provided by scipy.special 49 / 94
  • 109. Now, we need to generate the yk We apply the activation function element by element in netj y1 =         f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN ... ... ... ... f wT nH x1 f wT nH x2 · · · f wT nH xN         (28) IMPORTANT about overflows!!! Be careful about the numeric stability of the activation function. I the case of python, we can use the ones provided by scipy.special 49 / 94
  • 110. Now, we need to generate the yk We apply the activation function element by element in netj y1 =         f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN ... ... ... ... f wT nH x1 f wT nH x2 · · · f wT nH xN         (28) IMPORTANT about overflows!!! Be careful about the numeric stability of the activation function. I the case of python, we can use the ones provided by scipy.special 49 / 94
  • 111. However, We can create a Sigmoid function It is possible to use the following pseudo-code Sigmoid(x) 1 if try 1.0 1.0+exp{−αx} catch {OVERFLOW } I will use a try-and-catch to catch the overflow 2 if x < 0 3 return 0 4 else 5 return 1 6 else 7 return 1.0 1.0+exp{−αx} 1.0 refers to the floating point (Rationals trying to represent Reals) 50 / 94
  • 112. However, We can create a Sigmoid function It is possible to use the following pseudo-code Sigmoid(x) 1 if try 1.0 1.0+exp{−αx} catch {OVERFLOW } I will use a try-and-catch to catch the overflow 2 if x < 0 3 return 0 4 else 5 return 1 6 else 7 return 1.0 1.0+exp{−αx} 1.0 refers to the floating point (Rationals trying to represent Reals) 50 / 94
  • 113. However, We can create a Sigmoid function It is possible to use the following pseudo-code Sigmoid(x) 1 if try 1.0 1.0+exp{−αx} catch {OVERFLOW } I will use a try-and-catch to catch the overflow 2 if x < 0 3 return 0 4 else 5 return 1 6 else 7 return 1.0 1.0+exp{−αx} 1.0 refers to the floating point (Rationals trying to represent Reals) 50 / 94
  • 114. However, We can create a Sigmoid function It is possible to use the following pseudo-code Sigmoid(x) 1 if try 1.0 1.0+exp{−αx} catch {OVERFLOW } I will use a try-and-catch to catch the overflow 2 if x < 0 3 return 0 4 else 5 return 1 6 else 7 return 1.0 1.0+exp{−αx} 1.0 refers to the floating point (Rationals trying to represent Reals) 50 / 94
  • 115. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 51 / 94
  • 116. For this, we get netk For this, we obtain the WHO WHO = wo 11 wo 12 · · · wo 1nH = wT o (29) Thus netk = wo 11 wo 12 · · · wo 1nH       f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN . . . . . . . . . . . . f w T nH x1 f w T nH x2 · · · f w T nH xN       yk1 yk2 · · · ykN (30) In matrix notation netk = wT o yk1 wT o yk2 · · · wT o ykN (31) 52 / 94
  • 117. For this, we get netk For this, we obtain the WHO WHO = wo 11 wo 12 · · · wo 1nH = wT o (29) Thus netk = wo 11 wo 12 · · · wo 1nH       f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN . . . . . . . . . . . . f w T nH x1 f w T nH x2 · · · f w T nH xN       yk1 yk2 · · · ykN (30) In matrix notation netk = wT o yk1 wT o yk2 · · · wT o ykN (31) 52 / 94
  • 118. For this, we get netk For this, we obtain the WHO WHO = wo 11 wo 12 · · · wo 1nH = wT o (29) Thus netk = wo 11 wo 12 · · · wo 1nH       f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN . . . . . . . . . . . . f w T nH x1 f w T nH x2 · · · f w T nH xN       yk1 yk2 · · · ykN (30) In matrix notation netk = wT o yk1 wT o yk2 · · · wT o ykN (31) 52 / 94
  • 119. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 53 / 94
  • 120. Now, we have Thus, we have zk (In our case k = 1, but it could be a range of values) zk = f wT o yk1 f wT o yk2 · · · f wT o ykN (32) Thus, we generate a vector of differences d = t − zk = t1 − f wT o yk1 t2 − f wT o yk2 · · · tN − f wT o ykN (33) where t = t1 t2 · · · tN is a row vector of desired outputs for each sample. 54 / 94
  • 121. Now, we have Thus, we have zk (In our case k = 1, but it could be a range of values) zk = f wT o yk1 f wT o yk2 · · · f wT o ykN (32) Thus, we generate a vector of differences d = t − zk = t1 − f wT o yk1 t2 − f wT o yk2 · · · tN − f wT o ykN (33) where t = t1 t2 · · · tN is a row vector of desired outputs for each sample. 54 / 94
  • 122. Now, we multiply element wise We have the following vector of derivatives of net Df = ηf wT o yk1 ηf wT o yk2 · · · ηf wT o ykN (34) where η is the step rate. Finally, by element wise multiplication (Hadamard Product) d = η t1 − f wT o yk1 f wT o yk1 η t2 − f wT o yk2 f wT o yk2 · · · η tN − f wT o ykN f wT o ykN 55 / 94
  • 123. Now, we multiply element wise We have the following vector of derivatives of net Df = ηf wT o yk1 ηf wT o yk2 · · · ηf wT o ykN (34) where η is the step rate. Finally, by element wise multiplication (Hadamard Product) d = η t1 − f wT o yk1 f wT o yk1 η t2 − f wT o yk2 f wT o yk2 · · · η tN − f wT o ykN f wT o ykN 55 / 94
  • 124. Tile d Tile downward dtile = nH rows          d d ... d       (35) Finally, we multiply element wise against y1 (Hadamard Product) ∆wtemp 1j = y1 ◦ dtile (36) 56 / 94
  • 125. Tile d Tile downward dtile = nH rows          d d ... d       (35) Finally, we multiply element wise against y1 (Hadamard Product) ∆wtemp 1j = y1 ◦ dtile (36) 56 / 94
  • 126. We obtain the total ∆w1j We sum along the rows of ∆wtemp 1j ∆w1j =    η t1 − f wT o yk1 f wT o yk1 y11 + η t1 − f wT o yk1 f wT o yk1 y1N . . . η t1 − f wT o yk1 f wT o yk1 ynH 1 + η t1 − f wT o yk1 f wT o yk1 ynH N    (37) where yhm = f wT h xm with h = 1, 2, ..., nH and m = 1, 2, ..., N. 57 / 94
  • 127. Finally, we update the first weights We have then WHO (t + 1) = WHO (t) + ∆wT 1j (t) (38) 58 / 94
  • 128. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 59 / 94
  • 129. First We multiply element wise the WHO and ∆w1j T = ∆wT 1j ◦ WT HO (39) Now, we obtain the element wise derivative of netj Dnetj =         f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN ... ... ... ... f wT nH x1 f wT nH x2 · · · f wT nH xN         (40) 60 / 94
  • 130. First We multiply element wise the WHO and ∆w1j T = ∆wT 1j ◦ WT HO (39) Now, we obtain the element wise derivative of netj Dnetj =         f wT 1 x1 f wT 1 x2 · · · f wT 1 xN f wT 2 x1 f wT 2 x2 · · · f wT 2 xN ... ... ... ... f wT nH x1 f wT nH x2 · · · f wT nH xN         (40) 60 / 94
  • 131. Thus We tile to the right T Ttile = T T · · · T N Columns (41) Now, we multiply element wise together with η Pt = η (Dnetj ◦ Ttile) (42) where η is constant multiplied against the result the Hadamar Product (Result a nH × N matrix) 61 / 94
  • 132. Thus We tile to the right T Ttile = T T · · · T N Columns (41) Now, we multiply element wise together with η Pt = η (Dnetj ◦ Ttile) (42) where η is constant multiplied against the result the Hadamar Product (Result a nH × N matrix) 61 / 94
  • 133. Finally We get use the transpose of X which is a N × d matrix XT =       xT 1 xT 2 ... xT N       (43) Finally, we get a nH × d matrix ∆wij = PtXT (44) Thus, given WIH WIH (t + 1) = WHO (t) + ∆wT ij (t) (45) 62 / 94
  • 134. Finally We get use the transpose of X which is a N × d matrix XT =       xT 1 xT 2 ... xT N       (43) Finally, we get a nH × d matrix ∆wij = PtXT (44) Thus, given WIH WIH (t + 1) = WHO (t) + ∆wT ij (t) (45) 62 / 94
  • 135. Finally We get use the transpose of X which is a N × d matrix XT =       xT 1 xT 2 ... xT N       (43) Finally, we get a nH × d matrix ∆wij = PtXT (44) Thus, given WIH WIH (t + 1) = WHO (t) + ∆wT ij (t) (45) 62 / 94
  • 136. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 63 / 94
  • 137. We have different activation functions The two most important 1 Sigmoid function. 2 Hyperbolic tangent function 64 / 94
  • 138. We have different activation functions The two most important 1 Sigmoid function. 2 Hyperbolic tangent function 64 / 94
  • 139. Logistic Function This non-linear function has the following definition for a neuron j fj (vj (n)) = 1 1 + exp {−avj (n)} a > 0 and − ∞ < vj (n) < ∞ (46) Example 65 / 94
  • 140. Logistic Function This non-linear function has the following definition for a neuron j fj (vj (n)) = 1 1 + exp {−avj (n)} a > 0 and − ∞ < vj (n) < ∞ (46) Example 65 / 94
  • 141. The differential of the sigmoid function Now if we differentiate, we have fj (vj (n)) = 1 1 + exp {−avj (n)} 1 − 1 1 + exp {−avj (n)} = exp {−avj (n)} (1 + exp {−avj (n)})2 66 / 94
  • 142. The differential of the sigmoid function Now if we differentiate, we have fj (vj (n)) = 1 1 + exp {−avj (n)} 1 − 1 1 + exp {−avj (n)} = exp {−avj (n)} (1 + exp {−avj (n)})2 66 / 94
  • 143. The outputs finish as For the output neurons δk = (tk − zk) f (netk) = (tk − fk (vk (n))) fk (vk (n)) (1 − fk (vk (n))) For the hidden neurons δj =fj (vj (n)) (1 − fj (vj (n))) c k=1 wkjδk 67 / 94
  • 144. The outputs finish as For the output neurons δk = (tk − zk) f (netk) = (tk − fk (vk (n))) fk (vk (n)) (1 − fk (vk (n))) For the hidden neurons δj =fj (vj (n)) (1 − fj (vj (n))) c k=1 wkjδk 67 / 94
  • 145. The outputs finish as For the output neurons δk = (tk − zk) f (netk) = (tk − fk (vk (n))) fk (vk (n)) (1 − fk (vk (n))) For the hidden neurons δj =fj (vj (n)) (1 − fj (vj (n))) c k=1 wkjδk 67 / 94
  • 146. Hyperbolic tangent function Another commonly used form of sigmoidal non linearity is the hyperbolic tangent function fj (vj (n)) = a tanh (bvj (n)) (47) Example 68 / 94
  • 147. Hyperbolic tangent function Another commonly used form of sigmoidal non linearity is the hyperbolic tangent function fj (vj (n)) = a tanh (bvj (n)) (47) Example 68 / 94
  • 148. The differential of the hyperbolic tangent We have fj (vj (n)) =absech2 (bvj (n)) =ab 1 − tanh2 (bvj (n)) BTW I leave to you to figure out the outputs. 69 / 94
  • 149. The differential of the hyperbolic tangent We have fj (vj (n)) =absech2 (bvj (n)) =ab 1 − tanh2 (bvj (n)) BTW I leave to you to figure out the outputs. 69 / 94
  • 150. The differential of the hyperbolic tangent We have fj (vj (n)) =absech2 (bvj (n)) =ab 1 − tanh2 (bvj (n)) BTW I leave to you to figure out the outputs. 69 / 94
  • 151. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 70 / 94
  • 152. Maximizing information content Two ways of achieving this, LeCun 1993 The use of an example that results in the largest training error. The use of an example that is radically different from all those previously used. For this Randomized the samples presented to the multilayer perceptron when not doing batch training. Or use an emphasizing scheme By using the error identify the difficult vs. easy patterns: Use them to train the neural network 71 / 94
  • 153. Maximizing information content Two ways of achieving this, LeCun 1993 The use of an example that results in the largest training error. The use of an example that is radically different from all those previously used. For this Randomized the samples presented to the multilayer perceptron when not doing batch training. Or use an emphasizing scheme By using the error identify the difficult vs. easy patterns: Use them to train the neural network 71 / 94
  • 154. Maximizing information content Two ways of achieving this, LeCun 1993 The use of an example that results in the largest training error. The use of an example that is radically different from all those previously used. For this Randomized the samples presented to the multilayer perceptron when not doing batch training. Or use an emphasizing scheme By using the error identify the difficult vs. easy patterns: Use them to train the neural network 71 / 94
  • 155. Maximizing information content Two ways of achieving this, LeCun 1993 The use of an example that results in the largest training error. The use of an example that is radically different from all those previously used. For this Randomized the samples presented to the multilayer perceptron when not doing batch training. Or use an emphasizing scheme By using the error identify the difficult vs. easy patterns: Use them to train the neural network 71 / 94
  • 156. Maximizing information content Two ways of achieving this, LeCun 1993 The use of an example that results in the largest training error. The use of an example that is radically different from all those previously used. For this Randomized the samples presented to the multilayer perceptron when not doing batch training. Or use an emphasizing scheme By using the error identify the difficult vs. easy patterns: Use them to train the neural network 71 / 94
  • 157. However Be careful about emphasizing scheme The distribution of examples within an epoch presented to the network is distorted. The presence of an outlier or a mislabeled example can have a catastrophic consequence on the performance of the algorithm. Definition of Outlier An outlier is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999). 72 / 94
  • 158. However Be careful about emphasizing scheme The distribution of examples within an epoch presented to the network is distorted. The presence of an outlier or a mislabeled example can have a catastrophic consequence on the performance of the algorithm. Definition of Outlier An outlier is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999). 72 / 94
  • 159. However Be careful about emphasizing scheme The distribution of examples within an epoch presented to the network is distorted. The presence of an outlier or a mislabeled example can have a catastrophic consequence on the performance of the algorithm. Definition of Outlier An outlier is an observation that lies outside the overall pattern of a distribution (Moore and McCabe 1999). 0 1 2 3 Outlier 72 / 94
  • 160. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 73 / 94
  • 161. Activation Function We say that An activation function f (v) is antisymmetric if f (−v) = −f (v) It seems to be That the multilayer perceptron learns faster using an antisymmetric function. Example: The hyperbolic tangent 74 / 94
  • 162. Activation Function We say that An activation function f (v) is antisymmetric if f (−v) = −f (v) It seems to be That the multilayer perceptron learns faster using an antisymmetric function. Example: The hyperbolic tangent 74 / 94
  • 163. Activation Function We say that An activation function f (v) is antisymmetric if f (−v) = −f (v) It seems to be That the multilayer perceptron learns faster using an antisymmetric function. Example: The hyperbolic tangent 74 / 94
  • 164. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 75 / 94
  • 165. Target Values Important It is important that the target values be chosen within the range of the sigmoid activation function. Specifically The desired response for neuron in the output layer of the multilayer perceptron should be offset by some amount 76 / 94
  • 166. Target Values Important It is important that the target values be chosen within the range of the sigmoid activation function. Specifically The desired response for neuron in the output layer of the multilayer perceptron should be offset by some amount 76 / 94
  • 167. For example Given the a limiting value We have then If we have a limiting value +a, we set t = a − . If we have a limiting value −a, we set t = −a + . 77 / 94
  • 168. For example Given the a limiting value We have then If we have a limiting value +a, we set t = a − . If we have a limiting value −a, we set t = −a + . 77 / 94
  • 169. For example Given the a limiting value We have then If we have a limiting value +a, we set t = a − . If we have a limiting value −a, we set t = −a + . 77 / 94
  • 170. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 78 / 94
  • 171. Normalizing the inputs Something Important (LeCun, 1993) Each input variable should be preprocessed so that: The mean value, averaged over the entire training set, is close to zero. Or it is smalll compared to its standard deviation. Example 79 / 94
  • 172. Normalizing the inputs Something Important (LeCun, 1993) Each input variable should be preprocessed so that: The mean value, averaged over the entire training set, is close to zero. Or it is smalll compared to its standard deviation. Example 79 / 94
  • 173. Normalizing the inputs Something Important (LeCun, 1993) Each input variable should be preprocessed so that: The mean value, averaged over the entire training set, is close to zero. Or it is smalll compared to its standard deviation. Example 79 / 94
  • 174. Normalizing the inputs Something Important (LeCun, 1993) Each input variable should be preprocessed so that: The mean value, averaged over the entire training set, is close to zero. Or it is smalll compared to its standard deviation. Example Mean Value 79 / 94
  • 175. The normalization must include two other measures Uncorrelated We can use the principal component analysis Example 80 / 94
  • 176. The normalization must include two other measures Uncorrelated We can use the principal component analysis Example 80 / 94
  • 177. In addition Quite interesting The decorrelated input variables should be scaled so that their covariances are The decorrelated approximately equal. Why Ensuring that the different synaptic weights in the ely the same speed network learn at approximately the same speed. 81 / 94
  • 178. In addition Quite interesting The decorrelated input variables should be scaled so that their covariances are The decorrelated approximately equal. Why Ensuring that the different synaptic weights in the ely the same speed network learn at approximately the same speed. 81 / 94
  • 179. There are other heuristics As Initialization Learning form hints Learning rates etc 82 / 94
  • 180. There are other heuristics As Initialization Learning form hints Learning rates etc 82 / 94
  • 181. There are other heuristics As Initialization Learning form hints Learning rates etc 82 / 94
  • 182. There are other heuristics As Initialization Learning form hints Learning rates etc 82 / 94
  • 183. In addition In section 4.15, Simon Haykin We have the following techniques: Network growing You start with a small network and add neurons and layers to accomplish the learning task. Network pruning Start with a large network, then prune weights that are not necessary in an orderly fashion. 83 / 94
  • 184. In addition In section 4.15, Simon Haykin We have the following techniques: Network growing You start with a small network and add neurons and layers to accomplish the learning task. Network pruning Start with a large network, then prune weights that are not necessary in an orderly fashion. 83 / 94
  • 185. In addition In section 4.15, Simon Haykin We have the following techniques: Network growing You start with a small network and add neurons and layers to accomplish the learning task. Network pruning Start with a large network, then prune weights that are not necessary in an orderly fashion. 83 / 94
  • 186. Outline 1 Introduction The XOR Problem 2 Multi-Layer Perceptron Architecture Back-propagation Gradient Descent Hidden–to-Output Weights Input-to-Hidden Weights Total Training Error About Stopping Criteria Final Basic Batch Algorithm 3 Using Matrix Operations to Simplify Using Matrix Operations to Simplify the Pseudo-Code Generating the Output zk Generating zk Generating the Weights from Hidden to Output Layer Generating the Weights from Input to Hidden Layer Activation Functions 4 Heuristic for Multilayer Perceptron Maximizing information content Activation Function Target Values Normalizing the inputs Virtues and limitations of Back-Propagation Layer 84 / 94
  • 187. Virtues and limitations of Back-Propagation Layer Something Notable The back-propagation algorithm has emerged as the most popular algorithm for the training of multilayer perceptrons. It has two distinct properties It is simple to compute locally. It performs stochastic gradient descent in weight space when doing pattern-by-pattern training 85 / 94
  • 188. Virtues and limitations of Back-Propagation Layer Something Notable The back-propagation algorithm has emerged as the most popular algorithm for the training of multilayer perceptrons. It has two distinct properties It is simple to compute locally. It performs stochastic gradient descent in weight space when doing pattern-by-pattern training 85 / 94
  • 189. Virtues and limitations of Back-Propagation Layer Something Notable The back-propagation algorithm has emerged as the most popular algorithm for the training of multilayer perceptrons. It has two distinct properties It is simple to compute locally. It performs stochastic gradient descent in weight space when doing pattern-by-pattern training 85 / 94
  • 190. Connectionism Back-propagation t is an example of a connectionist paradigm that relies on local computations to discover the processing capabilities of neural networks. This form of restriction It is known as the locality constraint 86 / 94
  • 191. Connectionism Back-propagation t is an example of a connectionist paradigm that relies on local computations to discover the processing capabilities of neural networks. This form of restriction It is known as the locality constraint 86 / 94
  • 192. Why this is advocated in Artificial Neural Networks First Artificial neural networks that perform local computations are often held up as metaphors for biological neural networks. Second The use of local computations permits a graceful degradation in performance due to hardware errors, and therefore provides the basis for a fault-tolerant network design. Third Local computations favor the use of parallel architectures as an efficient method for the implementation of artificial neural networks. 87 / 94
  • 193. Why this is advocated in Artificial Neural Networks First Artificial neural networks that perform local computations are often held up as metaphors for biological neural networks. Second The use of local computations permits a graceful degradation in performance due to hardware errors, and therefore provides the basis for a fault-tolerant network design. Third Local computations favor the use of parallel architectures as an efficient method for the implementation of artificial neural networks. 87 / 94
  • 194. Why this is advocated in Artificial Neural Networks First Artificial neural networks that perform local computations are often held up as metaphors for biological neural networks. Second The use of local computations permits a graceful degradation in performance due to hardware errors, and therefore provides the basis for a fault-tolerant network design. Third Local computations favor the use of parallel architectures as an efficient method for the implementation of artificial neural networks. 87 / 94
  • 195. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) First The reciprocal synaptic connections between the neurons of a multilayer perceptron may assume weights that are excitatory or inhibitory. In the real nervous system, neurons usually appear to be the one or the other. Second In a multilayer perceptron, hormonal and other types of global communications are ignored. 88 / 94
  • 196. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) First The reciprocal synaptic connections between the neurons of a multilayer perceptron may assume weights that are excitatory or inhibitory. In the real nervous system, neurons usually appear to be the one or the other. Second In a multilayer perceptron, hormonal and other types of global communications are ignored. 88 / 94
  • 197. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) First The reciprocal synaptic connections between the neurons of a multilayer perceptron may assume weights that are excitatory or inhibitory. In the real nervous system, neurons usually appear to be the one or the other. Second In a multilayer perceptron, hormonal and other types of global communications are ignored. 88 / 94
  • 198. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) Third In back-propagation learning, a synaptic weight is modified by a presynaptic activity and an error (learning) signal independent of postsynaptic activity. There is evidence from neurobiology to suggest otherwise. Fourth In a neurobiological sense, the implementation of back-propagation learning requires the rapid transmission of information backward along an axon. It appears highly unlikely that such an operation actually takes place in the brain. 89 / 94
  • 199. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) Third In back-propagation learning, a synaptic weight is modified by a presynaptic activity and an error (learning) signal independent of postsynaptic activity. There is evidence from neurobiology to suggest otherwise. Fourth In a neurobiological sense, the implementation of back-propagation learning requires the rapid transmission of information backward along an axon. It appears highly unlikely that such an operation actually takes place in the brain. 89 / 94
  • 200. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) Third In back-propagation learning, a synaptic weight is modified by a presynaptic activity and an error (learning) signal independent of postsynaptic activity. There is evidence from neurobiology to suggest otherwise. Fourth In a neurobiological sense, the implementation of back-propagation learning requires the rapid transmission of information backward along an axon. It appears highly unlikely that such an operation actually takes place in the brain. 89 / 94
  • 201. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) Third In back-propagation learning, a synaptic weight is modified by a presynaptic activity and an error (learning) signal independent of postsynaptic activity. There is evidence from neurobiology to suggest otherwise. Fourth In a neurobiological sense, the implementation of back-propagation learning requires the rapid transmission of information backward along an axon. It appears highly unlikely that such an operation actually takes place in the brain. 89 / 94
  • 202. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) Fifth Back-propagation learning implies the existence of a "teacher," which in the con text of the brain would presumably be another set of neurons with novel properties. The existence of such neurons is biologically implausible. 90 / 94
  • 203. However, all this has been seriously questioned on the following grounds(Shepherd, 1990b; Crick, 1989; Stork, 1989) Fifth Back-propagation learning implies the existence of a "teacher," which in the con text of the brain would presumably be another set of neurons with novel properties. The existence of such neurons is biologically implausible. 90 / 94
  • 204. Computational Efficiency Something Notable The computational complexity of an algorithm is usually measured in terms of the number of multiplications, additions, and storage involved in its implementation. This is the electrical engineering approach!!! Taking in account the total number of synapses, W including biases We have wkj = ηδkyj = η (tk − zk) f (netk) yj (Backward Pass) We have that for this step 1 We need to calculate netk linear in the number of weights. 2 We need to calculate yj = f (netj) which is linear in the number of weights. 91 / 94
  • 205. Computational Efficiency Something Notable The computational complexity of an algorithm is usually measured in terms of the number of multiplications, additions, and storage involved in its implementation. This is the electrical engineering approach!!! Taking in account the total number of synapses, W including biases We have wkj = ηδkyj = η (tk − zk) f (netk) yj (Backward Pass) We have that for this step 1 We need to calculate netk linear in the number of weights. 2 We need to calculate yj = f (netj) which is linear in the number of weights. 91 / 94
  • 206. Computational Efficiency Something Notable The computational complexity of an algorithm is usually measured in terms of the number of multiplications, additions, and storage involved in its implementation. This is the electrical engineering approach!!! Taking in account the total number of synapses, W including biases We have wkj = ηδkyj = η (tk − zk) f (netk) yj (Backward Pass) We have that for this step 1 We need to calculate netk linear in the number of weights. 2 We need to calculate yj = f (netj) which is linear in the number of weights. 91 / 94
  • 207. Computational Efficiency Now the Forward Pass ∆wji = ηxiδj = ηf (netj) c k=1 wkjδk xi We have that for this step [ c k=1 wkjδk] takes, because of the previous calculations of δk’s, linear on the number of weights Clearly all this takes to have memory In addition the calculation of the derivatives of the activation functions, but assuming a constant time. 92 / 94
  • 208. Computational Efficiency Now the Forward Pass ∆wji = ηxiδj = ηf (netj) c k=1 wkjδk xi We have that for this step [ c k=1 wkjδk] takes, because of the previous calculations of δk’s, linear on the number of weights Clearly all this takes to have memory In addition the calculation of the derivatives of the activation functions, but assuming a constant time. 92 / 94
  • 209. Computational Efficiency Now the Forward Pass ∆wji = ηxiδj = ηf (netj) c k=1 wkjδk xi We have that for this step [ c k=1 wkjδk] takes, because of the previous calculations of δk’s, linear on the number of weights Clearly all this takes to have memory In addition the calculation of the derivatives of the activation functions, but assuming a constant time. 92 / 94
  • 210. We have that The Complexity of the multi-layer perceptron is O (W ) Complexity 93 / 94
  • 211. Exercises We have from NN by Haykin 4.2, 4.3, 4.6, 4.8, 4.16, 4.17, 3.7 94 / 94