This document discusses backpropagation, an algorithm used to train feedforward neural networks. It begins by explaining gradient descent and how it is used to minimize error in the network by adjusting weights. It then describes how backpropagation specifically works to calculate the gradient of the error with respect to the weights in each layer by propagating error backwards from the output layer through the hidden layers. The general backpropagation rule is provided to update weights based on this error gradient calculation.
2. Pocket Algorithm
• Algorithm evolved in 1985 – essentially uses
PTA
• Basic Idea:
Always preserve the best weight obtained so far
in the “pocket”
Change weights, if found better (i.e. changed
weights result in reduced error).
3. XOR using 2 layers
)))
),
(
(
)),
(
,
(
( 2
1
2
1
2
1
2
1
2
1
x
x
NOT
AND
x
NOT
x
AND
OR
x
x
x
x
x
x
• Non-LS function expressed as a linearly separable
function of individual linearly separable functions.
4. Example - XOR
x1 x2 x1x
2
0 0 0
0 1 1
1 0 0
1 1 0
w2=1.5
w1=-1
θ = 1
x1 x2
2
1
1
2
0
w
w
w
w
Calculation of XOR
Calculation of x1x2
w2=1
w1=1
θ = 0.5
x1x2 x1x2
6. Some Terminology
• A multilayer feedforward neural network has
– Input layer
– Output layer
– Hidden layer (asserts computation)
Output units and hidden units are called
computation units.
7. Training of the MLP
• Multilayer Perceptron (MLP)
• Question:- How to find weights for the hidden
layers when no target output is available?
• Credit assignment problem – to be solved by
“Gradient Descent”
8. Gradient Descent Technique
• Let E be the error at the output layer
• ti = target output; oi = observed output
• i is the index going over n neurons in the outermost
layer
• j is the index going over the p patterns (1 to p)
• Ex: XOR:– p=4 and n=1
p
j
n
i
j
i
i o
t
E
1 1
2
)
(
2
1
9. Weights in a ff NN
• wmn is the weight of the
connection from the nth neuron to
the mth neuron
• E vs surface is a complex
surface in the space defined by the
weights wij
• gives the direction in which a
movement of the operating point
in the wmn co-ordinate space will
result in maximum decrease in
error
W
m
n
wmn
mn
w
E
mn
mn
w
E
w
10. Sigmoid neurons
• Gradient Descent needs a derivative computation
- not possible in perceptron due to the discontinuous step
function used!
Sigmoid neurons with easy-to-compute derivatives used!
• Computing power comes from non-linearity of sigmoid
function.
x
y
x
y
as
0
as
1
11. Derivative of Sigmoid function
)
1
(
1
1
1
1
1
)
1
(
)
(
)
1
(
1
1
1
2
2
y
y
e
e
e
e
e
e
dx
dy
e
y
x
x
x
x
x
x
x
12. Training algorithm
• Initialize weights to random values.
• For input x = <xn,xn-1,…,x0>, modify weights as follows
Target output = t, Observed output = o
• Iterate until E < (threshold)
i
i
w
E
w
2
)
(
2
1
o
t
E
14. Observations
Does the training technique support our
intuition?
• The larger the xi, larger is ∆wi
– Error burden is borne by the weight values
corresponding to large input values
18. Backpropagation – for outermost
layer
i
j
j
j
j
ji
j
j
j
j
m
p
p
p
th
j
j
j
j
j
o
o
o
o
t
w
o
o
o
t
j
o
t
E
net
net
o
o
E
net
E
j
)
1
(
)
(
))
1
(
)
(
(
Hence,
)
(
2
1
)
layer
j
at the
input
(
1
2
19. Backpropagation for hidden layers
Hidden layers
Input layer
(n i/p neurons)
Output layer
(m o/p neurons)
j
i
….
….
….
….
k
k is propagated backwards to find value of j
20. Backpropagation – for hidden
layers
i
j
j
k
k
kj
j
j
k
kj
k
j
j
j
k j
k
j
j
j
j
j
j
j
i
ji
o
o
o
w
o
o
w
o
o
o
netk
net
E
o
o
o
E
net
o
o
E
net
E
j
jo
w
)
1
(
)
(
)
1
(
)
(
Hence,
)
1
(
)
(
)
1
(
layer
next
layer
next
layer
next
21. General Backpropagation Rule
i
j
j
k
k
kj o
o
o
w )
1
(
)
(
layer
next
)
1
(
)
( j
j
j
j
j o
o
o
t
i
ji jo
w
• General weight updating rule:
• Where
for outermost layer
for hidden layers
22. How does it work?
• Input propagation forward and error
propagation backward (e.g. XOR)
w2=1
w1=1
θ = 0.5
x1x2 x1x2
-1
x1 x2
-1
1.5
1.5
1 1