2. Pocket Algorithm
• Algorithm evolved in 1985 – essentially uses
PTA
• Basic Idea:
Always preserve the best weight obtained so far
in the “pocket”
Change weights, if found better (i.e. changed
weights result in reduced error).
3. XOR using 2 layers
)))
),
(
(
)),
(
,
(
( 2
1
2
1
2
1
2
1
2
1
x
x
NOT
AND
x
NOT
x
AND
OR
x
x
x
x
x
x
• Non-LS function expressed as a linearly separable
function of individual linearly separable functions.
4. Example - XOR
x1 x2 x1x
2
0 0 0
0 1 1
1 0 0
1 1 0
w2=1.5
w1=-1
θ = 1
x1 x2
2
1
1
2
0
w
w
w
w
Calculation of XOR
Calculation of x1x2
w2=1
w1=1
θ = 0.5
x1x2 x1x2
6. Some Terminology
• A multilayer feedforward neural network has
– Input layer
– Output layer
– Hidden layer (asserts computation)
Output units and hidden units are called
computation units.
7. Training of the MLP
• Multilayer Perceptron (MLP)
• Question:- How to find weights for the hidden
layers when no target output is available?
• Credit assignment problem – to be solved by
“Gradient Descent”
8. Gradient Descent Technique
• Let E be the error at the output layer
• ti = target output; oi = observed output
• i is the index going over n neurons in the outermost
layer
• j is the index going over the p patterns (1 to p)
• Ex: XOR:– p=4 and n=1
p
j
n
i
j
i
i o
t
E
1 1
2
)
(
2
1
9. Weights in a ff NN
• wmn is the weight of the
connection from the nth neuron to
the mth neuron
• E vs surface is a complex
surface in the space defined by the
weights wij
• gives the direction in which a
movement of the operating point
in the wmn co-ordinate space will
result in maximum decrease in
error
W
m
n
wmn
mn
w
E
mn
mn
w
E
w
10. Sigmoid neurons
• Gradient Descent needs a derivative computation
- not possible in perceptron due to the discontinuous step
function used!
Sigmoid neurons with easy-to-compute derivatives used!
• Computing power comes from non-linearity of sigmoid
function.
x
y
x
y
as
0
as
1
11. Derivative of Sigmoid function
)
1
(
1
1
1
1
1
)
1
(
)
(
)
1
(
1
1
1
2
2
y
y
e
e
e
e
e
e
dx
dy
e
y
x
x
x
x
x
x
x
12. Training algorithm
• Initialize weights to random values.
• For input x = <xn,xn-1,…,x0>, modify weights as follows
Target output = t, Observed output = o
• Iterate until E < (threshold)
i
i
w
E
w
2
)
(
2
1
o
t
E
14. Observations
Does the training technique support our
intuition?
• The larger the xi, larger is ∆wi
– Error burden is borne by the weight values
corresponding to large input values
18. Backpropagation – for outermost
layer
i
j
j
j
j
ji
j
j
j
j
m
p
p
p
th
j
j
j
j
j
o
o
o
o
t
w
o
o
o
t
j
o
t
E
net
net
o
o
E
net
E
j
)
1
(
)
(
))
1
(
)
(
(
Hence,
)
(
2
1
)
layer
j
at the
input
(
1
2
19. Backpropagation for hidden layers
Hidden layers
Input layer
(n i/p neurons)
Output layer
(m o/p neurons)
j
i
….
….
….
….
k
k is propagated backwards to find value of j
20. Backpropagation – for hidden
layers
i
j
j
k
k
kj
j
j
k
kj
k
j
j
j
k j
k
j
j
j
j
j
j
j
i
ji
o
o
o
w
o
o
w
o
o
o
netk
net
E
o
o
o
E
net
o
o
E
net
E
j
jo
w
)
1
(
)
(
)
1
(
)
(
Hence,
)
1
(
)
(
)
1
(
layer
next
layer
next
layer
next
21. General Backpropagation Rule
i
j
j
k
k
kj o
o
o
w )
1
(
)
(
layer
next
)
1
(
)
( j
j
j
j
j o
o
o
t
i
ji jo
w
• General weight updating rule:
• Where
for outermost layer
for hidden layers
22. How does it work?
• Input propagation forward and error
propagation backward (e.g. XOR)
w2=1
w1=1
θ = 0.5
x1x2 x1x2
-1
x1 x2
-1
1.5
1.5
1 1