Lecture on the use of Deep Learning in Optimization. It explains in detail the backpropagation algorithm. Various techniques like the Newton method, Gradient Descent, and Conjugate Direction are explained.
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
EE658_Lecture_8.pdf
1. EE658 Optimization Techniques
Lecture 8
• Optimization Techniques for Deep Learning
“Pattern Recognition and Machine Learning” by
Christopher M. Bishop
EE658 Kuntal Deka IIT Guwahati 1
2. Deep Forward Networks
• First we construct M linear combinations of the input variables x1, . . . , xD
in the form
aj =
D
X
i=1
w
(1)
ji xi + w
(1)
j0
The quantities aj, j = 1, . . . , M are known as activations.
• Each aj is then transformed using a differentiable, nonlinear activation
function h(·) to give
zj = h (aj)
EE658 Kuntal Deka IIT Guwahati 2
3. Deep Forward Networks contd..
• zj values are again linearly combined to give output unit activations
ak =
M
X
j=1
w
(2)
kj zj + w
(2)
k0
where k = 1, · · · , K, and K is the total number of outputs.
EE658 Kuntal Deka IIT Guwahati 3
4. Deep Forward Networks contd..
• Each output unit activation is transformed using a logistic sigmoid
function so that
yk = σ (ak)
where σ (a) = 1
1+exp(−a)
• Combining these various stages to give the overall network function that,
for sigmoidal output unit activation functions
yk (x, w) = σ
M
X
j=1
w
(2)
kj h
D
X
i=1
w
(1)
ji xi + w
(1)
j0
!
+ w
(2)
k0
!
y (x, w) = [y1 (x, w) , y2 (x, w) , . . . , yK (x, w)]T
EE658 Kuntal Deka IIT Guwahati 4
5. Network Training
• Given
• A training set comprising a set of input vectors {xn}, where n = 1, ..., N
• Target vectors {tn}.
The objective is to minimize the error function
E (w) =
1
2
N
X
n=1
ky (xn, w) − tnk2
FONC:
∇E (w) = 0
Our goal is to find a vector w such
that E (w) takes is smallest value.
Gradient Descent Optimization:
w(τ+1)
= w(τ)
− η∇E
w(τ)
where the step-size parameter η is
known as learning rate.
wA is a local minimum
wB is the global minimum
EE658 Kuntal Deka IIT Guwahati 5
6. Error Backpropagation or Backprop
• Error Backpropagation or Backprop is an efficient technique for evaluating
the gradient of an error function E(w).
• Error function comprises a sum of terms, one for each data point in the
training set:
E (w) =
N
X
n=1
En (w)
•
En =
1
2
X
k
(ynk − tnk)2
• Each unit computes a weighted
sum of inputs:
aj =
X
i
wjizi
Forward direction
Backward direction
EE658 Kuntal Deka IIT Guwahati 6
7. Error Backpropagation contd..
• After activation, we get
zj = h (aj)
where h(·) is a nonlinear activation function.
Table: List of activation functions
Name h(aj) Range
Linear aj (−∞, ∞)
sigmoid 1
1+exp(−aj )
(0,1)
softmax
exp(aj)
P
j′
exp(aj′ )
(0,1)
ReLU max (0, aj ) [0, ∞)
tanh tanh (aj) −1, 1
EE658 Kuntal Deka IIT Guwahati 7
8. Error Backpropagation contd..
• By using chain rule of partial derivative, we get
∂En
∂wji
=
∂En
∂aj
| {z }
δj
∂aj
∂wji
| {z }
zi
= δjzi
δj =
∂En
∂aj
=
X
k
∂En
∂ak
∂ak
∂aj
= h
′
(aj)
X
k
δkwkj
[ We have zj = h (aj )
Therefore, aj =
P
i
wjizi
=
P
i
wjih (ai) ]
Forward direction
Backward direction
For sigmoid activation function, we have
h′
(aj) = d
daj
1
1+exp(−aj )
= 1
1+exp(−aj)
×
1 − 1
1+exp(−aj )
= zj (1 − zj)
EE658 Kuntal Deka IIT Guwahati 8
10. Steps for Error Backpropagation
Error Backpropagation
1. Apply an input vector xn to the network and forward propagate through
the network using the following equations to find the activations of all
the hidden and output units:
aj =
X
i
wjizi and zj = h (aj )
2. Evaluate the δk for all the output units using
δk = yk (1 − yk) (yk − tk)
3. Backpropagate the δs using the following equation to obtain δj for each
hidden unit in the network:
δj = h
′
(aj )
X
k
δkwkj
4. Evaluate the required derivatives:
∂En
∂wji
= δj zi
5. The derivatives are used in gradient descent:
w(τ+1)
= w(τ)
− η∇E
w(τ)
EE658 Kuntal Deka IIT Guwahati 10
11. Example for BackProp
Forward Pass
a1 = w
(1)
11 x1 + w
(1)
12 x2
= 0.1 × 0.35 + 0.8 × 0.9
= 0.755
a2 = w
(1)
21 x1 + w
(1)
22 x2
= 0.8 × 0.35 + 0.6 × 0.9
= 0.68
z1 = 1
1+exp(−a1)
= 0.68
z2 = 1
1+exp(−a2)
= 0.6637
y1 = σ
w
(2)
11 z1 + w
(2)
12 z2
= σ (0.801)
= 1
1+exp(−0.801)
= 0.69
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4 y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
EE658 Kuntal Deka IIT Guwahati 11
14. Example for BackProp
Backward Pass (Layer 1)
w
(1)
11 = w
(1)
11 − ∂E
∂w
(1)
11
= 0.0990725
w
(1)
12 = w
(1)
12 − ∂E
∂w
(1)
12
= 0.797612
w
(1)
22 = w
(1)
22 − ∂E
∂w
(1)
22
= 0.59262
w
(1)
21 = w
(1)
21 − ∂E
∂w
(1)
21
= 0.39713
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4
y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
δout
1 = 0.0406
δhid
2 = 0.0082
δhid
1 = 0.00265
EE658 Kuntal Deka IIT Guwahati 14
15. Example for BackProp
Updated Network
Error after this iteration is
E = 1
2
(0.6820 − 0.5)2
= 0.0166
Note that before this iteration,
the error was:
E = 1
2
(0.69 − 0.5)2
= 0.0180
x1=0.35
x2=0.9
a1=0.75
a2=0.67
w
(1)
22 = 0.59262
w
(1)
11 = 0.0990725
w
(1)
12 = 0.7976 w
(1)
21 = 0.397
y1 = 0.682
w
(2)
12 = 0.8731
w
(2)
11 = 0.27239
z2 = 0.6620
z1 = 0.6797
t1 = 0.5
EE658 Kuntal Deka IIT Guwahati 15
16. Momentum
• Gradient descent will take a long time to traverse a nearly flat surface as
shown in the following figure.
Regions that are nearly flat have
require many
and can thus
gradients with small magnitudes
iterations of of gradient descent
to traverse.
long region with small gradient
• Idea is to introduce memory:
v(k+1)
= βv(k)
− α∇f(x(k)
)
x(k+1)
= x(k)
+ v(k+1)
• Observe that for β = 0, we have gradient descent.
EE658 Kuntal Deka IIT Guwahati 16
17. Nesterov Momentum
• Nesterov Momentum uses the gradient at the projected future position
v(k+1)
= βv(k)
− α∇f(x(k)
+ βv(k)
| {z }
future position
)
x(k+1)
= x(k)
+ v(k+1)
EE658 Kuntal Deka IIT Guwahati 17
18. Adaptive Subgradient (Adagrad)
• Momentum and Nesterov momentum update all components of x with the
same learning rate.
• The adaptive subgradient method, or Adagrad adapts a learning rate for
each component of x.
x
(k+1)
i = x
(k)
i −
α
ε +
q
s
(k)
i
gi
(k)
[gi is the ith component of the gradient]
where s(k)
is a vector whose ith entry is the sum of the squares of the
partials, with respect to xi, up to time step k,
s
(k)
i =
k
X
j=1
g
(j)
i
2
• The components of s accumulates the partials which causes the effective
learning rate to decrease during training.
EE658 Kuntal Deka IIT Guwahati 18
19. RMSProp
• RMSProp extends Adagrad to avoid the effect of a monotonically
decreasing learning rate.
ŝ(k+1)
= γŝ(k)
+ (1 − γ)
g(k)
⊙ g(k)
where the decay γ ∈ [0, 1] is typically close to 0.9.
• RMSProp’s update equation:
x
(k+1)
i = x
(k)
i −
α
ε +
q
s
(k)
i
gi(k)
= x
(k)
i −
α
ε + RMS (gi)
gi(k)
EE658 Kuntal Deka IIT Guwahati 19
20. Adam
• Adam is a combination of RMSProp and momentum.
• It stores both
• an exponentially decaying squared gradient like RMSProp.
• an exponentially decaying gradient like momentum.
• The update equations are
• Biased decaying momentum:
v(k+1)
= γvv(k)
+ (1 − γv)g(k)
• Biased decaying sq. gradient:
s(k+1)
= γss(k)
+ (1 − γs)
g(k)
⊙ g(k)
• Corrected decaying momentum:
v̂(k+1)
= v(k+1)
/(1 − γk
v )
• Corrected decaying sq. gradient:
ŝ(k+1)
= s(k+1)
/(1 − γk
s )
• Next iterate:
x(k+1)
= x(k)
− αv̂(k+1)
/
∈ +
p
ŝ(k+1)
EE658 Kuntal Deka IIT Guwahati 20