EE658_Lecture_8.pdf

EE658 Optimization Techniques
Lecture 8
• Optimization Techniques for Deep Learning
“Pattern Recognition and Machine Learning” by
Christopher M. Bishop
EE658 Kuntal Deka IIT Guwahati 1

Deep Forward Networks
• First we construct M linear combinations of the input variables x1, . . . , xD
in the form
aj =
D
X
i=1
w
(1)
ji xi + w
(1)
j0
The quantities aj, j = 1, . . . , M are known as activations.
• Each aj is then transformed using a differentiable, nonlinear activation
function h(·) to give
zj = h (aj)

Deep Forward Networks contd..
• zj values are again linearly combined to give output unit activations
ak =
M
X
j=1
w
(2)
kj zj + w
(2)
k0
where k = 1, · · · , K, and K is the total number of outputs.

Deep Forward Networks contd..
• Each output unit activation is transformed using a logistic sigmoid
function so that
yk = σ (ak)
where σ (a) = 1
1+exp(−a)
• Combining these various stages to give the overall network function that,
for sigmoidal output unit activation functions
yk (x, w) = σ
M
X
j=1
w
(2)
kj h
D
X
i=1
w
(1)
ji xi + w
(1)
j0
!
+ w
(2)
k0
!
y (x, w) = [y1 (x, w) , y2 (x, w) , . . . , yK (x, w)]T

Network Training
• Given
• A training set comprising a set of input vectors {xn}, where n = 1, ..., N
• Target vectors {tn}.
The objective is to minimize the error function
E (w) =
1
2
N
X
n=1
ky (xn, w) − tnk2
FONC:
∇E (w) = 0
Our goal is to find a vector w such
that E (w) takes is smallest value.
Gradient Descent Optimization:
w(τ+1)
= w(τ)
− η∇E

w(τ)

where the step-size parameter η is
known as learning rate.
wA is a local minimum
wB is the global minimum

Error Backpropagation or Backprop
• Error Backpropagation or Backprop is an efficient technique for evaluating
the gradient of an error function E(w).
• Error function comprises a sum of terms, one for each data point in the
training set:
E (w) =
N
X
n=1
En (w)
•
En =
1
2
X
k
(ynk − tnk)2
• Each unit computes a weighted
sum of inputs:
aj =
X
i
wjizi
Forward direction
Backward direction

Error Backpropagation contd..
• After activation, we get
zj = h (aj)
where h(·) is a nonlinear activation function.
Table: List of activation functions
Name h(aj) Range
Linear aj (−∞, ∞)
sigmoid 1
1+exp(−aj )
(0,1)
softmax
exp(aj)
P
j′
exp(aj′ )
(0,1)
ReLU max (0, aj ) [0, ∞)
tanh tanh (aj) −1, 1

• By using chain rule of partial derivative, we get
∂En
∂wji
=
∂En
∂aj
| {z }
δj
∂aj
∂wji
| {z }
zi
= δjzi
δj =
∂En
∂aj
=
X
k
∂En
∂ak
∂ak
∂aj
= h
′
(aj)
X
k
δkwkj
[ We have zj = h (aj )
Therefore, aj =
P
i
wjizi
=
P
i
wjih (ai) ]
Forward direction
Backward direction
For sigmoid activation function, we have
h′
(aj) = d
daj

1
1+exp(−aj )

= 1
1+exp(−aj)
×

1 − 1
1+exp(−aj )

= zj (1 − zj)

• What is δj at the output layer?
δk =
∂En
∂ak
=
∂E
∂yk
∂yk
∂ak
∂E
∂yk
=
∂
∂yk

1
2
X
k
(yk − tk)2
#
= yk − tk
∂yk
∂ak
=
∂
∂ak
[σ (ak)] =
∂
∂ak

1
1 + exp(−ak)

=
exp(−ak)
(1 + exp(−ak))2
= yk (1 − yk)
• At the output layer,
δk = yk (1 − yk) (yk − tk)

Steps for Error Backpropagation
Error Backpropagation
1. Apply an input vector xn to the network and forward propagate through
the network using the following equations to find the activations of all
the hidden and output units:
aj =
X
i
wjizi and zj = h (aj )
2. Evaluate the δk for all the output units using
δk = yk (1 − yk) (yk − tk)
3. Backpropagate the δs using the following equation to obtain δj for each
hidden unit in the network:
δj = h
′
(aj )
X
k
δkwkj
4. Evaluate the required derivatives:
∂En
∂wji
= δj zi
5. The derivatives are used in gradient descent:
w(τ+1)
= w(τ)
− η∇E

w(τ)


Example for BackProp
Forward Pass
a1 = w
(1)
11 x1 + w
(1)
12 x2
= 0.1 × 0.35 + 0.8 × 0.9
= 0.755
a2 = w
(1)
21 x1 + w
(1)
22 x2
= 0.8 × 0.35 + 0.6 × 0.9
= 0.68
z1 = 1
1+exp(−a1)
= 0.68
z2 = 1
1+exp(−a2)
= 0.6637
y1 = σ

w
(2)
11 z1 + w
(2)
12 z2

= σ (0.801)
= 1
1+exp(−0.801)
= 0.69
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4 y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5

Backward Pass (Layer 2)
δout
1 = y1(1 − y1) (y1 − t1)
= 0.69 × (1 − 0.69) (0.69 − 0.5)
= 0.0406
∂E
∂w
(2)
11
= δout
1 z1 = 0.0406 × 0.68
= 0.02761
∂E
∂w
(2)
12
= δout
1 z2 = 0.0406 × 0.6637
= 0.0269
w
(2)
11 = w
(2)
11 − ∂E
∂w
(2)
11
= 0.27239
w
(2)
12 = w
(2)
12 − ∂E
∂w
(2)
12
= 0.8731
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4
y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
δout
1 = 0.0406
δhid
2 = 0.0082
δhid
1 = 0.00265
Forward direction
Backward direction
∂En
∂wji
= δjzi

δhid
1 = z1(1 − z1)δout
1 w
(2)
11
= 0.68 × (1 − 0.68)
×0.0406 × 0.3
= 0.00265
δhid
2 = z2(1 − z2)δout
1 w
(2)
12
= 0.6637 × (1 − 0.6637)
×0.0406 × 0.9
= 0.0082
∂E
∂w
(1)
11
= δhid
1 x1 = 0.0009275
∂E
∂w
(1)
12
= δhid
1 x2 = 0.002385
∂E
∂w
(1)
22
= δhid
2 x2 = 0.00738
∂E
∂w
(1)
21
= δhid
2 x1 = 0.00287
w
(2)
11 = w
(2)
11 − ∂E
∂w
(2) = 0.27239
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4
y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
δout
1 = 0.0406
δhid
2 = 0.0082
δhid
1 = 0.00265
Forward direction
Backward direction
∂En
∂wji
= δjzi
δj = zj 1 − zj
!
P
k δkwkj

w
(1)
11 = w
(1)
11 − ∂E
∂w
(1)
11
= 0.0990725
w
(1)
12 = w
(1)
12 − ∂E
∂w
(1)
12
= 0.797612
w
(1)
22 = w
(1)
22 − ∂E
∂w
(1)
22
= 0.59262
w
(1)
21 = w
(1)
21 − ∂E
∂w
(1)
21
= 0.39713
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4
y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
δout
1 = 0.0406
δhid
2 = 0.0082
δhid
1 = 0.00265

Updated Network
Error after this iteration is
E = 1
2
(0.6820 − 0.5)2
= 0.0166
Note that before this iteration,
the error was:
E = 1
2
(0.69 − 0.5)2
= 0.0180
x1=0.35
x2=0.9
a1=0.75
a2=0.67
w
(1)
22 = 0.59262
w
(1)
11 = 0.0990725
w
(1)
12 = 0.7976 w
(1)
21 = 0.397
y1 = 0.682
w
(2)
12 = 0.8731
w
(2)
11 = 0.27239
z2 = 0.6620
z1 = 0.6797
t1 = 0.5

Momentum
• Gradient descent will take a long time to traverse a nearly flat surface as
shown in the following figure.
Regions that are nearly flat have
require many
and can thus
gradients with small magnitudes
iterations of of gradient descent
to traverse.
long region with small gradient
• Idea is to introduce memory:
v(k+1)
= βv(k)
− α∇f(x(k)
)
x(k+1)
= x(k)
+ v(k+1)
• Observe that for β = 0, we have gradient descent.

Nesterov Momentum
• Nesterov Momentum uses the gradient at the projected future position
v(k+1)
= βv(k)
− α∇f(x(k)
+ βv(k)
| {z }
future position
)
x(k+1)
= x(k)
+ v(k+1)

Adaptive Subgradient (Adagrad)
• Momentum and Nesterov momentum update all components of x with the
same learning rate.
• The adaptive subgradient method, or Adagrad adapts a learning rate for
each component of x.
x
(k+1)
i = x
(k)
i −
α
ε +
q
s
(k)
i
gi
(k)
[gi is the ith component of the gradient]
where s(k)
is a vector whose ith entry is the sum of the squares of the
partials, with respect to xi, up to time step k,
s
(k)
i =
k
X
j=1

g
(j)
i
2
• The components of s accumulates the partials which causes the effective
learning rate to decrease during training.

RMSProp
• RMSProp extends Adagrad to avoid the effect of a monotonically
decreasing learning rate.
ŝ(k+1)
= γŝ(k)
+ (1 − γ)

g(k)
⊙ g(k)

where the decay γ ∈ [0, 1] is typically close to 0.9.
• RMSProp’s update equation:
x
(k+1)
i = x
(k)
i −
α
ε +
q
s
(k)
i
gi(k)
= x
(k)
i −
α
ε + RMS (gi)
gi(k)

Adam
• Adam is a combination of RMSProp and momentum.
• It stores both
• an exponentially decaying squared gradient like RMSProp.
• an exponentially decaying gradient like momentum.
• The update equations are
• Biased decaying momentum:
v(k+1)
= γvv(k)
+ (1 − γv)g(k)
• Biased decaying sq. gradient:
s(k+1)
= γss(k)
+ (1 − γs)

g(k)
⊙ g(k)

• Corrected decaying momentum:
v̂(k+1)
= v(k+1)
/(1 − γk
v )
• Corrected decaying sq. gradient:
ŝ(k+1)
= s(k+1)
/(1 − γk
s )
• Next iterate:
x(k+1)
= x(k)
− αv̂(k+1)
/

∈ +
p
ŝ(k+1)


EE658_Lecture_8.pdf

Recommended

Recommended

More Related Content

Similar to EE658_Lecture_8.pdf

Similar to EE658_Lecture_8.pdf (20)

Recently uploaded

Recently uploaded (20)

EE658_Lecture_8.pdf