SlideShare a Scribd company logo
1 of 20
Download to read offline
EE658 Optimization Techniques
Lecture 8
• Optimization Techniques for Deep Learning
“Pattern Recognition and Machine Learning” by
Christopher M. Bishop
EE658 Kuntal Deka IIT Guwahati 1
Deep Forward Networks
• First we construct M linear combinations of the input variables x1, . . . , xD
in the form
aj =
D
X
i=1
w
(1)
ji xi + w
(1)
j0
The quantities aj, j = 1, . . . , M are known as activations.
• Each aj is then transformed using a differentiable, nonlinear activation
function h(·) to give
zj = h (aj)
EE658 Kuntal Deka IIT Guwahati 2
Deep Forward Networks contd..
• zj values are again linearly combined to give output unit activations
ak =
M
X
j=1
w
(2)
kj zj + w
(2)
k0
where k = 1, · · · , K, and K is the total number of outputs.
EE658 Kuntal Deka IIT Guwahati 3
Deep Forward Networks contd..
• Each output unit activation is transformed using a logistic sigmoid
function so that
yk = σ (ak)
where σ (a) = 1
1+exp(−a)
• Combining these various stages to give the overall network function that,
for sigmoidal output unit activation functions
yk (x, w) = σ
M
X
j=1
w
(2)
kj h
D
X
i=1
w
(1)
ji xi + w
(1)
j0
!
+ w
(2)
k0
!
y (x, w) = [y1 (x, w) , y2 (x, w) , . . . , yK (x, w)]T
EE658 Kuntal Deka IIT Guwahati 4
Network Training
• Given
• A training set comprising a set of input vectors {xn}, where n = 1, ..., N
• Target vectors {tn}.
The objective is to minimize the error function
E (w) =
1
2
N
X
n=1
ky (xn, w) − tnk2
FONC:
∇E (w) = 0
Our goal is to find a vector w such
that E (w) takes is smallest value.
Gradient Descent Optimization:
w(τ+1)
= w(τ)
− η∇E

w(τ)

where the step-size parameter η is
known as learning rate.
wA is a local minimum
wB is the global minimum
EE658 Kuntal Deka IIT Guwahati 5
Error Backpropagation or Backprop
• Error Backpropagation or Backprop is an efficient technique for evaluating
the gradient of an error function E(w).
• Error function comprises a sum of terms, one for each data point in the
training set:
E (w) =
N
X
n=1
En (w)
•
En =
1
2
X
k
(ynk − tnk)2
• Each unit computes a weighted
sum of inputs:
aj =
X
i
wjizi
Forward direction
Backward direction
EE658 Kuntal Deka IIT Guwahati 6
Error Backpropagation contd..
• After activation, we get
zj = h (aj)
where h(·) is a nonlinear activation function.
Table: List of activation functions
Name h(aj) Range
Linear aj (−∞, ∞)
sigmoid 1
1+exp(−aj )
(0,1)
softmax
exp(aj)
P
j′
exp(aj′ )
(0,1)
ReLU max (0, aj ) [0, ∞)
tanh tanh (aj) −1, 1
EE658 Kuntal Deka IIT Guwahati 7
Error Backpropagation contd..
• By using chain rule of partial derivative, we get
∂En
∂wji
=
∂En
∂aj
| {z }
δj
∂aj
∂wji
| {z }
zi
= δjzi
δj =
∂En
∂aj
=
X
k
∂En
∂ak
∂ak
∂aj
= h
′
(aj)
X
k
δkwkj
[ We have zj = h (aj )
Therefore, aj =
P
i
wjizi
=
P
i
wjih (ai) ]
Forward direction
Backward direction
For sigmoid activation function, we have
h′
(aj) = d
daj

1
1+exp(−aj )

= 1
1+exp(−aj)
×

1 − 1
1+exp(−aj )

= zj (1 − zj)
EE658 Kuntal Deka IIT Guwahati 8
Error Backpropagation contd..
• What is δj at the output layer?
δk =
∂En
∂ak
=
∂E
∂yk
∂yk
∂ak
∂E
∂yk
=
∂
∂yk

1
2
X
k
(yk − tk)2
#
= yk − tk
∂yk
∂ak
=
∂
∂ak
[σ (ak)] =
∂
∂ak

1
1 + exp(−ak)

=
exp(−ak)
(1 + exp(−ak))2
= yk (1 − yk)
• At the output layer,
δk = yk (1 − yk) (yk − tk)
EE658 Kuntal Deka IIT Guwahati 9
Steps for Error Backpropagation
Error Backpropagation
1. Apply an input vector xn to the network and forward propagate through
the network using the following equations to find the activations of all
the hidden and output units:
aj =
X
i
wjizi and zj = h (aj )
2. Evaluate the δk for all the output units using
δk = yk (1 − yk) (yk − tk)
3. Backpropagate the δs using the following equation to obtain δj for each
hidden unit in the network:
δj = h
′
(aj )
X
k
δkwkj
4. Evaluate the required derivatives:
∂En
∂wji
= δj zi
5. The derivatives are used in gradient descent:
w(τ+1)
= w(τ)
− η∇E

w(τ)

EE658 Kuntal Deka IIT Guwahati 10
Example for BackProp
Forward Pass
a1 = w
(1)
11 x1 + w
(1)
12 x2
= 0.1 × 0.35 + 0.8 × 0.9
= 0.755
a2 = w
(1)
21 x1 + w
(1)
22 x2
= 0.8 × 0.35 + 0.6 × 0.9
= 0.68
z1 = 1
1+exp(−a1)
= 0.68
z2 = 1
1+exp(−a2)
= 0.6637
y1 = σ

w
(2)
11 z1 + w
(2)
12 z2

= σ (0.801)
= 1
1+exp(−0.801)
= 0.69
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4 y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
EE658 Kuntal Deka IIT Guwahati 11
Example for BackProp
Backward Pass (Layer 2)
δout
1 = y1(1 − y1) (y1 − t1)
= 0.69 × (1 − 0.69) (0.69 − 0.5)
= 0.0406
∂E
∂w
(2)
11
= δout
1 z1 = 0.0406 × 0.68
= 0.02761
∂E
∂w
(2)
12
= δout
1 z2 = 0.0406 × 0.6637
= 0.0269
w
(2)
11 = w
(2)
11 − ∂E
∂w
(2)
11
= 0.27239
w
(2)
12 = w
(2)
12 − ∂E
∂w
(2)
12
= 0.8731
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4
y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
δout
1 = 0.0406
δhid
2 = 0.0082
δhid
1 = 0.00265
Forward direction
Backward direction
∂En
∂wji
= δjzi
EE658 Kuntal Deka IIT Guwahati 12
Example for BackProp
Backward Pass (Layer 1)
δhid
1 = z1(1 − z1)δout
1 w
(2)
11
= 0.68 × (1 − 0.68)
×0.0406 × 0.3
= 0.00265
δhid
2 = z2(1 − z2)δout
1 w
(2)
12
= 0.6637 × (1 − 0.6637)
×0.0406 × 0.9
= 0.0082
∂E
∂w
(1)
11
= δhid
1 x1 = 0.0009275
∂E
∂w
(1)
12
= δhid
1 x2 = 0.002385
∂E
∂w
(1)
22
= δhid
2 x2 = 0.00738
∂E
∂w
(1)
21
= δhid
2 x1 = 0.00287
w
(2)
11 = w
(2)
11 − ∂E
∂w
(2) = 0.27239
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4
y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
δout
1 = 0.0406
δhid
2 = 0.0082
δhid
1 = 0.00265
Forward direction
Backward direction
∂En
∂wji
= δjzi
δj = zj 1 − zj
!
P
k δkwkj
EE658 Kuntal Deka IIT Guwahati 13
Example for BackProp
Backward Pass (Layer 1)
w
(1)
11 = w
(1)
11 − ∂E
∂w
(1)
11
= 0.0990725
w
(1)
12 = w
(1)
12 − ∂E
∂w
(1)
12
= 0.797612
w
(1)
22 = w
(1)
22 − ∂E
∂w
(1)
22
= 0.59262
w
(1)
21 = w
(1)
21 − ∂E
∂w
(1)
21
= 0.39713
x1=0.35
x2=0.9
a1=0.76
a2=0.68
w
(1)
22 = 0.6
w
(1)
11 = 0.1
w
(1)
12 = 0.8 w
(1)
21 = 0.4
y1 = 0.69
w
(2)
12 = 0.9
w
(2)
11 = 0.3
z2 = 0.6637
z1 = 0.68
t1 = 0.5
δout
1 = 0.0406
δhid
2 = 0.0082
δhid
1 = 0.00265
EE658 Kuntal Deka IIT Guwahati 14
Example for BackProp
Updated Network
Error after this iteration is
E = 1
2
(0.6820 − 0.5)2
= 0.0166
Note that before this iteration,
the error was:
E = 1
2
(0.69 − 0.5)2
= 0.0180
x1=0.35
x2=0.9
a1=0.75
a2=0.67
w
(1)
22 = 0.59262
w
(1)
11 = 0.0990725
w
(1)
12 = 0.7976 w
(1)
21 = 0.397
y1 = 0.682
w
(2)
12 = 0.8731
w
(2)
11 = 0.27239
z2 = 0.6620
z1 = 0.6797
t1 = 0.5
EE658 Kuntal Deka IIT Guwahati 15
Momentum
• Gradient descent will take a long time to traverse a nearly flat surface as
shown in the following figure.
Regions that are nearly flat have
require many
and can thus
gradients with small magnitudes
iterations of of gradient descent
to traverse.
long region with small gradient
• Idea is to introduce memory:
v(k+1)
= βv(k)
− α∇f(x(k)
)
x(k+1)
= x(k)
+ v(k+1)
• Observe that for β = 0, we have gradient descent.
EE658 Kuntal Deka IIT Guwahati 16
Nesterov Momentum
• Nesterov Momentum uses the gradient at the projected future position
v(k+1)
= βv(k)
− α∇f(x(k)
+ βv(k)
| {z }
future position
)
x(k+1)
= x(k)
+ v(k+1)
EE658 Kuntal Deka IIT Guwahati 17
Adaptive Subgradient (Adagrad)
• Momentum and Nesterov momentum update all components of x with the
same learning rate.
• The adaptive subgradient method, or Adagrad adapts a learning rate for
each component of x.
x
(k+1)
i = x
(k)
i −
α
ε +
q
s
(k)
i
gi
(k)
[gi is the ith component of the gradient]
where s(k)
is a vector whose ith entry is the sum of the squares of the
partials, with respect to xi, up to time step k,
s
(k)
i =
k
X
j=1

g
(j)
i
2
• The components of s accumulates the partials which causes the effective
learning rate to decrease during training.
EE658 Kuntal Deka IIT Guwahati 18
RMSProp
• RMSProp extends Adagrad to avoid the effect of a monotonically
decreasing learning rate.
ŝ(k+1)
= γŝ(k)
+ (1 − γ)

g(k)
⊙ g(k)

where the decay γ ∈ [0, 1] is typically close to 0.9.
• RMSProp’s update equation:
x
(k+1)
i = x
(k)
i −
α
ε +
q
s
(k)
i
gi(k)
= x
(k)
i −
α
ε + RMS (gi)
gi(k)
EE658 Kuntal Deka IIT Guwahati 19
Adam
• Adam is a combination of RMSProp and momentum.
• It stores both
• an exponentially decaying squared gradient like RMSProp.
• an exponentially decaying gradient like momentum.
• The update equations are
• Biased decaying momentum:
v(k+1)
= γvv(k)
+ (1 − γv)g(k)
• Biased decaying sq. gradient:
s(k+1)
= γss(k)
+ (1 − γs)

g(k)
⊙ g(k)

• Corrected decaying momentum:
v̂(k+1)
= v(k+1)
/(1 − γk
v )
• Corrected decaying sq. gradient:
ŝ(k+1)
= s(k+1)
/(1 − γk
s )
• Next iterate:
x(k+1)
= x(k)
− αv̂(k+1)
/

∈ +
p
ŝ(k+1)

EE658 Kuntal Deka IIT Guwahati 20

More Related Content

Similar to EE658_Lecture_8.pdf

Introduction to Artificial Neural Networks - PART III.pdf
Introduction to Artificial Neural Networks - PART III.pdfIntroduction to Artificial Neural Networks - PART III.pdf
Introduction to Artificial Neural Networks - PART III.pdfSasiKala592103
 
Ahsan 10X Engineering Talk (Quantum Computing).pptx
Ahsan 10X Engineering Talk (Quantum Computing).pptxAhsan 10X Engineering Talk (Quantum Computing).pptx
Ahsan 10X Engineering Talk (Quantum Computing).pptxMuhammad Ahsan
 
Definite Integrals 8/ Integration by Parts
Definite Integrals 8/ Integration by PartsDefinite Integrals 8/ Integration by Parts
Definite Integrals 8/ Integration by PartsLakshmikanta Satapathy
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfsurefooted
 
Journey to structure from motion
Journey to structure from motionJourney to structure from motion
Journey to structure from motionJa-Keoung Koo
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...JamesMa54
 
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Leo Asselborn
 
lec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdflec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdfAntonio Espinosa
 
Discrete control2 converted
Discrete control2 convertedDiscrete control2 converted
Discrete control2 convertedcairo university
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentRaphael Reitzig
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Shizuoka Inst. Science and Tech.
 
Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...
Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...
Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...Mohammad Khajenejad
 

Similar to EE658_Lecture_8.pdf (20)

Introduction to Artificial Neural Networks - PART III.pdf
Introduction to Artificial Neural Networks - PART III.pdfIntroduction to Artificial Neural Networks - PART III.pdf
Introduction to Artificial Neural Networks - PART III.pdf
 
Ahsan 10X Engineering Talk (Quantum Computing).pptx
Ahsan 10X Engineering Talk (Quantum Computing).pptxAhsan 10X Engineering Talk (Quantum Computing).pptx
Ahsan 10X Engineering Talk (Quantum Computing).pptx
 
Definite Integrals 8/ Integration by Parts
Definite Integrals 8/ Integration by PartsDefinite Integrals 8/ Integration by Parts
Definite Integrals 8/ Integration by Parts
 
ppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdfppt - Deep Learning From Scratch.pdf
ppt - Deep Learning From Scratch.pdf
 
Kalman
KalmanKalman
Kalman
 
Journey to structure from motion
Journey to structure from motionJourney to structure from motion
Journey to structure from motion
 
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...Fast and efficient exact synthesis of single qubit unitaries generated by cli...
Fast and efficient exact synthesis of single qubit unitaries generated by cli...
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
 
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
Robust Control of Uncertain Switched Linear Systems based on Stochastic Reach...
 
lec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdflec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdf
 
Unit 3
Unit 3Unit 3
Unit 3
 
Neural network and mlp
Neural network and mlpNeural network and mlp
Neural network and mlp
 
Discrete control2 converted
Discrete control2 convertedDiscrete control2 converted
Discrete control2 converted
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient Apportionment
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
 
Lec10.pptx
Lec10.pptxLec10.pptx
Lec10.pptx
 
Calculo integral - Larson
Calculo integral - LarsonCalculo integral - Larson
Calculo integral - Larson
 
Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...
Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...
Distributed Resilient Interval Observers for Bounded-Error LTI Systems Subjec...
 
Manual solucoes ex_extras
Manual solucoes ex_extrasManual solucoes ex_extras
Manual solucoes ex_extras
 
Manual solucoes ex_extras
Manual solucoes ex_extrasManual solucoes ex_extras
Manual solucoes ex_extras
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerunnathinaik
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentInMediaRes1
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxJiesonDelaCerna
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Celine George
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 

Recently uploaded (20)

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
internship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developerinternship ppt on smartinternz platform as salesforce developer
internship ppt on smartinternz platform as salesforce developer
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Meghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media ComponentMeghan Sutherland In Media Res Media Component
Meghan Sutherland In Media Res Media Component
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
CELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptxCELL CYCLE Division Science 8 quarter IV.pptx
CELL CYCLE Division Science 8 quarter IV.pptx
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
Incoming and Outgoing Shipments in 1 STEP Using Odoo 17
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 

EE658_Lecture_8.pdf

  • 1. EE658 Optimization Techniques Lecture 8 • Optimization Techniques for Deep Learning “Pattern Recognition and Machine Learning” by Christopher M. Bishop EE658 Kuntal Deka IIT Guwahati 1
  • 2. Deep Forward Networks • First we construct M linear combinations of the input variables x1, . . . , xD in the form aj = D X i=1 w (1) ji xi + w (1) j0 The quantities aj, j = 1, . . . , M are known as activations. • Each aj is then transformed using a differentiable, nonlinear activation function h(·) to give zj = h (aj) EE658 Kuntal Deka IIT Guwahati 2
  • 3. Deep Forward Networks contd.. • zj values are again linearly combined to give output unit activations ak = M X j=1 w (2) kj zj + w (2) k0 where k = 1, · · · , K, and K is the total number of outputs. EE658 Kuntal Deka IIT Guwahati 3
  • 4. Deep Forward Networks contd.. • Each output unit activation is transformed using a logistic sigmoid function so that yk = σ (ak) where σ (a) = 1 1+exp(−a) • Combining these various stages to give the overall network function that, for sigmoidal output unit activation functions yk (x, w) = σ M X j=1 w (2) kj h D X i=1 w (1) ji xi + w (1) j0 ! + w (2) k0 ! y (x, w) = [y1 (x, w) , y2 (x, w) , . . . , yK (x, w)]T EE658 Kuntal Deka IIT Guwahati 4
  • 5. Network Training • Given • A training set comprising a set of input vectors {xn}, where n = 1, ..., N • Target vectors {tn}. The objective is to minimize the error function E (w) = 1 2 N X n=1 ky (xn, w) − tnk2 FONC: ∇E (w) = 0 Our goal is to find a vector w such that E (w) takes is smallest value. Gradient Descent Optimization: w(τ+1) = w(τ) − η∇E w(τ) where the step-size parameter η is known as learning rate. wA is a local minimum wB is the global minimum EE658 Kuntal Deka IIT Guwahati 5
  • 6. Error Backpropagation or Backprop • Error Backpropagation or Backprop is an efficient technique for evaluating the gradient of an error function E(w). • Error function comprises a sum of terms, one for each data point in the training set: E (w) = N X n=1 En (w) • En = 1 2 X k (ynk − tnk)2 • Each unit computes a weighted sum of inputs: aj = X i wjizi Forward direction Backward direction EE658 Kuntal Deka IIT Guwahati 6
  • 7. Error Backpropagation contd.. • After activation, we get zj = h (aj) where h(·) is a nonlinear activation function. Table: List of activation functions Name h(aj) Range Linear aj (−∞, ∞) sigmoid 1 1+exp(−aj ) (0,1) softmax exp(aj) P j′ exp(aj′ ) (0,1) ReLU max (0, aj ) [0, ∞) tanh tanh (aj) −1, 1 EE658 Kuntal Deka IIT Guwahati 7
  • 8. Error Backpropagation contd.. • By using chain rule of partial derivative, we get ∂En ∂wji = ∂En ∂aj | {z } δj ∂aj ∂wji | {z } zi = δjzi δj = ∂En ∂aj = X k ∂En ∂ak ∂ak ∂aj = h ′ (aj) X k δkwkj [ We have zj = h (aj ) Therefore, aj = P i wjizi = P i wjih (ai) ] Forward direction Backward direction For sigmoid activation function, we have h′ (aj) = d daj 1 1+exp(−aj ) = 1 1+exp(−aj) × 1 − 1 1+exp(−aj ) = zj (1 − zj) EE658 Kuntal Deka IIT Guwahati 8
  • 9. Error Backpropagation contd.. • What is δj at the output layer? δk = ∂En ∂ak = ∂E ∂yk ∂yk ∂ak ∂E ∂yk = ∂ ∂yk 1 2 X k (yk − tk)2 # = yk − tk ∂yk ∂ak = ∂ ∂ak [σ (ak)] = ∂ ∂ak 1 1 + exp(−ak) = exp(−ak) (1 + exp(−ak))2 = yk (1 − yk) • At the output layer, δk = yk (1 − yk) (yk − tk) EE658 Kuntal Deka IIT Guwahati 9
  • 10. Steps for Error Backpropagation Error Backpropagation 1. Apply an input vector xn to the network and forward propagate through the network using the following equations to find the activations of all the hidden and output units: aj = X i wjizi and zj = h (aj ) 2. Evaluate the δk for all the output units using δk = yk (1 − yk) (yk − tk) 3. Backpropagate the δs using the following equation to obtain δj for each hidden unit in the network: δj = h ′ (aj ) X k δkwkj 4. Evaluate the required derivatives: ∂En ∂wji = δj zi 5. The derivatives are used in gradient descent: w(τ+1) = w(τ) − η∇E w(τ) EE658 Kuntal Deka IIT Guwahati 10
  • 11. Example for BackProp Forward Pass a1 = w (1) 11 x1 + w (1) 12 x2 = 0.1 × 0.35 + 0.8 × 0.9 = 0.755 a2 = w (1) 21 x1 + w (1) 22 x2 = 0.8 × 0.35 + 0.6 × 0.9 = 0.68 z1 = 1 1+exp(−a1) = 0.68 z2 = 1 1+exp(−a2) = 0.6637 y1 = σ w (2) 11 z1 + w (2) 12 z2 = σ (0.801) = 1 1+exp(−0.801) = 0.69 x1=0.35 x2=0.9 a1=0.76 a2=0.68 w (1) 22 = 0.6 w (1) 11 = 0.1 w (1) 12 = 0.8 w (1) 21 = 0.4 y1 = 0.69 w (2) 12 = 0.9 w (2) 11 = 0.3 z2 = 0.6637 z1 = 0.68 t1 = 0.5 EE658 Kuntal Deka IIT Guwahati 11
  • 12. Example for BackProp Backward Pass (Layer 2) δout 1 = y1(1 − y1) (y1 − t1) = 0.69 × (1 − 0.69) (0.69 − 0.5) = 0.0406 ∂E ∂w (2) 11 = δout 1 z1 = 0.0406 × 0.68 = 0.02761 ∂E ∂w (2) 12 = δout 1 z2 = 0.0406 × 0.6637 = 0.0269 w (2) 11 = w (2) 11 − ∂E ∂w (2) 11 = 0.27239 w (2) 12 = w (2) 12 − ∂E ∂w (2) 12 = 0.8731 x1=0.35 x2=0.9 a1=0.76 a2=0.68 w (1) 22 = 0.6 w (1) 11 = 0.1 w (1) 12 = 0.8 w (1) 21 = 0.4 y1 = 0.69 w (2) 12 = 0.9 w (2) 11 = 0.3 z2 = 0.6637 z1 = 0.68 t1 = 0.5 δout 1 = 0.0406 δhid 2 = 0.0082 δhid 1 = 0.00265 Forward direction Backward direction ∂En ∂wji = δjzi EE658 Kuntal Deka IIT Guwahati 12
  • 13. Example for BackProp Backward Pass (Layer 1) δhid 1 = z1(1 − z1)δout 1 w (2) 11 = 0.68 × (1 − 0.68) ×0.0406 × 0.3 = 0.00265 δhid 2 = z2(1 − z2)δout 1 w (2) 12 = 0.6637 × (1 − 0.6637) ×0.0406 × 0.9 = 0.0082 ∂E ∂w (1) 11 = δhid 1 x1 = 0.0009275 ∂E ∂w (1) 12 = δhid 1 x2 = 0.002385 ∂E ∂w (1) 22 = δhid 2 x2 = 0.00738 ∂E ∂w (1) 21 = δhid 2 x1 = 0.00287 w (2) 11 = w (2) 11 − ∂E ∂w (2) = 0.27239 x1=0.35 x2=0.9 a1=0.76 a2=0.68 w (1) 22 = 0.6 w (1) 11 = 0.1 w (1) 12 = 0.8 w (1) 21 = 0.4 y1 = 0.69 w (2) 12 = 0.9 w (2) 11 = 0.3 z2 = 0.6637 z1 = 0.68 t1 = 0.5 δout 1 = 0.0406 δhid 2 = 0.0082 δhid 1 = 0.00265 Forward direction Backward direction ∂En ∂wji = δjzi δj = zj 1 − zj ! P k δkwkj EE658 Kuntal Deka IIT Guwahati 13
  • 14. Example for BackProp Backward Pass (Layer 1) w (1) 11 = w (1) 11 − ∂E ∂w (1) 11 = 0.0990725 w (1) 12 = w (1) 12 − ∂E ∂w (1) 12 = 0.797612 w (1) 22 = w (1) 22 − ∂E ∂w (1) 22 = 0.59262 w (1) 21 = w (1) 21 − ∂E ∂w (1) 21 = 0.39713 x1=0.35 x2=0.9 a1=0.76 a2=0.68 w (1) 22 = 0.6 w (1) 11 = 0.1 w (1) 12 = 0.8 w (1) 21 = 0.4 y1 = 0.69 w (2) 12 = 0.9 w (2) 11 = 0.3 z2 = 0.6637 z1 = 0.68 t1 = 0.5 δout 1 = 0.0406 δhid 2 = 0.0082 δhid 1 = 0.00265 EE658 Kuntal Deka IIT Guwahati 14
  • 15. Example for BackProp Updated Network Error after this iteration is E = 1 2 (0.6820 − 0.5)2 = 0.0166 Note that before this iteration, the error was: E = 1 2 (0.69 − 0.5)2 = 0.0180 x1=0.35 x2=0.9 a1=0.75 a2=0.67 w (1) 22 = 0.59262 w (1) 11 = 0.0990725 w (1) 12 = 0.7976 w (1) 21 = 0.397 y1 = 0.682 w (2) 12 = 0.8731 w (2) 11 = 0.27239 z2 = 0.6620 z1 = 0.6797 t1 = 0.5 EE658 Kuntal Deka IIT Guwahati 15
  • 16. Momentum • Gradient descent will take a long time to traverse a nearly flat surface as shown in the following figure. Regions that are nearly flat have require many and can thus gradients with small magnitudes iterations of of gradient descent to traverse. long region with small gradient • Idea is to introduce memory: v(k+1) = βv(k) − α∇f(x(k) ) x(k+1) = x(k) + v(k+1) • Observe that for β = 0, we have gradient descent. EE658 Kuntal Deka IIT Guwahati 16
  • 17. Nesterov Momentum • Nesterov Momentum uses the gradient at the projected future position v(k+1) = βv(k) − α∇f(x(k) + βv(k) | {z } future position ) x(k+1) = x(k) + v(k+1) EE658 Kuntal Deka IIT Guwahati 17
  • 18. Adaptive Subgradient (Adagrad) • Momentum and Nesterov momentum update all components of x with the same learning rate. • The adaptive subgradient method, or Adagrad adapts a learning rate for each component of x. x (k+1) i = x (k) i − α ε + q s (k) i gi (k) [gi is the ith component of the gradient] where s(k) is a vector whose ith entry is the sum of the squares of the partials, with respect to xi, up to time step k, s (k) i = k X j=1 g (j) i 2 • The components of s accumulates the partials which causes the effective learning rate to decrease during training. EE658 Kuntal Deka IIT Guwahati 18
  • 19. RMSProp • RMSProp extends Adagrad to avoid the effect of a monotonically decreasing learning rate. ŝ(k+1) = γŝ(k) + (1 − γ) g(k) ⊙ g(k) where the decay γ ∈ [0, 1] is typically close to 0.9. • RMSProp’s update equation: x (k+1) i = x (k) i − α ε + q s (k) i gi(k) = x (k) i − α ε + RMS (gi) gi(k) EE658 Kuntal Deka IIT Guwahati 19
  • 20. Adam • Adam is a combination of RMSProp and momentum. • It stores both • an exponentially decaying squared gradient like RMSProp. • an exponentially decaying gradient like momentum. • The update equations are • Biased decaying momentum: v(k+1) = γvv(k) + (1 − γv)g(k) • Biased decaying sq. gradient: s(k+1) = γss(k) + (1 − γs) g(k) ⊙ g(k) • Corrected decaying momentum: v̂(k+1) = v(k+1) /(1 − γk v ) • Corrected decaying sq. gradient: ŝ(k+1) = s(k+1) /(1 − γk s ) • Next iterate: x(k+1) = x(k) − αv̂(k+1) / ∈ + p ŝ(k+1) EE658 Kuntal Deka IIT Guwahati 20