Derivation of the gradient descent rule.

•

0 likes•15 views

2217004

A topic in neural network based on machine learning

Engineering

• To calculate the direction of steepest descent
along the error surface:
• This direction can be found by computing the
derivative of E with respect to each
component of the vector 𝑤.
• This vector derivative is called the gradient of
E with respect to 𝑤 written 𝛻𝐸(𝑤)

• Since the gradient specifies the direction of
steepest increase of E, the training rule for
gradient descent is

• training rule can also be written in its
component form

STOCHASTIC APPROXIMATION TO
GRADIENT DESCENT
• Gradient descent is a strategy for searching
through a large or infinite hypothesis space
that can be applied whenever
(1) the hypothesis space contains continuously
parameterized hypotheses
(2) the error can be differentiated with respect
to these hypothesis parameters.

• The key practical difficulties in applying
gradient descent are
(1) converging to a local minimum can
sometimes be quite slow
(2) if there are multiple local minima in the error
surface, then there is no guarantee that the
procedure will find the global minimum.

• One common variation on gradient descent
intended to alleviate these difficulties is called
incremental gradient descent, or alternatively
stochastic gradient descent.

• Whereas the standard gradient descent
training rule presented in Equation computes
weight updates after summing over all the
training examples in D, the idea behind
stochastic gradient descent is to approximate
this gradient descent search by updating
weights incrementally, following the
calculation of the error for each individual
example.

MULTILAYER NETWORKS AND
THE BACKPROPAGATION ALGORITHM
• single perceptrons can only express linear
decision surfaces.

• the sigmoid unit computes its output o as

The BACKPROPAGATIAOIN Algorithm
• we begin by redefining E to sum the errors
over all of the network output units:

Derivation of the
BACKPROPAGATION Rule
• The specific problem we address here is
deriving the stochastic gradient descent rule
implemented by the algorithm

• Stochastic gradient descent involves iterating
through the training examples one at a time,
for each training example d descending the
gradient of the error Ed with respect to this
single example.

• In other words, for each training example d
every weight wji is updated by adding to it
𝛻wji

1.
• To begin, notice that weight wji can influence
the rest of the network only through netj.

Case 1: Training Rule for Output Unit
Weights.
• netj can influence the network only through oj

• Finally, we have the stochastic gradient
descent rule for output units

Case 2: Training Rule for Hidden Unit
Weights

• netj can influence the network outputs (and
therefore Ed) only through the units in
Downstream(j).

Similar to Derivation of the gradient descent rule.

Lecture 11Yasir Waqas

Lecture 11Umer Warraich

PR-305: Exploring Simple Siamese Representation LearningSungchul Kim

Deep learning conceptsJoe li

Training DNN Models - II.pptxPrabhuSelvaraj15

Conducting and reporting the results of a cfd simulationMalik Abdul Wahab

EMOD_Optimization_Presentation.pptxAliElMoselhy

Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Maninda Edirisooriya

Reporting.pptasodiatel

ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...Jihun Yun

Gradient descent methodSanghyuk Chun

Multilayer & Back propagation algorithmswapnac12

support vector machine 1.pptxsurbhidutta4

Linear regressionMartinHogg9

nural network ER. Abhishek k. upadhyayabhishek upadhyay

ngboost.pptxHadrian7

Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya

Derivative Free Optimization and Robust OptimizationSSA KPI

Methods of Optimization in Machine LearningKnoldus Inc.

Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry

Similar to Derivation of the gradient descent rule. (20)

Lecture 11

PR-305: Exploring Simple Siamese Representation Learning

Deep learning concepts

Training DNN Models - II.pptx

Conducting and reporting the results of a cfd simulation

EMOD_Optimization_Presentation.pptx

Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...

Reporting.ppt

ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...

Gradient descent method

Multilayer & Back propagation algorithm

support vector machine 1.pptx

Linear regression

nural network ER. Abhishek k. upadhyay

ngboost.pptx

Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)

Derivative Free Optimization and Robust Optimization

Methods of Optimization in Machine Learning

Hands on machine learning with scikit-learn and tensor flow by ahmed yousry

Recently uploaded

Lesson no16 application of Induction Generator in Wind.ppsxmichaelprrior

Dairy management system project report..pdfKamal Acharya

Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfJNTUA

Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Lovely Professional University

Research Methodolgy & Intellectual Property Rights Series 2T.D. Shashikala

Introduction to Artificial Intelligence and History of AISheetal Jain

Software Engineering - Modelling Concepts + Class Modelling + Building the An...Prakhyath Rai

Multivibrator and its types defination and usges.pptxalijaker017

DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDrGurudutt

"United Nations Park" Site Visit Report.MdManikurRahman

How to Design and spec harmonic filter.pdftawat puangthong

Complex plane, Modulus, Argument, Graphical representation of a complex numbe...MohammadAliNayeem

Supermarket billing system project report..pdfKamal Acharya

ALCOHOL PRODUCTION- Beer Brewing Process.pdfMadan Karki

Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala

Artificial Intelligence Bayesian Reasoninghotman30312

Circuit Breaker arc phenomenon.pdf engineeringKanchhaTamang

ChatGPT Prompt Engineering for project managers.pdfqasastareekh

2024 DevOps Pro Europe - Growing at the edgePaco Orozco

Linux Systems Programming: Semaphores, Shared Memory, and Message QueuesRashidFaridChishti

Recently uploaded (20)

Lesson no16 application of Induction Generator in Wind.ppsx

Dairy management system project report..pdf

Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf

Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...

Research Methodolgy & Intellectual Property Rights Series 2

Introduction to Artificial Intelligence and History of AI

Software Engineering - Modelling Concepts + Class Modelling + Building the An...

Multivibrator and its types defination and usges.pptx

DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf

"United Nations Park" Site Visit Report.

How to Design and spec harmonic filter.pdf

Complex plane, Modulus, Argument, Graphical representation of a complex numbe...

Supermarket billing system project report..pdf

ALCOHOL PRODUCTION- Beer Brewing Process.pdf

Research Methodolgy & Intellectual Property Rights Series 1

Artificial Intelligence Bayesian Reasoning

Circuit Breaker arc phenomenon.pdf engineering

ChatGPT Prompt Engineering for project managers.pdf

2024 DevOps Pro Europe - Growing at the edge

Linux Systems Programming: Semaphores, Shared Memory, and Message Queues

Derivation of the gradient descent rule.

1. DERIVATION OF THE GRADIENT DESCENT RULE

2. • To calculate the direction of steepest descent along the error surface: • This direction can be found by computing the derivative of E with respect to each component of the vector 𝑤. • This vector derivative is called the gradient of E with respect to 𝑤 written 𝛻𝐸(𝑤)

4. • Since the gradient specifies the direction of steepest increase of E, the training rule for gradient descent is

5. • training rule can also be written in its component form

7. So final standard GRADIENT DESCENT

8. STOCHASTIC APPROXIMATION TO GRADIENT DESCENT • Gradient descent is a strategy for searching through a large or infinite hypothesis space that can be applied whenever (1) the hypothesis space contains continuously parameterized hypotheses (2) the error can be differentiated with respect to these hypothesis parameters.

9. • The key practical difficulties in applying gradient descent are (1) converging to a local minimum can sometimes be quite slow (2) if there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global minimum.

10. • One common variation on gradient descent intended to alleviate these difficulties is called incremental gradient descent, or alternatively stochastic gradient descent.

11.

12. • Whereas the standard gradient descent training rule presented in Equation computes weight updates after summing over all the training examples in D, the idea behind stochastic gradient descent is to approximate this gradient descent search by updating weights incrementally, following the calculation of the error for each individual example.

13.

14. MULTILAYER NETWORKS AND THE BACKPROPAGATION ALGORITHM • single perceptrons can only express linear decision surfaces.

15. A Differentiable Threshold Unit

16. The sigmoid threshold unit

17. • the sigmoid unit computes its output o as

18.

19. The BACKPROPAGATIAOIN Algorithm • we begin by redefining E to sum the errors over all of the network output units:

20.

21. ADDING MOMENTUM

22. LEARNING IN ARBITRARY ACYCLIC NETWORKS

23. Derivation of the BACKPROPAGATION Rule • The specific problem we address here is deriving the stochastic gradient descent rule implemented by the algorithm

24. • Stochastic gradient descent involves iterating through the training examples one at a time, for each training example d descending the gradient of the error Ed with respect to this single example.

25. • In other words, for each training example d every weight wji is updated by adding to it 𝛻wji

26.

27. 1. • To begin, notice that weight wji can influence the rest of the network only through netj.

28. Case 1: Training Rule for Output Unit Weights. • netj can influence the network only through oj

29.

30.

31.

32. • So,

33. • Finally, we have the stochastic gradient descent rule for output units

34. Case 2: Training Rule for Hidden Unit Weights

35. • netj can influence the network outputs (and therefore Ed) only through the units in Downstream(j).

36.

37.

38. • So,

Derivation of the gradient descent rule.

Recommended

Recommended

More Related Content

Similar to Derivation of the gradient descent rule.

Similar to Derivation of the gradient descent rule. (20)

Recently uploaded

Recently uploaded (20)

Derivation of the gradient descent rule.