SlideShare a Scribd company logo
1 of 7
Download to read offline
RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE
FEED-FORWARD DEEP NEURAL NETWORK
ASHWIN RAO
1. Motivation
The purpose of this short paper is to derive the recursive formulation of the gradient of
the loss function with respect to the parameters of a dense feed-forward deep neural network.
We start by setting up the notation and then we state the recursive formulation of the loss
gradient. We derive this formulation for a fairly general case - where the supervisory
variable has a conditional probability density modeled as an arbitrary Generalized
Linear Model(GLM)’s “normal-form” probability density, when the output layer’s
activation function is the GLM canonical link function, and when the loss function
is cross-entropy loss. Finally, we show that Linear Regression and Softmax Classification are
special cases of this generic GLM structure, and so this formulation of the gradient is applicable
to the special cases of dense feed-forward deep neural networks whose output layer activation
function is linear (for regression) or softmax (for classification).
2. Notation
Consider a “vanilla” (i.e., dense feed-forward) deep neural network with L layers. Layers
l = 0, 1, . . . , L − 1 carry the hidden layer neurons and layer l = L carries the output layer
neurons.
Let Il be the inputs to layer l and let Ol be the outputs of layer l for all l = 0, 1, . . . , L. We
know that Il+1 = Ol for all l = 0, 1, . . . , L − 1. Note that the number of neurons in layer l
will be the number of outputs of layer l (= |Ol|. In the following exposition, we assume that
the input to this network is a single training data point (input features) denoted as X and the
output of the network is supervised by the supervisory value Y associated with X (we work
with a single training data point in this entire paper because the gradient calculation can be
independently applied to each training data point and finally summed over the training data
points). So, I0 = X and OL is the network’s prediction for input X (that will be supervised by
Y ).
We denote K as the number of elements in the supervisory variable Y (K will also be the
number of neurons in the output layer as well as the number of outputs produced by the output
layer). So, |OL| = |Y | = K. We denote the rth
element of OL (output of rth
neuron in the
output layer) as O
(r)
L and the rth
element of Y as Y (r)
for all r = 1, . . . K.
Note that K = |OL| = |Y | = 1 for regression. For classification, K is the number of classes
with K = |OL| neurons in the output layer. The supervisory variable Y will be have a “one-
hot” encoding of length K as follows: . If the supervisory variable is meant to represent class
k ∈ [1, K], then Y (k)
= 1 and Y (r)
= 0 for all r = 1, . . . , k − 1, k + 1, . . . K (the non-k elements
of Y are 0).
The error E = OL − Y is a scalar for regression. For classification, E = OL − Y is a vector
of length K whose kth
element is O
(k)
L − 1 and the other (non-k) elements of E are O
(r)
L for all
r = 1, . . . , k − 1, k + 1, . . . K.
1
2 ASHWIN RAO
We denote the parameters (“weights”) for layer l as Wl (Wl is a matrix of size |Ol|×|Il|). For
ease of exposition, we will ignore the bias parameters (as we can always treat X as augmented
by a “fake” feature that is always 1). We denote the activation function of layer l as gl(·) for all
l = 0, 1, . . . , L − 1. Let Sl = Il · Wl for all l = 0, 1, . . . , L, so Ol = gl(Sl)
Notation Description
Il Inputs to layer l for all l = 0, 1, . . . , L
Ol Outputs of layer l for all l = 0, 1, . . . , L
X Input Features to the network for a single training data point
Y Supervisory value associated with input X
Wl Parameters (“weights”) for layer l for all l = 0, 1, . . . , L
E Error OL − Y for the single training data point (X, Y )
gl(·) Activation function for layer l for l = 0, 1, . . . , L − 1
Sl Sl = Il · Wl, Ol = gl(Sl) for all l = 0, 1, . . . L
Pl “Proxy Error” for layer l for all l = 0, 1, . . . , L
λl Regularization coefficient for layer l for all l = 0, 1, . . . , L
3. Recursive Formulation
We define “proxy error” Pl for layer l recursively as follows for all l = 0, 1, . . . , L:
Pl = (Pl+1 · Wl+1) ◦ gl(Sl)
with the recursion terminating as:
PL = E = OL − Y
Then, the gradient of the loss function with respect to the parameters of layer l for all l =
0, 1, . . . , L will be:
∂Loss
∂Wl
= Pl ⊗ Il
Section 4 provides a proof of the above when the supervisory variable has a
conditional probability density modeled as the Generalized Linear Model(GLM)’s
“normal-form” probability density (i.e. fairly generic family of probability distri-
butions), when the output layer’s activation function is the GLM canonical link
function, and when the loss function is cross-entropy loss. We show in the final two
sections that Linear Regression and Softmax Classification are special cases of this generic GLM
structure, and so this formulation of the gradient is applicable to the special cases of dense feed-
forward deep neural networks whose output layer activation function is linear (for regression) or
softmax (for classification). For now, we give an overview of the proof in the following 3 bullet
points:
• “Proxy Error” Pl is defined as ∂Loss
∂Sl
, and so gradient ∂Loss
∂Wl
= ∂Loss
∂Sl
⊗ ∂Sl
∂Wl
= Pl ⊗ Il
• The important result PL = E = OL − Y is due to the important GLM result A (SL) =
gL(SL) = OL where A(·) is the key function in the probability density functional form
for GLM.
• The recursive formulation for Pl has nothing to do with GLM, it is purely due to the
structure of the feed-forward network: Sl+1 = gl(Sl) · Wl+1. Differentiation chain rule
yields ∂Loss
∂Sl
= ∂Loss
∂Sl+1
· ∂Sl+1
∂Sl
which gives us the recursive formulation Pl = (Pl+1 · Wl+1) ◦
gl(Sl)
RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK3
The detailed proof is in Section 4.
Note that Pl+1 ·Wl+1 is the inner-product of the |Ol+1| size vector Pl+1 and the |Ol+1|×|Il+1|
size matrix Wl+1, and the resultant |Il+1| = |Ol| size vector Pl+1 · Wl+1 is multiplied pointwise
(Hadamard product) with the |Ol| size vector gl(Sl) to yield the |Ol| size vector Pl.
Pl ⊗Il is the matrix outer-product of the |Ol| size vector PL and the |Il| size vector Il. Hence,
∂Loss
∂Wl
is a matrix of size |Ol| × |Il|
If we do L2 regularization (with λl as the regularization coefficient in layer l), then:
∂Loss
∂Wl
= Pl ⊗ Il + 2λlWl
If we do L1 regularization (with λl as the regularization coefficient in layer l), then:
∂Loss
∂Wl
= Pl ⊗ Il + λlsign(Wl)
where sign(Wl) is the sign operation applied pointwise on the elements of the matrix Wl.
4. Derivation of Gradient Formulation for GLM conditional probability
density and Canonical Link Function
Here we show that the formula for gradient shown in the previous section is applicable when
the supervisory variable has a conditional probability density modeled as the Generalized Linear
Model(GLM)’s “normal-form” probability density (i.e., fairly generic family of probability dis-
tributions), when the output layer’s activation function is the GLM canonical link function, and
when the loss function is cross-entropy loss. First, we give a quick summary of GLM (independent
of neural networks).
4.1. The GLM Setting. In the GLM setting, the conditional probability density of the super-
visory variable Y is modeled as:
f(Y |θ, τ) = h(Y, τ) · e
b(θ)·T (Y )−A(θ)
d(τ)
where θ should be thought of as the “center” parameter (related to the mean) of the prob-
ability distribution and τ should be thought of as the “dispersion” parameter (related to the
variance) of the distribution. h(·, ·), b(·), A(·), d(·) are general functions whose specializations
define the family of distributions that can be modeled in the GLM framework. Note that this
form is a generalization of the exponential-family functional form incorporating the “dispersion”
parameter τ (for this reason, the above form is known as the “overdispersed exponential family”
of distributions).
It is important to mention here that the GLM framework operates even when the
supervisory variable Y is multi-dimensional. We will denote the dimension of Y and θ as
K (eg: for classification, this would mean K classes). Note that b(θ) · T(Y ) is the inner-product
between the vectors b(θ) and T(Y ).
When we specialize the above probability density form to the so-called “normal-form” (which
means both b(·) and T(·) are identity functions), then the conditional probability density of the
supervisory variable is (this is the form we work with in this paper):
f(Y |θ, τ) = h(Y, τ) · e
θ·Y −A(θ)
d(τ)
The GLM link function q(·) is the function that transforms the linear predictor to the mean
of the supervisory variable Y conditioned on the linear predictor η (linear predictor η is the
inner-product of the input vector and the parameters matrix). Therefore,
4 ASHWIN RAO
E[Y |θ] = q(η)
We will specialize the GLM link function q(·) to be the canonical link function (“canonical”
simply means that θ is equal to the linear predictor η). So for a canonical link function q(·),
E[Y |θ] = q(θ)
Note that q transforms a vector of length K to a vector of length K (θ → E[Y |θ]).
4.2. Key Result. Since
y
f(y|θ, τ)dy = 1,
it’s partial derivatives with respect to θ will be zero. In other words,
∂( y
f(y|θ, τ)dy)
∂θ
= 0
Hence,
∂( y
h(y, τ) · e
θ·y−A(θ)
d(τ) dy)
∂θ
= 0
Taking the partial derivative inside the integral, we get:
y
h(y, τ) · e
θ·y−A(θ)
d(τ) ·
y − ∂A
∂θ
d(τ)
dy = 0
y
f(y|θ, τ) · (y −
∂A
∂θ
)dy = 0
E[(Y |θ) −
∂A
∂θ
] = 0
This gives us the key result (which we will utilize later in our proof):
E[Y |θ] =
∂A
∂θ
(which as we know from above is = q(θ))
As an aside, we want to mention that apart from allowing for a non-linear relationship between
the input and the mean of the supervisory variable, GLM allows for heteroskedastic models (i.e.,
the variance of the supervisory variable is a function of the input variables). Specifically,
V ariance[Y |θ] =
∂2
A
∂θ2
· d(τ) (derivation similar to that of E[Y |θ] above)
4.3. Common Examples of GLM. Several standard probability distributions paired with
their canonical link function are special cases of “normal-form” GLM. For example,
• Normal distribution N(µ, σ) paired with the identity function (q(θ) = θ) is Linear Re-
gression (θ = µ, τ = σ, h(Y, τ) = e
−Y 2
2τ2
√
2πτ
, A(θ) = θ2
2 , d(τ) = τ2
).
• Bernoulli distribution parameterized by p paired with the logistic function (q(θ) = 1
1+e−θ )
is Logistic Regression (θ = log ( p
1−p ), τ = h(Y, τ) = d(τ) = 1, A(θ) = log (1 + eθ
)).
• Poisson distribution parameterized by λ paired with the exponential function (q(θ) = eθ
)
is Log-Linear Regression (θ = log λ, τ = d(τ) = 1, h(Y, τ) = 1
Y ! , A(θ) = eθ
).
RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK5
4.4. GLM in a Dense Feed-Forward Deep Neural Network. Now we come to our dense
feed-forward deep neural network setting. Let us assume that the supervisory variable Y has
a conditional probability distribution of the GLM “normal-form” expressed above and let us
assume that the activation function of the output layer gL(·) is the canonical link function q(·).
Then,
E[Y |θ] =
∂A
∂θ
= q(θ) = q(IL · WL) = gL(IL · WL) = gL(SL) = OL
In the above equation, we note that each of Y, θ, SL, OL is a vector of length K.
The Cross-Entropy Loss (Negative Log-Likelihood) of the training data point (X, Y ) is given
by:
Loss = − log (h(Y, τ)) +
A(θ) − θ · Y
d(τ)
We define the “Proxy Error” Pl for layer l (for all l = 0, 1, . . . L) as:
Pl = d(τ) ·
∂Loss
∂Sl
4.5. Recursion Termination. We want to first establish the termination of the recursion for
Pl, i.e., we want to establish that:
PL = d(τ) ·
∂Loss
∂SL
= d(τ) ·
∂Loss
∂θ
= OL − Y = E
To calculate PL, we have to evaluate the partial derivatives of Loss with respect to the θ
vector.
∂Loss
∂θ
=
∂A
∂θ − Y
d(τ)
It follows that:
PL = d(τ) ·
∂Loss
∂θ
=
∂A
∂θ
− Y = q(θ) − Y = gL(θ) − Y = OL − Y
This establishes the termination of the recursion for Pl, i.e. PL = OL − Y = E (the “error” in
the output of the output layer). It pays to re-emphasize that this result (PL = E) owes a Thank
You to the key result ∂A
∂θ = q(θ), which is applicable when the conditional probability density of
the supervisory variable is in GLM “normal form” and when the link function q(·) is canonical.
We are fortunate that this is still a fairly general setting and so, it applies to a wide range of
regressions and classifications.
4.6. Recursion. Next, we want to establish the following recursion for Pl for all l = 0, 1, . . . , L−
1:
Pl = (Pl+1 · Wl+1) ◦ gl(Sl)
We note that:
Sl+1 = gl(Sl) · Wl+1
Hence,
Pl = d(τ) ·
∂Loss
∂Sl
= d(τ) ·
∂Loss
∂Sl+1
·
∂Sl+1
∂Sl
= d(τ) ·
∂Loss
∂Sl+1
· Wl+1 ◦ gl(Sl) = (Pl+1 · Wl+1) ◦ gl(Sl)
6 ASHWIN RAO
This establishes the recursion for Pl for all l = 0, 1, . . . , L − 1. Again, it pays to re-emphasize
that this recursion has nothing to do with GLM, it only has to do with the structure of the
feed-forward network: Sl+1 = gl(Sl) · Wl+1.
4.7. Loss Gradient Formulation. Finally, we are ready to express the gradient of the loss
function with respect to the parameters of layer l (for all l = 0, 1, . . . L).
∂Loss
∂Wl
=
∂Loss
∂Sl
·
∂Sl
∂Wl
= Pl ⊗ Il (ignoring the constant factor d(τ))
5. Regression on a dense feed-forward deep neural network
Linear Regression is a special case of the GLM family because the linear regression loss function
is simply the cross-entropy loss function (negative log likelihood) when the supervisory variable
Y follows a linear-predictor-conditional normal distribution and the canonical link function for
the normal distribution is the identity function. Here we consider a dense feed-forward deep
neural network whose output layer activation function is the identity function and we define the
loss function to be mean-squared error (as follows):
(IL · WL − Y )2
The Cross-Entropy Loss (Negative Loss Likelihood) when Y |(IL ·WL) is a normal distribution
N(µ, σ) is as follows:
− log (
1
√
2πσ
· e−
(Y −(IL·WL))2
2σ2
) =
(Y − (IL · WL))2
2σ2
− log (
1
√
2πσ
)
Since σ is a constant, the gradient of cross-entropy loss is same as the gradient of mean-squared
error up to a constant factor. Hence, the gradient formulation we derived for GLM is applicable
here.
6. Classification on a dense feed-forward deep neural network
Softmax Classification is a special case of the GLM family where the supervisory variable Y
follows a linear-predictor-conditional Multinoulli distribution and the canonical link function for
Multinoulli is Softmax. Here we consider a dense feed-forward deep neural network whose output
layer activation function is softmax and we define the loss function to be cross-entropy loss.
We assume K classes. We denote the rth
element of OL (output of rth
neuron in the output
layer) as O
(r)
L , the rth
element of the training data point Y as Y (r)
and the rth
element of SL as
S
(r)
L for all r = 1, . . . K. The supervisory variable Y will be have a “one-hot” encoding of length
K as follows: . If the supervisory variable is meant to represent class k ∈ [1, K], then Y (k)
= 1
and Y (r)
= 0 for all r = 1, . . . , k − 1, k + 1, . . . K (the non-k elements of Y are 0).
The softmax classification cross-entropy loss function is given by:
−
K
r=1
(Y (r)
· log O
(r)
L )
where
O
(r)
L =
eS
(r)
L
K
i=1 eS
(i)
L
We can write the above cross-entropy loss function as the negative log-likelihood of a Multi-
noulli ((p1, p2, . . . , pK)) distribution that is a special case of a GLM distribution:
RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK7
− log f(Y |θ, τ) = − log (h(Y, τ) · e
θ·Y −A(θ)
d(τ) ) = − log (h(Y, τ)) ·
A(θ) − θ · Y
d(τ)
with the following specializations:
τ = h(Y, τ) = d(τ) = 1
θ(r)
= log (pr) = log (O
(r)
L ) = S
(r)
L for all r = 1, . . . , K
A(θ) = A(θ(1)
, . . . , θ(K)
) = log (
K
r=1
eθ(r)
)
The canonical link function for Multinoulli distribution is the softmax function g(θ) given by:
∂A
∂θ
= g(θ) = g(θ(1)
, θ(2)
, . . . , θ(K)
) = (
eθ(1)
K
r=1 eθ(r)
,
eθ(2)
K
r=1 eθ(r)
, . . . ,
eθ(K)
K
r=1 eθ(r)
)
So we can see that:
OL = g(θ) = g(SL)
Hence, the gradient formulation we derived for GLM is applicable here.

More Related Content

What's hot

13 fourierfiltrationen
13 fourierfiltrationen13 fourierfiltrationen
13 fourierfiltrationenhoailinhtinh
 
Discrete control2 converted
Discrete control2 convertedDiscrete control2 converted
Discrete control2 convertedcairo university
 
Lecture 12 (Image transformation)
Lecture 12 (Image transformation)Lecture 12 (Image transformation)
Lecture 12 (Image transformation)VARUN KUMAR
 
11.generalized and subset integrated autoregressive moving average bilinear t...
11.generalized and subset integrated autoregressive moving average bilinear t...11.generalized and subset integrated autoregressive moving average bilinear t...
11.generalized and subset integrated autoregressive moving average bilinear t...Alexander Decker
 
PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodHa Phuong
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptronESCOM
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decompositionKenta Oono
 
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtDynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtJinho Choi
 
Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...
Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...
Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...Behzad Samadi
 
Stochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest wayStochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest wayIOSR Journals
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operatorsAlexander Decker
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operatorsAlexander Decker
 
A new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditionsA new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditionsAlexander Decker
 
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKS
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKSFEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKS
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKSieijjournal
 
Generalization of Tensor Factorization and Applications
Generalization of Tensor Factorization and ApplicationsGeneralization of Tensor Factorization and Applications
Generalization of Tensor Factorization and ApplicationsKohei Hayashi
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Fabian Pedregosa
 

What's hot (20)

presentation_1
presentation_1presentation_1
presentation_1
 
13 fourierfiltrationen
13 fourierfiltrationen13 fourierfiltrationen
13 fourierfiltrationen
 
Discrete control2 converted
Discrete control2 convertedDiscrete control2 converted
Discrete control2 converted
 
Lecture 12 (Image transformation)
Lecture 12 (Image transformation)Lecture 12 (Image transformation)
Lecture 12 (Image transformation)
 
11.generalized and subset integrated autoregressive moving average bilinear t...
11.generalized and subset integrated autoregressive moving average bilinear t...11.generalized and subset integrated autoregressive moving average bilinear t...
11.generalized and subset integrated autoregressive moving average bilinear t...
 
PRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling MethodPRML Reading Chapter 11 - Sampling Method
PRML Reading Chapter 11 - Sampling Method
 
The multilayer perceptron
The multilayer perceptronThe multilayer perceptron
The multilayer perceptron
 
tensor-decomposition
tensor-decompositiontensor-decomposition
tensor-decomposition
 
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtDynamic Feature Induction: The Last Gist to the State-of-the-Art
Dynamic Feature Induction: The Last Gist to the State-of-the-Art
 
Polyadic integer numbers and finite (m,n) fields
Polyadic integer numbers and finite (m,n) fieldsPolyadic integer numbers and finite (m,n) fields
Polyadic integer numbers and finite (m,n) fields
 
Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...
Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...
Backstepping Controller Synthesis for Piecewise Polynomial Systems: A Sum of ...
 
Ch13
Ch13Ch13
Ch13
 
Stochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest wayStochastics Calculus: Malliavin Calculus in a simplest way
Stochastics Calculus: Malliavin Calculus in a simplest way
 
11.the univalence of some integral operators
11.the univalence of some integral operators11.the univalence of some integral operators
11.the univalence of some integral operators
 
The univalence of some integral operators
The univalence of some integral operatorsThe univalence of some integral operators
The univalence of some integral operators
 
A new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditionsA new approach to constants of the motion and the helmholtz conditions
A new approach to constants of the motion and the helmholtz conditions
 
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKS
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKSFEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKS
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKS
 
Z transform ROC eng.Math
Z transform ROC eng.MathZ transform ROC eng.Math
Z transform ROC eng.Math
 
Generalization of Tensor Factorization and Applications
Generalization of Tensor Factorization and ApplicationsGeneralization of Tensor Factorization and Applications
Generalization of Tensor Factorization and Applications
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 

Similar to Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network

Lecture 1 maximum likelihood
Lecture 1 maximum likelihoodLecture 1 maximum likelihood
Lecture 1 maximum likelihoodAnant Dashpute
 
Reachability Analysis "Control Of Dynamical Non-Linear Systems"
Reachability Analysis "Control Of Dynamical Non-Linear Systems" Reachability Analysis "Control Of Dynamical Non-Linear Systems"
Reachability Analysis "Control Of Dynamical Non-Linear Systems" M Reza Rahmati
 
Reachability Analysis Control of Non-Linear Dynamical Systems
Reachability Analysis Control of Non-Linear Dynamical SystemsReachability Analysis Control of Non-Linear Dynamical Systems
Reachability Analysis Control of Non-Linear Dynamical SystemsM Reza Rahmati
 
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Avichai Cohen
 
Modeling biased tracers at the field level
Modeling biased tracers at the field levelModeling biased tracers at the field level
Modeling biased tracers at the field levelMarcel Schmittfull
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fieldslswing
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learningYujiro Katagiri
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural NetworksMasahiro Suzuki
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural ProcessesDeep Learning JP
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 

Similar to Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network (20)

Lecture 1 maximum likelihood
Lecture 1 maximum likelihoodLecture 1 maximum likelihood
Lecture 1 maximum likelihood
 
1500403828
15004038281500403828
1500403828
 
Reachability Analysis "Control Of Dynamical Non-Linear Systems"
Reachability Analysis "Control Of Dynamical Non-Linear Systems" Reachability Analysis "Control Of Dynamical Non-Linear Systems"
Reachability Analysis "Control Of Dynamical Non-Linear Systems"
 
Reachability Analysis Control of Non-Linear Dynamical Systems
Reachability Analysis Control of Non-Linear Dynamical SystemsReachability Analysis Control of Non-Linear Dynamical Systems
Reachability Analysis Control of Non-Linear Dynamical Systems
 
Presentation gauge field theory
Presentation gauge field theoryPresentation gauge field theory
Presentation gauge field theory
 
Glm
GlmGlm
Glm
 
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
Amirim Project - Threshold Functions in Random Simplicial Complexes - Avichai...
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
simpl_nie_engl
simpl_nie_englsimpl_nie_engl
simpl_nie_engl
 
Neural Networks - How do they work?
Neural Networks - How do they work?Neural Networks - How do they work?
Neural Networks - How do they work?
 
UofT_ML_lecture.pptx
UofT_ML_lecture.pptxUofT_ML_lecture.pptx
UofT_ML_lecture.pptx
 
Modeling biased tracers at the field level
Modeling biased tracers at the field levelModeling biased tracers at the field level
Modeling biased tracers at the field level
 
Conditional Random Fields
Conditional Random FieldsConditional Random Fields
Conditional Random Fields
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
boyd 11.6
boyd 11.6boyd 11.6
boyd 11.6
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 

More from Ashwin Rao

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingAshwin Rao
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAshwin Rao
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingAshwin Rao
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningAshwin Rao
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Ashwin Rao
 
Understanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsUnderstanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsAshwin Rao
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionAshwin Rao
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsAshwin Rao
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TDAshwin Rao
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemAshwin Rao
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient TheoremAshwin Rao
 
A Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsA Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsAshwin Rao
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesAshwin Rao
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffAshwin Rao
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesAshwin Rao
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationAshwin Rao
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAshwin Rao
 

More from Ashwin Rao (20)

Stochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market MakingStochastic Control/Reinforcement Learning for Optimal Market Making
Stochastic Control/Reinforcement Learning for Optimal Market Making
 
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree SearchAdaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
Adaptive Multistage Sampling Algorithm: The Origins of Monte Carlo Tree Search
 
Fundamental Theorems of Asset Pricing
Fundamental Theorems of Asset PricingFundamental Theorems of Asset Pricing
Fundamental Theorems of Asset Pricing
 
Evolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement LearningEvolutionary Strategies as an alternative to Reinforcement Learning
Evolutionary Strategies as an alternative to Reinforcement Learning
 
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
Principles of Mathematical Economics applied to a Physical-Stores Retail Busi...
 
Understanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman OperatorsUnderstanding Dynamic Programming through Bellman Operators
Understanding Dynamic Programming through Bellman Operators
 
Stochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order ExecutionStochastic Control of Optimal Trade Order Execution
Stochastic Control of Optimal Trade Order Execution
 
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...
 
Overview of Stochastic Calculus Foundations
Overview of Stochastic Calculus FoundationsOverview of Stochastic Calculus Foundations
Overview of Stochastic Calculus Foundations
 
Risk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility TheoryRisk-Aversion, Risk-Premium and Utility Theory
Risk-Aversion, Risk-Premium and Utility Theory
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TD
 
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...
 
HJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio ProblemHJB Equation and Merton's Portfolio Problem
HJB Equation and Merton's Portfolio Problem
 
Policy Gradient Theorem
Policy Gradient TheoremPolicy Gradient Theorem
Policy Gradient Theorem
 
A Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier MathematicsA Quick and Terse Introduction to Efficient Frontier Mathematics
A Quick and Terse Introduction to Efficient Frontier Mathematics
 
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed SecuritiesTowards Improved Pricing and Hedging of Agency Mortgage-backed Securities
Towards Improved Pricing and Hedging of Agency Mortgage-backed Securities
 
Demystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance TradeoffDemystifying the Bias-Variance Tradeoff
Demystifying the Bias-Variance Tradeoff
 
Category Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) picturesCategory Theory made easy with (ugly) pictures
Category Theory made easy with (ugly) pictures
 
Risk Pooling sensitivity to Correlation
Risk Pooling sensitivity to CorrelationRisk Pooling sensitivity to Correlation
Risk Pooling sensitivity to Correlation
 
Abstract Algebra in 3 Hours
Abstract Algebra in 3 HoursAbstract Algebra in 3 Hours
Abstract Algebra in 3 Hours
 

Recently uploaded

GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫qfactory1
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptArshadWarsi13
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 

Recently uploaded (20)

GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫Manassas R - Parkside Middle School 🌎🏫
Manassas R - Parkside Middle School 🌎🏫
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Transposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.pptTransposable elements in prokaryotes.ppt
Transposable elements in prokaryotes.ppt
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 

Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network

  • 1. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK ASHWIN RAO 1. Motivation The purpose of this short paper is to derive the recursive formulation of the gradient of the loss function with respect to the parameters of a dense feed-forward deep neural network. We start by setting up the notation and then we state the recursive formulation of the loss gradient. We derive this formulation for a fairly general case - where the supervisory variable has a conditional probability density modeled as an arbitrary Generalized Linear Model(GLM)’s “normal-form” probability density, when the output layer’s activation function is the GLM canonical link function, and when the loss function is cross-entropy loss. Finally, we show that Linear Regression and Softmax Classification are special cases of this generic GLM structure, and so this formulation of the gradient is applicable to the special cases of dense feed-forward deep neural networks whose output layer activation function is linear (for regression) or softmax (for classification). 2. Notation Consider a “vanilla” (i.e., dense feed-forward) deep neural network with L layers. Layers l = 0, 1, . . . , L − 1 carry the hidden layer neurons and layer l = L carries the output layer neurons. Let Il be the inputs to layer l and let Ol be the outputs of layer l for all l = 0, 1, . . . , L. We know that Il+1 = Ol for all l = 0, 1, . . . , L − 1. Note that the number of neurons in layer l will be the number of outputs of layer l (= |Ol|. In the following exposition, we assume that the input to this network is a single training data point (input features) denoted as X and the output of the network is supervised by the supervisory value Y associated with X (we work with a single training data point in this entire paper because the gradient calculation can be independently applied to each training data point and finally summed over the training data points). So, I0 = X and OL is the network’s prediction for input X (that will be supervised by Y ). We denote K as the number of elements in the supervisory variable Y (K will also be the number of neurons in the output layer as well as the number of outputs produced by the output layer). So, |OL| = |Y | = K. We denote the rth element of OL (output of rth neuron in the output layer) as O (r) L and the rth element of Y as Y (r) for all r = 1, . . . K. Note that K = |OL| = |Y | = 1 for regression. For classification, K is the number of classes with K = |OL| neurons in the output layer. The supervisory variable Y will be have a “one- hot” encoding of length K as follows: . If the supervisory variable is meant to represent class k ∈ [1, K], then Y (k) = 1 and Y (r) = 0 for all r = 1, . . . , k − 1, k + 1, . . . K (the non-k elements of Y are 0). The error E = OL − Y is a scalar for regression. For classification, E = OL − Y is a vector of length K whose kth element is O (k) L − 1 and the other (non-k) elements of E are O (r) L for all r = 1, . . . , k − 1, k + 1, . . . K. 1
  • 2. 2 ASHWIN RAO We denote the parameters (“weights”) for layer l as Wl (Wl is a matrix of size |Ol|×|Il|). For ease of exposition, we will ignore the bias parameters (as we can always treat X as augmented by a “fake” feature that is always 1). We denote the activation function of layer l as gl(·) for all l = 0, 1, . . . , L − 1. Let Sl = Il · Wl for all l = 0, 1, . . . , L, so Ol = gl(Sl) Notation Description Il Inputs to layer l for all l = 0, 1, . . . , L Ol Outputs of layer l for all l = 0, 1, . . . , L X Input Features to the network for a single training data point Y Supervisory value associated with input X Wl Parameters (“weights”) for layer l for all l = 0, 1, . . . , L E Error OL − Y for the single training data point (X, Y ) gl(·) Activation function for layer l for l = 0, 1, . . . , L − 1 Sl Sl = Il · Wl, Ol = gl(Sl) for all l = 0, 1, . . . L Pl “Proxy Error” for layer l for all l = 0, 1, . . . , L λl Regularization coefficient for layer l for all l = 0, 1, . . . , L 3. Recursive Formulation We define “proxy error” Pl for layer l recursively as follows for all l = 0, 1, . . . , L: Pl = (Pl+1 · Wl+1) ◦ gl(Sl) with the recursion terminating as: PL = E = OL − Y Then, the gradient of the loss function with respect to the parameters of layer l for all l = 0, 1, . . . , L will be: ∂Loss ∂Wl = Pl ⊗ Il Section 4 provides a proof of the above when the supervisory variable has a conditional probability density modeled as the Generalized Linear Model(GLM)’s “normal-form” probability density (i.e. fairly generic family of probability distri- butions), when the output layer’s activation function is the GLM canonical link function, and when the loss function is cross-entropy loss. We show in the final two sections that Linear Regression and Softmax Classification are special cases of this generic GLM structure, and so this formulation of the gradient is applicable to the special cases of dense feed- forward deep neural networks whose output layer activation function is linear (for regression) or softmax (for classification). For now, we give an overview of the proof in the following 3 bullet points: • “Proxy Error” Pl is defined as ∂Loss ∂Sl , and so gradient ∂Loss ∂Wl = ∂Loss ∂Sl ⊗ ∂Sl ∂Wl = Pl ⊗ Il • The important result PL = E = OL − Y is due to the important GLM result A (SL) = gL(SL) = OL where A(·) is the key function in the probability density functional form for GLM. • The recursive formulation for Pl has nothing to do with GLM, it is purely due to the structure of the feed-forward network: Sl+1 = gl(Sl) · Wl+1. Differentiation chain rule yields ∂Loss ∂Sl = ∂Loss ∂Sl+1 · ∂Sl+1 ∂Sl which gives us the recursive formulation Pl = (Pl+1 · Wl+1) ◦ gl(Sl)
  • 3. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK3 The detailed proof is in Section 4. Note that Pl+1 ·Wl+1 is the inner-product of the |Ol+1| size vector Pl+1 and the |Ol+1|×|Il+1| size matrix Wl+1, and the resultant |Il+1| = |Ol| size vector Pl+1 · Wl+1 is multiplied pointwise (Hadamard product) with the |Ol| size vector gl(Sl) to yield the |Ol| size vector Pl. Pl ⊗Il is the matrix outer-product of the |Ol| size vector PL and the |Il| size vector Il. Hence, ∂Loss ∂Wl is a matrix of size |Ol| × |Il| If we do L2 regularization (with λl as the regularization coefficient in layer l), then: ∂Loss ∂Wl = Pl ⊗ Il + 2λlWl If we do L1 regularization (with λl as the regularization coefficient in layer l), then: ∂Loss ∂Wl = Pl ⊗ Il + λlsign(Wl) where sign(Wl) is the sign operation applied pointwise on the elements of the matrix Wl. 4. Derivation of Gradient Formulation for GLM conditional probability density and Canonical Link Function Here we show that the formula for gradient shown in the previous section is applicable when the supervisory variable has a conditional probability density modeled as the Generalized Linear Model(GLM)’s “normal-form” probability density (i.e., fairly generic family of probability dis- tributions), when the output layer’s activation function is the GLM canonical link function, and when the loss function is cross-entropy loss. First, we give a quick summary of GLM (independent of neural networks). 4.1. The GLM Setting. In the GLM setting, the conditional probability density of the super- visory variable Y is modeled as: f(Y |θ, τ) = h(Y, τ) · e b(θ)·T (Y )−A(θ) d(τ) where θ should be thought of as the “center” parameter (related to the mean) of the prob- ability distribution and τ should be thought of as the “dispersion” parameter (related to the variance) of the distribution. h(·, ·), b(·), A(·), d(·) are general functions whose specializations define the family of distributions that can be modeled in the GLM framework. Note that this form is a generalization of the exponential-family functional form incorporating the “dispersion” parameter τ (for this reason, the above form is known as the “overdispersed exponential family” of distributions). It is important to mention here that the GLM framework operates even when the supervisory variable Y is multi-dimensional. We will denote the dimension of Y and θ as K (eg: for classification, this would mean K classes). Note that b(θ) · T(Y ) is the inner-product between the vectors b(θ) and T(Y ). When we specialize the above probability density form to the so-called “normal-form” (which means both b(·) and T(·) are identity functions), then the conditional probability density of the supervisory variable is (this is the form we work with in this paper): f(Y |θ, τ) = h(Y, τ) · e θ·Y −A(θ) d(τ) The GLM link function q(·) is the function that transforms the linear predictor to the mean of the supervisory variable Y conditioned on the linear predictor η (linear predictor η is the inner-product of the input vector and the parameters matrix). Therefore,
  • 4. 4 ASHWIN RAO E[Y |θ] = q(η) We will specialize the GLM link function q(·) to be the canonical link function (“canonical” simply means that θ is equal to the linear predictor η). So for a canonical link function q(·), E[Y |θ] = q(θ) Note that q transforms a vector of length K to a vector of length K (θ → E[Y |θ]). 4.2. Key Result. Since y f(y|θ, τ)dy = 1, it’s partial derivatives with respect to θ will be zero. In other words, ∂( y f(y|θ, τ)dy) ∂θ = 0 Hence, ∂( y h(y, τ) · e θ·y−A(θ) d(τ) dy) ∂θ = 0 Taking the partial derivative inside the integral, we get: y h(y, τ) · e θ·y−A(θ) d(τ) · y − ∂A ∂θ d(τ) dy = 0 y f(y|θ, τ) · (y − ∂A ∂θ )dy = 0 E[(Y |θ) − ∂A ∂θ ] = 0 This gives us the key result (which we will utilize later in our proof): E[Y |θ] = ∂A ∂θ (which as we know from above is = q(θ)) As an aside, we want to mention that apart from allowing for a non-linear relationship between the input and the mean of the supervisory variable, GLM allows for heteroskedastic models (i.e., the variance of the supervisory variable is a function of the input variables). Specifically, V ariance[Y |θ] = ∂2 A ∂θ2 · d(τ) (derivation similar to that of E[Y |θ] above) 4.3. Common Examples of GLM. Several standard probability distributions paired with their canonical link function are special cases of “normal-form” GLM. For example, • Normal distribution N(µ, σ) paired with the identity function (q(θ) = θ) is Linear Re- gression (θ = µ, τ = σ, h(Y, τ) = e −Y 2 2τ2 √ 2πτ , A(θ) = θ2 2 , d(τ) = τ2 ). • Bernoulli distribution parameterized by p paired with the logistic function (q(θ) = 1 1+e−θ ) is Logistic Regression (θ = log ( p 1−p ), τ = h(Y, τ) = d(τ) = 1, A(θ) = log (1 + eθ )). • Poisson distribution parameterized by λ paired with the exponential function (q(θ) = eθ ) is Log-Linear Regression (θ = log λ, τ = d(τ) = 1, h(Y, τ) = 1 Y ! , A(θ) = eθ ).
  • 5. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK5 4.4. GLM in a Dense Feed-Forward Deep Neural Network. Now we come to our dense feed-forward deep neural network setting. Let us assume that the supervisory variable Y has a conditional probability distribution of the GLM “normal-form” expressed above and let us assume that the activation function of the output layer gL(·) is the canonical link function q(·). Then, E[Y |θ] = ∂A ∂θ = q(θ) = q(IL · WL) = gL(IL · WL) = gL(SL) = OL In the above equation, we note that each of Y, θ, SL, OL is a vector of length K. The Cross-Entropy Loss (Negative Log-Likelihood) of the training data point (X, Y ) is given by: Loss = − log (h(Y, τ)) + A(θ) − θ · Y d(τ) We define the “Proxy Error” Pl for layer l (for all l = 0, 1, . . . L) as: Pl = d(τ) · ∂Loss ∂Sl 4.5. Recursion Termination. We want to first establish the termination of the recursion for Pl, i.e., we want to establish that: PL = d(τ) · ∂Loss ∂SL = d(τ) · ∂Loss ∂θ = OL − Y = E To calculate PL, we have to evaluate the partial derivatives of Loss with respect to the θ vector. ∂Loss ∂θ = ∂A ∂θ − Y d(τ) It follows that: PL = d(τ) · ∂Loss ∂θ = ∂A ∂θ − Y = q(θ) − Y = gL(θ) − Y = OL − Y This establishes the termination of the recursion for Pl, i.e. PL = OL − Y = E (the “error” in the output of the output layer). It pays to re-emphasize that this result (PL = E) owes a Thank You to the key result ∂A ∂θ = q(θ), which is applicable when the conditional probability density of the supervisory variable is in GLM “normal form” and when the link function q(·) is canonical. We are fortunate that this is still a fairly general setting and so, it applies to a wide range of regressions and classifications. 4.6. Recursion. Next, we want to establish the following recursion for Pl for all l = 0, 1, . . . , L− 1: Pl = (Pl+1 · Wl+1) ◦ gl(Sl) We note that: Sl+1 = gl(Sl) · Wl+1 Hence, Pl = d(τ) · ∂Loss ∂Sl = d(τ) · ∂Loss ∂Sl+1 · ∂Sl+1 ∂Sl = d(τ) · ∂Loss ∂Sl+1 · Wl+1 ◦ gl(Sl) = (Pl+1 · Wl+1) ◦ gl(Sl)
  • 6. 6 ASHWIN RAO This establishes the recursion for Pl for all l = 0, 1, . . . , L − 1. Again, it pays to re-emphasize that this recursion has nothing to do with GLM, it only has to do with the structure of the feed-forward network: Sl+1 = gl(Sl) · Wl+1. 4.7. Loss Gradient Formulation. Finally, we are ready to express the gradient of the loss function with respect to the parameters of layer l (for all l = 0, 1, . . . L). ∂Loss ∂Wl = ∂Loss ∂Sl · ∂Sl ∂Wl = Pl ⊗ Il (ignoring the constant factor d(τ)) 5. Regression on a dense feed-forward deep neural network Linear Regression is a special case of the GLM family because the linear regression loss function is simply the cross-entropy loss function (negative log likelihood) when the supervisory variable Y follows a linear-predictor-conditional normal distribution and the canonical link function for the normal distribution is the identity function. Here we consider a dense feed-forward deep neural network whose output layer activation function is the identity function and we define the loss function to be mean-squared error (as follows): (IL · WL − Y )2 The Cross-Entropy Loss (Negative Loss Likelihood) when Y |(IL ·WL) is a normal distribution N(µ, σ) is as follows: − log ( 1 √ 2πσ · e− (Y −(IL·WL))2 2σ2 ) = (Y − (IL · WL))2 2σ2 − log ( 1 √ 2πσ ) Since σ is a constant, the gradient of cross-entropy loss is same as the gradient of mean-squared error up to a constant factor. Hence, the gradient formulation we derived for GLM is applicable here. 6. Classification on a dense feed-forward deep neural network Softmax Classification is a special case of the GLM family where the supervisory variable Y follows a linear-predictor-conditional Multinoulli distribution and the canonical link function for Multinoulli is Softmax. Here we consider a dense feed-forward deep neural network whose output layer activation function is softmax and we define the loss function to be cross-entropy loss. We assume K classes. We denote the rth element of OL (output of rth neuron in the output layer) as O (r) L , the rth element of the training data point Y as Y (r) and the rth element of SL as S (r) L for all r = 1, . . . K. The supervisory variable Y will be have a “one-hot” encoding of length K as follows: . If the supervisory variable is meant to represent class k ∈ [1, K], then Y (k) = 1 and Y (r) = 0 for all r = 1, . . . , k − 1, k + 1, . . . K (the non-k elements of Y are 0). The softmax classification cross-entropy loss function is given by: − K r=1 (Y (r) · log O (r) L ) where O (r) L = eS (r) L K i=1 eS (i) L We can write the above cross-entropy loss function as the negative log-likelihood of a Multi- noulli ((p1, p2, . . . , pK)) distribution that is a special case of a GLM distribution:
  • 7. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK7 − log f(Y |θ, τ) = − log (h(Y, τ) · e θ·Y −A(θ) d(τ) ) = − log (h(Y, τ)) · A(θ) − θ · Y d(τ) with the following specializations: τ = h(Y, τ) = d(τ) = 1 θ(r) = log (pr) = log (O (r) L ) = S (r) L for all r = 1, . . . , K A(θ) = A(θ(1) , . . . , θ(K) ) = log ( K r=1 eθ(r) ) The canonical link function for Multinoulli distribution is the softmax function g(θ) given by: ∂A ∂θ = g(θ) = g(θ(1) , θ(2) , . . . , θ(K) ) = ( eθ(1) K r=1 eθ(r) , eθ(2) K r=1 eθ(r) , . . . , eθ(K) K r=1 eθ(r) ) So we can see that: OL = g(θ) = g(SL) Hence, the gradient formulation we derived for GLM is applicable here.