Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network. Derived for a fairly general setting where the supervisory variable has a conditional probability density modeled as an arbitrary Generalized Linear Model's "normal-form" probability density, and whose output layer activation function is the GLM canonical link function.
My talk in the MCQMC Conference 2016, Stanford University. The talk is about Multilevel Hybrid Split Step Implicit Tau-Leap
for Stochastic Reaction Networks.
My talk in the MCQMC Conference 2016, Stanford University. The talk is about Multilevel Hybrid Split Step Implicit Tau-Leap
for Stochastic Reaction Networks.
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtJinho Choi
We introduce a novel technique called dynamic feature induction that keeps inducing high dimensional features automatically until the feature space becomes `more' linearly separable. Dynamic feature induction searches for the feature combinations that give strong clues for distinguishing certain label pairs, and generates joint features from these combinations. These induced features are trained along with the primitive low dimensional features. Our approach was evaluated on two core NLP tasks, part-of-speech tagging and named entity recognition, and showed the state-of-the-art results for both tasks, achieving the accuracy of 97.64 and the F1-score of 91.00 respectively, with about a 25% increase in the feature space.
The polyadic integer numbers, which form a polyadic ring, are representatives of a fixed congruence class. The basics of polyadic arithmetic are presented: prime polyadic numbers, the polyadic Euler totient function, polyadic division with a remainder, etc. are defined. Secondary congruence classes of polyadic integer numbers, which become ordinary residue classes in the binary limit, and the corresponding finite polyadic rings are introduced. Further, polyadic versions of (prime) finite fields are considered. These can be zeroless, zeroless and nonunital, or have several units; it is even possible for all of their elements to be units. There exist non-isomorphic finite polyadic fields of the same arity shape and order. None of the above situations is possible in the binary case. It is conjectured that any finite polyadic field should contain a certain canonical prime polyadic field as a smallest finite subfield, which can be considered a polyadic analogue of GF (p).
Stochastics Calculus: Malliavin Calculus in a simplest wayIOSR Journals
We present the theory of Malliavin Calculus by tracing the origin of this calculus as well as giving a
simple introduction to the classical variational problem. In the work, we apply the method of integration-byparts
technique which lies at the core of the theory of stochastic calculus of variation as provided in Malliavin
Calculus. We consider the application of this calculus to the computation of Greeks, as well as discussing the
calculation of Greeks (price sensitivities) by considering a one dimensional Black-Scholes Model. The result
shows that Malliavin Calculus is an important tool which provides a simple way of calculating sensitivities of
financial derivatives to change in its underlying parameters such as Delta, Vega, Gamma, Rho and Theta
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKSieijjournal
This paper investigates the usage of some sophisticated and advanced nonlinear control algorithms in order to control a nonlinear Coupled Tanks System. The first control procedure is called the Feedback linearisation control (FLC), this type of control has been found a successful in achieving a global exponential asymptotic stability, with very short time response, no significant overshooting is recorded and with a negligible norm of the error. The second control procedure is the approaches of Back stepping control (BC) which is a recursive procedure that interlaces the choice of a Lyapunov function with the design of feedback control, from simulation results it shown that this method preserves tracking, robust control and it can often solve stabilization problems with less restrictive conditions may been countered in other methods. Finally both of the proposed control schemes guarantee the asymptoticstability of the closed loop system meeting trajectory tracking objectives.
Dynamic Feature Induction: The Last Gist to the State-of-the-ArtJinho Choi
We introduce a novel technique called dynamic feature induction that keeps inducing high dimensional features automatically until the feature space becomes `more' linearly separable. Dynamic feature induction searches for the feature combinations that give strong clues for distinguishing certain label pairs, and generates joint features from these combinations. These induced features are trained along with the primitive low dimensional features. Our approach was evaluated on two core NLP tasks, part-of-speech tagging and named entity recognition, and showed the state-of-the-art results for both tasks, achieving the accuracy of 97.64 and the F1-score of 91.00 respectively, with about a 25% increase in the feature space.
The polyadic integer numbers, which form a polyadic ring, are representatives of a fixed congruence class. The basics of polyadic arithmetic are presented: prime polyadic numbers, the polyadic Euler totient function, polyadic division with a remainder, etc. are defined. Secondary congruence classes of polyadic integer numbers, which become ordinary residue classes in the binary limit, and the corresponding finite polyadic rings are introduced. Further, polyadic versions of (prime) finite fields are considered. These can be zeroless, zeroless and nonunital, or have several units; it is even possible for all of their elements to be units. There exist non-isomorphic finite polyadic fields of the same arity shape and order. None of the above situations is possible in the binary case. It is conjectured that any finite polyadic field should contain a certain canonical prime polyadic field as a smallest finite subfield, which can be considered a polyadic analogue of GF (p).
Stochastics Calculus: Malliavin Calculus in a simplest wayIOSR Journals
We present the theory of Malliavin Calculus by tracing the origin of this calculus as well as giving a
simple introduction to the classical variational problem. In the work, we apply the method of integration-byparts
technique which lies at the core of the theory of stochastic calculus of variation as provided in Malliavin
Calculus. We consider the application of this calculus to the computation of Greeks, as well as discussing the
calculation of Greeks (price sensitivities) by considering a one dimensional Black-Scholes Model. The result
shows that Malliavin Calculus is an important tool which provides a simple way of calculating sensitivities of
financial derivatives to change in its underlying parameters such as Delta, Vega, Gamma, Rho and Theta
FEEDBACK LINEARIZATION AND BACKSTEPPING CONTROLLERS FOR COUPLED TANKSieijjournal
This paper investigates the usage of some sophisticated and advanced nonlinear control algorithms in order to control a nonlinear Coupled Tanks System. The first control procedure is called the Feedback linearisation control (FLC), this type of control has been found a successful in achieving a global exponential asymptotic stability, with very short time response, no significant overshooting is recorded and with a negligible norm of the error. The second control procedure is the approaches of Back stepping control (BC) which is a recursive procedure that interlaces the choice of a Lyapunov function with the design of feedback control, from simulation results it shown that this method preserves tracking, robust control and it can often solve stabilization problems with less restrictive conditions may been countered in other methods. Finally both of the proposed control schemes guarantee the asymptoticstability of the closed loop system meeting trajectory tracking objectives.
This presentation begins with explaining the basic algorithms of machine learning and using the same concepts, discusses in detail 2 supervised learning/deep learning algorithms - Artificial neural nets and Convolutional Neural Nets. The relationship between Artificial neural nets and basic machine learning algorithms such as logistic regression and soft max is also explored. For hands on the implementation of ANN's and CNN's on MNIST dataset is also explained.
Opening of our Deep Learning Lunch & Learn series. First session: introduction to Neural Networks, Gradient descent and backpropagation, by Pablo J. Villacorta, with a prologue by Fernando Velasco
Stochastic Control/Reinforcement Learning for Optimal Market MakingAshwin Rao
Optimal Market Making is the problem of dynamically adjusting bid and ask prices/sizes on the Limit Order Book so as to maximize Expected Utility of Gains. This is a stochastic control problem that can be tackled with classical Dynamic Programming techniques or with Reinforcement Learning (using a market-learnt simulator)
Understanding Dynamic Programming through Bellman OperatorsAshwin Rao
Policy Iteration and Value Iteration algorithms are best understood by viewing them from the lens of Bellman Policy Operator and Bellman Optimality Operator
A.I. for Dynamic Decisioning under Uncertainty (for real-world problems in Re...Ashwin Rao
Slides from the Research Seminar talk I gave at Nvidia. The topic was: A.I. for Dynamic Decisioning under Uncertainty (for Real-World problems in Retail and in Financial Trading)
Overview of Stochastic Calculus FoundationsAshwin Rao
This is a quick refresher/overview of Stochastic Calculus Foundations. This assumes you have done a Stochastic Calculus course previously and now want to review/revise the material to prepare for a course that lists Stochastic Calculus as a pre-req. In these 11 slides, I list the key content you must be familiar with within Stochastic Calculus.
Risk-Aversion, Risk-Premium and Utility TheoryAshwin Rao
This lecture helps understand the concepts of Risk-Aversion and Risk-Premium viewed from the lens of Utility Theory. These are foundational economic concepts used widely in Financial applications - Portfolio problems and Pricing problems, to name a couple.
To make Reinforcement Learning Algorithms work in the real-world, one has to get around (what Sutton calls) the "deadly triad": the combination of bootstrapping, function approximation and off-policy evaluation. The first step here is to understand Value Function Vector Space/Geometry and then make one's way into Gradient TD Algorithms (a big breakthrough to overcome the "deadly triad").
Stanford CME 241 - Reinforcement Learning for Stochastic Control Problems in ...Ashwin Rao
I am pleased to introduce a new and exciting course, as part of ICME at Stanford University. I will be teaching CME 241 (Reinforcement Learning for Stochastic Control Problems in Finance) in Winter 2019.
HJB Equation and Merton's Portfolio ProblemAshwin Rao
Deriving the solution to Merton's Portfolio Problem (Optimal Asset Allocation and Consumption) using the elegant formulation of Hamilton-Jacobi-Bellman equation.
A Quick and Terse Introduction to Efficient Frontier MathematicsAshwin Rao
A Quick and Terse Introduction to Efficient Frontier Mathematics. Only a basic background in Linear Algebra, Probability and Optimization is expected to cover this material and gain a reasonable understanding of this topic within one hour.
Richard's entangled aventures in wonderlandRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Recursive Formulation of Gradient in a Dense Feed-Forward Deep Neural Network
1. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE
FEED-FORWARD DEEP NEURAL NETWORK
ASHWIN RAO
1. Motivation
The purpose of this short paper is to derive the recursive formulation of the gradient of
the loss function with respect to the parameters of a dense feed-forward deep neural network.
We start by setting up the notation and then we state the recursive formulation of the loss
gradient. We derive this formulation for a fairly general case - where the supervisory
variable has a conditional probability density modeled as an arbitrary Generalized
Linear Model(GLM)’s “normal-form” probability density, when the output layer’s
activation function is the GLM canonical link function, and when the loss function
is cross-entropy loss. Finally, we show that Linear Regression and Softmax Classification are
special cases of this generic GLM structure, and so this formulation of the gradient is applicable
to the special cases of dense feed-forward deep neural networks whose output layer activation
function is linear (for regression) or softmax (for classification).
2. Notation
Consider a “vanilla” (i.e., dense feed-forward) deep neural network with L layers. Layers
l = 0, 1, . . . , L − 1 carry the hidden layer neurons and layer l = L carries the output layer
neurons.
Let Il be the inputs to layer l and let Ol be the outputs of layer l for all l = 0, 1, . . . , L. We
know that Il+1 = Ol for all l = 0, 1, . . . , L − 1. Note that the number of neurons in layer l
will be the number of outputs of layer l (= |Ol|. In the following exposition, we assume that
the input to this network is a single training data point (input features) denoted as X and the
output of the network is supervised by the supervisory value Y associated with X (we work
with a single training data point in this entire paper because the gradient calculation can be
independently applied to each training data point and finally summed over the training data
points). So, I0 = X and OL is the network’s prediction for input X (that will be supervised by
Y ).
We denote K as the number of elements in the supervisory variable Y (K will also be the
number of neurons in the output layer as well as the number of outputs produced by the output
layer). So, |OL| = |Y | = K. We denote the rth
element of OL (output of rth
neuron in the
output layer) as O
(r)
L and the rth
element of Y as Y (r)
for all r = 1, . . . K.
Note that K = |OL| = |Y | = 1 for regression. For classification, K is the number of classes
with K = |OL| neurons in the output layer. The supervisory variable Y will be have a “one-
hot” encoding of length K as follows: . If the supervisory variable is meant to represent class
k ∈ [1, K], then Y (k)
= 1 and Y (r)
= 0 for all r = 1, . . . , k − 1, k + 1, . . . K (the non-k elements
of Y are 0).
The error E = OL − Y is a scalar for regression. For classification, E = OL − Y is a vector
of length K whose kth
element is O
(k)
L − 1 and the other (non-k) elements of E are O
(r)
L for all
r = 1, . . . , k − 1, k + 1, . . . K.
1
2. 2 ASHWIN RAO
We denote the parameters (“weights”) for layer l as Wl (Wl is a matrix of size |Ol|×|Il|). For
ease of exposition, we will ignore the bias parameters (as we can always treat X as augmented
by a “fake” feature that is always 1). We denote the activation function of layer l as gl(·) for all
l = 0, 1, . . . , L − 1. Let Sl = Il · Wl for all l = 0, 1, . . . , L, so Ol = gl(Sl)
Notation Description
Il Inputs to layer l for all l = 0, 1, . . . , L
Ol Outputs of layer l for all l = 0, 1, . . . , L
X Input Features to the network for a single training data point
Y Supervisory value associated with input X
Wl Parameters (“weights”) for layer l for all l = 0, 1, . . . , L
E Error OL − Y for the single training data point (X, Y )
gl(·) Activation function for layer l for l = 0, 1, . . . , L − 1
Sl Sl = Il · Wl, Ol = gl(Sl) for all l = 0, 1, . . . L
Pl “Proxy Error” for layer l for all l = 0, 1, . . . , L
λl Regularization coefficient for layer l for all l = 0, 1, . . . , L
3. Recursive Formulation
We define “proxy error” Pl for layer l recursively as follows for all l = 0, 1, . . . , L:
Pl = (Pl+1 · Wl+1) ◦ gl(Sl)
with the recursion terminating as:
PL = E = OL − Y
Then, the gradient of the loss function with respect to the parameters of layer l for all l =
0, 1, . . . , L will be:
∂Loss
∂Wl
= Pl ⊗ Il
Section 4 provides a proof of the above when the supervisory variable has a
conditional probability density modeled as the Generalized Linear Model(GLM)’s
“normal-form” probability density (i.e. fairly generic family of probability distri-
butions), when the output layer’s activation function is the GLM canonical link
function, and when the loss function is cross-entropy loss. We show in the final two
sections that Linear Regression and Softmax Classification are special cases of this generic GLM
structure, and so this formulation of the gradient is applicable to the special cases of dense feed-
forward deep neural networks whose output layer activation function is linear (for regression) or
softmax (for classification). For now, we give an overview of the proof in the following 3 bullet
points:
• “Proxy Error” Pl is defined as ∂Loss
∂Sl
, and so gradient ∂Loss
∂Wl
= ∂Loss
∂Sl
⊗ ∂Sl
∂Wl
= Pl ⊗ Il
• The important result PL = E = OL − Y is due to the important GLM result A (SL) =
gL(SL) = OL where A(·) is the key function in the probability density functional form
for GLM.
• The recursive formulation for Pl has nothing to do with GLM, it is purely due to the
structure of the feed-forward network: Sl+1 = gl(Sl) · Wl+1. Differentiation chain rule
yields ∂Loss
∂Sl
= ∂Loss
∂Sl+1
· ∂Sl+1
∂Sl
which gives us the recursive formulation Pl = (Pl+1 · Wl+1) ◦
gl(Sl)
3. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK3
The detailed proof is in Section 4.
Note that Pl+1 ·Wl+1 is the inner-product of the |Ol+1| size vector Pl+1 and the |Ol+1|×|Il+1|
size matrix Wl+1, and the resultant |Il+1| = |Ol| size vector Pl+1 · Wl+1 is multiplied pointwise
(Hadamard product) with the |Ol| size vector gl(Sl) to yield the |Ol| size vector Pl.
Pl ⊗Il is the matrix outer-product of the |Ol| size vector PL and the |Il| size vector Il. Hence,
∂Loss
∂Wl
is a matrix of size |Ol| × |Il|
If we do L2 regularization (with λl as the regularization coefficient in layer l), then:
∂Loss
∂Wl
= Pl ⊗ Il + 2λlWl
If we do L1 regularization (with λl as the regularization coefficient in layer l), then:
∂Loss
∂Wl
= Pl ⊗ Il + λlsign(Wl)
where sign(Wl) is the sign operation applied pointwise on the elements of the matrix Wl.
4. Derivation of Gradient Formulation for GLM conditional probability
density and Canonical Link Function
Here we show that the formula for gradient shown in the previous section is applicable when
the supervisory variable has a conditional probability density modeled as the Generalized Linear
Model(GLM)’s “normal-form” probability density (i.e., fairly generic family of probability dis-
tributions), when the output layer’s activation function is the GLM canonical link function, and
when the loss function is cross-entropy loss. First, we give a quick summary of GLM (independent
of neural networks).
4.1. The GLM Setting. In the GLM setting, the conditional probability density of the super-
visory variable Y is modeled as:
f(Y |θ, τ) = h(Y, τ) · e
b(θ)·T (Y )−A(θ)
d(τ)
where θ should be thought of as the “center” parameter (related to the mean) of the prob-
ability distribution and τ should be thought of as the “dispersion” parameter (related to the
variance) of the distribution. h(·, ·), b(·), A(·), d(·) are general functions whose specializations
define the family of distributions that can be modeled in the GLM framework. Note that this
form is a generalization of the exponential-family functional form incorporating the “dispersion”
parameter τ (for this reason, the above form is known as the “overdispersed exponential family”
of distributions).
It is important to mention here that the GLM framework operates even when the
supervisory variable Y is multi-dimensional. We will denote the dimension of Y and θ as
K (eg: for classification, this would mean K classes). Note that b(θ) · T(Y ) is the inner-product
between the vectors b(θ) and T(Y ).
When we specialize the above probability density form to the so-called “normal-form” (which
means both b(·) and T(·) are identity functions), then the conditional probability density of the
supervisory variable is (this is the form we work with in this paper):
f(Y |θ, τ) = h(Y, τ) · e
θ·Y −A(θ)
d(τ)
The GLM link function q(·) is the function that transforms the linear predictor to the mean
of the supervisory variable Y conditioned on the linear predictor η (linear predictor η is the
inner-product of the input vector and the parameters matrix). Therefore,
4. 4 ASHWIN RAO
E[Y |θ] = q(η)
We will specialize the GLM link function q(·) to be the canonical link function (“canonical”
simply means that θ is equal to the linear predictor η). So for a canonical link function q(·),
E[Y |θ] = q(θ)
Note that q transforms a vector of length K to a vector of length K (θ → E[Y |θ]).
4.2. Key Result. Since
y
f(y|θ, τ)dy = 1,
it’s partial derivatives with respect to θ will be zero. In other words,
∂( y
f(y|θ, τ)dy)
∂θ
= 0
Hence,
∂( y
h(y, τ) · e
θ·y−A(θ)
d(τ) dy)
∂θ
= 0
Taking the partial derivative inside the integral, we get:
y
h(y, τ) · e
θ·y−A(θ)
d(τ) ·
y − ∂A
∂θ
d(τ)
dy = 0
y
f(y|θ, τ) · (y −
∂A
∂θ
)dy = 0
E[(Y |θ) −
∂A
∂θ
] = 0
This gives us the key result (which we will utilize later in our proof):
E[Y |θ] =
∂A
∂θ
(which as we know from above is = q(θ))
As an aside, we want to mention that apart from allowing for a non-linear relationship between
the input and the mean of the supervisory variable, GLM allows for heteroskedastic models (i.e.,
the variance of the supervisory variable is a function of the input variables). Specifically,
V ariance[Y |θ] =
∂2
A
∂θ2
· d(τ) (derivation similar to that of E[Y |θ] above)
4.3. Common Examples of GLM. Several standard probability distributions paired with
their canonical link function are special cases of “normal-form” GLM. For example,
• Normal distribution N(µ, σ) paired with the identity function (q(θ) = θ) is Linear Re-
gression (θ = µ, τ = σ, h(Y, τ) = e
−Y 2
2τ2
√
2πτ
, A(θ) = θ2
2 , d(τ) = τ2
).
• Bernoulli distribution parameterized by p paired with the logistic function (q(θ) = 1
1+e−θ )
is Logistic Regression (θ = log ( p
1−p ), τ = h(Y, τ) = d(τ) = 1, A(θ) = log (1 + eθ
)).
• Poisson distribution parameterized by λ paired with the exponential function (q(θ) = eθ
)
is Log-Linear Regression (θ = log λ, τ = d(τ) = 1, h(Y, τ) = 1
Y ! , A(θ) = eθ
).
5. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK5
4.4. GLM in a Dense Feed-Forward Deep Neural Network. Now we come to our dense
feed-forward deep neural network setting. Let us assume that the supervisory variable Y has
a conditional probability distribution of the GLM “normal-form” expressed above and let us
assume that the activation function of the output layer gL(·) is the canonical link function q(·).
Then,
E[Y |θ] =
∂A
∂θ
= q(θ) = q(IL · WL) = gL(IL · WL) = gL(SL) = OL
In the above equation, we note that each of Y, θ, SL, OL is a vector of length K.
The Cross-Entropy Loss (Negative Log-Likelihood) of the training data point (X, Y ) is given
by:
Loss = − log (h(Y, τ)) +
A(θ) − θ · Y
d(τ)
We define the “Proxy Error” Pl for layer l (for all l = 0, 1, . . . L) as:
Pl = d(τ) ·
∂Loss
∂Sl
4.5. Recursion Termination. We want to first establish the termination of the recursion for
Pl, i.e., we want to establish that:
PL = d(τ) ·
∂Loss
∂SL
= d(τ) ·
∂Loss
∂θ
= OL − Y = E
To calculate PL, we have to evaluate the partial derivatives of Loss with respect to the θ
vector.
∂Loss
∂θ
=
∂A
∂θ − Y
d(τ)
It follows that:
PL = d(τ) ·
∂Loss
∂θ
=
∂A
∂θ
− Y = q(θ) − Y = gL(θ) − Y = OL − Y
This establishes the termination of the recursion for Pl, i.e. PL = OL − Y = E (the “error” in
the output of the output layer). It pays to re-emphasize that this result (PL = E) owes a Thank
You to the key result ∂A
∂θ = q(θ), which is applicable when the conditional probability density of
the supervisory variable is in GLM “normal form” and when the link function q(·) is canonical.
We are fortunate that this is still a fairly general setting and so, it applies to a wide range of
regressions and classifications.
4.6. Recursion. Next, we want to establish the following recursion for Pl for all l = 0, 1, . . . , L−
1:
Pl = (Pl+1 · Wl+1) ◦ gl(Sl)
We note that:
Sl+1 = gl(Sl) · Wl+1
Hence,
Pl = d(τ) ·
∂Loss
∂Sl
= d(τ) ·
∂Loss
∂Sl+1
·
∂Sl+1
∂Sl
= d(τ) ·
∂Loss
∂Sl+1
· Wl+1 ◦ gl(Sl) = (Pl+1 · Wl+1) ◦ gl(Sl)
6. 6 ASHWIN RAO
This establishes the recursion for Pl for all l = 0, 1, . . . , L − 1. Again, it pays to re-emphasize
that this recursion has nothing to do with GLM, it only has to do with the structure of the
feed-forward network: Sl+1 = gl(Sl) · Wl+1.
4.7. Loss Gradient Formulation. Finally, we are ready to express the gradient of the loss
function with respect to the parameters of layer l (for all l = 0, 1, . . . L).
∂Loss
∂Wl
=
∂Loss
∂Sl
·
∂Sl
∂Wl
= Pl ⊗ Il (ignoring the constant factor d(τ))
5. Regression on a dense feed-forward deep neural network
Linear Regression is a special case of the GLM family because the linear regression loss function
is simply the cross-entropy loss function (negative log likelihood) when the supervisory variable
Y follows a linear-predictor-conditional normal distribution and the canonical link function for
the normal distribution is the identity function. Here we consider a dense feed-forward deep
neural network whose output layer activation function is the identity function and we define the
loss function to be mean-squared error (as follows):
(IL · WL − Y )2
The Cross-Entropy Loss (Negative Loss Likelihood) when Y |(IL ·WL) is a normal distribution
N(µ, σ) is as follows:
− log (
1
√
2πσ
· e−
(Y −(IL·WL))2
2σ2
) =
(Y − (IL · WL))2
2σ2
− log (
1
√
2πσ
)
Since σ is a constant, the gradient of cross-entropy loss is same as the gradient of mean-squared
error up to a constant factor. Hence, the gradient formulation we derived for GLM is applicable
here.
6. Classification on a dense feed-forward deep neural network
Softmax Classification is a special case of the GLM family where the supervisory variable Y
follows a linear-predictor-conditional Multinoulli distribution and the canonical link function for
Multinoulli is Softmax. Here we consider a dense feed-forward deep neural network whose output
layer activation function is softmax and we define the loss function to be cross-entropy loss.
We assume K classes. We denote the rth
element of OL (output of rth
neuron in the output
layer) as O
(r)
L , the rth
element of the training data point Y as Y (r)
and the rth
element of SL as
S
(r)
L for all r = 1, . . . K. The supervisory variable Y will be have a “one-hot” encoding of length
K as follows: . If the supervisory variable is meant to represent class k ∈ [1, K], then Y (k)
= 1
and Y (r)
= 0 for all r = 1, . . . , k − 1, k + 1, . . . K (the non-k elements of Y are 0).
The softmax classification cross-entropy loss function is given by:
−
K
r=1
(Y (r)
· log O
(r)
L )
where
O
(r)
L =
eS
(r)
L
K
i=1 eS
(i)
L
We can write the above cross-entropy loss function as the negative log-likelihood of a Multi-
noulli ((p1, p2, . . . , pK)) distribution that is a special case of a GLM distribution:
7. RECURSIVE FORMULATION OF LOSS GRADIENT IN A DENSE FEED-FORWARD DEEP NEURAL NETWORK7
− log f(Y |θ, τ) = − log (h(Y, τ) · e
θ·Y −A(θ)
d(τ) ) = − log (h(Y, τ)) ·
A(θ) − θ · Y
d(τ)
with the following specializations:
τ = h(Y, τ) = d(τ) = 1
θ(r)
= log (pr) = log (O
(r)
L ) = S
(r)
L for all r = 1, . . . , K
A(θ) = A(θ(1)
, . . . , θ(K)
) = log (
K
r=1
eθ(r)
)
The canonical link function for Multinoulli distribution is the softmax function g(θ) given by:
∂A
∂θ
= g(θ) = g(θ(1)
, θ(2)
, . . . , θ(K)
) = (
eθ(1)
K
r=1 eθ(r)
,
eθ(2)
K
r=1 eθ(r)
, . . . ,
eθ(K)
K
r=1 eθ(r)
)
So we can see that:
OL = g(θ) = g(SL)
Hence, the gradient formulation we derived for GLM is applicable here.