APPLIED MACHINE LEARNING

APPLIED MACHINE LEARNING: FINALS
Course Code: CS501
DEEP NEURAL NETWORKS & COMPUTATIONAL GRAPHS
Name Puranam Revanth Kumar
(Research Scholar)
Roll No. 19STRCHH01004

Question
1. Deep Neural Networks & Computational Graphs
(a) Explain the Concept - derivatives, partial derivatives, optimization, training set,
activation functions etc.
(b) Give simple examples of Chain Rule then generalize - assume all activation func-
tions have partial derivatives.
(c) Demonstrate on simple example such as Sigmoid activation functions.
ii

Contents
Question ii
1 Deep Neural Networks & Computational Graphs 1
1.1 Explain the Concept - derivatives, partial derivatives, optimization, training
set, activation functions etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Give simple examples of Chain Rule then generalize - assume all activation
functions have partial derivatives . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Chain Rule in Back Propagation . . . . . . . . . . . . . . . . . . . 9
1.3 Demonstrate on simple example such as Sigmoid activation functions . . . 11
Summary 13
iii

List of Figures
1.1 Artificial neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 A simple computational graph with two nodes . . . . . . . . . . . . . . . . 2
1.3 Illustration of chain rule in computational graphs: The products of node-
specific partial derivatives along paths from weight w to output o are aggre-
gated. The resulting value yields the derivative of output o with respect to
weight w. Only two paths between input and output exist in this simplified
example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 plot a function f(a)=3a . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 function f(a) with slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Sigmoid Activation function . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 Hyperbolic Tangent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 ReLU function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 Leaky ReLU function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.10 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.11 Neural network with two hidden layers . . . . . . . . . . . . . . . . . . . . 9
1.12 Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.13 DNN using Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . 11
iv

1. Deep Neural Networks & Computational Graphs
Deep learning is a technique which basically mimics the human brain. So, the Scientist
and Researchers taught can we make machine learn in the same way so, their is where deep
learning concept came that lead to the invention called neural network. The first simplest
type of neural network is called perceptron. There was some problems in the perceptron
because the perceptron not able to learn properly because the concepts they applied, but later
on in 1980’s Geoffrey Hinton he invented concept called backpropagation[1]. So, the ANN,
CNN, RNN became efficient that many companies are using it, developed lot of applications.
An artificial neural network computes a function of the inputs by propagating the com-
puted values from the input neurons to the output neuron(s) and using the weights as inter-
mediate parameters. Learning occurs by changing the weights connecting the neurons [2, 3].
Just as external stimuli are needed for learning in biological organisms, the external stimulus
in artificial neural networks is provided by the training data containing examples of input-
output pairs of the function to be learned. For example, the training data might contain pixel
representations of images (input) and their annotated labels (e.g., cat, dog) as the output.
These training data pairs are fed into the neural network by using the input representations
to make predictions about the output labels.
Figure 1.1: Artificial neural network
Here,
f1, f2, f3 are my input features.
• If it is a multi classification: more than one node can be specified.
• If it is a binary classification: only one node need to be specified.
The training data provides feedback to the correctness of the weights in the neural network
depending on how well the predicted output (e.g., probability of cat) for a particular input
matches the annotated output label in the training data. One can view the errors made by the
1

neural network in the computation of a function as a kind of unpleasant feedback in a bio-
logical organism, leading to an adjustment in the synaptic strengths. Similarly, the weights
between neurons are adjusted in a neural network in response to prediction errors. The goal
of changing the weights is to modify the computed function to make the predictions more
correct in future iterations. Therefore, the weights are changed carefully in a mathematically
justified way so as to reduce the error in computation on that example. By successively ad-
justing the weights between neurons over many input-output pairs, the function computed by
the neural network is refined over time so that it provides more accurate predictions.
Computational Graphs
A neural network is a computational graph, in which a unit of computation is the neu-
ron. Neural networks are fundamentally more powerful than their building blocks because
the parameters of these models are learned jointly to create a highly optimized composition
function of these models [4]. Furthermore, the nonlinear activations between the different
layers add to the expressive power of the network. A multilayer network evaluates compo-
sitions of functions computed at individual nodes. A path of length 2 in the neural network
in which the function f(·) follows g(·) can be considered a composition function f(g(·)). Just
to provide an idea, let us look at a trivial computational graph with two nodes, in which the
sigmoid function is applied at each node to the input weight w. In such a case, the computed
function appears as follows:
f(g(w)) =
1
1 + exp[− 1
1+exp(w)
]
The resulting iterative approach is dynamic programming, and the corresponding update is
really the chain rule of differential calculus. In order to understand how the chain rule works
in a computational graph, we will discuss the two basic variants of the rule that one needs to
keep in mind. The simplest version of the chain rule works for a straightforward composition
of the functions.
∂f(g(w))
∂w
=
∂f(g(w))
∂g(w)
∂g(w)
∂w
Figure 1.2: A simple computational graph with two nodes
Consider a sequence of hidden units h1, h2, ..., hk followed by output o, with respect to
which the loss function L is computed. Furthermore, assume that the weight of the connec-
tion from hidden unit hr to hr+1 is w(hr,hr+1). Then, in the case that a single path exists from
h1 to o, one can derive the gradient of the loss function with respect to any of these edge
weights using the chain rule:
∂L
∂w(hr−1,hr)
=
∂L
∂o
"
∂o
∂hk
k−1
Y
i=r
∂hi+1
∂hi
#
∂hr
∂w(hr−1,hr)
∀r ∈ 1...k
2

Figure 1.3: Illustration of chain rule in computational graphs: The products of node-specific
partial derivatives along paths from weight w to output o are aggregated. The resulting value
yields the derivative of output o with respect to weight w. Only two paths between input and
output exist in this simplified example.
∂o
∂w
=
∂o
∂p
∂p
∂w
+
∂o
∂q
∂q
∂w
[Multivariable Chain Rule]
=
∂o
∂p
∂p
∂y
∂y
∂w
+
∂o
∂q
∂q
∂z
∂z
∂w
[Univariate Chain Rule]
=
∂K(p, q)
∂p
g
0
(y) f
0
(w) +
∂K(p, q)
∂q
h
0
(z) f
0
(w)
First path Second path
3

1.1 Explain the Concept - derivatives, partial derivatives,
optimization, training set, activation functions etc.
(a) Derivatives: The derivative of a function of a single variable at a chosen input value,
when it exists, is the slope of the tangent line to the graph of the function at that point [5].
Example: Let us plot here the function f(a) = 3a. So, it’s just a straight line.
Figure 1.4: plot a function f(a)=3a
Let say that a = 2. In that case, f(a), which is equal to 3 times a is equal to f(a) = 6. Now, i
am going to just bump up a, a little bit, so that it is now 2.001 just plot this into scale, 2.001,
this 0.001 difference is too small to show on this plot.
Now,just give a little nudge to that right f(a), is equal to three times that. So, it’s 6.003, so
we plot this over here.
Figure 1.5: function f(a) with slope
If you look at this little triangle the slope or derivative of a function f(a) at a = 2 is 3.
The term derivative basically means slope, formally slop is defined as height / width which
is = 3.
4

Now,
df(a)
da
= 3 ;
d
da
f(a).
(b) Partial derivatives: Finding the gradient is essentially finding the derivative of the
function. There are many independent variables that we can tweak (all the weights and bi-
ases), we have to find the derivatives with respect to each variable. This is known as the
partial derivative, with the symbol ∂.
Computing the partial derivative of simple functions is easy: simply treat every other vari-
able in the equation as a constant and find the usual scalar derivative.
(c) Optimization: The Optimization choose inputs that result in best possible outputs. Op-
timizers are algorithms or methods used to change the attributes of your neural network such
as weights and learning rate in order to reduce the losses.
How you should change your weights or learning rates of your neural network to reduce the
losses is defined by the optimizers you use.
Example: θ1 := θ1 − α ∂
∂θ1
; If α is too small, the gradient decent can be slow.
Is α is too large, gradient decent can overshoot the minimum. It may fail to converge or even
diverge.
5

(d) Training set: Training set is a set of pairs of input patterns with corresponding de-
sired output patterns. Each pair represents how the network is supposed to respond to a
particular input.
There are two approaches to training - supervised and unsupervised. Supervised training
involves a mechanism of providing the network with the desired output either by manually
"grading" the network’s performance or by providing the desired outputs with the inputs.
Unsupervised training is where the network has to make sense of the inputs without outside
help.
(e) Activation function: The activation function is a mathematical “gate” in between the
input feeding the current neuron and its output going to the next layer. It can be as simple as
a step function that turns the neuron output on and off, depending on a rule or threshold [7].
Figure 1.6: Sigmoid Activation function
TanH / Hyperbolic Tangent: Zero centered—making it easier to model inputs that have
strongly negative, neutral, and strongly positive values. Otherwise like the Sigmoid function.
Figure 1.7: Hyperbolic Tangent
6

ReLU (Rectified Linear Unit): Computationally efficient—allows the network to con-
verge very quickly Non-linear—although it looks like a linear function, ReLU has a deriva-
tive function and allows for backpropagation.
Figure 1.8: ReLU function
Leaky ReLU: Prevents dying ReLU problem—this variation of ReLU has a small positive
slope in the negative area, so it does enable backpropagation, even for negative input values
Otherwise like ReLU.
Figure 1.9: Leaky ReLU function
Softmax: Able to handle multiple classes only one class in other activation functions—normalizes
the outputs for each class between 0 and 1, and divides by their sum, giving the probability
of the input value being in a specific class.
Useful for output neurons—typically Softmax is used only for the output layer, for neural
networks that need to classify inputs into multiple categories.
7

1.2 Give simple examples of Chain Rule then generalize -
assume all activation functions have partial derivatives
Suppose u is a differentiable function of x1, x2, ..., xn and each xj differentiable function
of t1, t2, ..., tn. Then u is a function of t1, t2, ..., tn and the partial derivative u with respect to
t is [6]:
∂u
∂t1
=
∂u
∂x1
∂x1
∂t1
+
∂u
∂x2
∂x2
∂t1
+ ... +
∂u
∂xn
∂xn
∂tn
(1)
1.2.1 Forward Propagation
In this phase, the inputs for a training instance are fed into the neural network. This results
in a forward cascade of computations across the layers, using the current set of weights. The
final predicted output can be compared to that of the training instance and the derivative of
the loss function with respect to the output is computed. The derivative of this loss now
needs to be computed with respect to the weights in all layers in the backwards phase.
Let Inputs are x1, x2, x3. These inputs will pass to hidden neuron. Then 2 important
operations will take place:
Figure 1.10: Neural Network
Step 1: The summation of weights and the inputs
n
X
i=1
wixi
y = w1x1 + w2x2 + w3x3
Step 2: Before activation function the bias will be added and summation follows:
y = w1x1 + w2x2 + w3x3 + bi
z = Act(y)
z = z × w4
8

1.2.2 Chain Rule in Back Propagation
The main goal of the backward phase is to learn the gradient of the loss function with re-
spect to the different weights by using the chain rule of differential calculus. These gradients
are used to update the weights. Since these gradients are learned in the backward direction,
starting from the output node, this learning process is referred to as the backward phase.
Suppose the inputs are x1, x2, x3, x4 which are getting connected with two hidden layers.
In hidden layer one there are 3 neurons and in the hidden layer two there are 2 neurons. The
best way to define the hidden layer is w1
11for 1st
hidden layer and w2
11or the 2nd
hidden layer
[4].
Figure 1.11: Neural network with two hidden layers
To reduce the loss value back propagate need to be used. While doing back propagation these
weights will get updated.
For a single record value the difference can be found by loss function.
Loss = (y − b
y)2
For multiple records the Cost function need to be defined.
n
X
i=1
(y − b
y)2
where,
w1
11, w2
11, w3
11 are weights,
HL1, HL2 are hidden layers,
O11, O21, O31 are outputs of hidden layer.
9

• Let us update the weights;
w11
3
new = w11
3
old − α
∂L
∂w3
11
• w3
11 need to be updated in the back propagation, what we do is that we get a b
y we get
a loss value now, when we back propagate we update the weights.
• Now, we see how to find derivative ∂L
∂w3
11
. This basically indicates the slope and how it
is related to chain rule.
• ∂L
∂w3
11
can be written as
• The weight w3
11 will impact the output O31. Since it impact output O31 this can be
write as:
∂L
∂w3
11
=
∂L
∂O31
×
∂O31
∂w3
11
this is basically a chain rule.
• Suppose, to find the derivative of w3
21
∂L
∂w3
21
=
∂L
∂O31
×
∂O31
∂w3
21
• To find the derivative of w2
11
∂L
∂w2
11
=
∂L
∂O31
×
∂O31
∂O21
×
∂O21
∂w2
11
• To find w3
11 because there are 2 output layers are impacting f21, f22. After finding the
derivative add one more derivative

∂L
∂O31
×
∂O31
∂O21
×
∂O21
∂w2
11

+

∂L
∂O31
×
∂O31
∂O22
×
∂O22
∂w2
12

• When this derivative is updated basically weights are getting updated then b
y going to
change until we reach global minima.
10

1.3 Demonstrate on simple example such as Sigmoid acti-
vation functions
The activation function is a mathematical “gate” in between the input feeding the current
neuron and its output going to the next layer. It can be as simple as a step function that turns
the neuron output on and off depending on a rule or threshold [3].
σ(x) =
1
1 + e−y
; y =
n
X
i=1
wixi + bi
The inputs can be classified based on the gradient or the slope that we decide as a threshold.
Figure 1.12: Sigmoid function
Here 0.5 is the threshold, any inputs which fall above the given threshold are classified in to
one Cluster and any inputs below the threshold are classified in to another. This will trans-
form the value between 0 or 1. If it is 0.5 considered as 0.
Figure 1.13: DNN using Sigmoid function
11

Nice Property =
dσ(x)
dx
= σ(x)(1 − σ(x))
w0, w1, ..., wn ⇒ weights
x1, x2, ..., xn ⇒ inputs
In the above diagram, the activation function i.e.,the Sigmoid function is applied on sum-
mation, differentiating sigmoid function with respect to x.
Now,
dσ(x)
dx
=
1
(1 + e−x)2
· e−x
=
e−x
(1 + e−x)2
=
1
(1 + e−x)
*
e−x
(1 + e−x)
(sigmoid) (1-sigmoid)
∴
dσ(x)
dx
= σ(x)(1 − σ(x)).
12

Summary
Although a neural network can be viewed as a simulation of the learning process in living
organisms, a more direct understanding of neural networks is as computational graphs. Such
computational graphs perform recursive composition of simpler functions in order to learn
more complex functions. Since these computational graphs are parameterized, the problem
generally boils down to learning the parameters of the graph in order to optimize a loss
function. The simplest types of neural networks are often basic machine learning models
like least-squares regression. The real power of neural networks is unleashed by using more
complex combinations of the underlying functions. The parameters of such networks are
learned by using a dynamic programming method, referred to as backpropagation. There
are several challenges associated with learning neural network models, such as overfitting
and training instability. In recent years, numerous algorithmic advancements have reduced
these problems.Lastly, the mathematical intuition behind forward and back-ward propagation
has been derived in order to show how internally training of dataset is done and how error is
minimized using back-propagation. The design of deep learning methods in specific domains
such as text and images requires carefully crafted architectures.
13

Bibliography
[1] https://en.wikipedia.org/wiki/Geoffrey_Hinton [Cited on page 1]
[2] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.
IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016. [Cited
on page 1]
[3] https://www.youtube.com/watch?v=DKSZHN7jftIt=4s [Cited on pages 1 and 11]
[4] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-
propagating errors. Nature, 323 (6088), pp. 533–536, 1986. [Cited on pages 2 and 9]
[5] https://www.coursera.org [Cited on page 4]
[6] https://towardsdatascience.com/understanding-backpropagation-algorithm-
7bb3aa2f95fd [Cited on page 8]
[7] https://missinglink.ai/ [Cited on page 6]
14

APPLIED MACHINE LEARNING

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to APPLIED MACHINE LEARNING

Similar to APPLIED MACHINE LEARNING (20)

Recently uploaded

Recently uploaded (20)

APPLIED MACHINE LEARNING