Neural Nets Deconstructed

NEURAL NETS
DECONSTRUCTED
Paul Sterk 
AI Engineer
1

WHAT ARE WE GOING TO COVER?
➤ What is a Neural Network (NN)?
➤ What does it mean to ‘deconstruct’ a NN?
➤ How is this helpful to students and those new to the ﬁeld?
➤ We will also touch learning psychology - on some thoughts on
what habits I found beneﬁcial in learning this material
2

GETTING STARTED - THE “BIG PICTURE”
➤ How do AI, ML, Deep Learning and  
Neural Nets relate to each other?
➤ “You can think of deep learning, machine learning and
artiﬁcial intelligence as a set of Russian dolls nested within
each other, beginning with the smallest and working out.
Deep learning is a subset of machine learning, and machine
learning is a subset of AI, which is an umbrella term for
any computer program that does something smart.”[1]
➤ Deep Learning in an ML method based on training Neural
Networks.
[1] https://skymind.ai/wiki/ai-vs-machine-learning-vs-deep-learning
John McCarthy
3

WHAT IS A NEURAL NET?
➤ Definition: A neural network (NN) is an  
interconnected group of natural or  
artificial neurons that uses a mathematical  
or computational model statistical for  
data modeling or decision making.
➤ In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows
through the network.
➤ They can be used to model complex relationships between
inputs and outputs or to find patterns in data.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 4

APPLICATIONS FOR NEURAL NETS
➤ The tasks to which artificial neural networks are applied tend
to fall within the following broad categories:
➤ Function approximation, or regression analysis, including
time series prediction and modeling.
➤ Classification, including pattern and sequence recognition,
novelty detection and sequential decision making.
➤ Data processing, including filtering, clustering, signal and
compression.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 5

LET’S LOOK AT A SIMPLE EXAMPLE
➤ Let's say you have a data sets with six houses. You have the size of the houses (in square
feet or square meters), and the price of the house. You then want to fit a function to predict
the price of the houses as a function of the size. If you are familiar with linear regression,
you might try to fit a straight line to these data to model the relationship.
➤ So you can think of this function that you've just fit the housing prices as a very simple
neural network.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 6

HOUSING PREDICTION CONTINUED…
➤ Given these input features, the job of the neural network will be to predict the price y. Notice also that each
of these circle, these are called hidden units in the neural network, that each of them takes its inputs all
four input features.
➤ The middle layer of the neural network is density connected because every input feature is connected to
every one of these circles in the middle. And the remarkable thing about neural networks is that, given
enough data about x and y, given enough training examples with both x and y, neural networks are
remarkably good a ﬁguring out functions that accurately map x to y.
7source: https://www.coursera.org/learn/neural-networks-deep-learning/

SUPERVISED LEARNING
➤ One of the most exciting things about the rise of neural networks is that computers
are now much better at interpreting unstructured data as well compared to just a few
years ago. And this creates opportunities for many new exciting applications that use
speech recognition, image recognition, and natural language processing on text.

HOW DO NEURAL NETS ‘LEARN’?
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.
source: towardsdatascience.com 9

HOW DO NEURAL NETS ‘LEARN’?

LOGISTIC REGRESSION
➤ We will use logistic regression in order to make the ideas
easier to understand. Logistic regression is an algorithm for
binary classiﬁcation.
➤ Here's an example of a binary classiﬁcation problem. You
might have an input of an image, and want to output a label
to recognize this image as either being a cat, in which case
you output 1, or not-cat in which case you output 0, and we're
going to use y to denote the output label.
➤ In the logistic model, the log-odds (the logarithm of the odds)
for the value labeled "1" is a linear combination of one or
more independent variables (“predictors").

LOGISTIC REGRESSION: ASSUMPTIONS
➤ First, binary logistic regression requires the dependent variable to be binary and ordinal
logistic regression requires the dependent variable to be ordinal.
➤ Second, logistic regression requires the observations to be independent of each other. In
other words, the observations should not come from repeated measurements or
matched data.
➤ Third, logistic regression requires there to be little or no multicollinearity among the
independent variables. This means that the independent variables should not be too
highly correlated with each other.
➤ Fourth, logistic regression assumes linearity of independent variables and log odds.
Although the dependent and independent variables do not have to be related linearly, it
requires that the independent variables are linearly related to the log odds.
➤ Finally, logistic regression typically requires a large sample size. A general guideline is
that you need at minimum of 10 cases with the least frequent outcome for each
independent variable in your model. For example, if you have 5 independent variables
and the expected probability of your least frequent outcome is .10, then you would need
a minimum sample size of 500 (10*5 / .10).
12source: https://www.statisticssolutions.com/assumptions-of-logistic-regression/

LOGISTIC ACTIVATION FUNCTION: SIGMOID
➤ The goal is to predict the target class y from input z. The probability
P(y=1|z) that input z is classiﬁed as class y=1 is represented by the output
ŷ of the sigmoid function computed as ŷ = σ(z). 
 
➤ Note that input z to the logistic function corresponds to the log odds ratio: 
 
 
 
➤ This means that log odds ratio changes linearly with z. Furthermore, since  
, this means input z changes linearly with the parameters w and
input samples x. This linearity property is a requirement for logistic
regression.
z = wT
⋅ x
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 13

➤ Note: when we look a neural nets β0 will be modeled as
parameter b and β1 and β2 will be modeled as w1 and w2
SIGMOID FUNCTION: DECONSTRUCTED
➤ Consider a model with two predictors, x1 and x2, and one
binary (Bernoulli) response variable Y. Then the general form
of the log-odds (here denoted by ℓ) is:
P
(1 − P)
= eβ0+β1x1+β2x2
z = β0 + β1x1 + β2x2
P =
of
of + 1
=
ez
ez + 1
=
1
1 +
1
ez
=
1
1 + e−z
P
(1 − P)
= of = eβ0+β1x1+β2x2 = ez
->
->
z = b0 + w1x1 + w2x2
source: https://en.wikipedia.org/wiki/Logistic_regression 14

EXAMPLE OF BINARY CLASSIFICATION - IMAGE RECOGNITION
Is this a picture of a cat? Yes: 1, No: 0

NEURAL NET NOTATION
➤ Sigma(σ), in this context, is the activation function of a
node which deﬁnes the output of that node given an input or
set of inputs.
➤ The linear function, z, is the input and the activation, a, is the
output.

LOGISTIC ACTIVATION FUNCTION: SIGMOID

➤ The loss function used to optimize the classiﬁcation is the cross-
entropy loss function.  
 
 
➤ The output of the model a = σ(z) can be interpreted as a probability a
that input z belongs to one class (y=1) or probability 1 - a that z belongs
to the other class (y=0)
➤ The neural network model will be optimized by maximizing the
likelihood that a given set of parameters θ of the model can result in a
prediction of the correct class of each input sample. The likelihood
maximization can be written as: 
 
CROSS ENTROPY LOSS FUNCTION
arg max
θ
ℒ(θ|y, z) = arg
θ
max
n
∏
i=1
ℒ(θ|y, z)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]

WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION?
➤ Why do we care about the likelihood function? Because it is
the best model for the use case at hand: we have this observed
data, outcomes and data inputs, but do not know anything
about the parameters that establish a relationship between
the two.
➤ The likelihood function (often simply the likelihood) is the
joint probability distribution of observed data expressed as a
function of statistical parameters. Given the outcome, x, and
parameter θ and continuous probability density function f, the
likelihood function is:
ℒ(θ|x) = fθ (x)
source: https://en.wikipedia.org/wiki/Likelihood_function 19

WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION? (CONT.)
➤ The likelihood function describes the relative probability or
odds of obtaining the observed data for all permissible values
of the parameters, and is used to identify the particular
parameter values that are most plausible given the observed data.
➤ The likelihood function is a function of the parameter only,
with the data held as a ﬁxed constant. It is the probability of
the data given the parameter value.
➤ Over the domain of permissible parameter values, the
likelihood function describes a surface.[5] The peak of that
surface, if it exists, identiﬁes the point in the parameter space
that maximizes the likelihood; that is the value that is most
likely to be the parameter of the joint probability distribution
underlying the observed data.
source: https://en.wikipedia.org/wiki/Likelihood_function 20

CROSS ENTROPY LOSS FUNCTION CONT.
➤ The likelihood function can be written as a joint probability of
generating y and z, given parameters θ: 
 
➤ Since we are not interested in the probability of z, we can
reduce this to: P(y|z,θ)
ℒ(θ|y, z) = P(y, z|θ)

➤ Since yi is a Bernoulli variable and the probability of P(y|z) is
ﬁxed for a given θ, we further simplify:  
 
➤ Why is the above a product sum? Since the probability of y
given z is for a sample of size n, we have to account for the
probability of y=1 for each outcome in the sample. For
example: if the sample size is 3, the probability of y=1 for
each outcome is 0.9 given z, and we have three outcomes
where y=1, the probability of getting y=1 for the sample,
given z, is 0.9 x 0.9 x 0.9 = 0.729
P(y|z) =
n
∏
i=1
P(yi = 1|zi)yi ⋅ (1 − P(yi = 1|zi))1−yi
P(y|z) =
n
∏
i=1
ay
i
⋅ (1 − ai)1−yi
P(y = 1|z) = σ(z) = a
->

BERNOULLI DISTRIBUTION
➤ The Bernoulli distribution, is the discrete probability
distribution of a random variable which takes the value 1 with
probability p and the value 0 with probability q = 1 - p
➤ It is the probability distribution of any single experiment that
asks a yes–no question. As such, it is a special case of the
binomial distribution where a single trial is conducted.
source:https://en.wikipedia.org/wiki/Bernoulli_distribution 23

➤ Taking the log of the likelihood function results in a convex
loss function where we can determine the minimum value. 
 
➤ Minimizing the negative of this function (minimizing the
negative log likelihood) corresponds to maximizing the
likelihood. This loss function L(y, a) is known as the cross-
entropy error (loss) function, also known as the log-loss. 
 
➤ Why the negative of the function? So we can minimize the loss
or the diﬀerence between predicted and actual observations.
L(y, a) =
{
log ℒ(θ|y, z) = log
n
∏
i=1
ayi
i
⋅ (1 − ai)1−yi =
n
∑
i=1
yi log(ai) + (1 − yt)log(1 − ai)
−log(ai) if yi = 1
−log(1 − ai) if yi = 0

➤ By minimizing the negative log probability, we will maximize
the log probability. And since y can only be 0 and 1, we can
write L(y, a) as: 
 
➤ Which give the following if we sum over all the samples, n: 
 
➤ So what we end up with is a loss function that is 0 if the
probability to predict the correct class is 1 and goes to inﬁnity
as the probability to predict the correct class goes to 0.
L(y, a) = − y log(a) − (1 − y)log(1 − a)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]

MINIMIZING THE LOSS FUNCTION: GRADIENT DESCENT
➤ Recall that our goal is to minimize the loss function by
traversing the function’s surface area.
➤ To minimize the loss function, we use the gradient descent
algorithm with respect to the parameters, w and b.

GRADIENT DESCENT - CONT.
➤ The gradient descent algorithm works by taking the gradient
( derivative ) of the loss function L with respect to the
parameters, w and b and updates the parameters in the direction
of the negative gradient (down along the loss function).
➤ What is the derivative of the loss function? 
 
➤ The parameters w are updated every iteration k by taking steps
proportional to the negative of the gradient: 
w(k + 1) = w(k) - Δw(k + 1)
➤ Δw is deﬁned as:
∂L
∂w
= (ai − yi) ⋅ xi
Δw = α
∂L
∂w
source: https://peterroelants.github.io/posts/neural-network-implementation-part02/ 27

GRADIENT DESCENT - CONT.
➤ Below is a diagram that shows the algorithm ‘moving down’ the
negative gradient in steps of size alpha (learning rate)
➤ Note: this function maps the parameters to the loss function: 
J(w, b) = L(y, a)

WHY THE NEGATIVE GRADIENT?
➤ Because your goal is to minimize the loss function J(θ) =
J(w,b):
source: https://medium.com/@aerinykim/why-do-we-subtract-the-slope-a-in-gradient-descent-73c7368644fa 29

THE GRADIENT: DECONSTRUCTED
➤ Ok, so the gradient is the derivative of the loss function which
is: 
 
➤ The next question is: WHY? How is this derivative
calculated?
∂L
∂w
= xi ⋅ (ai − yi)
30

CROSS ENTROPY LOSS & WEIGHTS: CALCULATING THE DERIVATIVE
∂L
∂w
= xi ⋅ (ai − yi)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
∂z
∂w
=
∂(x ⋅ w)
∂w
= x
Prove:
Let’s break it down:
Let’s handle the easy one first:
31

CROSS ENTROPY LOSS & ACTIVATION FUNCTION: CALCULATING THE DERIVATIVE
L(a, y) = − (ylog(a) + (1 − y)log(1 − a))
∂L
∂a
=
∂( − (ylog(a) + (1 − y)log(1 − a))
∂a
=
∂(−ylog(a))
∂a
−
∂( − (1 − y)log(1 − a))
∂a
= −
y
a
+
1 − y
1 − a
=
a(1 − a)
a(1 − a)
(−y)
a
+
a(1 − a)
a(1 − a)
1 − y
1 − a
=
−y(1 − a) + a(1 − y)
a(1 − a)
=
ya − y + a − ay
a(1 − a)
=
a − y
a(1 − a)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
Let’s handle this one next:
∂L
∂a
Recall the loss function:
32

CROSS ENTROPY LOSS & SIGMOID FUNCTION: CALCULATING THE DERIVATIVE
a = σ(z)
∂a
∂z
= (1 + e−z
)−2
(−1)(−1)(e−z
) =
(−1)(−1)e−z
(1 + e−z)2
=
e−z
(1 + e−z)2
=
1
(1 + e−z)
e−z
(1 + e−z)
1 − a = 1 −
1
1 + e−z
=
(1 + e−z
)
(1 + e−z)
−
1
1 + e−z
=
e−z
1 + e−z
1
(1 + e−z)
e−z
(1 + e−z)
= a(1 − a)
a =
1
1 + e−z
= (1 + e−z
)−1
Finally: ∂a
∂z
Recall that a is the sigmoid  
activation function :
Also, note:
Substitute a and 1-a:
33
^Chain rule: -1 for the exponent and -1 for -z

LOSS FUNCTION WITH RESPECT TO Z: DERIVATIVE
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
= xi ⋅ a(1 − a)
a − y
a(1 − a)
= xi ⋅ (a − y)
Putting it all together:
34

DERIVATIVES AND BACKPROPAGATION
➤ Backpropagation is an iterative, recursive and eﬃcient method for calculating
the weights updates to improve in the network until it is able to perform the
task for which it is being trained
➤ The important part is the blue text on the right: note how we are adjusting
the weights (w1, w2 and b) by subtracting the negative gradient (derivative).
∂L
∂w
= xi ⋅ (ai − yi)

NEURAL NETS: FULL CIRCLE
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.

WELL, WAS IT WORTH THE EFFORT?
➤ Consider this…
➤ One of the most famous and consequential meetings in the
history of science took place in the summer of 1684 when the
young astronomer Edmund Halley paid a visit to Isaac Newton.
After they had been some time together, the Dr asked him what
he thought the curve would be that would be described by the
planets supposing the force of attraction towards the sun to be
reciprocal to the square of their distance from it. Sir Isaac replied
immediately that it would be an ellipse. The Doctor, 
struck with joy and amazement, asked him how he  
knew it. 
 
Why, saith he, I have calculated it.
source: https://www.mathpages.com/home/kmath658/kmath658.htm 37

Neural Nets Deconstructed

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Neural Nets Deconstructed

Similar to Neural Nets Deconstructed (20)

Recently uploaded

Recently uploaded (20)

Neural Nets Deconstructed