NEURAL NETS
DECONSTRUCTED
Paul Sterk

AI Engineer
1
WHAT ARE WE GOING TO COVER?
➤ What is a Neural Network (NN)?
➤ What does it mean to ‘deconstruct’ a NN?
➤ How is this helpful to students and those new to the field?
➤ We will also touch learning psychology - on some thoughts on
what habits I found beneficial in learning this material
2
GETTING STARTED - THE “BIG PICTURE”
➤ How do AI, ML, Deep Learning and 

Neural Nets relate to each other?
➤ “You can think of deep learning, machine learning and
artificial intelligence as a set of Russian dolls nested within
each other, beginning with the smallest and working out.
Deep learning is a subset of machine learning, and machine
learning is a subset of AI, which is an umbrella term for
any computer program that does something smart.”[1]
➤ Deep Learning in an ML method based on training Neural
Networks.
[1] https://skymind.ai/wiki/ai-vs-machine-learning-vs-deep-learning
John McCarthy
3
WHAT IS A NEURAL NET?
➤ Definition: A neural network (NN) is an 

interconnected group of natural or 

artificial neurons that uses a mathematical 

or computational model statistical for 

data modeling or decision making.
➤ In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows
through the network.
➤ They can be used to model complex relationships between
inputs and outputs or to find patterns in data.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 4
APPLICATIONS FOR NEURAL NETS
➤ The tasks to which artificial neural networks are applied tend
to fall within the following broad categories:
➤ Function approximation, or regression analysis, including
time series prediction and modeling.
➤ Classification, including pattern and sequence recognition,
novelty detection and sequential decision making.
➤ Data processing, including filtering, clustering, signal and
compression.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 5
LET’S LOOK AT A SIMPLE EXAMPLE
➤ Let's say you have a data sets with six houses. You have the size of the houses (in square
feet or square meters), and the price of the house. You then want to fit a function to predict
the price of the houses as a function of the size. If you are familiar with linear regression,
you might try to fit a straight line to these data to model the relationship.
➤ So you can think of this function that you've just fit the housing prices as a very simple
neural network.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 6
HOUSING PREDICTION CONTINUED…
➤ Given these input features, the job of the neural network will be to predict the price y. Notice also that each
of these circle, these are called hidden units in the neural network, that each of them takes its inputs all
four input features.
➤ The middle layer of the neural network is density connected because every input feature is connected to
every one of these circles in the middle. And the remarkable thing about neural networks is that, given
enough data about x and y, given enough training examples with both x and y, neural networks are
remarkably good a figuring out functions that accurately map x to y.
7source: https://www.coursera.org/learn/neural-networks-deep-learning/
SUPERVISED LEARNING
➤ One of the most exciting things about the rise of neural networks is that computers
are now much better at interpreting unstructured data as well compared to just a few
years ago. And this creates opportunities for many new exciting applications that use
speech recognition, image recognition, and natural language processing on text.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 8
HOW DO NEURAL NETS ‘LEARN’?
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.
source: towardsdatascience.com 9
HOW DO NEURAL NETS ‘LEARN’?
source: towardsdatascience.com 10
LOGISTIC REGRESSION
➤ We will use logistic regression in order to make the ideas
easier to understand. Logistic regression is an algorithm for
binary classification.
➤ Here's an example of a binary classification problem. You
might have an input of an image, and want to output a label
to recognize this image as either being a cat, in which case
you output 1, or not-cat in which case you output 0, and we're
going to use y to denote the output label.
➤ In the logistic model, the log-odds (the logarithm of the odds)
for the value labeled "1" is a linear combination of one or
more independent variables (“predictors").
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 11
LOGISTIC REGRESSION: ASSUMPTIONS
➤ First, binary logistic regression requires the dependent variable to be binary and ordinal
logistic regression requires the dependent variable to be ordinal.
➤ Second, logistic regression requires the observations to be independent of each other.  In
other words, the observations should not come from repeated measurements or
matched data.
➤ Third, logistic regression requires there to be little or no multicollinearity among the
independent variables.  This means that the independent variables should not be too
highly correlated with each other.
➤ Fourth, logistic regression assumes linearity of independent variables and log odds. 
Although the dependent and independent variables do not have to be related linearly, it
requires that the independent variables are linearly related to the log odds.
➤ Finally, logistic regression typically requires a large sample size.  A general guideline is
that you need at minimum of 10 cases with the least frequent outcome for each
independent variable in your model. For example, if you have 5 independent variables
and the expected probability of your least frequent outcome is .10, then you would need
a minimum sample size of 500 (10*5 / .10).
12source: https://www.statisticssolutions.com/assumptions-of-logistic-regression/
LOGISTIC ACTIVATION FUNCTION: SIGMOID
➤ The goal is to predict the target class y from input z. The probability
P(y=1|z) that input z is classified as class y=1 is represented by the output
ŷ of the sigmoid function computed as ŷ = σ(z).



➤ Note that input z to the logistic function corresponds to the log odds ratio:







➤ This means that log odds ratio changes linearly with z. Furthermore, since 

, this means input z changes linearly with the parameters w and
input samples x. This linearity property is a requirement for logistic
regression.
z = wT
⋅ x
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 13
➤ Note: when we look a neural nets β0 will be modeled as
parameter b and β1 and β2 will be modeled as w1 and w2
SIGMOID FUNCTION: DECONSTRUCTED
➤ Consider a model with two predictors, x1 and x2, and one
binary (Bernoulli) response variable Y. Then the general form
of the log-odds (here denoted by ℓ) is:
P
(1 − P)
= eβ0+β1x1+β2x2
z = β0 + β1x1 + β2x2
P =
of
of + 1
=
ez
ez + 1
=
1
1 +
1
ez
=
1
1 + e−z
P
(1 − P)
= of = eβ0+β1x1+β2x2 = ez
->
->
z = b0 + w1x1 + w2x2
source: https://en.wikipedia.org/wiki/Logistic_regression 14
EXAMPLE OF BINARY CLASSIFICATION - IMAGE RECOGNITION
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 15
Is this a picture of a cat? Yes: 1, No: 0
NEURAL NET NOTATION
➤ Sigma(σ), in this context, is the activation function of a
node which defines the output of that node given an input or
set of inputs.
➤ The linear function, z, is the input and the activation, a, is the
output.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 16
LOGISTIC ACTIVATION FUNCTION: SIGMOID
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 17
➤ The loss function used to optimize the classification is the cross-
entropy loss function. 





➤ The output of the model a = σ(z) can be interpreted as a probability a
that input z belongs to one class (y=1) or probability 1 - a that z belongs
to the other class (y=0)
➤ The neural network model will be optimized by maximizing the
likelihood that a given set of parameters θ of the model can result in a
prediction of the correct class of each input sample. The likelihood
maximization can be written as:



CROSS ENTROPY LOSS FUNCTION
arg max
θ
ℒ(θ|y, z) = arg
θ
max
n
∏
i=1
ℒ(θ|y, z)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 18
WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION?
➤ Why do we care about the likelihood function? Because it is
the best model for the use case at hand: we have this observed
data, outcomes and data inputs, but do not know anything
about the parameters that establish a relationship between
the two.
➤ The likelihood function (often simply the likelihood) is the
joint probability distribution of observed data expressed as a
function of statistical parameters. Given the outcome, x, and
parameter θ and continuous probability density function f, the
likelihood function is:
ℒ(θ|x) = fθ (x)
source: https://en.wikipedia.org/wiki/Likelihood_function 19
WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION? (CONT.)
➤ The likelihood function describes the relative probability or
odds of obtaining the observed data for all permissible values
of the parameters, and is used to identify the particular
parameter values that are most plausible given the observed data.
➤ The likelihood function is a function of the parameter only,
with the data held as a fixed constant. It is the probability of
the data given the parameter value.
➤ Over the domain of permissible parameter values, the
likelihood function describes a surface.[5] The peak of that
surface, if it exists, identifies the point in the parameter space
that maximizes the likelihood; that is the value that is most
likely to be the parameter of the joint probability distribution
underlying the observed data.
source: https://en.wikipedia.org/wiki/Likelihood_function 20
CROSS ENTROPY LOSS FUNCTION CONT.
➤ The likelihood function can be written as a joint probability of
generating y and z, given parameters θ:



➤ Since we are not interested in the probability of z, we can
reduce this to: P(y|z,θ)
ℒ(θ|y, z) = P(y, z|θ)
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 21
CROSS ENTROPY LOSS FUNCTION CONT.
➤ Since yi is a Bernoulli variable and the probability of P(y|z) is
fixed for a given θ, we further simplify: 



➤ Why is the above a product sum? Since the probability of y
given z is for a sample of size n, we have to account for the
probability of y=1 for each outcome in the sample. For
example: if the sample size is 3, the probability of y=1 for
each outcome is 0.9 given z, and we have three outcomes
where y=1, the probability of getting y=1 for the sample,
given z, is 0.9 x 0.9 x 0.9 = 0.729
P(y|z) =
n
∏
i=1
P(yi = 1|zi)yi ⋅ (1 − P(yi = 1|zi))1−yi
P(y|z) =
n
∏
i=1
ay
i
⋅ (1 − ai)1−yi
P(y = 1|z) = σ(z) = a
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 22
->
BERNOULLI DISTRIBUTION
➤ The Bernoulli distribution, is the discrete probability
distribution of a random variable which takes the value 1 with
probability p and the value 0 with probability q = 1 - p
➤ It is the probability distribution of any single experiment that
asks a yes–no question. As such, it is a special case of the
binomial distribution where a single trial is conducted.
source:https://en.wikipedia.org/wiki/Bernoulli_distribution 23
➤ Taking the log of the likelihood function results in a convex
loss function where we can determine the minimum value.



➤ Minimizing the negative of this function (minimizing the
negative log likelihood) corresponds to maximizing the
likelihood. This loss function L(y, a) is known as the cross-
entropy error (loss) function, also known as the log-loss.



➤ Why the negative of the function? So we can minimize the loss
or the difference between predicted and actual observations.
CROSS ENTROPY LOSS FUNCTION CONT.
L(y, a) =
{
log ℒ(θ|y, z) = log
n
∏
i=1
ayi
i
⋅ (1 − ai)1−yi =
n
∑
i=1
yi log(ai) + (1 − yt)log(1 − ai)
−log(ai) if yi = 1
−log(1 − ai) if yi = 0
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 24
CROSS ENTROPY LOSS FUNCTION CONT.
➤ By minimizing the negative log probability, we will maximize
the log probability. And since y can only be 0 and 1, we can
write L(y, a) as:



➤ Which give the following if we sum over all the samples, n:



➤ So what we end up with is a loss function that is 0 if the
probability to predict the correct class is 1 and goes to infinity
as the probability to predict the correct class goes to 0.
L(y, a) = − y log(a) − (1 − y)log(1 − a)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 25
MINIMIZING THE LOSS FUNCTION: GRADIENT DESCENT
➤ Recall that our goal is to minimize the loss function by
traversing the function’s surface area.
➤ To minimize the loss function, we use the gradient descent
algorithm with respect to the parameters, w and b.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 26
GRADIENT DESCENT - CONT.
➤ The gradient descent algorithm works by taking the gradient
( derivative ) of the loss function L with respect to the
parameters, w and b and updates the parameters in the direction
of the negative gradient (down along the loss function).
➤ What is the derivative of the loss function?



➤ The parameters w are updated every iteration k by taking steps
proportional to the negative of the gradient:

w(k + 1) = w(k) - Δw(k + 1)
➤ Δw is defined as:
∂L
∂w
= (ai − yi) ⋅ xi
Δw = α
∂L
∂w
source: https://peterroelants.github.io/posts/neural-network-implementation-part02/ 27
GRADIENT DESCENT - CONT.
➤ Below is a diagram that shows the algorithm ‘moving down’ the
negative gradient in steps of size alpha (learning rate)
➤ Note: this function maps the parameters to the loss function:

J(w, b) = L(y, a)
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 28
WHY THE NEGATIVE GRADIENT?
➤ Because your goal is to minimize the loss function J(θ) =
J(w,b):
source: https://medium.com/@aerinykim/why-do-we-subtract-the-slope-a-in-gradient-descent-73c7368644fa 29
THE GRADIENT: DECONSTRUCTED
➤ Ok, so the gradient is the derivative of the loss function which
is:



➤ The next question is: WHY? How is this derivative
calculated?
∂L
∂w
= xi ⋅ (ai − yi)
30
CROSS ENTROPY LOSS & WEIGHTS: CALCULATING THE DERIVATIVE
∂L
∂w
= xi ⋅ (ai − yi)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
∂z
∂w
=
∂(x ⋅ w)
∂w
= x
Prove:
Let’s break it down:
Let’s handle the easy one first:
31
CROSS ENTROPY LOSS & ACTIVATION FUNCTION: CALCULATING THE DERIVATIVE
L(a, y) = − (ylog(a) + (1 − y)log(1 − a))
∂L
∂a
=
∂( − (ylog(a) + (1 − y)log(1 − a))
∂a
=
∂(−ylog(a))
∂a
−
∂( − (1 − y)log(1 − a))
∂a
= −
y
a
+
1 − y
1 − a
=
a(1 − a)
a(1 − a)
(−y)
a
+
a(1 − a)
a(1 − a)
1 − y
1 − a
=
−y(1 − a) + a(1 − y)
a(1 − a)
=
ya − y + a − ay
a(1 − a)
=
a − y
a(1 − a)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
Let’s handle this one next:
∂L
∂a
Recall the loss function:
32
CROSS ENTROPY LOSS & SIGMOID FUNCTION: CALCULATING THE DERIVATIVE
a = σ(z)
∂a
∂z
= (1 + e−z
)−2
(−1)(−1)(e−z
) =
(−1)(−1)e−z
(1 + e−z)2
=
e−z
(1 + e−z)2
=
1
(1 + e−z)
e−z
(1 + e−z)
1 − a = 1 −
1
1 + e−z
=
(1 + e−z
)
(1 + e−z)
−
1
1 + e−z
=
e−z
1 + e−z
1
(1 + e−z)
e−z
(1 + e−z)
= a(1 − a)
a =
1
1 + e−z
= (1 + e−z
)−1
Finally: ∂a
∂z
Recall that a is the sigmoid 

activation function :
Also, note:
Substitute a and 1-a:
33
^Chain rule: -1 for the exponent and -1 for -z
LOSS FUNCTION WITH RESPECT TO Z: DERIVATIVE
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
= xi ⋅ a(1 − a)
a − y
a(1 − a)
= xi ⋅ (a − y)
Putting it all together:
34
DERIVATIVES AND BACKPROPAGATION
➤ Backpropagation is an iterative, recursive and efficient method for calculating
the weights updates to improve in the network until it is able to perform the
task for which it is being trained
➤ The important part is the blue text on the right: note how we are adjusting
the weights (w1, w2 and b) by subtracting the negative gradient (derivative).
∂L
∂w
= xi ⋅ (ai − yi)
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 35
NEURAL NETS: FULL CIRCLE
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.
source: towardsdatascience.com 36
WELL, WAS IT WORTH THE EFFORT?
➤ Consider this…
➤ One of the most famous and consequential meetings in the
history of science took place in the summer of 1684 when the
young astronomer Edmund Halley paid a visit to Isaac Newton.
After they had been some time together, the Dr asked him what
he thought the curve would be that would be described by the
planets supposing the force of attraction towards the sun to be
reciprocal to the square of their distance from it. Sir Isaac replied
immediately that it would be an ellipse. The Doctor,

struck with joy and amazement, asked him how he 

knew it.



Why, saith he, I have calculated it.
source: https://www.mathpages.com/home/kmath658/kmath658.htm 37
THANK YOU!
38

Neural Nets Deconstructed

  • 1.
  • 2.
    WHAT ARE WEGOING TO COVER? ➤ What is a Neural Network (NN)? ➤ What does it mean to ‘deconstruct’ a NN? ➤ How is this helpful to students and those new to the field? ➤ We will also touch learning psychology - on some thoughts on what habits I found beneficial in learning this material 2
  • 3.
    GETTING STARTED -THE “BIG PICTURE” ➤ How do AI, ML, Deep Learning and 
 Neural Nets relate to each other? ➤ “You can think of deep learning, machine learning and artificial intelligence as a set of Russian dolls nested within each other, beginning with the smallest and working out. Deep learning is a subset of machine learning, and machine learning is a subset of AI, which is an umbrella term for any computer program that does something smart.”[1] ➤ Deep Learning in an ML method based on training Neural Networks. [1] https://skymind.ai/wiki/ai-vs-machine-learning-vs-deep-learning John McCarthy 3
  • 4.
    WHAT IS ANEURAL NET? ➤ Definition: A neural network (NN) is an 
 interconnected group of natural or 
 artificial neurons that uses a mathematical 
 or computational model statistical for 
 data modeling or decision making. ➤ In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network. ➤ They can be used to model complex relationships between inputs and outputs or to find patterns in data. source: https://en.wikipedia.org/wiki/Artificial_neural_network 4
  • 5.
    APPLICATIONS FOR NEURALNETS ➤ The tasks to which artificial neural networks are applied tend to fall within the following broad categories: ➤ Function approximation, or regression analysis, including time series prediction and modeling. ➤ Classification, including pattern and sequence recognition, novelty detection and sequential decision making. ➤ Data processing, including filtering, clustering, signal and compression. source: https://en.wikipedia.org/wiki/Artificial_neural_network 5
  • 6.
    LET’S LOOK ATA SIMPLE EXAMPLE ➤ Let's say you have a data sets with six houses. You have the size of the houses (in square feet or square meters), and the price of the house. You then want to fit a function to predict the price of the houses as a function of the size. If you are familiar with linear regression, you might try to fit a straight line to these data to model the relationship. ➤ So you can think of this function that you've just fit the housing prices as a very simple neural network. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 6
  • 7.
    HOUSING PREDICTION CONTINUED… ➤Given these input features, the job of the neural network will be to predict the price y. Notice also that each of these circle, these are called hidden units in the neural network, that each of them takes its inputs all four input features. ➤ The middle layer of the neural network is density connected because every input feature is connected to every one of these circles in the middle. And the remarkable thing about neural networks is that, given enough data about x and y, given enough training examples with both x and y, neural networks are remarkably good a figuring out functions that accurately map x to y. 7source: https://www.coursera.org/learn/neural-networks-deep-learning/
  • 8.
    SUPERVISED LEARNING ➤ Oneof the most exciting things about the rise of neural networks is that computers are now much better at interpreting unstructured data as well compared to just a few years ago. And this creates opportunities for many new exciting applications that use speech recognition, image recognition, and natural language processing on text. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 8
  • 9.
    HOW DO NEURALNETS ‘LEARN’? 1. Start with values (often random) for the network parameters (wij weights and bj biases). 2. Take a set of examples of input data and pass them through the network to obtain their prediction. 3. Compare these predictions obtained with the values of expected labels and calculate the loss with them. 4. Perform the backpropagation in order to propagate this loss to each and every one of the parameters that make up the model of the neural network. 5. Use this propagated information to update the parameters of the neural network with the gradient descent in a way that the total loss is reduced and a better model is obtained. 6. Continue iterating in the previous steps until we consider that we have a good model. source: towardsdatascience.com 9
  • 10.
    HOW DO NEURALNETS ‘LEARN’? source: towardsdatascience.com 10
  • 11.
    LOGISTIC REGRESSION ➤ Wewill use logistic regression in order to make the ideas easier to understand. Logistic regression is an algorithm for binary classification. ➤ Here's an example of a binary classification problem. You might have an input of an image, and want to output a label to recognize this image as either being a cat, in which case you output 1, or not-cat in which case you output 0, and we're going to use y to denote the output label. ➤ In the logistic model, the log-odds (the logarithm of the odds) for the value labeled "1" is a linear combination of one or more independent variables (“predictors"). source: https://www.coursera.org/learn/neural-networks-deep-learning/ 11
  • 12.
    LOGISTIC REGRESSION: ASSUMPTIONS ➤First, binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal. ➤ Second, logistic regression requires the observations to be independent of each other.  In other words, the observations should not come from repeated measurements or matched data. ➤ Third, logistic regression requires there to be little or no multicollinearity among the independent variables.  This means that the independent variables should not be too highly correlated with each other. ➤ Fourth, logistic regression assumes linearity of independent variables and log odds.  Although the dependent and independent variables do not have to be related linearly, it requires that the independent variables are linearly related to the log odds. ➤ Finally, logistic regression typically requires a large sample size.  A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10). 12source: https://www.statisticssolutions.com/assumptions-of-logistic-regression/
  • 13.
    LOGISTIC ACTIVATION FUNCTION:SIGMOID ➤ The goal is to predict the target class y from input z. The probability P(y=1|z) that input z is classified as class y=1 is represented by the output ŷ of the sigmoid function computed as ŷ = σ(z).
 
 ➤ Note that input z to the logistic function corresponds to the log odds ratio:
 
 
 
 ➤ This means that log odds ratio changes linearly with z. Furthermore, since 
 , this means input z changes linearly with the parameters w and input samples x. This linearity property is a requirement for logistic regression. z = wT ⋅ x source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 13
  • 14.
    ➤ Note: whenwe look a neural nets β0 will be modeled as parameter b and β1 and β2 will be modeled as w1 and w2 SIGMOID FUNCTION: DECONSTRUCTED ➤ Consider a model with two predictors, x1 and x2, and one binary (Bernoulli) response variable Y. Then the general form of the log-odds (here denoted by ℓ) is: P (1 − P) = eβ0+β1x1+β2x2 z = β0 + β1x1 + β2x2 P = of of + 1 = ez ez + 1 = 1 1 + 1 ez = 1 1 + e−z P (1 − P) = of = eβ0+β1x1+β2x2 = ez -> -> z = b0 + w1x1 + w2x2 source: https://en.wikipedia.org/wiki/Logistic_regression 14
  • 15.
    EXAMPLE OF BINARYCLASSIFICATION - IMAGE RECOGNITION source: https://www.coursera.org/learn/neural-networks-deep-learning/ 15 Is this a picture of a cat? Yes: 1, No: 0
  • 16.
    NEURAL NET NOTATION ➤Sigma(σ), in this context, is the activation function of a node which defines the output of that node given an input or set of inputs. ➤ The linear function, z, is the input and the activation, a, is the output. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 16
  • 17.
    LOGISTIC ACTIVATION FUNCTION:SIGMOID source: https://www.coursera.org/learn/neural-networks-deep-learning/ 17
  • 18.
    ➤ The lossfunction used to optimize the classification is the cross- entropy loss function. 
 
 
 ➤ The output of the model a = σ(z) can be interpreted as a probability a that input z belongs to one class (y=1) or probability 1 - a that z belongs to the other class (y=0) ➤ The neural network model will be optimized by maximizing the likelihood that a given set of parameters θ of the model can result in a prediction of the correct class of each input sample. The likelihood maximization can be written as:
 
 CROSS ENTROPY LOSS FUNCTION arg max θ ℒ(θ|y, z) = arg θ max n ∏ i=1 ℒ(θ|y, z) L(y, a) = − 1 N n ∑ i=1 [yi log(ai) + (1 − yi)log(1 − ai)] source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 18
  • 19.
    WHY DO WECARE ABOUT THE LIKELIHOOD FUNCTION? ➤ Why do we care about the likelihood function? Because it is the best model for the use case at hand: we have this observed data, outcomes and data inputs, but do not know anything about the parameters that establish a relationship between the two. ➤ The likelihood function (often simply the likelihood) is the joint probability distribution of observed data expressed as a function of statistical parameters. Given the outcome, x, and parameter θ and continuous probability density function f, the likelihood function is: ℒ(θ|x) = fθ (x) source: https://en.wikipedia.org/wiki/Likelihood_function 19
  • 20.
    WHY DO WECARE ABOUT THE LIKELIHOOD FUNCTION? (CONT.) ➤ The likelihood function describes the relative probability or odds of obtaining the observed data for all permissible values of the parameters, and is used to identify the particular parameter values that are most plausible given the observed data. ➤ The likelihood function is a function of the parameter only, with the data held as a fixed constant. It is the probability of the data given the parameter value. ➤ Over the domain of permissible parameter values, the likelihood function describes a surface.[5] The peak of that surface, if it exists, identifies the point in the parameter space that maximizes the likelihood; that is the value that is most likely to be the parameter of the joint probability distribution underlying the observed data. source: https://en.wikipedia.org/wiki/Likelihood_function 20
  • 21.
    CROSS ENTROPY LOSSFUNCTION CONT. ➤ The likelihood function can be written as a joint probability of generating y and z, given parameters θ:
 
 ➤ Since we are not interested in the probability of z, we can reduce this to: P(y|z,θ) ℒ(θ|y, z) = P(y, z|θ) source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 21
  • 22.
    CROSS ENTROPY LOSSFUNCTION CONT. ➤ Since yi is a Bernoulli variable and the probability of P(y|z) is fixed for a given θ, we further simplify: 
 
 ➤ Why is the above a product sum? Since the probability of y given z is for a sample of size n, we have to account for the probability of y=1 for each outcome in the sample. For example: if the sample size is 3, the probability of y=1 for each outcome is 0.9 given z, and we have three outcomes where y=1, the probability of getting y=1 for the sample, given z, is 0.9 x 0.9 x 0.9 = 0.729 P(y|z) = n ∏ i=1 P(yi = 1|zi)yi ⋅ (1 − P(yi = 1|zi))1−yi P(y|z) = n ∏ i=1 ay i ⋅ (1 − ai)1−yi P(y = 1|z) = σ(z) = a source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 22 ->
  • 23.
    BERNOULLI DISTRIBUTION ➤ TheBernoulli distribution, is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q = 1 - p ➤ It is the probability distribution of any single experiment that asks a yes–no question. As such, it is a special case of the binomial distribution where a single trial is conducted. source:https://en.wikipedia.org/wiki/Bernoulli_distribution 23
  • 24.
    ➤ Taking thelog of the likelihood function results in a convex loss function where we can determine the minimum value.
 
 ➤ Minimizing the negative of this function (minimizing the negative log likelihood) corresponds to maximizing the likelihood. This loss function L(y, a) is known as the cross- entropy error (loss) function, also known as the log-loss.
 
 ➤ Why the negative of the function? So we can minimize the loss or the difference between predicted and actual observations. CROSS ENTROPY LOSS FUNCTION CONT. L(y, a) = { log ℒ(θ|y, z) = log n ∏ i=1 ayi i ⋅ (1 − ai)1−yi = n ∑ i=1 yi log(ai) + (1 − yt)log(1 − ai) −log(ai) if yi = 1 −log(1 − ai) if yi = 0 source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 24
  • 25.
    CROSS ENTROPY LOSSFUNCTION CONT. ➤ By minimizing the negative log probability, we will maximize the log probability. And since y can only be 0 and 1, we can write L(y, a) as:
 
 ➤ Which give the following if we sum over all the samples, n:
 
 ➤ So what we end up with is a loss function that is 0 if the probability to predict the correct class is 1 and goes to infinity as the probability to predict the correct class goes to 0. L(y, a) = − y log(a) − (1 − y)log(1 − a) L(y, a) = − 1 N n ∑ i=1 [yi log(ai) + (1 − yi)log(1 − ai)] source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 25
  • 26.
    MINIMIZING THE LOSSFUNCTION: GRADIENT DESCENT ➤ Recall that our goal is to minimize the loss function by traversing the function’s surface area. ➤ To minimize the loss function, we use the gradient descent algorithm with respect to the parameters, w and b. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 26
  • 27.
    GRADIENT DESCENT -CONT. ➤ The gradient descent algorithm works by taking the gradient ( derivative ) of the loss function L with respect to the parameters, w and b and updates the parameters in the direction of the negative gradient (down along the loss function). ➤ What is the derivative of the loss function?
 
 ➤ The parameters w are updated every iteration k by taking steps proportional to the negative of the gradient:
 w(k + 1) = w(k) - Δw(k + 1) ➤ Δw is defined as: ∂L ∂w = (ai − yi) ⋅ xi Δw = α ∂L ∂w source: https://peterroelants.github.io/posts/neural-network-implementation-part02/ 27
  • 28.
    GRADIENT DESCENT -CONT. ➤ Below is a diagram that shows the algorithm ‘moving down’ the negative gradient in steps of size alpha (learning rate) ➤ Note: this function maps the parameters to the loss function:
 J(w, b) = L(y, a) source: https://www.coursera.org/learn/neural-networks-deep-learning/ 28
  • 29.
    WHY THE NEGATIVEGRADIENT? ➤ Because your goal is to minimize the loss function J(θ) = J(w,b): source: https://medium.com/@aerinykim/why-do-we-subtract-the-slope-a-in-gradient-descent-73c7368644fa 29
  • 30.
    THE GRADIENT: DECONSTRUCTED ➤Ok, so the gradient is the derivative of the loss function which is:
 
 ➤ The next question is: WHY? How is this derivative calculated? ∂L ∂w = xi ⋅ (ai − yi) 30
  • 31.
    CROSS ENTROPY LOSS& WEIGHTS: CALCULATING THE DERIVATIVE ∂L ∂w = xi ⋅ (ai − yi) ∂L ∂w = ∂L ∂a ∂a ∂z ∂z ∂w ∂z ∂w = ∂(x ⋅ w) ∂w = x Prove: Let’s break it down: Let’s handle the easy one first: 31
  • 32.
    CROSS ENTROPY LOSS& ACTIVATION FUNCTION: CALCULATING THE DERIVATIVE L(a, y) = − (ylog(a) + (1 − y)log(1 − a)) ∂L ∂a = ∂( − (ylog(a) + (1 − y)log(1 − a)) ∂a = ∂(−ylog(a)) ∂a − ∂( − (1 − y)log(1 − a)) ∂a = − y a + 1 − y 1 − a = a(1 − a) a(1 − a) (−y) a + a(1 − a) a(1 − a) 1 − y 1 − a = −y(1 − a) + a(1 − y) a(1 − a) = ya − y + a − ay a(1 − a) = a − y a(1 − a) ∂L ∂w = ∂L ∂a ∂a ∂z ∂z ∂w Let’s handle this one next: ∂L ∂a Recall the loss function: 32
  • 33.
    CROSS ENTROPY LOSS& SIGMOID FUNCTION: CALCULATING THE DERIVATIVE a = σ(z) ∂a ∂z = (1 + e−z )−2 (−1)(−1)(e−z ) = (−1)(−1)e−z (1 + e−z)2 = e−z (1 + e−z)2 = 1 (1 + e−z) e−z (1 + e−z) 1 − a = 1 − 1 1 + e−z = (1 + e−z ) (1 + e−z) − 1 1 + e−z = e−z 1 + e−z 1 (1 + e−z) e−z (1 + e−z) = a(1 − a) a = 1 1 + e−z = (1 + e−z )−1 Finally: ∂a ∂z Recall that a is the sigmoid 
 activation function : Also, note: Substitute a and 1-a: 33 ^Chain rule: -1 for the exponent and -1 for -z
  • 34.
    LOSS FUNCTION WITHRESPECT TO Z: DERIVATIVE ∂L ∂w = ∂L ∂a ∂a ∂z ∂z ∂w = xi ⋅ a(1 − a) a − y a(1 − a) = xi ⋅ (a − y) Putting it all together: 34
  • 35.
    DERIVATIVES AND BACKPROPAGATION ➤Backpropagation is an iterative, recursive and efficient method for calculating the weights updates to improve in the network until it is able to perform the task for which it is being trained ➤ The important part is the blue text on the right: note how we are adjusting the weights (w1, w2 and b) by subtracting the negative gradient (derivative). ∂L ∂w = xi ⋅ (ai − yi) source: https://www.coursera.org/learn/neural-networks-deep-learning/ 35
  • 36.
    NEURAL NETS: FULLCIRCLE 1. Start with values (often random) for the network parameters (wij weights and bj biases). 2. Take a set of examples of input data and pass them through the network to obtain their prediction. 3. Compare these predictions obtained with the values of expected labels and calculate the loss with them. 4. Perform the backpropagation in order to propagate this loss to each and every one of the parameters that make up the model of the neural network. 5. Use this propagated information to update the parameters of the neural network with the gradient descent in a way that the total loss is reduced and a better model is obtained. 6. Continue iterating in the previous steps until we consider that we have a good model. source: towardsdatascience.com 36
  • 37.
    WELL, WAS ITWORTH THE EFFORT? ➤ Consider this… ➤ One of the most famous and consequential meetings in the history of science took place in the summer of 1684 when the young astronomer Edmund Halley paid a visit to Isaac Newton. After they had been some time together, the Dr asked him what he thought the curve would be that would be described by the planets supposing the force of attraction towards the sun to be reciprocal to the square of their distance from it. Sir Isaac replied immediately that it would be an ellipse. The Doctor,
 struck with joy and amazement, asked him how he 
 knew it.
 
 Why, saith he, I have calculated it. source: https://www.mathpages.com/home/kmath658/kmath658.htm 37
  • 38.