SlideShare a Scribd company logo
NEURAL NETS
DECONSTRUCTED
Paul Sterk

AI Engineer
1
WHAT ARE WE GOING TO COVER?
➤ What is a Neural Network (NN)?
➤ What does it mean to ‘deconstruct’ a NN?
➤ How is this helpful to students and those new to the field?
➤ We will also touch learning psychology - on some thoughts on
what habits I found beneficial in learning this material
2
GETTING STARTED - THE “BIG PICTURE”
➤ How do AI, ML, Deep Learning and 

Neural Nets relate to each other?
➤ “You can think of deep learning, machine learning and
artificial intelligence as a set of Russian dolls nested within
each other, beginning with the smallest and working out.
Deep learning is a subset of machine learning, and machine
learning is a subset of AI, which is an umbrella term for
any computer program that does something smart.”[1]
➤ Deep Learning in an ML method based on training Neural
Networks.
[1] https://skymind.ai/wiki/ai-vs-machine-learning-vs-deep-learning
John McCarthy
3
WHAT IS A NEURAL NET?
➤ Definition: A neural network (NN) is an 

interconnected group of natural or 

artificial neurons that uses a mathematical 

or computational model statistical for 

data modeling or decision making.
➤ In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows
through the network.
➤ They can be used to model complex relationships between
inputs and outputs or to find patterns in data.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 4
APPLICATIONS FOR NEURAL NETS
➤ The tasks to which artificial neural networks are applied tend
to fall within the following broad categories:
➤ Function approximation, or regression analysis, including
time series prediction and modeling.
➤ Classification, including pattern and sequence recognition,
novelty detection and sequential decision making.
➤ Data processing, including filtering, clustering, signal and
compression.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 5
LET’S LOOK AT A SIMPLE EXAMPLE
➤ Let's say you have a data sets with six houses. You have the size of the houses (in square
feet or square meters), and the price of the house. You then want to fit a function to predict
the price of the houses as a function of the size. If you are familiar with linear regression,
you might try to fit a straight line to these data to model the relationship.
➤ So you can think of this function that you've just fit the housing prices as a very simple
neural network.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 6
HOUSING PREDICTION CONTINUED…
➤ Given these input features, the job of the neural network will be to predict the price y. Notice also that each
of these circle, these are called hidden units in the neural network, that each of them takes its inputs all
four input features.
➤ The middle layer of the neural network is density connected because every input feature is connected to
every one of these circles in the middle. And the remarkable thing about neural networks is that, given
enough data about x and y, given enough training examples with both x and y, neural networks are
remarkably good a figuring out functions that accurately map x to y.
7source: https://www.coursera.org/learn/neural-networks-deep-learning/
SUPERVISED LEARNING
➤ One of the most exciting things about the rise of neural networks is that computers
are now much better at interpreting unstructured data as well compared to just a few
years ago. And this creates opportunities for many new exciting applications that use
speech recognition, image recognition, and natural language processing on text.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 8
HOW DO NEURAL NETS ‘LEARN’?
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.
source: towardsdatascience.com 9
HOW DO NEURAL NETS ‘LEARN’?
source: towardsdatascience.com 10
LOGISTIC REGRESSION
➤ We will use logistic regression in order to make the ideas
easier to understand. Logistic regression is an algorithm for
binary classification.
➤ Here's an example of a binary classification problem. You
might have an input of an image, and want to output a label
to recognize this image as either being a cat, in which case
you output 1, or not-cat in which case you output 0, and we're
going to use y to denote the output label.
➤ In the logistic model, the log-odds (the logarithm of the odds)
for the value labeled "1" is a linear combination of one or
more independent variables (“predictors").
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 11
LOGISTIC REGRESSION: ASSUMPTIONS
➤ First, binary logistic regression requires the dependent variable to be binary and ordinal
logistic regression requires the dependent variable to be ordinal.
➤ Second, logistic regression requires the observations to be independent of each other.  In
other words, the observations should not come from repeated measurements or
matched data.
➤ Third, logistic regression requires there to be little or no multicollinearity among the
independent variables.  This means that the independent variables should not be too
highly correlated with each other.
➤ Fourth, logistic regression assumes linearity of independent variables and log odds. 
Although the dependent and independent variables do not have to be related linearly, it
requires that the independent variables are linearly related to the log odds.
➤ Finally, logistic regression typically requires a large sample size.  A general guideline is
that you need at minimum of 10 cases with the least frequent outcome for each
independent variable in your model. For example, if you have 5 independent variables
and the expected probability of your least frequent outcome is .10, then you would need
a minimum sample size of 500 (10*5 / .10).
12source: https://www.statisticssolutions.com/assumptions-of-logistic-regression/
LOGISTIC ACTIVATION FUNCTION: SIGMOID
➤ The goal is to predict the target class y from input z. The probability
P(y=1|z) that input z is classified as class y=1 is represented by the output
ŷ of the sigmoid function computed as ŷ = σ(z).



➤ Note that input z to the logistic function corresponds to the log odds ratio:







➤ This means that log odds ratio changes linearly with z. Furthermore, since 

, this means input z changes linearly with the parameters w and
input samples x. This linearity property is a requirement for logistic
regression.
z = wT
⋅ x
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 13
➤ Note: when we look a neural nets β0 will be modeled as
parameter b and β1 and β2 will be modeled as w1 and w2
SIGMOID FUNCTION: DECONSTRUCTED
➤ Consider a model with two predictors, x1 and x2, and one
binary (Bernoulli) response variable Y. Then the general form
of the log-odds (here denoted by ℓ) is:
P
(1 − P)
= eβ0+β1x1+β2x2
z = β0 + β1x1 + β2x2
P =
of
of + 1
=
ez
ez + 1
=
1
1 +
1
ez
=
1
1 + e−z
P
(1 − P)
= of = eβ0+β1x1+β2x2 = ez
->
->
z = b0 + w1x1 + w2x2
source: https://en.wikipedia.org/wiki/Logistic_regression 14
EXAMPLE OF BINARY CLASSIFICATION - IMAGE RECOGNITION
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 15
Is this a picture of a cat? Yes: 1, No: 0
NEURAL NET NOTATION
➤ Sigma(σ), in this context, is the activation function of a
node which defines the output of that node given an input or
set of inputs.
➤ The linear function, z, is the input and the activation, a, is the
output.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 16
LOGISTIC ACTIVATION FUNCTION: SIGMOID
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 17
➤ The loss function used to optimize the classification is the cross-
entropy loss function. 





➤ The output of the model a = σ(z) can be interpreted as a probability a
that input z belongs to one class (y=1) or probability 1 - a that z belongs
to the other class (y=0)
➤ The neural network model will be optimized by maximizing the
likelihood that a given set of parameters θ of the model can result in a
prediction of the correct class of each input sample. The likelihood
maximization can be written as:



CROSS ENTROPY LOSS FUNCTION
arg max
θ
ℒ(θ|y, z) = arg
θ
max
n
∏
i=1
ℒ(θ|y, z)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 18
WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION?
➤ Why do we care about the likelihood function? Because it is
the best model for the use case at hand: we have this observed
data, outcomes and data inputs, but do not know anything
about the parameters that establish a relationship between
the two.
➤ The likelihood function (often simply the likelihood) is the
joint probability distribution of observed data expressed as a
function of statistical parameters. Given the outcome, x, and
parameter θ and continuous probability density function f, the
likelihood function is:
ℒ(θ|x) = fθ (x)
source: https://en.wikipedia.org/wiki/Likelihood_function 19
WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION? (CONT.)
➤ The likelihood function describes the relative probability or
odds of obtaining the observed data for all permissible values
of the parameters, and is used to identify the particular
parameter values that are most plausible given the observed data.
➤ The likelihood function is a function of the parameter only,
with the data held as a fixed constant. It is the probability of
the data given the parameter value.
➤ Over the domain of permissible parameter values, the
likelihood function describes a surface.[5] The peak of that
surface, if it exists, identifies the point in the parameter space
that maximizes the likelihood; that is the value that is most
likely to be the parameter of the joint probability distribution
underlying the observed data.
source: https://en.wikipedia.org/wiki/Likelihood_function 20
CROSS ENTROPY LOSS FUNCTION CONT.
➤ The likelihood function can be written as a joint probability of
generating y and z, given parameters θ:



➤ Since we are not interested in the probability of z, we can
reduce this to: P(y|z,θ)
ℒ(θ|y, z) = P(y, z|θ)
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 21
CROSS ENTROPY LOSS FUNCTION CONT.
➤ Since yi is a Bernoulli variable and the probability of P(y|z) is
fixed for a given θ, we further simplify: 



➤ Why is the above a product sum? Since the probability of y
given z is for a sample of size n, we have to account for the
probability of y=1 for each outcome in the sample. For
example: if the sample size is 3, the probability of y=1 for
each outcome is 0.9 given z, and we have three outcomes
where y=1, the probability of getting y=1 for the sample,
given z, is 0.9 x 0.9 x 0.9 = 0.729
P(y|z) =
n
∏
i=1
P(yi = 1|zi)yi ⋅ (1 − P(yi = 1|zi))1−yi
P(y|z) =
n
∏
i=1
ay
i
⋅ (1 − ai)1−yi
P(y = 1|z) = σ(z) = a
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 22
->
BERNOULLI DISTRIBUTION
➤ The Bernoulli distribution, is the discrete probability
distribution of a random variable which takes the value 1 with
probability p and the value 0 with probability q = 1 - p
➤ It is the probability distribution of any single experiment that
asks a yes–no question. As such, it is a special case of the
binomial distribution where a single trial is conducted.
source:https://en.wikipedia.org/wiki/Bernoulli_distribution 23
➤ Taking the log of the likelihood function results in a convex
loss function where we can determine the minimum value.



➤ Minimizing the negative of this function (minimizing the
negative log likelihood) corresponds to maximizing the
likelihood. This loss function L(y, a) is known as the cross-
entropy error (loss) function, also known as the log-loss.



➤ Why the negative of the function? So we can minimize the loss
or the difference between predicted and actual observations.
CROSS ENTROPY LOSS FUNCTION CONT.
L(y, a) =
{
log ℒ(θ|y, z) = log
n
∏
i=1
ayi
i
⋅ (1 − ai)1−yi =
n
∑
i=1
yi log(ai) + (1 − yt)log(1 − ai)
−log(ai) if yi = 1
−log(1 − ai) if yi = 0
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 24
CROSS ENTROPY LOSS FUNCTION CONT.
➤ By minimizing the negative log probability, we will maximize
the log probability. And since y can only be 0 and 1, we can
write L(y, a) as:



➤ Which give the following if we sum over all the samples, n:



➤ So what we end up with is a loss function that is 0 if the
probability to predict the correct class is 1 and goes to infinity
as the probability to predict the correct class goes to 0.
L(y, a) = − y log(a) − (1 − y)log(1 − a)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 25
MINIMIZING THE LOSS FUNCTION: GRADIENT DESCENT
➤ Recall that our goal is to minimize the loss function by
traversing the function’s surface area.
➤ To minimize the loss function, we use the gradient descent
algorithm with respect to the parameters, w and b.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 26
GRADIENT DESCENT - CONT.
➤ The gradient descent algorithm works by taking the gradient
( derivative ) of the loss function L with respect to the
parameters, w and b and updates the parameters in the direction
of the negative gradient (down along the loss function).
➤ What is the derivative of the loss function?



➤ The parameters w are updated every iteration k by taking steps
proportional to the negative of the gradient:

w(k + 1) = w(k) - Δw(k + 1)
➤ Δw is defined as:
∂L
∂w
= (ai − yi) ⋅ xi
Δw = α
∂L
∂w
source: https://peterroelants.github.io/posts/neural-network-implementation-part02/ 27
GRADIENT DESCENT - CONT.
➤ Below is a diagram that shows the algorithm ‘moving down’ the
negative gradient in steps of size alpha (learning rate)
➤ Note: this function maps the parameters to the loss function:

J(w, b) = L(y, a)
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 28
WHY THE NEGATIVE GRADIENT?
➤ Because your goal is to minimize the loss function J(θ) =
J(w,b):
source: https://medium.com/@aerinykim/why-do-we-subtract-the-slope-a-in-gradient-descent-73c7368644fa 29
THE GRADIENT: DECONSTRUCTED
➤ Ok, so the gradient is the derivative of the loss function which
is:



➤ The next question is: WHY? How is this derivative
calculated?
∂L
∂w
= xi ⋅ (ai − yi)
30
CROSS ENTROPY LOSS & WEIGHTS: CALCULATING THE DERIVATIVE
∂L
∂w
= xi ⋅ (ai − yi)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
∂z
∂w
=
∂(x ⋅ w)
∂w
= x
Prove:
Let’s break it down:
Let’s handle the easy one first:
31
CROSS ENTROPY LOSS & ACTIVATION FUNCTION: CALCULATING THE DERIVATIVE
L(a, y) = − (ylog(a) + (1 − y)log(1 − a))
∂L
∂a
=
∂( − (ylog(a) + (1 − y)log(1 − a))
∂a
=
∂(−ylog(a))
∂a
−
∂( − (1 − y)log(1 − a))
∂a
= −
y
a
+
1 − y
1 − a
=
a(1 − a)
a(1 − a)
(−y)
a
+
a(1 − a)
a(1 − a)
1 − y
1 − a
=
−y(1 − a) + a(1 − y)
a(1 − a)
=
ya − y + a − ay
a(1 − a)
=
a − y
a(1 − a)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
Let’s handle this one next:
∂L
∂a
Recall the loss function:
32
CROSS ENTROPY LOSS & SIGMOID FUNCTION: CALCULATING THE DERIVATIVE
a = σ(z)
∂a
∂z
= (1 + e−z
)−2
(−1)(−1)(e−z
) =
(−1)(−1)e−z
(1 + e−z)2
=
e−z
(1 + e−z)2
=
1
(1 + e−z)
e−z
(1 + e−z)
1 − a = 1 −
1
1 + e−z
=
(1 + e−z
)
(1 + e−z)
−
1
1 + e−z
=
e−z
1 + e−z
1
(1 + e−z)
e−z
(1 + e−z)
= a(1 − a)
a =
1
1 + e−z
= (1 + e−z
)−1
Finally: ∂a
∂z
Recall that a is the sigmoid 

activation function :
Also, note:
Substitute a and 1-a:
33
^Chain rule: -1 for the exponent and -1 for -z
LOSS FUNCTION WITH RESPECT TO Z: DERIVATIVE
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
= xi ⋅ a(1 − a)
a − y
a(1 − a)
= xi ⋅ (a − y)
Putting it all together:
34
DERIVATIVES AND BACKPROPAGATION
➤ Backpropagation is an iterative, recursive and efficient method for calculating
the weights updates to improve in the network until it is able to perform the
task for which it is being trained
➤ The important part is the blue text on the right: note how we are adjusting
the weights (w1, w2 and b) by subtracting the negative gradient (derivative).
∂L
∂w
= xi ⋅ (ai − yi)
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 35
NEURAL NETS: FULL CIRCLE
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.
source: towardsdatascience.com 36
WELL, WAS IT WORTH THE EFFORT?
➤ Consider this…
➤ One of the most famous and consequential meetings in the
history of science took place in the summer of 1684 when the
young astronomer Edmund Halley paid a visit to Isaac Newton.
After they had been some time together, the Dr asked him what
he thought the curve would be that would be described by the
planets supposing the force of attraction towards the sun to be
reciprocal to the square of their distance from it. Sir Isaac replied
immediately that it would be an ellipse. The Doctor,

struck with joy and amazement, asked him how he 

knew it.



Why, saith he, I have calculated it.
source: https://www.mathpages.com/home/kmath658/kmath658.htm 37
THANK YOU!
38

More Related Content

What's hot

【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
Naoki Hayashi
 
The Fuzzy Logical Databases
The Fuzzy Logical DatabasesThe Fuzzy Logical Databases
The Fuzzy Logical Databases
AlaaZ
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
Sujit Pal
 
Statistical inference of network structure
Statistical inference of network structureStatistical inference of network structure
Statistical inference of network structure
Tiago Peixoto
 
Lecture18 structurein c.ppt
Lecture18 structurein c.pptLecture18 structurein c.ppt
Lecture18 structurein c.ppt
eShikshak
 
Introduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theoristIntroduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theorist
Akin Osman Kazakci
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
Devashish Patel
 
Functional Domain Modeling
Functional Domain ModelingFunctional Domain Modeling
Functional Domain Modeling
Michal Bigos
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
When to use a structure vs classes in c++
When to use a structure vs classes in c++When to use a structure vs classes in c++
When to use a structure vs classes in c++
Naman Kumar
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
changedaeoh
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
Ken Kuroki
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
Kyuri Kim
 
Ejsr 86 3
Ejsr 86 3Ejsr 86 3
Machine Learning
Machine LearningMachine Learning
Machine Learning
Jean-Luc Caut
 
Interpolation wikipedia
Interpolation   wikipediaInterpolation   wikipedia
Interpolation wikipediahort34
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
Rakibul Hasan Pranto
 

What's hot (17)

【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
【博士論文発表会】パラメータ制約付き特異モデルの統計的学習理論
 
The Fuzzy Logical Databases
The Fuzzy Logical DatabasesThe Fuzzy Logical Databases
The Fuzzy Logical Databases
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
Statistical inference of network structure
Statistical inference of network structureStatistical inference of network structure
Statistical inference of network structure
 
Lecture18 structurein c.ppt
Lecture18 structurein c.pptLecture18 structurein c.ppt
Lecture18 structurein c.ppt
 
Introduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theoristIntroduction to search and optimisation for the design theorist
Introduction to search and optimisation for the design theorist
 
Sparse autoencoder
Sparse autoencoderSparse autoencoder
Sparse autoencoder
 
Functional Domain Modeling
Functional Domain ModelingFunctional Domain Modeling
Functional Domain Modeling
 
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 4 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 4 Semester 3 MSc IT Part 2 Mumbai University
 
When to use a structure vs classes in c++
When to use a structure vs classes in c++When to use a structure vs classes in c++
When to use a structure vs classes in c++
 
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...Representation Learning & Generative Modeling with Variational Autoencoder(VA...
Representation Learning & Generative Modeling with Variational Autoencoder(VA...
 
Interaction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and PhysicsInteraction Networks for Learning about Objects, Relations and Physics
Interaction Networks for Learning about Objects, Relations and Physics
 
Generative adversarial networks
Generative adversarial networksGenerative adversarial networks
Generative adversarial networks
 
Ejsr 86 3
Ejsr 86 3Ejsr 86 3
Ejsr 86 3
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Interpolation wikipedia
Interpolation   wikipediaInterpolation   wikipedia
Interpolation wikipedia
 
Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019 Islamic University Pattern Recognition & Neural Network 2019
Islamic University Pattern Recognition & Neural Network 2019
 

Similar to Neural Nets Deconstructed

OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
shesnasuneer
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
shesnasuneer
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
Joel Graff
 
machinelearningengineeringslideshare-160909192132 (1).pdf
machinelearningengineeringslideshare-160909192132 (1).pdfmachinelearningengineeringslideshare-160909192132 (1).pdf
machinelearningengineeringslideshare-160909192132 (1).pdf
ShivareddyGangam
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
IDEAS - Int'l Data Engineering and Science Association
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
GayathriRHICETCSESTA
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
Nagarajan
 
Neural Networks by Priyanka Kasture
Neural Networks by Priyanka KastureNeural Networks by Priyanka Kasture
Neural Networks by Priyanka Kasture
Priyanka Kasture
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
ShwethaShreeS
 
# Neural network toolbox
# Neural network toolbox # Neural network toolbox
# Neural network toolbox
VineetKumar508
 
09 Inference for Networks – Exponential Random Graph Models (2017)
09 Inference for Networks – Exponential Random Graph Models (2017)09 Inference for Networks – Exponential Random Graph Models (2017)
09 Inference for Networks – Exponential Random Graph Models (2017)
Duke Network Analysis Center
 
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
JaeJun Yoo
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
Te-Yen Liu
 
Artificial neural network by arpit_sharma
Artificial neural network by arpit_sharmaArtificial neural network by arpit_sharma
Artificial neural network by arpit_sharma
Er. Arpit Sharma
 
X trepan an extended trepan for
X trepan an extended trepan forX trepan an extended trepan for
X trepan an extended trepan for
ijaia
 
Artificial Neural Networks ppt.pptx for final sem cse
Artificial Neural Networks  ppt.pptx for final sem cseArtificial Neural Networks  ppt.pptx for final sem cse
Artificial Neural Networks ppt.pptx for final sem cse
NaveenBhajantri1
 

Similar to Neural Nets Deconstructed (20)

OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkkOBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
OBJECTRECOGNITION1.pptxjjjkkkkjjjjkkkkkkk
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
machinelearningengineeringslideshare-160909192132 (1).pdf
machinelearningengineeringslideshare-160909192132 (1).pdfmachinelearningengineeringslideshare-160909192132 (1).pdf
machinelearningengineeringslideshare-160909192132 (1).pdf
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
Ann
Ann Ann
Ann
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
 
nncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdfnncollovcapaldo2013-131220052427-phpapp01.pdf
nncollovcapaldo2013-131220052427-phpapp01.pdf
 
StockMarketPrediction
StockMarketPredictionStockMarketPrediction
StockMarketPrediction
 
Introduction Of Artificial neural network
Introduction Of Artificial neural networkIntroduction Of Artificial neural network
Introduction Of Artificial neural network
 
Neural Networks by Priyanka Kasture
Neural Networks by Priyanka KastureNeural Networks by Priyanka Kasture
Neural Networks by Priyanka Kasture
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
# Neural network toolbox
# Neural network toolbox # Neural network toolbox
# Neural network toolbox
 
09 Inference for Networks – Exponential Random Graph Models (2017)
09 Inference for Networks – Exponential Random Graph Models (2017)09 Inference for Networks – Exponential Random Graph Models (2017)
09 Inference for Networks – Exponential Random Graph Models (2017)
 
[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization[PR12] understanding deep learning requires rethinking generalization
[PR12] understanding deep learning requires rethinking generalization
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Artificial neural network by arpit_sharma
Artificial neural network by arpit_sharmaArtificial neural network by arpit_sharma
Artificial neural network by arpit_sharma
 
X trepan an extended trepan for
X trepan an extended trepan forX trepan an extended trepan for
X trepan an extended trepan for
 
Artificial Neural Networks ppt.pptx for final sem cse
Artificial Neural Networks  ppt.pptx for final sem cseArtificial Neural Networks  ppt.pptx for final sem cse
Artificial Neural Networks ppt.pptx for final sem cse
 

Recently uploaded

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 

Recently uploaded (20)

Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 

Neural Nets Deconstructed

  • 2. WHAT ARE WE GOING TO COVER? ➤ What is a Neural Network (NN)? ➤ What does it mean to ‘deconstruct’ a NN? ➤ How is this helpful to students and those new to the field? ➤ We will also touch learning psychology - on some thoughts on what habits I found beneficial in learning this material 2
  • 3. GETTING STARTED - THE “BIG PICTURE” ➤ How do AI, ML, Deep Learning and 
 Neural Nets relate to each other? ➤ “You can think of deep learning, machine learning and artificial intelligence as a set of Russian dolls nested within each other, beginning with the smallest and working out. Deep learning is a subset of machine learning, and machine learning is a subset of AI, which is an umbrella term for any computer program that does something smart.”[1] ➤ Deep Learning in an ML method based on training Neural Networks. [1] https://skymind.ai/wiki/ai-vs-machine-learning-vs-deep-learning John McCarthy 3
  • 4. WHAT IS A NEURAL NET? ➤ Definition: A neural network (NN) is an 
 interconnected group of natural or 
 artificial neurons that uses a mathematical 
 or computational model statistical for 
 data modeling or decision making. ➤ In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network. ➤ They can be used to model complex relationships between inputs and outputs or to find patterns in data. source: https://en.wikipedia.org/wiki/Artificial_neural_network 4
  • 5. APPLICATIONS FOR NEURAL NETS ➤ The tasks to which artificial neural networks are applied tend to fall within the following broad categories: ➤ Function approximation, or regression analysis, including time series prediction and modeling. ➤ Classification, including pattern and sequence recognition, novelty detection and sequential decision making. ➤ Data processing, including filtering, clustering, signal and compression. source: https://en.wikipedia.org/wiki/Artificial_neural_network 5
  • 6. LET’S LOOK AT A SIMPLE EXAMPLE ➤ Let's say you have a data sets with six houses. You have the size of the houses (in square feet or square meters), and the price of the house. You then want to fit a function to predict the price of the houses as a function of the size. If you are familiar with linear regression, you might try to fit a straight line to these data to model the relationship. ➤ So you can think of this function that you've just fit the housing prices as a very simple neural network. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 6
  • 7. HOUSING PREDICTION CONTINUED… ➤ Given these input features, the job of the neural network will be to predict the price y. Notice also that each of these circle, these are called hidden units in the neural network, that each of them takes its inputs all four input features. ➤ The middle layer of the neural network is density connected because every input feature is connected to every one of these circles in the middle. And the remarkable thing about neural networks is that, given enough data about x and y, given enough training examples with both x and y, neural networks are remarkably good a figuring out functions that accurately map x to y. 7source: https://www.coursera.org/learn/neural-networks-deep-learning/
  • 8. SUPERVISED LEARNING ➤ One of the most exciting things about the rise of neural networks is that computers are now much better at interpreting unstructured data as well compared to just a few years ago. And this creates opportunities for many new exciting applications that use speech recognition, image recognition, and natural language processing on text. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 8
  • 9. HOW DO NEURAL NETS ‘LEARN’? 1. Start with values (often random) for the network parameters (wij weights and bj biases). 2. Take a set of examples of input data and pass them through the network to obtain their prediction. 3. Compare these predictions obtained with the values of expected labels and calculate the loss with them. 4. Perform the backpropagation in order to propagate this loss to each and every one of the parameters that make up the model of the neural network. 5. Use this propagated information to update the parameters of the neural network with the gradient descent in a way that the total loss is reduced and a better model is obtained. 6. Continue iterating in the previous steps until we consider that we have a good model. source: towardsdatascience.com 9
  • 10. HOW DO NEURAL NETS ‘LEARN’? source: towardsdatascience.com 10
  • 11. LOGISTIC REGRESSION ➤ We will use logistic regression in order to make the ideas easier to understand. Logistic regression is an algorithm for binary classification. ➤ Here's an example of a binary classification problem. You might have an input of an image, and want to output a label to recognize this image as either being a cat, in which case you output 1, or not-cat in which case you output 0, and we're going to use y to denote the output label. ➤ In the logistic model, the log-odds (the logarithm of the odds) for the value labeled "1" is a linear combination of one or more independent variables (“predictors"). source: https://www.coursera.org/learn/neural-networks-deep-learning/ 11
  • 12. LOGISTIC REGRESSION: ASSUMPTIONS ➤ First, binary logistic regression requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal. ➤ Second, logistic regression requires the observations to be independent of each other.  In other words, the observations should not come from repeated measurements or matched data. ➤ Third, logistic regression requires there to be little or no multicollinearity among the independent variables.  This means that the independent variables should not be too highly correlated with each other. ➤ Fourth, logistic regression assumes linearity of independent variables and log odds.  Although the dependent and independent variables do not have to be related linearly, it requires that the independent variables are linearly related to the log odds. ➤ Finally, logistic regression typically requires a large sample size.  A general guideline is that you need at minimum of 10 cases with the least frequent outcome for each independent variable in your model. For example, if you have 5 independent variables and the expected probability of your least frequent outcome is .10, then you would need a minimum sample size of 500 (10*5 / .10). 12source: https://www.statisticssolutions.com/assumptions-of-logistic-regression/
  • 13. LOGISTIC ACTIVATION FUNCTION: SIGMOID ➤ The goal is to predict the target class y from input z. The probability P(y=1|z) that input z is classified as class y=1 is represented by the output ŷ of the sigmoid function computed as ŷ = σ(z).
 
 ➤ Note that input z to the logistic function corresponds to the log odds ratio:
 
 
 
 ➤ This means that log odds ratio changes linearly with z. Furthermore, since 
 , this means input z changes linearly with the parameters w and input samples x. This linearity property is a requirement for logistic regression. z = wT ⋅ x source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 13
  • 14. ➤ Note: when we look a neural nets β0 will be modeled as parameter b and β1 and β2 will be modeled as w1 and w2 SIGMOID FUNCTION: DECONSTRUCTED ➤ Consider a model with two predictors, x1 and x2, and one binary (Bernoulli) response variable Y. Then the general form of the log-odds (here denoted by ℓ) is: P (1 − P) = eβ0+β1x1+β2x2 z = β0 + β1x1 + β2x2 P = of of + 1 = ez ez + 1 = 1 1 + 1 ez = 1 1 + e−z P (1 − P) = of = eβ0+β1x1+β2x2 = ez -> -> z = b0 + w1x1 + w2x2 source: https://en.wikipedia.org/wiki/Logistic_regression 14
  • 15. EXAMPLE OF BINARY CLASSIFICATION - IMAGE RECOGNITION source: https://www.coursera.org/learn/neural-networks-deep-learning/ 15 Is this a picture of a cat? Yes: 1, No: 0
  • 16. NEURAL NET NOTATION ➤ Sigma(σ), in this context, is the activation function of a node which defines the output of that node given an input or set of inputs. ➤ The linear function, z, is the input and the activation, a, is the output. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 16
  • 17. LOGISTIC ACTIVATION FUNCTION: SIGMOID source: https://www.coursera.org/learn/neural-networks-deep-learning/ 17
  • 18. ➤ The loss function used to optimize the classification is the cross- entropy loss function. 
 
 
 ➤ The output of the model a = σ(z) can be interpreted as a probability a that input z belongs to one class (y=1) or probability 1 - a that z belongs to the other class (y=0) ➤ The neural network model will be optimized by maximizing the likelihood that a given set of parameters θ of the model can result in a prediction of the correct class of each input sample. The likelihood maximization can be written as:
 
 CROSS ENTROPY LOSS FUNCTION arg max θ ℒ(θ|y, z) = arg θ max n ∏ i=1 ℒ(θ|y, z) L(y, a) = − 1 N n ∑ i=1 [yi log(ai) + (1 − yi)log(1 − ai)] source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 18
  • 19. WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION? ➤ Why do we care about the likelihood function? Because it is the best model for the use case at hand: we have this observed data, outcomes and data inputs, but do not know anything about the parameters that establish a relationship between the two. ➤ The likelihood function (often simply the likelihood) is the joint probability distribution of observed data expressed as a function of statistical parameters. Given the outcome, x, and parameter θ and continuous probability density function f, the likelihood function is: ℒ(θ|x) = fθ (x) source: https://en.wikipedia.org/wiki/Likelihood_function 19
  • 20. WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION? (CONT.) ➤ The likelihood function describes the relative probability or odds of obtaining the observed data for all permissible values of the parameters, and is used to identify the particular parameter values that are most plausible given the observed data. ➤ The likelihood function is a function of the parameter only, with the data held as a fixed constant. It is the probability of the data given the parameter value. ➤ Over the domain of permissible parameter values, the likelihood function describes a surface.[5] The peak of that surface, if it exists, identifies the point in the parameter space that maximizes the likelihood; that is the value that is most likely to be the parameter of the joint probability distribution underlying the observed data. source: https://en.wikipedia.org/wiki/Likelihood_function 20
  • 21. CROSS ENTROPY LOSS FUNCTION CONT. ➤ The likelihood function can be written as a joint probability of generating y and z, given parameters θ:
 
 ➤ Since we are not interested in the probability of z, we can reduce this to: P(y|z,θ) ℒ(θ|y, z) = P(y, z|θ) source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 21
  • 22. CROSS ENTROPY LOSS FUNCTION CONT. ➤ Since yi is a Bernoulli variable and the probability of P(y|z) is fixed for a given θ, we further simplify: 
 
 ➤ Why is the above a product sum? Since the probability of y given z is for a sample of size n, we have to account for the probability of y=1 for each outcome in the sample. For example: if the sample size is 3, the probability of y=1 for each outcome is 0.9 given z, and we have three outcomes where y=1, the probability of getting y=1 for the sample, given z, is 0.9 x 0.9 x 0.9 = 0.729 P(y|z) = n ∏ i=1 P(yi = 1|zi)yi ⋅ (1 − P(yi = 1|zi))1−yi P(y|z) = n ∏ i=1 ay i ⋅ (1 − ai)1−yi P(y = 1|z) = σ(z) = a source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 22 ->
  • 23. BERNOULLI DISTRIBUTION ➤ The Bernoulli distribution, is the discrete probability distribution of a random variable which takes the value 1 with probability p and the value 0 with probability q = 1 - p ➤ It is the probability distribution of any single experiment that asks a yes–no question. As such, it is a special case of the binomial distribution where a single trial is conducted. source:https://en.wikipedia.org/wiki/Bernoulli_distribution 23
  • 24. ➤ Taking the log of the likelihood function results in a convex loss function where we can determine the minimum value.
 
 ➤ Minimizing the negative of this function (minimizing the negative log likelihood) corresponds to maximizing the likelihood. This loss function L(y, a) is known as the cross- entropy error (loss) function, also known as the log-loss.
 
 ➤ Why the negative of the function? So we can minimize the loss or the difference between predicted and actual observations. CROSS ENTROPY LOSS FUNCTION CONT. L(y, a) = { log ℒ(θ|y, z) = log n ∏ i=1 ayi i ⋅ (1 − ai)1−yi = n ∑ i=1 yi log(ai) + (1 − yt)log(1 − ai) −log(ai) if yi = 1 −log(1 − ai) if yi = 0 source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 24
  • 25. CROSS ENTROPY LOSS FUNCTION CONT. ➤ By minimizing the negative log probability, we will maximize the log probability. And since y can only be 0 and 1, we can write L(y, a) as:
 
 ➤ Which give the following if we sum over all the samples, n:
 
 ➤ So what we end up with is a loss function that is 0 if the probability to predict the correct class is 1 and goes to infinity as the probability to predict the correct class goes to 0. L(y, a) = − y log(a) − (1 − y)log(1 − a) L(y, a) = − 1 N n ∑ i=1 [yi log(ai) + (1 − yi)log(1 − ai)] source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 25
  • 26. MINIMIZING THE LOSS FUNCTION: GRADIENT DESCENT ➤ Recall that our goal is to minimize the loss function by traversing the function’s surface area. ➤ To minimize the loss function, we use the gradient descent algorithm with respect to the parameters, w and b. source: https://www.coursera.org/learn/neural-networks-deep-learning/ 26
  • 27. GRADIENT DESCENT - CONT. ➤ The gradient descent algorithm works by taking the gradient ( derivative ) of the loss function L with respect to the parameters, w and b and updates the parameters in the direction of the negative gradient (down along the loss function). ➤ What is the derivative of the loss function?
 
 ➤ The parameters w are updated every iteration k by taking steps proportional to the negative of the gradient:
 w(k + 1) = w(k) - Δw(k + 1) ➤ Δw is defined as: ∂L ∂w = (ai − yi) ⋅ xi Δw = α ∂L ∂w source: https://peterroelants.github.io/posts/neural-network-implementation-part02/ 27
  • 28. GRADIENT DESCENT - CONT. ➤ Below is a diagram that shows the algorithm ‘moving down’ the negative gradient in steps of size alpha (learning rate) ➤ Note: this function maps the parameters to the loss function:
 J(w, b) = L(y, a) source: https://www.coursera.org/learn/neural-networks-deep-learning/ 28
  • 29. WHY THE NEGATIVE GRADIENT? ➤ Because your goal is to minimize the loss function J(θ) = J(w,b): source: https://medium.com/@aerinykim/why-do-we-subtract-the-slope-a-in-gradient-descent-73c7368644fa 29
  • 30. THE GRADIENT: DECONSTRUCTED ➤ Ok, so the gradient is the derivative of the loss function which is:
 
 ➤ The next question is: WHY? How is this derivative calculated? ∂L ∂w = xi ⋅ (ai − yi) 30
  • 31. CROSS ENTROPY LOSS & WEIGHTS: CALCULATING THE DERIVATIVE ∂L ∂w = xi ⋅ (ai − yi) ∂L ∂w = ∂L ∂a ∂a ∂z ∂z ∂w ∂z ∂w = ∂(x ⋅ w) ∂w = x Prove: Let’s break it down: Let’s handle the easy one first: 31
  • 32. CROSS ENTROPY LOSS & ACTIVATION FUNCTION: CALCULATING THE DERIVATIVE L(a, y) = − (ylog(a) + (1 − y)log(1 − a)) ∂L ∂a = ∂( − (ylog(a) + (1 − y)log(1 − a)) ∂a = ∂(−ylog(a)) ∂a − ∂( − (1 − y)log(1 − a)) ∂a = − y a + 1 − y 1 − a = a(1 − a) a(1 − a) (−y) a + a(1 − a) a(1 − a) 1 − y 1 − a = −y(1 − a) + a(1 − y) a(1 − a) = ya − y + a − ay a(1 − a) = a − y a(1 − a) ∂L ∂w = ∂L ∂a ∂a ∂z ∂z ∂w Let’s handle this one next: ∂L ∂a Recall the loss function: 32
  • 33. CROSS ENTROPY LOSS & SIGMOID FUNCTION: CALCULATING THE DERIVATIVE a = σ(z) ∂a ∂z = (1 + e−z )−2 (−1)(−1)(e−z ) = (−1)(−1)e−z (1 + e−z)2 = e−z (1 + e−z)2 = 1 (1 + e−z) e−z (1 + e−z) 1 − a = 1 − 1 1 + e−z = (1 + e−z ) (1 + e−z) − 1 1 + e−z = e−z 1 + e−z 1 (1 + e−z) e−z (1 + e−z) = a(1 − a) a = 1 1 + e−z = (1 + e−z )−1 Finally: ∂a ∂z Recall that a is the sigmoid 
 activation function : Also, note: Substitute a and 1-a: 33 ^Chain rule: -1 for the exponent and -1 for -z
  • 34. LOSS FUNCTION WITH RESPECT TO Z: DERIVATIVE ∂L ∂w = ∂L ∂a ∂a ∂z ∂z ∂w = xi ⋅ a(1 − a) a − y a(1 − a) = xi ⋅ (a − y) Putting it all together: 34
  • 35. DERIVATIVES AND BACKPROPAGATION ➤ Backpropagation is an iterative, recursive and efficient method for calculating the weights updates to improve in the network until it is able to perform the task for which it is being trained ➤ The important part is the blue text on the right: note how we are adjusting the weights (w1, w2 and b) by subtracting the negative gradient (derivative). ∂L ∂w = xi ⋅ (ai − yi) source: https://www.coursera.org/learn/neural-networks-deep-learning/ 35
  • 36. NEURAL NETS: FULL CIRCLE 1. Start with values (often random) for the network parameters (wij weights and bj biases). 2. Take a set of examples of input data and pass them through the network to obtain their prediction. 3. Compare these predictions obtained with the values of expected labels and calculate the loss with them. 4. Perform the backpropagation in order to propagate this loss to each and every one of the parameters that make up the model of the neural network. 5. Use this propagated information to update the parameters of the neural network with the gradient descent in a way that the total loss is reduced and a better model is obtained. 6. Continue iterating in the previous steps until we consider that we have a good model. source: towardsdatascience.com 36
  • 37. WELL, WAS IT WORTH THE EFFORT? ➤ Consider this… ➤ One of the most famous and consequential meetings in the history of science took place in the summer of 1684 when the young astronomer Edmund Halley paid a visit to Isaac Newton. After they had been some time together, the Dr asked him what he thought the curve would be that would be described by the planets supposing the force of attraction towards the sun to be reciprocal to the square of their distance from it. Sir Isaac replied immediately that it would be an ellipse. The Doctor,
 struck with joy and amazement, asked him how he 
 knew it.
 
 Why, saith he, I have calculated it. source: https://www.mathpages.com/home/kmath658/kmath658.htm 37