Deep dive into the mathematics and algorithms of neural nets. Covers the sigmoid activation function, cross-entropy loss function, gradient descent and the derivatives used in back propagation.
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
This lecture provides an overview of Image Processing and Deep Learning for the applications of data science and machine learning. We will go through examples of image processing techniques using a couple of different R packages. Afterwards, we will shift our focus and dive into the topics of Deep Neural Networks and Deep Learning. We will discuss topics including Deep Boltzmann Machines, Deep Belief Networks, & Convolutional Neural Networks and finish the presentation with a practical exercise in hand writing recognition technique.
Data Science - Part XVII - Deep Learning & Image ProcessingDerek Kane
This lecture provides an overview of Image Processing and Deep Learning for the applications of data science and machine learning. We will go through examples of image processing techniques using a couple of different R packages. Afterwards, we will shift our focus and dive into the topics of Deep Neural Networks and Deep Learning. We will discuss topics including Deep Boltzmann Machines, Deep Belief Networks, & Convolutional Neural Networks and finish the presentation with a practical exercise in hand writing recognition technique.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
Statistical inference of network structureTiago Peixoto
Lecture given at the "Mediterranean School of Complex Networks", Salina, Sicily 3-8 Sept 2017,
http://mediterraneanschoolcomplex.net/
Additional information:
"Bayesian stochastic blockmodeling", Tiago P. Peixoto, arXiv: 1705.10225, https://arxiv.org/abs/1705.10225
How to infer modular network structure using graph-tool (https://graph-tool.skewed.de/): https://graph-tool.skewed.de/static/doc/demos/inference/inference.html
Introduction to search and optimisation for the design theoristAkin Osman Kazakci
An historically important design theory is the state-space search by Herbert Simon. Over the years, the importance of this model has been consistently downplayed for various reasons. Today, it is not being used or discussed very frequently - except to downplay its significance even more - usually without an in-depth analysis.
However, the young generation of (design) researchers do not know well-enough the underlying formalism and how it can be used to interpret design phenomena.
This short introduction intends to give the basics of search, optimisation and problem-solving formalisms in a very intuitive way - which also helps to understand more complicated formal models of design.
How to apply functional concepts in modeling domain objects. Examples are in Scala. Presented on Rubyslava / PyVo #32 (see http://lanyrd.com/2013/rubyslava-september)
Representation Learning & Generative Modeling with Variational Autoencoder(VA...changedaeoh
표현학습(representation learning)과 생성모델링(generative modeling)에 대한 개요를 설명하고 이를 Auto-Encoding Variational Bayes 논문의 내용과 연결시켜 VAE를 이해한다.
연합동아리 TAVE research seminar 21.05.18 발표자료
발표자: 오창대
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
For my presentation for a reading group. I have not in any way contributed this study, which is done by the researchers named on the first slide.
https://papers.nips.cc/paper/6418-interaction-networks-for-learning-about-objects-relations-and-physics
Leveraging Machine Learning or IA in order to detect Credit Card Fraud and suspicious transations. The aim of this presentation is to help you to improve your knowledge in Machnie Learning and to start development of multiple families of algorithms in Python.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
Slides for talk Abhishek Sharma and I gave at the Gennovation tech talks (https://gennovationtalks.com/) at Genesis. The talk was part of outreach for the Deep Learning Enthusiasts meetup group at San Francisco. My part of the talk is covered from slides 19-34.
Statistical inference of network structureTiago Peixoto
Lecture given at the "Mediterranean School of Complex Networks", Salina, Sicily 3-8 Sept 2017,
http://mediterraneanschoolcomplex.net/
Additional information:
"Bayesian stochastic blockmodeling", Tiago P. Peixoto, arXiv: 1705.10225, https://arxiv.org/abs/1705.10225
How to infer modular network structure using graph-tool (https://graph-tool.skewed.de/): https://graph-tool.skewed.de/static/doc/demos/inference/inference.html
Introduction to search and optimisation for the design theoristAkin Osman Kazakci
An historically important design theory is the state-space search by Herbert Simon. Over the years, the importance of this model has been consistently downplayed for various reasons. Today, it is not being used or discussed very frequently - except to downplay its significance even more - usually without an in-depth analysis.
However, the young generation of (design) researchers do not know well-enough the underlying formalism and how it can be used to interpret design phenomena.
This short introduction intends to give the basics of search, optimisation and problem-solving formalisms in a very intuitive way - which also helps to understand more complicated formal models of design.
How to apply functional concepts in modeling domain objects. Examples are in Scala. Presented on Rubyslava / PyVo #32 (see http://lanyrd.com/2013/rubyslava-september)
Representation Learning & Generative Modeling with Variational Autoencoder(VA...changedaeoh
표현학습(representation learning)과 생성모델링(generative modeling)에 대한 개요를 설명하고 이를 Auto-Encoding Variational Bayes 논문의 내용과 연결시켜 VAE를 이해한다.
연합동아리 TAVE research seminar 21.05.18 발표자료
발표자: 오창대
Interaction Networks for Learning about Objects, Relations and PhysicsKen Kuroki
For my presentation for a reading group. I have not in any way contributed this study, which is done by the researchers named on the first slide.
https://papers.nips.cc/paper/6418-interaction-networks-for-learning-about-objects-relations-and-physics
Leveraging Machine Learning or IA in order to detect Credit Card Fraud and suspicious transations. The aim of this presentation is to help you to improve your knowledge in Machnie Learning and to start development of multiple families of algorithms in Python.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
The presentation gives a brief introduction to artificial neural networks, their types, their characteristics and properties, their advantages and disadvantages, their architectures, different activation functions, popular learning approaches, the gradient descent algorithm, the back propagation algorithm and the application domains.
An artificial neural network (ANN) is the piece of a computing system designed to simulate the way the human brain analyzes and processes information. It is the foundation of artificial intelligence (AI) and solves problems that would prove impossible or difficult by human or statistical standards. ANNs have self-learning capabilities that enable them to produce better results as more data becomes available.
Brief introduction of neural network including-
1. Fitting Tool
2. Clustering data with a self-organising map
3. Pattern Recognition Tool
4. Time Series Toolbox
In this work, the TREPAN algorithm is enhanced and extended for extracting decision trees from neural networks. We empirically evaluated the performance of the algorithm on a set of databases from real world events. This benchmark enhancement was achieved by adapting Single-test TREPAN and C4.5 decision tree induction algorithms to analyze the datasets. The models are then compared with X-TREPAN for
comprehensibility and classification accuracy. Furthermore, we validate the experimentations by applying statistical methods. Finally, the modified algorithm is extended to work with multi-class regression problems and the ability to comprehend generalized feed forward networks is achieved.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. WHAT ARE WE GOING TO COVER?
➤ What is a Neural Network (NN)?
➤ What does it mean to ‘deconstruct’ a NN?
➤ How is this helpful to students and those new to the field?
➤ We will also touch learning psychology - on some thoughts on
what habits I found beneficial in learning this material
2
3. GETTING STARTED - THE “BIG PICTURE”
➤ How do AI, ML, Deep Learning and
Neural Nets relate to each other?
➤ “You can think of deep learning, machine learning and
artificial intelligence as a set of Russian dolls nested within
each other, beginning with the smallest and working out.
Deep learning is a subset of machine learning, and machine
learning is a subset of AI, which is an umbrella term for
any computer program that does something smart.”[1]
➤ Deep Learning in an ML method based on training Neural
Networks.
[1] https://skymind.ai/wiki/ai-vs-machine-learning-vs-deep-learning
John McCarthy
3
4. WHAT IS A NEURAL NET?
➤ Definition: A neural network (NN) is an
interconnected group of natural or
artificial neurons that uses a mathematical
or computational model statistical for
data modeling or decision making.
➤ In most cases an ANN is an adaptive system that changes its
structure based on external or internal information that flows
through the network.
➤ They can be used to model complex relationships between
inputs and outputs or to find patterns in data.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 4
5. APPLICATIONS FOR NEURAL NETS
➤ The tasks to which artificial neural networks are applied tend
to fall within the following broad categories:
➤ Function approximation, or regression analysis, including
time series prediction and modeling.
➤ Classification, including pattern and sequence recognition,
novelty detection and sequential decision making.
➤ Data processing, including filtering, clustering, signal and
compression.
source: https://en.wikipedia.org/wiki/Artificial_neural_network 5
6. LET’S LOOK AT A SIMPLE EXAMPLE
➤ Let's say you have a data sets with six houses. You have the size of the houses (in square
feet or square meters), and the price of the house. You then want to fit a function to predict
the price of the houses as a function of the size. If you are familiar with linear regression,
you might try to fit a straight line to these data to model the relationship.
➤ So you can think of this function that you've just fit the housing prices as a very simple
neural network.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 6
7. HOUSING PREDICTION CONTINUED…
➤ Given these input features, the job of the neural network will be to predict the price y. Notice also that each
of these circle, these are called hidden units in the neural network, that each of them takes its inputs all
four input features.
➤ The middle layer of the neural network is density connected because every input feature is connected to
every one of these circles in the middle. And the remarkable thing about neural networks is that, given
enough data about x and y, given enough training examples with both x and y, neural networks are
remarkably good a figuring out functions that accurately map x to y.
7source: https://www.coursera.org/learn/neural-networks-deep-learning/
8. SUPERVISED LEARNING
➤ One of the most exciting things about the rise of neural networks is that computers
are now much better at interpreting unstructured data as well compared to just a few
years ago. And this creates opportunities for many new exciting applications that use
speech recognition, image recognition, and natural language processing on text.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 8
9. HOW DO NEURAL NETS ‘LEARN’?
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.
source: towardsdatascience.com 9
10. HOW DO NEURAL NETS ‘LEARN’?
source: towardsdatascience.com 10
11. LOGISTIC REGRESSION
➤ We will use logistic regression in order to make the ideas
easier to understand. Logistic regression is an algorithm for
binary classification.
➤ Here's an example of a binary classification problem. You
might have an input of an image, and want to output a label
to recognize this image as either being a cat, in which case
you output 1, or not-cat in which case you output 0, and we're
going to use y to denote the output label.
➤ In the logistic model, the log-odds (the logarithm of the odds)
for the value labeled "1" is a linear combination of one or
more independent variables (“predictors").
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 11
12. LOGISTIC REGRESSION: ASSUMPTIONS
➤ First, binary logistic regression requires the dependent variable to be binary and ordinal
logistic regression requires the dependent variable to be ordinal.
➤ Second, logistic regression requires the observations to be independent of each other. In
other words, the observations should not come from repeated measurements or
matched data.
➤ Third, logistic regression requires there to be little or no multicollinearity among the
independent variables. This means that the independent variables should not be too
highly correlated with each other.
➤ Fourth, logistic regression assumes linearity of independent variables and log odds.
Although the dependent and independent variables do not have to be related linearly, it
requires that the independent variables are linearly related to the log odds.
➤ Finally, logistic regression typically requires a large sample size. A general guideline is
that you need at minimum of 10 cases with the least frequent outcome for each
independent variable in your model. For example, if you have 5 independent variables
and the expected probability of your least frequent outcome is .10, then you would need
a minimum sample size of 500 (10*5 / .10).
12source: https://www.statisticssolutions.com/assumptions-of-logistic-regression/
13. LOGISTIC ACTIVATION FUNCTION: SIGMOID
➤ The goal is to predict the target class y from input z. The probability
P(y=1|z) that input z is classified as class y=1 is represented by the output
ŷ of the sigmoid function computed as ŷ = σ(z).
➤ Note that input z to the logistic function corresponds to the log odds ratio:
➤ This means that log odds ratio changes linearly with z. Furthermore, since
, this means input z changes linearly with the parameters w and
input samples x. This linearity property is a requirement for logistic
regression.
z = wT
⋅ x
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 13
14. ➤ Note: when we look a neural nets β0 will be modeled as
parameter b and β1 and β2 will be modeled as w1 and w2
SIGMOID FUNCTION: DECONSTRUCTED
➤ Consider a model with two predictors, x1 and x2, and one
binary (Bernoulli) response variable Y. Then the general form
of the log-odds (here denoted by ℓ) is:
P
(1 − P)
= eβ0+β1x1+β2x2
z = β0 + β1x1 + β2x2
P =
of
of + 1
=
ez
ez + 1
=
1
1 +
1
ez
=
1
1 + e−z
P
(1 − P)
= of = eβ0+β1x1+β2x2 = ez
->
->
z = b0 + w1x1 + w2x2
source: https://en.wikipedia.org/wiki/Logistic_regression 14
15. EXAMPLE OF BINARY CLASSIFICATION - IMAGE RECOGNITION
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 15
Is this a picture of a cat? Yes: 1, No: 0
16. NEURAL NET NOTATION
➤ Sigma(σ), in this context, is the activation function of a
node which defines the output of that node given an input or
set of inputs.
➤ The linear function, z, is the input and the activation, a, is the
output.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 16
18. ➤ The loss function used to optimize the classification is the cross-
entropy loss function.
➤ The output of the model a = σ(z) can be interpreted as a probability a
that input z belongs to one class (y=1) or probability 1 - a that z belongs
to the other class (y=0)
➤ The neural network model will be optimized by maximizing the
likelihood that a given set of parameters θ of the model can result in a
prediction of the correct class of each input sample. The likelihood
maximization can be written as:
CROSS ENTROPY LOSS FUNCTION
arg max
θ
ℒ(θ|y, z) = arg
θ
max
n
∏
i=1
ℒ(θ|y, z)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 18
19. WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION?
➤ Why do we care about the likelihood function? Because it is
the best model for the use case at hand: we have this observed
data, outcomes and data inputs, but do not know anything
about the parameters that establish a relationship between
the two.
➤ The likelihood function (often simply the likelihood) is the
joint probability distribution of observed data expressed as a
function of statistical parameters. Given the outcome, x, and
parameter θ and continuous probability density function f, the
likelihood function is:
ℒ(θ|x) = fθ (x)
source: https://en.wikipedia.org/wiki/Likelihood_function 19
20. WHY DO WE CARE ABOUT THE LIKELIHOOD FUNCTION? (CONT.)
➤ The likelihood function describes the relative probability or
odds of obtaining the observed data for all permissible values
of the parameters, and is used to identify the particular
parameter values that are most plausible given the observed data.
➤ The likelihood function is a function of the parameter only,
with the data held as a fixed constant. It is the probability of
the data given the parameter value.
➤ Over the domain of permissible parameter values, the
likelihood function describes a surface.[5] The peak of that
surface, if it exists, identifies the point in the parameter space
that maximizes the likelihood; that is the value that is most
likely to be the parameter of the joint probability distribution
underlying the observed data.
source: https://en.wikipedia.org/wiki/Likelihood_function 20
21. CROSS ENTROPY LOSS FUNCTION CONT.
➤ The likelihood function can be written as a joint probability of
generating y and z, given parameters θ:
➤ Since we are not interested in the probability of z, we can
reduce this to: P(y|z,θ)
ℒ(θ|y, z) = P(y, z|θ)
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 21
22. CROSS ENTROPY LOSS FUNCTION CONT.
➤ Since yi is a Bernoulli variable and the probability of P(y|z) is
fixed for a given θ, we further simplify:
➤ Why is the above a product sum? Since the probability of y
given z is for a sample of size n, we have to account for the
probability of y=1 for each outcome in the sample. For
example: if the sample size is 3, the probability of y=1 for
each outcome is 0.9 given z, and we have three outcomes
where y=1, the probability of getting y=1 for the sample,
given z, is 0.9 x 0.9 x 0.9 = 0.729
P(y|z) =
n
∏
i=1
P(yi = 1|zi)yi ⋅ (1 − P(yi = 1|zi))1−yi
P(y|z) =
n
∏
i=1
ay
i
⋅ (1 − ai)1−yi
P(y = 1|z) = σ(z) = a
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 22
->
23. BERNOULLI DISTRIBUTION
➤ The Bernoulli distribution, is the discrete probability
distribution of a random variable which takes the value 1 with
probability p and the value 0 with probability q = 1 - p
➤ It is the probability distribution of any single experiment that
asks a yes–no question. As such, it is a special case of the
binomial distribution where a single trial is conducted.
source:https://en.wikipedia.org/wiki/Bernoulli_distribution 23
24. ➤ Taking the log of the likelihood function results in a convex
loss function where we can determine the minimum value.
➤ Minimizing the negative of this function (minimizing the
negative log likelihood) corresponds to maximizing the
likelihood. This loss function L(y, a) is known as the cross-
entropy error (loss) function, also known as the log-loss.
➤ Why the negative of the function? So we can minimize the loss
or the difference between predicted and actual observations.
CROSS ENTROPY LOSS FUNCTION CONT.
L(y, a) =
{
log ℒ(θ|y, z) = log
n
∏
i=1
ayi
i
⋅ (1 − ai)1−yi =
n
∑
i=1
yi log(ai) + (1 − yt)log(1 − ai)
−log(ai) if yi = 1
−log(1 − ai) if yi = 0
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 24
25. CROSS ENTROPY LOSS FUNCTION CONT.
➤ By minimizing the negative log probability, we will maximize
the log probability. And since y can only be 0 and 1, we can
write L(y, a) as:
➤ Which give the following if we sum over all the samples, n:
➤ So what we end up with is a loss function that is 0 if the
probability to predict the correct class is 1 and goes to infinity
as the probability to predict the correct class goes to 0.
L(y, a) = − y log(a) − (1 − y)log(1 − a)
L(y, a) = −
1
N
n
∑
i=1
[yi log(ai) + (1 − yi)log(1 − ai)]
source: https://peterroelants.github.io/posts/cross-entropy-logistic/ 25
26. MINIMIZING THE LOSS FUNCTION: GRADIENT DESCENT
➤ Recall that our goal is to minimize the loss function by
traversing the function’s surface area.
➤ To minimize the loss function, we use the gradient descent
algorithm with respect to the parameters, w and b.
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 26
27. GRADIENT DESCENT - CONT.
➤ The gradient descent algorithm works by taking the gradient
( derivative ) of the loss function L with respect to the
parameters, w and b and updates the parameters in the direction
of the negative gradient (down along the loss function).
➤ What is the derivative of the loss function?
➤ The parameters w are updated every iteration k by taking steps
proportional to the negative of the gradient:
w(k + 1) = w(k) - Δw(k + 1)
➤ Δw is defined as:
∂L
∂w
= (ai − yi) ⋅ xi
Δw = α
∂L
∂w
source: https://peterroelants.github.io/posts/neural-network-implementation-part02/ 27
28. GRADIENT DESCENT - CONT.
➤ Below is a diagram that shows the algorithm ‘moving down’ the
negative gradient in steps of size alpha (learning rate)
➤ Note: this function maps the parameters to the loss function:
J(w, b) = L(y, a)
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 28
29. WHY THE NEGATIVE GRADIENT?
➤ Because your goal is to minimize the loss function J(θ) =
J(w,b):
source: https://medium.com/@aerinykim/why-do-we-subtract-the-slope-a-in-gradient-descent-73c7368644fa 29
30. THE GRADIENT: DECONSTRUCTED
➤ Ok, so the gradient is the derivative of the loss function which
is:
➤ The next question is: WHY? How is this derivative
calculated?
∂L
∂w
= xi ⋅ (ai − yi)
30
31. CROSS ENTROPY LOSS & WEIGHTS: CALCULATING THE DERIVATIVE
∂L
∂w
= xi ⋅ (ai − yi)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
∂z
∂w
=
∂(x ⋅ w)
∂w
= x
Prove:
Let’s break it down:
Let’s handle the easy one first:
31
32. CROSS ENTROPY LOSS & ACTIVATION FUNCTION: CALCULATING THE DERIVATIVE
L(a, y) = − (ylog(a) + (1 − y)log(1 − a))
∂L
∂a
=
∂( − (ylog(a) + (1 − y)log(1 − a))
∂a
=
∂(−ylog(a))
∂a
−
∂( − (1 − y)log(1 − a))
∂a
= −
y
a
+
1 − y
1 − a
=
a(1 − a)
a(1 − a)
(−y)
a
+
a(1 − a)
a(1 − a)
1 − y
1 − a
=
−y(1 − a) + a(1 − y)
a(1 − a)
=
ya − y + a − ay
a(1 − a)
=
a − y
a(1 − a)
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
Let’s handle this one next:
∂L
∂a
Recall the loss function:
32
33. CROSS ENTROPY LOSS & SIGMOID FUNCTION: CALCULATING THE DERIVATIVE
a = σ(z)
∂a
∂z
= (1 + e−z
)−2
(−1)(−1)(e−z
) =
(−1)(−1)e−z
(1 + e−z)2
=
e−z
(1 + e−z)2
=
1
(1 + e−z)
e−z
(1 + e−z)
1 − a = 1 −
1
1 + e−z
=
(1 + e−z
)
(1 + e−z)
−
1
1 + e−z
=
e−z
1 + e−z
1
(1 + e−z)
e−z
(1 + e−z)
= a(1 − a)
a =
1
1 + e−z
= (1 + e−z
)−1
Finally: ∂a
∂z
Recall that a is the sigmoid
activation function :
Also, note:
Substitute a and 1-a:
33
^Chain rule: -1 for the exponent and -1 for -z
34. LOSS FUNCTION WITH RESPECT TO Z: DERIVATIVE
∂L
∂w
=
∂L
∂a
∂a
∂z
∂z
∂w
= xi ⋅ a(1 − a)
a − y
a(1 − a)
= xi ⋅ (a − y)
Putting it all together:
34
35. DERIVATIVES AND BACKPROPAGATION
➤ Backpropagation is an iterative, recursive and efficient method for calculating
the weights updates to improve in the network until it is able to perform the
task for which it is being trained
➤ The important part is the blue text on the right: note how we are adjusting
the weights (w1, w2 and b) by subtracting the negative gradient (derivative).
∂L
∂w
= xi ⋅ (ai − yi)
source: https://www.coursera.org/learn/neural-networks-deep-learning/ 35
36. NEURAL NETS: FULL CIRCLE
1. Start with values (often random) for the network parameters (wij weights and bj biases).
2. Take a set of examples of input data and pass them through the network to obtain their
prediction.
3. Compare these predictions obtained with the values of expected labels and calculate the
loss with them.
4. Perform the backpropagation in order to propagate this loss to each and every one of the
parameters that make up the model of the neural network.
5. Use this propagated information to update the parameters of the neural network with the
gradient descent in a way that the total loss is reduced and a better model is obtained.
6. Continue iterating in the previous steps until we consider that we have a good model.
source: towardsdatascience.com 36
37. WELL, WAS IT WORTH THE EFFORT?
➤ Consider this…
➤ One of the most famous and consequential meetings in the
history of science took place in the summer of 1684 when the
young astronomer Edmund Halley paid a visit to Isaac Newton.
After they had been some time together, the Dr asked him what
he thought the curve would be that would be described by the
planets supposing the force of attraction towards the sun to be
reciprocal to the square of their distance from it. Sir Isaac replied
immediately that it would be an ellipse. The Doctor,
struck with joy and amazement, asked him how he
knew it.
Why, saith he, I have calculated it.
source: https://www.mathpages.com/home/kmath658/kmath658.htm 37