SlideShare a Scribd company logo
1 of 93
Alex Mirugwe
Victoria University
Artificial Neural Networks
2
Outline
The objective of this part of the Supervised Learning lectures will be to gain
and understanding of:
• Background forANNs
• HowANNs for regression and classification problems work
• Perceptron learning algorithm
• Gradient descent algorithm
• Stochastic gradient descent algorithm
• how to analyze datasets withANNs in R
• how to interpret the results
Resources
• Page 389, Chapter 11 Neural Networks
3
4
Biological Neuron and
Links to Perceptrons
5
Introduction
• An Artificial Neural Network (ANN) models the relationship between a
set of input signals (features) and an output signal (y variable) using a
model derived from our understanding of how a biological brain responds
to stimuli from sensory inputs.
• Just as a brain uses a network of interconnected cells called neurons to
create a massive parallel processor, ANN uses a network of artificial
neurons or nodes to solve learning problems.
• Before we explainANNs, let us understand how the biological brain
works.
Neurons
Ref: Url: https://www.verywellmind.com/what-is-a-neuron-2794890 6
The Neuron
𝑥
7
3
𝑥2
𝑥1
𝑥4
𝑥
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ +
5 𝑤5𝑥5
𝑗
𝑗
=
=
1
1
𝑝
𝑗
𝑗
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 <
≥ 𝜃
𝑦
ො 𝑦
ො
=01
Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
((p+1)x1) weights matrix
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
1
0
(nx(p+1)) input matrix
• 𝒚 = 0
1
1
(nx1) output matrix, 𝒚
ො
=
?
?
?
?
?
12
(nx1) predicted output matrix
Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
𝑦
ො =
1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+1*2+0*(-1)+(-1)*0.5=2.5
𝜙 𝑍 =
𝑝
1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
13
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
y = 1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+3*2+1*(-1)+2*0.5=7
𝜙 𝑍 =
𝑝
1 𝑍 = 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
17
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
((p+1)x1) weights matrix
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
1
0
(nx(p+1)) input matrix
• 𝒚 = 0
1
1
(nx1) output matrix, 𝑦=
1
1
1
?
?
18
(nx1) predicted output matrix
Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
y= 1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+4*2+0*(-1)+(-2)*0.5=8
𝜙 𝑍 =
𝑝
1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
19
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
((p+1)x1) weights matrix
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
(nx(p+1)) input matrix
• 𝒚 =
1
0
0
1
(nx1) output matrix, 𝒚
ො
=
1
1
1
1
?
20
(nx1) predicted output matrix
Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
𝑦 = 1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+5*2+1*(-1)+(1)*0.5=9.5
𝜙 𝑍 =
𝑝
1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
21
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
((p+1)x1) weights matrix
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
1
0
(nx(p+1)) input matrix
• 𝒚 = 0
1
1
(nx1) output matrix, 𝑦 =
1
1
1
1
1
22
(nx1) predicted output matrix
• It is clear that this set of weights do not achieve a good prediction. Need to
be updated.
23
Perceptron Learning
Algorithm
24
How do we find the weights, w?
• 1) Perceptron LearningAlgorithm:
– Step 0: Training begins by assigning some initial random values for the
network parameters.Agood initial heuristic is to start with the average of the
positive input vectors minus the average of the negative input vectors. In many
cases this yields an initial vector near the solution region.
– Step 1: Presenting the input vectors to the network, apply the activation
function (FORWARD PROPAGATION)
– Step 2: Update the weights according to the following rule
(BACKWARD PROPAGATION):
𝑤𝑗 ∶= 𝑤𝑗 − Δ𝑤𝑗
Δ𝑤𝑗 = 𝜂 𝑦 𝑖 − 𝑦
ො
𝑖 𝑗
𝑥 𝑖
Here 𝜂 is the learning rate and 0 < 𝜂 ≤ 1 and 𝑖 = 1, … , 𝑛 representing the
samples
• Continue the iteration until the perceptron classifies all training examples
correctly.
• : = is an assignment
25
Perceptron as a neural network : Going Backward
Input Layer Output Layer
𝑥0
𝑥1
𝑤0
𝑤1
𝑤2
𝑥2
𝑤3
𝑥3
𝑍 𝜙 𝑍
𝑒𝑟𝑟𝑜𝑟 = 𝑦 − 𝑦
ො
Goal is to min error
𝑦
ො
26
𝑗
𝑥 𝑖
Rate of change: Δ𝑤𝑗 = 𝜂 𝑦 𝑖 − 𝑦
ො
𝑖
• Scenario 1: The output is correct - 𝑦 𝑖 = 1, 𝑦
ො 𝑖 = 1
• Scenario 2: The output is incorrect - 𝑦 𝑖 = 1, 𝑦
ො 𝑖
= 0
• Scenario 1:
𝑗
Δ𝑤𝑗 = 𝜂 1 − 1 𝑥 𝑖
= 0 no change is necessary
• Scenario 2:
Δ𝑤𝑗 = 𝜂 1 − 0 𝑥 𝑖
𝑗
𝑗
= 𝜂𝑥 𝑖
the weight update is
proportional to the value of 𝑥 𝑖
𝑗
• In summary: where the perceptron predicts the class label correctly, the weights
remain unchanged, where the perceptron predicts the class label incorrectly, the
weights are updated proportional to the value of the input. The perceptron
learning algorithm selects a search direction in weight space according to the
incorrect classification of the last tested vector
27
Perceptron Learning
Algorithm with Iris
Dataset
Our good old Iris Dataset
0
𝑤 + 𝑤1𝑥1 + 𝑤2𝑥2=0
𝑥1
• Check the perceptron learning algorithm R codes and the video.
𝑥2
Slope = −𝑤1/𝑤2
Intercept = −𝑤0/𝑤2
28
29
Multilayer Perceptrons
Linearly separable – inseparable cases
• It is important to note that the convergence of the perceptron is only
guaranteed if the two classes are linearly separable and the learning rate is
sufficiently small.
30
Multilayer perceptrons
• Single layer perceptrons are only capable of solving linearly separable
problems.
• In order to overcome the linearly inseparable problem, we can add 2 or
more perceptrons together, by creating a multilayer perceptrons.
• Therefore by joining several hyper-planes, we can define a new set of
decision rules.
31
Example 1
• In total we have 12+8 = 20 weights to optimize
3 × 4 = 12 𝑤
32
4 × 2 = 8 𝑤
3 neurons L=1 4 neurons L=2
2 neurons L=3
Example 2
• In total we have 20+5 = 25 weights to optimize
33
4 × 5 = 20 𝑤 5 × 1 = 5 𝑤
Example 3
• In total we have 12+9+3 = 24 weights to optimize
34
4 × 3 = 12 𝑤
3 × 1 = 3 𝑤
3 × 3 = 9 𝑤
𝑥0
𝑥1
𝑥2
𝑥3
𝑤01
𝑤11
𝑤21
31
𝑦
ො
𝑤21
𝑤02
𝑤12
22
𝑤32
𝑎1
𝑍2 𝑎2
(1)
(1)
(1)
(1)
(1)
𝑤(1)
𝑤(1)
(1)
(2)
(2)
𝑤01
𝑤11
(2)
(2)
𝑍1
(2)
(2)
(2)
(2)
(3)
𝑍1 𝑎1
(3)
𝑗
𝒁(2) = 𝒘(1)
𝑥𝑗 𝑗
𝑗
𝒁(𝑘+1) = 𝒘(𝑘)
𝑎(𝑘)
𝑎(𝑘+1)
= 𝜙 𝒁(𝑘+1)
𝑗 𝑗
Input Layer, k=1 Output Layer, k=3
Hidden Layer, k=2 35
𝑗 𝑗
1 Input Layer, 1 Hidden Layer NN with 3 input variables
and 1 output variable (numeric output) – Going Forward
𝑥 = 𝑎(1)
𝑎0 =1
1 Input Layer, 1 Hidden Layer NN with 3 input variables and
1 output variable (categorical output – with 3 categories)
𝑥0
𝑥1
𝑥2
𝑥3
𝑤01
𝑤11
𝑤21
31
𝑤02
𝑤12
22
𝑤32
2
𝑎1
2
(1)
(1)
(1)
(1)
(1)
𝑤(1)
𝑤(1)
(1)
0
(2)
𝑍1
(2)
𝑍 𝑎
(2)
(2)
𝑎(2)=1
𝑤23
(2)
𝑍1 𝑎1
22
13
𝑤12
𝑤11
𝑤02
𝑤03
(2)
𝑤
(2)
𝑤21
(2)
𝑤01
(2)
(2)
(2)
(2)
𝑤(2)
𝑦
ො 𝑗
Input Layer, k=1 Output Layer, k=3
Hidden Layer, k=2 36
𝑦
ො 𝑗
𝑦
ො 𝑗
(3) (3)
(3) (3)
𝑍2 𝑎2
(3)
𝑍3 𝑎3
(3)
37
• 𝑾(𝑘): matrix of weights controlling function mapping from layer (𝑘) to layer
(𝑘 + 1). (Here k = 1, … , 𝐿).
• 𝒁(𝑘+1): vector of linear combinations of weights and inputs from layer (𝑘):
𝒁(𝑘+1) = 𝒘(𝑘)
𝑎(𝑘)
𝑗 𝑗
𝑗 0
𝑗
where 𝑎(1)
= 𝑥 and 𝑎(𝑘)
= 1 (acts as a bias) and 𝑎(𝐿)
= 𝑦
ො
𝑗
𝑗
• 𝑎(𝑘)
:Activation of unit (𝑗) in Layer (𝑘) with a pre-specified activation
𝑗
function. (Here j = 0, … , 𝑃(𝑘) and specific to the layer).
𝑎(𝑘+1)
= 𝜙 𝒁(𝑘+1)
𝑗 𝑗
• There are several different activation functions:
38
𝑗 𝑗
𝑎(1)
= 𝑥 and (in case of a regression problem we have one output) and
39
Activation Functions
• In perceptrons, a small change in the weights or bias of any single perceptron
in the network can sometimes cause the output of that perceptron to
completely flip, say from 0 to 1. That flip may then cause the behaviour of
the rest of the network to completely change in some very complicated way.
• There are several different activation functions:
– Step function
– Constant function
– Threshold function (step)
– Threshold function (ramp)
– Linear function
– Sigmoid function
– Hyperbolic Tangent function
Activation Function – Step Function (Symmetric)
𝑎 = 𝜙 𝑍 =
1
40
𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
−1
Activation Function – Step Function (Binary)
𝑎 = 𝜙 𝑍 =
1 𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
0
41
Activation Function – Step Function (Linear)
𝑎 = 𝜙 𝑍 = 𝑍
42
Activation Function – Semilinear Function
𝑎 = 𝜙 𝑍 =
1 𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 > 1
𝑗=0
𝑝
− 1 ≤ ෍ 𝑤𝑗 𝑥𝑗 ≤ 1
𝑗=0
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 < −1
𝑗=0
𝑍 𝑖𝑓
−1 𝑖𝑓
43
Activation Function – Sigmoid Function
𝑎 = 𝜙 𝑍 =
1
1 + exp(−𝛼𝑍)
We will focus on sigmoid
activation function at the
moment.
44
Activation Function – Hyperbolic Tangent (Tanh)
Function
𝑎 = 𝜙 𝑍 =
exp 𝑍 − exp(−𝑍)
exp(𝑍) + exp(−𝑍)
45
Activation Function – ReLU Function (Rectified
Linear Unit)
• Non differentiable at 0, however, it is differentiable anywhere else.At the
value of zero, a random choice of 0 or 1 is possible.
𝑎 = 𝜙 𝑍 =
𝑍 𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 > 0
𝑗=0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
0
46
47
How do we find the weights, w?
• One way of attacking the problem is to use calculus to try
to find the minimum analytically.
• We could compute derivatives and then try using them to
find places where C is an extremum. With some luck that
might work when C is a function of just one or a few
variables.
• But it'll turn into a nightmare when we have many more
variables.
• And for neural networks we'll often want far more
variables - the biggest neural networks have cost functions
which depend on billions of weights and biases in an
extremely complicated way.
• Using calculus to minimize that just won't work!
How do we find the weights, w? – Going Backward
• 2) Gradient DescentAlgorithm:
– Step 0: Training begins by assigning some initial random values for the network
parameters.
– Step 1: Presenting the input vectors to the network, apply the activation function
(FORWARD PROPAGATION)
– Step 2: Calculate the error using an activation function :
𝐽 𝑤 =
1
2𝑛 𝑖
=1
σ𝑛 𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
for regression problems (without any regularization)
𝐽 𝑤
2𝑛 𝑖
=1
= −1
σ𝑛
𝑦(𝑖) log 𝑦
ො (
𝑖
)
+ (1 − 𝑦(𝑖)) log 1 − 𝑦ො (
𝑖
) for classification problems
(without any regularization)
Our goal is to minimise the error 𝐽 𝑤 with respect to 𝑤𝑗
– Step 3: Update the weights according to the following rule:
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
48
𝜕𝑤𝑗
𝐽(𝑤)
Here 𝛼 is the learning rate and 0 < 𝛼 ≤ 1
• Continue the iteration until convergence.
How do we find the weights, w? – Going Backward
2) Gradient DescentAlgorithm:
𝑗
𝑗
• Let us examine the weight update function:
• 𝑤 ∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
• Partial derivative answers the question “What is the slope of the 𝐽 𝑤 at
point 𝑤.
• And the 𝛼 determines the amount of change that needs to be done. If 𝛼 is too
small take small steps to reach the optimal values. It will take too long to
reach the optimum.
• If 𝛼 is too big, we may miss the optimal values. Fail to converge.
• Let us have a look at these concepts with a small example:
x: (size) c(0, 1, 2, 3)
y: (price) c(0, 2, 4, 6)
one feature to estimate a numeric variable.
49
50
Batch, Stochastic Gradient Descent
• In batch gradient descent learning, the weight update is
calculated based on all samples in the training set (instead of
updating the weights incrementally after each sample),
which is why this approach is also referred to as “batch”
gradient descent.
• Vector – matrix operations
51
Batch, Stochastic Gradient Descent
• Now imagine we have a very large dataset with millions of data points,
which is not uncommon in many machine learning applications. Running
batch gradient descent can be computationally quite costly in such scenarios
since we need to re-evaluate the whole training dataset each time we take
one step towards the global minimum.
• A popular alternative to the batch gradient descent algorithm is stochastic
gradient descent, sometimes also called iterative or on-line gradient descent.
Instead of updating the weights based on the sum of the accumulated errors
over all samples, we update the weights incrementally for each training
sample.
for (i in 1:n){
- calculate error
- calculate derivatives
- update weight
}
52
Batch, Stochastic Gradient Descent
• Acompromise between batch gradient descent and stochastic
gradient descent is the so-called mini-batch learning. In
mini-batch learning, a neural network learns from just one
training input at a time.
• Mini-batch learning can be understood as applying batch
gradient descent to smaller subsets of the training data—for
example, 50 samples at a time.
• By averaging over this small sample it turns out that we
can quickly get a good estimate of the true gradient,
and this helps speed up gradient descent, and thus
learning
53
Final note on Gradient Descent:
Input variable preprocessing
• Gradient descent is one of the many algorithms that benefit from feature
scaling. Each input variable should be preprocessed so that its mean value,
averaged over the entire training sample, is close to zero, or else it will be
small compared to its standard deviation.
• In order to accelerate the back-propagation learning process, the
normalization of the inputs should also include two other measures (LeCun,
1993):
• The input variables contained in the training set should be uncorrelated; this
can be done by using principal-components analysis (USL).
• The decorrelated input variables should be scaled so that their covariances
are approximately equal, thereby ensuring that the different synaptic weights
in the network learn at approximately the same speed.
Final note on Gradient Descent:
Input variable preprocessing
• We will use a feature scaling method called standardization, which gives our
data the property of a standard normal distribution.
• The mean of each input feature is centered at value 0 and the feature column
has a standard deviation of 1:
𝑥𝑠𝑡 =
𝑥𝑗 − 𝑥𝑗
𝑗
𝑠𝑗
54
Hypothetical Data – The exact function is 𝑦
ො = 0 +
2𝑥
> dat
x y
1 0 0
2 1 2
3 2 4
4 3 6
𝑦
ො = 0 +
2𝑥
55
1 input layer, 1 output layer
Input Layer Output Layer
𝑥0
𝑥1
𝑤0
𝑤1 𝑍 𝜙 𝑍
𝑍 = 𝑤0𝑥0 + 𝑤1𝑥1
𝜙 𝑍 = 𝑍
𝑒𝑟𝑟𝑜𝑟 = 𝑦 − 𝑦
ො
𝑦
ො
56
For: 𝑤0 = 0, 𝑤1 = 0 𝐽 0,0 =7
> dat
x y yhat=0+0*x
1 0 0 0+0*0 = 0
2 1 2 0+0*1 = 0
3 2 4 0+0*2 = 0
4 3 6 0+0*3 = 0
1
𝑒 = 0
𝑒2 = 2
𝑒3 = 4
𝑒4 = 6
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 22 + 42 + 62 = 7
57
For: 𝑤0 = 0, 𝑤1 = 0.5 𝐽 0,0.5 =3.59375
> dat
1
x
1 0
y
0
yhat=0+0.5*x
0+0.5*0 = 0 𝑒4 = 4.5
2 1 2 0+0.5*1 = 0.5
3 2 4 0+0.5*2 = 1 𝑒3 = 2.5
4 3 6 0+0.5*3 = 1.5
𝑒2
𝑒 = 0 = 1.5
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 1.52 + 2.52 + 4.52
58
= 3.59375
For: 𝑤0 = 0, 𝑤1 =1 𝐽 0,1 =1.75
> dat
x y yhat=0+1*x
1 0 0 0+1*0 = 0
2 1 2 0+1*1 = 1
3 2 4 0+1*2 = 2
4 3 6 0+1*3 = 3
1
𝑒 = 0
𝑒2 = 1
𝑒3 = 2
𝑒4 = 3
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 12 + 22 + 32
59
= 1.75
For: 𝑤0 = 0, 𝑤1 =2 𝐽 0,2 = 0
> dat
x y yhat=0+2*x
1 0 0 0+2*0 = 0
2 1 2 0+2*1 = 2
3 2 4 0+2*2 = 4
4 3 6 0+2*3 = 6
𝑒1 = 0
𝑒3 = 0
𝑒2 = 0
𝑒4 =0
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 02 + 02 + 02 = 0
60
Cost function with respect to w1
𝑤1 = 0
𝜕
𝜕𝑤𝑗
𝐽 𝑤 < 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
61
Cost function with respect to w1
𝑤1 = 0.5
𝜕
𝜕𝑤𝑗
𝐽 𝑤 < 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
62
Cost function with respect to w1
𝜕
𝜕𝑤𝑗
𝐽 𝑤
𝑤1 = 1.25
< 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
63
Cost function with respect to w1
𝑤1 = 2
𝜕
𝜕𝑤𝑗
𝐽 𝑤 < 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
64
Cost function with respect to w1
𝑤1 =2.75
𝜕
𝜕𝑤𝑗
𝐽 𝑤 > 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
65
Cost function with respect to w1
𝑤1 =2.5
𝜕
𝜕𝑤𝑗
𝐽 𝑤 > 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
66
Cost function with respect to w1
𝑤1 =2
𝜕
𝜕𝑤𝑗
𝐽 𝑤 > 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
67
Cost function with respect to w1, one w parameter to
optimize
Ref: Raschka, p35 68
69
Example 1
Gradient Descent in a very simple example:
1 input layer, 1 output layer, 1 x, 1 numeric y
• Consider the linear regression example:
• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
and 𝑦ො (
𝑖
)= 𝑤 𝑥
0 0 + 𝑤 𝑥
1 1
• We have 2 weights!
•
𝜕
𝜕𝑤0
𝐽(𝑤) and
𝜕
𝜕𝑤1
𝐽(𝑤) need to be calculated:
• ∇𝐽 =
𝜕
𝜕𝑤0
𝐽 𝑤 ,
𝜕
𝜕𝑤1
𝐽(𝑤)
𝑗 𝑗
• 𝑤 ∶= 𝑤 − 𝛼
𝜕
70
𝜕𝑤𝑗
𝐽(𝑤)
• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
and 𝑦
ො (
𝑖
)
= 𝑤0𝑥0 + 𝑤1𝑥1
𝜕
𝜕𝑤0
−2
𝐽 𝑤 =
2𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
𝜕
𝜕𝑤0
𝑦
ො 𝑖𝑤 =
−1
𝑛
(𝑖) (𝑖
)
(𝑦 − 𝑦
ො )𝑥0
• and
𝜕
𝜕𝑤1
−2
𝐽 𝑤 =
2𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
𝜕
𝜕𝑤1
𝑦
ො
𝑖
𝑛
71
−1
𝑤 = (𝑦(𝑖) − 𝑦ො (𝑖))
𝑥
1
𝜕
𝜕𝑤0
𝐽 𝑤 =
−1
𝑛
𝑦 − 𝑤0𝑥0 − 𝑤1𝑥1 ∗ 𝑥0
𝜕
1
𝐽 𝑤 =
𝜕𝑤
−1
𝑛
𝑦 − 𝑤0𝑥0 − 𝑤1𝑥1 ∗ 𝑥1
𝑗
𝑤𝑗 ∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
•
𝜕
𝐽 𝑤 < 0 that means we will increase the weight
•
𝜕𝑤
𝜕
𝜕𝑤
𝜕
𝐽 𝑤 > 0 that means we will decrease the weight
•
𝜕
𝑤
𝐽 𝑤
72
= 0 that means we will not change the weight
73
Example 2
𝑥0
𝑥1
𝑤01
𝑤11
𝑦
ො
𝑤21
𝑤02
𝑤12
𝑍2
𝑎1
𝑎2
(1)
(1)
(1)
(1)
(2)
(2)
𝑤01
𝑤11
(2)
(2)
𝑍1
(2)
(2)
(2)
(2)
(3)
𝑍1 𝑎1
(3)
𝑗
𝒁(2) = 𝒘(1)
𝑥𝑗
1 𝑗
𝑗
𝑦
ො = 𝑎(3)
= 𝒁(𝑘+1) = 𝒘(𝑘)
𝑎
(2)
𝑗
𝑎(2)
= 𝜙 𝑗
𝒁(2)
=
1
𝑗
1 + exp(−𝒁 2
)
𝑗 𝑗
Gradient Descent in a very simple example:
1 input layer, 1 hidden layer (2 neurons + bias), 1 output
layer, 1 x, 1 numeric y
𝑥 = 𝑎(1)
𝑎0 =1
Gradient Descent in a very simple example:
1 input layer, 1 hidden layer (2 neurons + bias), 1
output layer, 1 x, 1 numeric y
• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
𝜕
𝜕
𝑤
• 𝐽 𝑤 = −2
2𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
𝜕
𝜕
𝑤
𝑦
ො 𝑖𝑤
75
Gradient Descent in a very simple example:
1 input layer, 1 hidden layer (2 neurons + bias), 1
output layer, 1 x, 1 numeric y
• 𝑦
ො (
𝑖
)
and
0 01 1 11 2 21 0
= 𝑎(2)
𝑤(2)
+ 𝑎(2)
𝑤(2)
+ 𝑎(2)
𝑤(2)
𝑎(2)
= 1
Output layer weights Hidden layer weights
𝜕
𝑦
ො 𝑖𝑤 = 𝑎(2)
= 1
𝜕𝑤(2) 0
01
𝜕
𝜕𝑤(1)
01
𝑦
ො
𝑖
𝑤 =?
𝜕
𝑦
ො 𝑖𝑤 = 𝑎(2)
𝜕𝑤(2) 1
11
𝜕
𝜕𝑤(1)
02
𝑦
ො
𝑖
𝑤 =?
𝜕
𝑦
ො 𝑖𝑤 = 𝑎(2)
𝜕𝑤(2) 2
21
𝜕
𝜕𝑤(1)
11
𝑦
ො
𝑖
𝑤 =?
𝜕
𝜕𝑤(1)
12
𝑦
ො
𝑖
𝑤 =?
76
Update the weights from the hidden layer to the
output layer
• 𝑤(2)
:= 𝑤(2)
− 𝛼
01 01
𝑦 𝑖
− 𝑦
ො 𝑖
• 𝑤(2)
:= 𝑤(2)
− 𝛼
11 11
𝑦 𝑖
− 𝑦
ො 𝑖
1
𝑎(2)
• 𝑤(2)
:= 𝑤(2)
− 𝛼
21 21
−1
𝑛
−1
𝑛
−1
𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
77
2
𝑎(2)
78
• 𝑦
ො (
𝑖
)
= 1𝑤(2)
+ 𝑎(2)
𝑤(2)
+ 𝑎(2)
𝑤(2)
01 1 11 2 21 and 0
𝑎(2)
= 1
•
1
1
1+exp(−𝒁 2
)
1 2
𝑎(2)
= 𝑎(2)
= 1
2
1+exp(−𝒁 2
)
• 1
𝑍 2
01 11
= 𝑤(1)
1 + 𝑤(1)
𝑥1 2
𝑍 2
02 12
= 𝑤(1)
1 + 𝑤(1)
𝑥1
Output layer weights Hidden layer weights
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(2)
01
= 1
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
01
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
1 1
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
1 1 01
= 𝑤 2
𝑎 2
11 1
1 − 𝑎 2
1 1
𝜕
𝑦
ො 𝑖 𝑤 (2)
(2) = 𝑎1
𝜕𝑤11
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
11
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
1 1
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
1 1 11
= 𝑤 2
𝑎 2
11 1
1 − 𝑎 2
1 𝑥1
𝜕
𝑦
ො 𝑖 𝑤 (2)
(2) = 𝑎2
𝜕𝑤21
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
02
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
2 2
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
2 2 02
= 𝑤 2
𝑎 2
21 2
1 − 𝑎 2
2 1
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
12
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
2 2
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
2 2 12
= 𝑤 2
𝑎 2
21 2
1 − 𝑎 2
2 𝑥1
Update the weights from the input layer to the
hidden layer
• 𝑤(1)
:= 𝑤(1)
− 𝛼
01 01
𝑦 𝑖
− 𝑦
ො 𝑖
11 1
𝑤 2
𝑎 2
1
1 − 𝑎 2
• 𝑤(1)
:= 𝑤(1)
− 𝛼
11 11
𝑦 𝑖
− 𝑦
ො 𝑖
11 1
𝑤 2
𝑎 2
1
1 − 𝑎 2
𝑥1
• 𝑤(1)
:= 𝑤(1)
− 𝛼
02 02
𝑦 𝑖
− 𝑦
ො 𝑖
21 2
𝑤 2
𝑎 2
2
1 − 𝑎 2
• 𝑤(1)
:= 𝑤(1)
− 𝛼
12 12
−1
𝑛
−1
𝑛
−1
𝑛
−1
𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
21 2
𝑤 2
𝑎 2
2
1 − 𝑎 2
𝑥
79
1
80
𝛼 Learning Rate
What does 𝛼 do?
• If 𝛼 is too small, the rate of change in the
weights will be tiny. It will take too long to
reach to the optimum solution.
• If 𝛼 is too big, the rate of change in the
weights will be very big. We may never
find the optimum solution, our algorithm
may fail to converge.
81
𝛼, adaptive learning
• In stochastic gradient descent implementations, the fixed learning rate 𝛼 is
often replaced by an adaptive learning rate that decreases over time, for
example,
𝐶1
#𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 + 𝐶2
• where 𝐶1and 𝐶2 are constants. Note that stochastic gradient descent does
not reach the global minimum but an area very close to it. By using an
adaptive learning rate, we can achieve further annealing to a better global
minimum.
82
• Ref: Raschka, p.40
83
What if we have 𝑤0 and 𝑤1 together to change?
𝑤1
𝑤0
• The cost function J(𝑤0, 𝑤1) will be a 3D surface plot (left pane)
• The contour plot will provide the same cost along the same contour (right pane)
J(𝑤0, 𝑤1)
𝑤0
84
𝑤1
85
Controlling the
Complexity of NNs
86
1) Regularization in
Neural Networks
87
Regularization in Neural Networks
• In multi-layer neural networks, the number of input and outputs units is
generally determined by the dimensionality of the data set.
• On the other hand, we are free with the number of hidden layer units (M).
We may typically have hundreds, thousands, or even billions of weights
that we need to optimize.
• Choose optimum number of hidden layer units (M) that gives the best
generalization performance for balance between underfitting and
overfitting.
• A network is said to generalize well when the input–output mapping
computed by the network is correct (or nearly so) for test data never used in
creating or training the network. Here, it is assumed that the test data are
drawn from the same population used to generate the training data.
88
Regularization in Neural Networks
• The generalization error, however, is not a simple function of the number of
hidden layer units (M) due to the presence of local minima in the error
function.
• Each time when we start with random values of the weight vector for each
hidden layer unit size considered, we see the effect of choosing multiple
random initializations for the weight vector for a range of values of M.
• In practice, one approach to choosing M is in fact to plot a graph of the M
vs the errors, then to choose the specific solution having the smallest
validation set error.
Regularization in Neural Networks
• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
• There are, however, other ways to control the complexity of a neural
network model in order to avoid over-fitting. Such as adding a quadratic
regularizer (L2):
2
• 𝐽
ሚ𝑤 = 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2𝑛 2
2 𝜆
+ 𝑤
89
2
• This regularizer is also known as weight decay.
• The effective model complexity is then determined by the choice of the
regularization coefficient λ.
90
2) Early Stopping
91
Early Stopping in NNs
• An alternative to regularization as a way of controlling the effective
complexity of a network is the procedure of early stopping.
• The training of nonlinear network models corresponds to an iterative
reduction of the error function defined with respect to a set of training data.
• The error measured with respect to independent data, generally called a
validation set, often shows a decrease at first, followed by an increase as
the network starts to over-fit. Training can therefore be stopped at the point
of smallest error with respect to the validation data set, in order to obtain a
network having good generalization performance.
92
When to use NNs
• When dealing with unstructured datasets
• When you do not need interpretable results, for example when you just
want to classify your pictures based on cats and dogs, you don’t need to
know why the outcome is classified as a cat or a dog. You don’t need to
explain the relationships.
• When you have many features, with regularization
• When you have nonlinear relationships
93
Resources
• Afree online book by Michael Nielsen (brilliant resource for partial
derivatives and gradient descent):
http://neuralnetworksanddeeplearning.com/
• The Elements of Statistical Learning, Trevor Hastie book (p.389)
• Pattern Recognition and Machine Learning Book, Christopher Bishop
(p.227)
• Machine Learning with R, Brett Lantz (p.219)
• Neural Networks – a comprehensive foundation, Simon S Haykin
• Python Machine Learning, Sebastian Raschka (p.17)
• Neural Network Design, Hagan, Demuth, Beale, De Jesus
(http://hagan.okstate.edu/nnd.html)
• https://github.com/stephencwelch/Neural-Networks-Demystified
• Of course again Prof. Patrick Henry Winston’s MIT youtube lectures.

More Related Content

Similar to Neural Networks

Neural Networks
Neural NetworksNeural Networks
Neural NetworksAdri Jovin
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksarjitkantgupta
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Networkssuserab4f3e
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Multilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPMultilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPAbdullah al Mamun
 
Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9Randa Elanwar
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyayabhishek upadhyay
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptxKarasuLee
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networksLet's talk about IT
 
Echo state networks and locomotion patterns
Echo state networks and locomotion patternsEcho state networks and locomotion patterns
Echo state networks and locomotion patternsVito Strano
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptGayathriRHICETCSESTA
 

Similar to Neural Networks (20)

Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Multilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPMultilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLP
 
Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
ERF Training Workshop Panel Data 5
ERF Training WorkshopPanel Data 5ERF Training WorkshopPanel Data 5
ERF Training Workshop Panel Data 5
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptx
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
DNN.pptx
DNN.pptxDNN.pptx
DNN.pptx
 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networks
 
Lec 3-4-5-learning
Lec 3-4-5-learningLec 3-4-5-learning
Lec 3-4-5-learning
 
Echo state networks and locomotion patterns
Echo state networks and locomotion patternsEcho state networks and locomotion patterns
Echo state networks and locomotion patterns
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Neural
NeuralNeural
Neural
 
feedforward-network-
feedforward-network-feedforward-network-
feedforward-network-
 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
 

Recently uploaded

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentInMediaRes1
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfUmakantAnnand
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...M56BOOKSTORE PRODUCT/SERVICE
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 

Recently uploaded (20)

Alper Gobel In Media Res Media Component
Alper Gobel In Media Res Media ComponentAlper Gobel In Media Res Media Component
Alper Gobel In Media Res Media Component
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Concept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.CompdfConcept of Vouching. B.Com(Hons) /B.Compdf
Concept of Vouching. B.Com(Hons) /B.Compdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
KSHARA STURA .pptx---KSHARA KARMA THERAPY (CAUSTIC THERAPY)————IMP.OF KSHARA ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 

Neural Networks

  • 2. 2 Outline The objective of this part of the Supervised Learning lectures will be to gain and understanding of: • Background forANNs • HowANNs for regression and classification problems work • Perceptron learning algorithm • Gradient descent algorithm • Stochastic gradient descent algorithm • how to analyze datasets withANNs in R • how to interpret the results
  • 3. Resources • Page 389, Chapter 11 Neural Networks 3
  • 5. 5 Introduction • An Artificial Neural Network (ANN) models the relationship between a set of input signals (features) and an output signal (y variable) using a model derived from our understanding of how a biological brain responds to stimuli from sensory inputs. • Just as a brain uses a network of interconnected cells called neurons to create a massive parallel processor, ANN uses a network of artificial neurons or nodes to solve learning problems. • Before we explainANNs, let us understand how the biological brain works.
  • 7. The Neuron 𝑥 7 3 𝑥2 𝑥1 𝑥4 𝑥 𝑤1 𝑤2 𝑤3 𝑤4 𝑤5 𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ + 5 𝑤5𝑥5 𝑗 𝑗 = = 1 1 𝑝 𝑗 𝑗 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 < ≥ 𝜃 𝑦 ො 𝑦 ො =01
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Simple example • 𝒘 = 𝑤0 𝑤1 𝑤2 𝑤3 = 1 2 −1 0.5 ((p+1)x1) weights matrix • 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 1 0 (nx(p+1)) input matrix • 𝒚 = 0 1 1 (nx1) output matrix, 𝒚 ො = ? ? ? ? ? 12 (nx1) predicted output matrix
  • 13. Simple example 𝑥0 𝑥1 𝑥2 𝑥3 𝑤0 =1 𝑤1 = 2 𝑤2 = −1 𝑤3 = 0.5 𝑦 ො = 1 𝑍 𝜙 𝑍 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑤𝑗𝑥𝑗=1*1+1*2+0*(-1)+(-1)*0.5=2.5 𝜙 𝑍 = 𝑝 1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0 𝑗=0 13 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 14.
  • 15.
  • 16.
  • 17. Simple example 𝑥0 𝑥1 𝑥2 𝑥3 𝑤0 =1 𝑤1 = 2 𝑤2 = −1 𝑤3 = 0.5 y = 1 𝑍 𝜙 𝑍 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑤𝑗𝑥𝑗=1*1+3*2+1*(-1)+2*0.5=7 𝜙 𝑍 = 𝑝 1 𝑍 = 𝑤𝑗 𝑥𝑗 ≥ 0 𝑗=0 17 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 18. Simple example • 𝒘 = 𝑤0 𝑤1 𝑤2 𝑤3 = 1 2 −1 0.5 ((p+1)x1) weights matrix • 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 1 0 (nx(p+1)) input matrix • 𝒚 = 0 1 1 (nx1) output matrix, 𝑦= 1 1 1 ? ? 18 (nx1) predicted output matrix
  • 19. Simple example 𝑥0 𝑥1 𝑥2 𝑥3 𝑤0 =1 𝑤1 = 2 𝑤2 = −1 𝑤3 = 0.5 y= 1 𝑍 𝜙 𝑍 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑤𝑗𝑥𝑗=1*1+4*2+0*(-1)+(-2)*0.5=8 𝜙 𝑍 = 𝑝 1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0 𝑗=0 19 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 20. Simple example • 𝒘 = 𝑤0 𝑤1 𝑤2 𝑤3 = 1 2 −1 0.5 ((p+1)x1) weights matrix • 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 (nx(p+1)) input matrix • 𝒚 = 1 0 0 1 (nx1) output matrix, 𝒚 ො = 1 1 1 1 ? 20 (nx1) predicted output matrix
  • 21. Simple example 𝑥0 𝑥1 𝑥2 𝑥3 𝑤0 =1 𝑤1 = 2 𝑤2 = −1 𝑤3 = 0.5 𝑦 = 1 𝑍 𝜙 𝑍 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑤𝑗𝑥𝑗=1*1+5*2+1*(-1)+(1)*0.5=9.5 𝜙 𝑍 = 𝑝 1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0 𝑗=0 21 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 22. Simple example • 𝒘 = 𝑤0 𝑤1 𝑤2 𝑤3 = 1 2 −1 0.5 ((p+1)x1) weights matrix • 𝒙 = 1 1 0 −1 1 2 1 0 1 3 1 2 1 4 0 −2 1 5 1 1 1 0 (nx(p+1)) input matrix • 𝒚 = 0 1 1 (nx1) output matrix, 𝑦 = 1 1 1 1 1 22 (nx1) predicted output matrix • It is clear that this set of weights do not achieve a good prediction. Need to be updated.
  • 24. 24 How do we find the weights, w? • 1) Perceptron LearningAlgorithm: – Step 0: Training begins by assigning some initial random values for the network parameters.Agood initial heuristic is to start with the average of the positive input vectors minus the average of the negative input vectors. In many cases this yields an initial vector near the solution region. – Step 1: Presenting the input vectors to the network, apply the activation function (FORWARD PROPAGATION) – Step 2: Update the weights according to the following rule (BACKWARD PROPAGATION): 𝑤𝑗 ∶= 𝑤𝑗 − Δ𝑤𝑗 Δ𝑤𝑗 = 𝜂 𝑦 𝑖 − 𝑦 ො 𝑖 𝑗 𝑥 𝑖 Here 𝜂 is the learning rate and 0 < 𝜂 ≤ 1 and 𝑖 = 1, … , 𝑛 representing the samples • Continue the iteration until the perceptron classifies all training examples correctly. • : = is an assignment
  • 25. 25 Perceptron as a neural network : Going Backward Input Layer Output Layer 𝑥0 𝑥1 𝑤0 𝑤1 𝑤2 𝑥2 𝑤3 𝑥3 𝑍 𝜙 𝑍 𝑒𝑟𝑟𝑜𝑟 = 𝑦 − 𝑦 ො Goal is to min error 𝑦 ො
  • 26. 26 𝑗 𝑥 𝑖 Rate of change: Δ𝑤𝑗 = 𝜂 𝑦 𝑖 − 𝑦 ො 𝑖 • Scenario 1: The output is correct - 𝑦 𝑖 = 1, 𝑦 ො 𝑖 = 1 • Scenario 2: The output is incorrect - 𝑦 𝑖 = 1, 𝑦 ො 𝑖 = 0 • Scenario 1: 𝑗 Δ𝑤𝑗 = 𝜂 1 − 1 𝑥 𝑖 = 0 no change is necessary • Scenario 2: Δ𝑤𝑗 = 𝜂 1 − 0 𝑥 𝑖 𝑗 𝑗 = 𝜂𝑥 𝑖 the weight update is proportional to the value of 𝑥 𝑖 𝑗 • In summary: where the perceptron predicts the class label correctly, the weights remain unchanged, where the perceptron predicts the class label incorrectly, the weights are updated proportional to the value of the input. The perceptron learning algorithm selects a search direction in weight space according to the incorrect classification of the last tested vector
  • 28. Our good old Iris Dataset 0 𝑤 + 𝑤1𝑥1 + 𝑤2𝑥2=0 𝑥1 • Check the perceptron learning algorithm R codes and the video. 𝑥2 Slope = −𝑤1/𝑤2 Intercept = −𝑤0/𝑤2 28
  • 30. Linearly separable – inseparable cases • It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable and the learning rate is sufficiently small. 30
  • 31. Multilayer perceptrons • Single layer perceptrons are only capable of solving linearly separable problems. • In order to overcome the linearly inseparable problem, we can add 2 or more perceptrons together, by creating a multilayer perceptrons. • Therefore by joining several hyper-planes, we can define a new set of decision rules. 31
  • 32. Example 1 • In total we have 12+8 = 20 weights to optimize 3 × 4 = 12 𝑤 32 4 × 2 = 8 𝑤 3 neurons L=1 4 neurons L=2 2 neurons L=3
  • 33. Example 2 • In total we have 20+5 = 25 weights to optimize 33 4 × 5 = 20 𝑤 5 × 1 = 5 𝑤
  • 34. Example 3 • In total we have 12+9+3 = 24 weights to optimize 34 4 × 3 = 12 𝑤 3 × 1 = 3 𝑤 3 × 3 = 9 𝑤
  • 35. 𝑥0 𝑥1 𝑥2 𝑥3 𝑤01 𝑤11 𝑤21 31 𝑦 ො 𝑤21 𝑤02 𝑤12 22 𝑤32 𝑎1 𝑍2 𝑎2 (1) (1) (1) (1) (1) 𝑤(1) 𝑤(1) (1) (2) (2) 𝑤01 𝑤11 (2) (2) 𝑍1 (2) (2) (2) (2) (3) 𝑍1 𝑎1 (3) 𝑗 𝒁(2) = 𝒘(1) 𝑥𝑗 𝑗 𝑗 𝒁(𝑘+1) = 𝒘(𝑘) 𝑎(𝑘) 𝑎(𝑘+1) = 𝜙 𝒁(𝑘+1) 𝑗 𝑗 Input Layer, k=1 Output Layer, k=3 Hidden Layer, k=2 35 𝑗 𝑗 1 Input Layer, 1 Hidden Layer NN with 3 input variables and 1 output variable (numeric output) – Going Forward 𝑥 = 𝑎(1) 𝑎0 =1
  • 36. 1 Input Layer, 1 Hidden Layer NN with 3 input variables and 1 output variable (categorical output – with 3 categories) 𝑥0 𝑥1 𝑥2 𝑥3 𝑤01 𝑤11 𝑤21 31 𝑤02 𝑤12 22 𝑤32 2 𝑎1 2 (1) (1) (1) (1) (1) 𝑤(1) 𝑤(1) (1) 0 (2) 𝑍1 (2) 𝑍 𝑎 (2) (2) 𝑎(2)=1 𝑤23 (2) 𝑍1 𝑎1 22 13 𝑤12 𝑤11 𝑤02 𝑤03 (2) 𝑤 (2) 𝑤21 (2) 𝑤01 (2) (2) (2) (2) 𝑤(2) 𝑦 ො 𝑗 Input Layer, k=1 Output Layer, k=3 Hidden Layer, k=2 36 𝑦 ො 𝑗 𝑦 ො 𝑗 (3) (3) (3) (3) 𝑍2 𝑎2 (3) 𝑍3 𝑎3 (3)
  • 37. 37
  • 38. • 𝑾(𝑘): matrix of weights controlling function mapping from layer (𝑘) to layer (𝑘 + 1). (Here k = 1, … , 𝐿). • 𝒁(𝑘+1): vector of linear combinations of weights and inputs from layer (𝑘): 𝒁(𝑘+1) = 𝒘(𝑘) 𝑎(𝑘) 𝑗 𝑗 𝑗 0 𝑗 where 𝑎(1) = 𝑥 and 𝑎(𝑘) = 1 (acts as a bias) and 𝑎(𝐿) = 𝑦 ො 𝑗 𝑗 • 𝑎(𝑘) :Activation of unit (𝑗) in Layer (𝑘) with a pre-specified activation 𝑗 function. (Here j = 0, … , 𝑃(𝑘) and specific to the layer). 𝑎(𝑘+1) = 𝜙 𝒁(𝑘+1) 𝑗 𝑗 • There are several different activation functions: 38 𝑗 𝑗 𝑎(1) = 𝑥 and (in case of a regression problem we have one output) and
  • 39. 39 Activation Functions • In perceptrons, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. • There are several different activation functions: – Step function – Constant function – Threshold function (step) – Threshold function (ramp) – Linear function – Sigmoid function – Hyperbolic Tangent function
  • 40. Activation Function – Step Function (Symmetric) 𝑎 = 𝜙 𝑍 = 1 40 𝑖𝑓 𝑝 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0 𝑗=0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 −1
  • 41. Activation Function – Step Function (Binary) 𝑎 = 𝜙 𝑍 = 1 𝑖𝑓 𝑝 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0 𝑗=0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 41
  • 42. Activation Function – Step Function (Linear) 𝑎 = 𝜙 𝑍 = 𝑍 42
  • 43. Activation Function – Semilinear Function 𝑎 = 𝜙 𝑍 = 1 𝑖𝑓 𝑝 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 > 1 𝑗=0 𝑝 − 1 ≤ ෍ 𝑤𝑗 𝑥𝑗 ≤ 1 𝑗=0 𝑝 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 < −1 𝑗=0 𝑍 𝑖𝑓 −1 𝑖𝑓 43
  • 44. Activation Function – Sigmoid Function 𝑎 = 𝜙 𝑍 = 1 1 + exp(−𝛼𝑍) We will focus on sigmoid activation function at the moment. 44
  • 45. Activation Function – Hyperbolic Tangent (Tanh) Function 𝑎 = 𝜙 𝑍 = exp 𝑍 − exp(−𝑍) exp(𝑍) + exp(−𝑍) 45
  • 46. Activation Function – ReLU Function (Rectified Linear Unit) • Non differentiable at 0, however, it is differentiable anywhere else.At the value of zero, a random choice of 0 or 1 is possible. 𝑎 = 𝜙 𝑍 = 𝑍 𝑖𝑓 𝑝 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 > 0 𝑗=0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 0 46
  • 47. 47 How do we find the weights, w? • One way of attacking the problem is to use calculus to try to find the minimum analytically. • We could compute derivatives and then try using them to find places where C is an extremum. With some luck that might work when C is a function of just one or a few variables. • But it'll turn into a nightmare when we have many more variables. • And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. • Using calculus to minimize that just won't work!
  • 48. How do we find the weights, w? – Going Backward • 2) Gradient DescentAlgorithm: – Step 0: Training begins by assigning some initial random values for the network parameters. – Step 1: Presenting the input vectors to the network, apply the activation function (FORWARD PROPAGATION) – Step 2: Calculate the error using an activation function : 𝐽 𝑤 = 1 2𝑛 𝑖 =1 σ𝑛 𝑦(𝑖) − 𝑦 ො ( 𝑖 ) 2 for regression problems (without any regularization) 𝐽 𝑤 2𝑛 𝑖 =1 = −1 σ𝑛 𝑦(𝑖) log 𝑦 ො ( 𝑖 ) + (1 − 𝑦(𝑖)) log 1 − 𝑦ො ( 𝑖 ) for classification problems (without any regularization) Our goal is to minimise the error 𝐽 𝑤 with respect to 𝑤𝑗 – Step 3: Update the weights according to the following rule: 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 48 𝜕𝑤𝑗 𝐽(𝑤) Here 𝛼 is the learning rate and 0 < 𝛼 ≤ 1 • Continue the iteration until convergence.
  • 49. How do we find the weights, w? – Going Backward 2) Gradient DescentAlgorithm: 𝑗 𝑗 • Let us examine the weight update function: • 𝑤 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) • Partial derivative answers the question “What is the slope of the 𝐽 𝑤 at point 𝑤. • And the 𝛼 determines the amount of change that needs to be done. If 𝛼 is too small take small steps to reach the optimal values. It will take too long to reach the optimum. • If 𝛼 is too big, we may miss the optimal values. Fail to converge. • Let us have a look at these concepts with a small example: x: (size) c(0, 1, 2, 3) y: (price) c(0, 2, 4, 6) one feature to estimate a numeric variable. 49
  • 50. 50 Batch, Stochastic Gradient Descent • In batch gradient descent learning, the weight update is calculated based on all samples in the training set (instead of updating the weights incrementally after each sample), which is why this approach is also referred to as “batch” gradient descent. • Vector – matrix operations
  • 51. 51 Batch, Stochastic Gradient Descent • Now imagine we have a very large dataset with millions of data points, which is not uncommon in many machine learning applications. Running batch gradient descent can be computationally quite costly in such scenarios since we need to re-evaluate the whole training dataset each time we take one step towards the global minimum. • A popular alternative to the batch gradient descent algorithm is stochastic gradient descent, sometimes also called iterative or on-line gradient descent. Instead of updating the weights based on the sum of the accumulated errors over all samples, we update the weights incrementally for each training sample. for (i in 1:n){ - calculate error - calculate derivatives - update weight }
  • 52. 52 Batch, Stochastic Gradient Descent • Acompromise between batch gradient descent and stochastic gradient descent is the so-called mini-batch learning. In mini-batch learning, a neural network learns from just one training input at a time. • Mini-batch learning can be understood as applying batch gradient descent to smaller subsets of the training data—for example, 50 samples at a time. • By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient, and this helps speed up gradient descent, and thus learning
  • 53. 53 Final note on Gradient Descent: Input variable preprocessing • Gradient descent is one of the many algorithms that benefit from feature scaling. Each input variable should be preprocessed so that its mean value, averaged over the entire training sample, is close to zero, or else it will be small compared to its standard deviation. • In order to accelerate the back-propagation learning process, the normalization of the inputs should also include two other measures (LeCun, 1993): • The input variables contained in the training set should be uncorrelated; this can be done by using principal-components analysis (USL). • The decorrelated input variables should be scaled so that their covariances are approximately equal, thereby ensuring that the different synaptic weights in the network learn at approximately the same speed.
  • 54. Final note on Gradient Descent: Input variable preprocessing • We will use a feature scaling method called standardization, which gives our data the property of a standard normal distribution. • The mean of each input feature is centered at value 0 and the feature column has a standard deviation of 1: 𝑥𝑠𝑡 = 𝑥𝑗 − 𝑥𝑗 𝑗 𝑠𝑗 54
  • 55. Hypothetical Data – The exact function is 𝑦 ො = 0 + 2𝑥 > dat x y 1 0 0 2 1 2 3 2 4 4 3 6 𝑦 ො = 0 + 2𝑥 55
  • 56. 1 input layer, 1 output layer Input Layer Output Layer 𝑥0 𝑥1 𝑤0 𝑤1 𝑍 𝜙 𝑍 𝑍 = 𝑤0𝑥0 + 𝑤1𝑥1 𝜙 𝑍 = 𝑍 𝑒𝑟𝑟𝑜𝑟 = 𝑦 − 𝑦 ො 𝑦 ො 56
  • 57. For: 𝑤0 = 0, 𝑤1 = 0 𝐽 0,0 =7 > dat x y yhat=0+0*x 1 0 0 0+0*0 = 0 2 1 2 0+0*1 = 0 3 2 4 0+0*2 = 0 4 3 6 0+0*3 = 0 1 𝑒 = 0 𝑒2 = 2 𝑒3 = 4 𝑒4 = 6 𝐽 𝑤 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) − 𝑦ො ( 𝑖 ) 2 = 1 2 ∗ 4 02 + 22 + 42 + 62 = 7 57
  • 58. For: 𝑤0 = 0, 𝑤1 = 0.5 𝐽 0,0.5 =3.59375 > dat 1 x 1 0 y 0 yhat=0+0.5*x 0+0.5*0 = 0 𝑒4 = 4.5 2 1 2 0+0.5*1 = 0.5 3 2 4 0+0.5*2 = 1 𝑒3 = 2.5 4 3 6 0+0.5*3 = 1.5 𝑒2 𝑒 = 0 = 1.5 𝐽 𝑤 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) − 𝑦ො ( 𝑖 ) 2 = 1 2 ∗ 4 02 + 1.52 + 2.52 + 4.52 58 = 3.59375
  • 59. For: 𝑤0 = 0, 𝑤1 =1 𝐽 0,1 =1.75 > dat x y yhat=0+1*x 1 0 0 0+1*0 = 0 2 1 2 0+1*1 = 1 3 2 4 0+1*2 = 2 4 3 6 0+1*3 = 3 1 𝑒 = 0 𝑒2 = 1 𝑒3 = 2 𝑒4 = 3 𝐽 𝑤 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) − 𝑦ො ( 𝑖 ) 2 = 1 2 ∗ 4 02 + 12 + 22 + 32 59 = 1.75
  • 60. For: 𝑤0 = 0, 𝑤1 =2 𝐽 0,2 = 0 > dat x y yhat=0+2*x 1 0 0 0+2*0 = 0 2 1 2 0+2*1 = 2 3 2 4 0+2*2 = 4 4 3 6 0+2*3 = 6 𝑒1 = 0 𝑒3 = 0 𝑒2 = 0 𝑒4 =0 𝐽 𝑤 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) − 𝑦ො ( 𝑖 ) 2 = 1 2 ∗ 4 02 + 02 + 02 + 02 = 0 60
  • 61. Cost function with respect to w1 𝑤1 = 0 𝜕 𝜕𝑤𝑗 𝐽 𝑤 < 0 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) 61
  • 62. Cost function with respect to w1 𝑤1 = 0.5 𝜕 𝜕𝑤𝑗 𝐽 𝑤 < 0 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) 62
  • 63. Cost function with respect to w1 𝜕 𝜕𝑤𝑗 𝐽 𝑤 𝑤1 = 1.25 < 0 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) 63
  • 64. Cost function with respect to w1 𝑤1 = 2 𝜕 𝜕𝑤𝑗 𝐽 𝑤 < 0 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) 64
  • 65. Cost function with respect to w1 𝑤1 =2.75 𝜕 𝜕𝑤𝑗 𝐽 𝑤 > 0 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) 65
  • 66. Cost function with respect to w1 𝑤1 =2.5 𝜕 𝜕𝑤𝑗 𝐽 𝑤 > 0 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) 66
  • 67. Cost function with respect to w1 𝑤1 =2 𝜕 𝜕𝑤𝑗 𝐽 𝑤 > 0 𝑤𝑗 𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) 67
  • 68. Cost function with respect to w1, one w parameter to optimize Ref: Raschka, p35 68
  • 70. Gradient Descent in a very simple example: 1 input layer, 1 output layer, 1 x, 1 numeric y • Consider the linear regression example: • 𝐽 𝑤 = 2𝑛 𝑖 =1 1 σ𝑛 𝑦(𝑖) − 𝑦 ො ( 𝑖 ) 2 and 𝑦ො ( 𝑖 )= 𝑤 𝑥 0 0 + 𝑤 𝑥 1 1 • We have 2 weights! • 𝜕 𝜕𝑤0 𝐽(𝑤) and 𝜕 𝜕𝑤1 𝐽(𝑤) need to be calculated: • ∇𝐽 = 𝜕 𝜕𝑤0 𝐽 𝑤 , 𝜕 𝜕𝑤1 𝐽(𝑤) 𝑗 𝑗 • 𝑤 ∶= 𝑤 − 𝛼 𝜕 70 𝜕𝑤𝑗 𝐽(𝑤)
  • 71. • 𝐽 𝑤 = 2𝑛 𝑖 =1 1 σ𝑛 𝑦(𝑖) − 𝑦 ො ( 𝑖 ) 2 and 𝑦 ො ( 𝑖 ) = 𝑤0𝑥0 + 𝑤1𝑥1 𝜕 𝜕𝑤0 −2 𝐽 𝑤 = 2𝑛 𝑦 𝑖 − 𝑦 ො 𝑖 𝜕 𝜕𝑤0 𝑦 ො 𝑖𝑤 = −1 𝑛 (𝑖) (𝑖 ) (𝑦 − 𝑦 ො )𝑥0 • and 𝜕 𝜕𝑤1 −2 𝐽 𝑤 = 2𝑛 𝑦 𝑖 − 𝑦 ො 𝑖 𝜕 𝜕𝑤1 𝑦 ො 𝑖 𝑛 71 −1 𝑤 = (𝑦(𝑖) − 𝑦ො (𝑖)) 𝑥 1
  • 72. 𝜕 𝜕𝑤0 𝐽 𝑤 = −1 𝑛 𝑦 − 𝑤0𝑥0 − 𝑤1𝑥1 ∗ 𝑥0 𝜕 1 𝐽 𝑤 = 𝜕𝑤 −1 𝑛 𝑦 − 𝑤0𝑥0 − 𝑤1𝑥1 ∗ 𝑥1 𝑗 𝑤𝑗 ∶= 𝑤 − 𝛼 𝜕 𝜕𝑤𝑗 𝐽(𝑤) • 𝜕 𝐽 𝑤 < 0 that means we will increase the weight • 𝜕𝑤 𝜕 𝜕𝑤 𝜕 𝐽 𝑤 > 0 that means we will decrease the weight • 𝜕 𝑤 𝐽 𝑤 72 = 0 that means we will not change the weight
  • 74. 𝑥0 𝑥1 𝑤01 𝑤11 𝑦 ො 𝑤21 𝑤02 𝑤12 𝑍2 𝑎1 𝑎2 (1) (1) (1) (1) (2) (2) 𝑤01 𝑤11 (2) (2) 𝑍1 (2) (2) (2) (2) (3) 𝑍1 𝑎1 (3) 𝑗 𝒁(2) = 𝒘(1) 𝑥𝑗 1 𝑗 𝑗 𝑦 ො = 𝑎(3) = 𝒁(𝑘+1) = 𝒘(𝑘) 𝑎 (2) 𝑗 𝑎(2) = 𝜙 𝑗 𝒁(2) = 1 𝑗 1 + exp(−𝒁 2 ) 𝑗 𝑗 Gradient Descent in a very simple example: 1 input layer, 1 hidden layer (2 neurons + bias), 1 output layer, 1 x, 1 numeric y 𝑥 = 𝑎(1) 𝑎0 =1
  • 75. Gradient Descent in a very simple example: 1 input layer, 1 hidden layer (2 neurons + bias), 1 output layer, 1 x, 1 numeric y • 𝐽 𝑤 = 2𝑛 𝑖 =1 1 σ𝑛 𝑦(𝑖) − 𝑦 ො ( 𝑖 ) 2 𝜕 𝜕 𝑤 • 𝐽 𝑤 = −2 2𝑛 𝑦 𝑖 − 𝑦 ො 𝑖 𝜕 𝜕 𝑤 𝑦 ො 𝑖𝑤 75
  • 76. Gradient Descent in a very simple example: 1 input layer, 1 hidden layer (2 neurons + bias), 1 output layer, 1 x, 1 numeric y • 𝑦 ො ( 𝑖 ) and 0 01 1 11 2 21 0 = 𝑎(2) 𝑤(2) + 𝑎(2) 𝑤(2) + 𝑎(2) 𝑤(2) 𝑎(2) = 1 Output layer weights Hidden layer weights 𝜕 𝑦 ො 𝑖𝑤 = 𝑎(2) = 1 𝜕𝑤(2) 0 01 𝜕 𝜕𝑤(1) 01 𝑦 ො 𝑖 𝑤 =? 𝜕 𝑦 ො 𝑖𝑤 = 𝑎(2) 𝜕𝑤(2) 1 11 𝜕 𝜕𝑤(1) 02 𝑦 ො 𝑖 𝑤 =? 𝜕 𝑦 ො 𝑖𝑤 = 𝑎(2) 𝜕𝑤(2) 2 21 𝜕 𝜕𝑤(1) 11 𝑦 ො 𝑖 𝑤 =? 𝜕 𝜕𝑤(1) 12 𝑦 ො 𝑖 𝑤 =? 76
  • 77. Update the weights from the hidden layer to the output layer • 𝑤(2) := 𝑤(2) − 𝛼 01 01 𝑦 𝑖 − 𝑦 ො 𝑖 • 𝑤(2) := 𝑤(2) − 𝛼 11 11 𝑦 𝑖 − 𝑦 ො 𝑖 1 𝑎(2) • 𝑤(2) := 𝑤(2) − 𝛼 21 21 −1 𝑛 −1 𝑛 −1 𝑛 𝑦 𝑖 − 𝑦 ො 𝑖 77 2 𝑎(2)
  • 78. 78 • 𝑦 ො ( 𝑖 ) = 1𝑤(2) + 𝑎(2) 𝑤(2) + 𝑎(2) 𝑤(2) 01 1 11 2 21 and 0 𝑎(2) = 1 • 1 1 1+exp(−𝒁 2 ) 1 2 𝑎(2) = 𝑎(2) = 1 2 1+exp(−𝒁 2 ) • 1 𝑍 2 01 11 = 𝑤(1) 1 + 𝑤(1) 𝑥1 2 𝑍 2 02 12 = 𝑤(1) 1 + 𝑤(1) 𝑥1 Output layer weights Hidden layer weights 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑤(2) 01 = 1 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑤(1) 01 = 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑎(2) 𝜕𝑍 2 1 1 𝜕𝑎(2) 𝜕𝑍 2 𝜕𝑤(1) 1 1 01 = 𝑤 2 𝑎 2 11 1 1 − 𝑎 2 1 1 𝜕 𝑦 ො 𝑖 𝑤 (2) (2) = 𝑎1 𝜕𝑤11 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑤(1) 11 = 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑎(2) 𝜕𝑍 2 1 1 𝜕𝑎(2) 𝜕𝑍 2 𝜕𝑤(1) 1 1 11 = 𝑤 2 𝑎 2 11 1 1 − 𝑎 2 1 𝑥1 𝜕 𝑦 ො 𝑖 𝑤 (2) (2) = 𝑎2 𝜕𝑤21 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑤(1) 02 = 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑎(2) 𝜕𝑍 2 2 2 𝜕𝑎(2) 𝜕𝑍 2 𝜕𝑤(1) 2 2 02 = 𝑤 2 𝑎 2 21 2 1 − 𝑎 2 2 1 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑤(1) 12 = 𝜕 𝑦 ො 𝑖 𝑤 𝜕𝑎(2) 𝜕𝑍 2 2 2 𝜕𝑎(2) 𝜕𝑍 2 𝜕𝑤(1) 2 2 12 = 𝑤 2 𝑎 2 21 2 1 − 𝑎 2 2 𝑥1
  • 79. Update the weights from the input layer to the hidden layer • 𝑤(1) := 𝑤(1) − 𝛼 01 01 𝑦 𝑖 − 𝑦 ො 𝑖 11 1 𝑤 2 𝑎 2 1 1 − 𝑎 2 • 𝑤(1) := 𝑤(1) − 𝛼 11 11 𝑦 𝑖 − 𝑦 ො 𝑖 11 1 𝑤 2 𝑎 2 1 1 − 𝑎 2 𝑥1 • 𝑤(1) := 𝑤(1) − 𝛼 02 02 𝑦 𝑖 − 𝑦 ො 𝑖 21 2 𝑤 2 𝑎 2 2 1 − 𝑎 2 • 𝑤(1) := 𝑤(1) − 𝛼 12 12 −1 𝑛 −1 𝑛 −1 𝑛 −1 𝑛 𝑦 𝑖 − 𝑦 ො 𝑖 21 2 𝑤 2 𝑎 2 2 1 − 𝑎 2 𝑥 79 1
  • 81. What does 𝛼 do? • If 𝛼 is too small, the rate of change in the weights will be tiny. It will take too long to reach to the optimum solution. • If 𝛼 is too big, the rate of change in the weights will be very big. We may never find the optimum solution, our algorithm may fail to converge. 81
  • 82. 𝛼, adaptive learning • In stochastic gradient descent implementations, the fixed learning rate 𝛼 is often replaced by an adaptive learning rate that decreases over time, for example, 𝐶1 #𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 + 𝐶2 • where 𝐶1and 𝐶2 are constants. Note that stochastic gradient descent does not reach the global minimum but an area very close to it. By using an adaptive learning rate, we can achieve further annealing to a better global minimum. 82
  • 83. • Ref: Raschka, p.40 83
  • 84. What if we have 𝑤0 and 𝑤1 together to change? 𝑤1 𝑤0 • The cost function J(𝑤0, 𝑤1) will be a 3D surface plot (left pane) • The contour plot will provide the same cost along the same contour (right pane) J(𝑤0, 𝑤1) 𝑤0 84 𝑤1
  • 87. 87 Regularization in Neural Networks • In multi-layer neural networks, the number of input and outputs units is generally determined by the dimensionality of the data set. • On the other hand, we are free with the number of hidden layer units (M). We may typically have hundreds, thousands, or even billions of weights that we need to optimize. • Choose optimum number of hidden layer units (M) that gives the best generalization performance for balance between underfitting and overfitting. • A network is said to generalize well when the input–output mapping computed by the network is correct (or nearly so) for test data never used in creating or training the network. Here, it is assumed that the test data are drawn from the same population used to generate the training data.
  • 88. 88 Regularization in Neural Networks • The generalization error, however, is not a simple function of the number of hidden layer units (M) due to the presence of local minima in the error function. • Each time when we start with random values of the weight vector for each hidden layer unit size considered, we see the effect of choosing multiple random initializations for the weight vector for a range of values of M. • In practice, one approach to choosing M is in fact to plot a graph of the M vs the errors, then to choose the specific solution having the smallest validation set error.
  • 89. Regularization in Neural Networks • 𝐽 𝑤 = 2𝑛 𝑖 =1 1 σ𝑛 • There are, however, other ways to control the complexity of a neural network model in order to avoid over-fitting. Such as adding a quadratic regularizer (L2): 2 • 𝐽 ሚ𝑤 = 𝑖 =1 1 σ𝑛 𝑦(𝑖) − 𝑦 ො ( 𝑖 ) 𝑦(𝑖) − 𝑦 ො ( 𝑖 ) 2𝑛 2 2 𝜆 + 𝑤 89 2 • This regularizer is also known as weight decay. • The effective model complexity is then determined by the choice of the regularization coefficient λ.
  • 91. 91 Early Stopping in NNs • An alternative to regularization as a way of controlling the effective complexity of a network is the procedure of early stopping. • The training of nonlinear network models corresponds to an iterative reduction of the error function defined with respect to a set of training data. • The error measured with respect to independent data, generally called a validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation data set, in order to obtain a network having good generalization performance.
  • 92. 92 When to use NNs • When dealing with unstructured datasets • When you do not need interpretable results, for example when you just want to classify your pictures based on cats and dogs, you don’t need to know why the outcome is classified as a cat or a dog. You don’t need to explain the relationships. • When you have many features, with regularization • When you have nonlinear relationships
  • 93. 93 Resources • Afree online book by Michael Nielsen (brilliant resource for partial derivatives and gradient descent): http://neuralnetworksanddeeplearning.com/ • The Elements of Statistical Learning, Trevor Hastie book (p.389) • Pattern Recognition and Machine Learning Book, Christopher Bishop (p.227) • Machine Learning with R, Brett Lantz (p.219) • Neural Networks – a comprehensive foundation, Simon S Haykin • Python Machine Learning, Sebastian Raschka (p.17) • Neural Network Design, Hagan, Demuth, Beale, De Jesus (http://hagan.okstate.edu/nnd.html) • https://github.com/stephencwelch/Neural-Networks-Demystified • Of course again Prof. Patrick Henry Winston’s MIT youtube lectures.