Neural Networks

Alex Mirugwe
Victoria University
Artificial Neural Networks

2
Outline
The objective of this part of the Supervised Learning lectures will be to gain
and understanding of:
• Background forANNs
• HowANNs for regression and classification problems work
• Perceptron learning algorithm
• Gradient descent algorithm
• Stochastic gradient descent algorithm
• how to analyze datasets withANNs in R
• how to interpret the results

Resources
• Page 389, Chapter 11 Neural Networks
3

4
Biological Neuron and
Links to Perceptrons

5
Introduction
• An Artificial Neural Network (ANN) models the relationship between a
set of input signals (features) and an output signal (y variable) using a
model derived from our understanding of how a biological brain responds
to stimuli from sensory inputs.
• Just as a brain uses a network of interconnected cells called neurons to
create a massive parallel processor, ANN uses a network of artificial
neurons or nodes to solve learning problems.
• Before we explainANNs, let us understand how the biological brain
works.

Neurons
Ref: Url: https://www.verywellmind.com/what-is-a-neuron-2794890 6

The Neuron
𝑥
7
3
𝑥2
𝑥1
𝑥4
𝑥
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑤1𝑥1 + 𝑤2𝑥2 + ⋯ +
5 𝑤5𝑥5
𝑗
𝑗
=
=
1
1
𝑝
𝑗
𝑗
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 <
≥ 𝜃
𝑦
ො 𝑦
ො
=01

Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
((p+1)x1) weights matrix
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
1
0
(nx(p+1)) input matrix
• 𝒚 = 0
1
1
(nx1) output matrix, 𝒚
ො
=
?
?
?
?
?
12
(nx1) predicted output matrix

Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
𝑦
ො =
1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+1*2+0*(-1)+(-1)*0.5=2.5
𝜙 𝑍 =
𝑝
1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
13
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
y = 1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+3*2+1*(-1)+2*0.5=7
𝜙 𝑍 =
𝑝
1 𝑍 = 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
17

Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
1
0
• 𝒚 = 0
1
1
(nx1) output matrix, 𝑦=
1
1
1
?
?
18

Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
y= 1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+4*2+0*(-1)+(-2)*0.5=8
𝜙 𝑍 =
𝑝
1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
19

Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
• 𝒚 =
1
0
0
1
(nx1) output matrix, 𝒚
ො
=
1
1
1
1
?
20

Simple example
𝑥0
𝑥1
𝑥2
𝑥3
𝑤0 =1
𝑤1 = 2
𝑤2 = −1
𝑤3 = 0.5
𝑦 = 1
𝑍 𝜙 𝑍
𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑤𝑗𝑥𝑗=1*1+5*2+1*(-1)+(1)*0.5=9.5
𝜙 𝑍 =
𝑝
1 𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
21

Simple example
• 𝒘 =
𝑤0
𝑤1
𝑤2
𝑤3
=
1
2
−1
0.5
• 𝒙 =
1 1 0 −1
1 2 1 0
1 3 1 2
1 4 0 −2
1 5 1 1
1
0
• 𝒚 = 0
1
1
(nx1) output matrix, 𝑦 =
1
1
1
1
1
22
• It is clear that this set of weights do not achieve a good prediction. Need to
be updated.

23
Perceptron Learning
Algorithm

24
How do we find the weights, w?
• 1) Perceptron LearningAlgorithm:
– Step 0: Training begins by assigning some initial random values for the
network parameters.Agood initial heuristic is to start with the average of the
positive input vectors minus the average of the negative input vectors. In many
cases this yields an initial vector near the solution region.
– Step 1: Presenting the input vectors to the network, apply the activation
function (FORWARD PROPAGATION)
– Step 2: Update the weights according to the following rule
(BACKWARD PROPAGATION):
𝑤𝑗 ∶= 𝑤𝑗 − Δ𝑤𝑗
Δ𝑤𝑗 = 𝜂 𝑦 𝑖 − 𝑦
ො
𝑖 𝑗
𝑥 𝑖
Here 𝜂 is the learning rate and 0 < 𝜂 ≤ 1 and 𝑖 = 1, … , 𝑛 representing the
samples
• Continue the iteration until the perceptron classifies all training examples
correctly.
• : = is an assignment

25
Perceptron as a neural network : Going Backward
Input Layer Output Layer
𝑥0
𝑥1
𝑤0
𝑤1
𝑤2
𝑥2
𝑤3
𝑥3
𝑍 𝜙 𝑍
𝑒𝑟𝑟𝑜𝑟 = 𝑦 − 𝑦
ො
Goal is to min error
𝑦
ො

26
𝑗
𝑥 𝑖
Rate of change: Δ𝑤𝑗 = 𝜂 𝑦 𝑖 − 𝑦
ො
𝑖
• Scenario 1: The output is correct - 𝑦 𝑖 = 1, 𝑦
ො 𝑖 = 1
• Scenario 2: The output is incorrect - 𝑦 𝑖 = 1, 𝑦
ො 𝑖
= 0
• Scenario 1:
𝑗
Δ𝑤𝑗 = 𝜂 1 − 1 𝑥 𝑖
= 0 no change is necessary
• Scenario 2:
Δ𝑤𝑗 = 𝜂 1 − 0 𝑥 𝑖
𝑗
𝑗
= 𝜂𝑥 𝑖
the weight update is
proportional to the value of 𝑥 𝑖
𝑗
• In summary: where the perceptron predicts the class label correctly, the weights
remain unchanged, where the perceptron predicts the class label incorrectly, the
weights are updated proportional to the value of the input. The perceptron
learning algorithm selects a search direction in weight space according to the
incorrect classification of the last tested vector

27
Perceptron Learning
Algorithm with Iris
Dataset

Our good old Iris Dataset
0
𝑤 + 𝑤1𝑥1 + 𝑤2𝑥2=0
𝑥1
• Check the perceptron learning algorithm R codes and the video.
𝑥2
Slope = −𝑤1/𝑤2
Intercept = −𝑤0/𝑤2
28

Linearly separable – inseparable cases
• It is important to note that the convergence of the perceptron is only
guaranteed if the two classes are linearly separable and the learning rate is
sufficiently small.
30

Multilayer perceptrons
• Single layer perceptrons are only capable of solving linearly separable
problems.
• In order to overcome the linearly inseparable problem, we can add 2 or
more perceptrons together, by creating a multilayer perceptrons.
• Therefore by joining several hyper-planes, we can define a new set of
decision rules.
31

Example 1
• In total we have 12+8 = 20 weights to optimize
3 × 4 = 12 𝑤
32
4 × 2 = 8 𝑤
3 neurons L=1 4 neurons L=2
2 neurons L=3

Example 2
• In total we have 20+5 = 25 weights to optimize
33
4 × 5 = 20 𝑤 5 × 1 = 5 𝑤

Example 3
• In total we have 12+9+3 = 24 weights to optimize
34
4 × 3 = 12 𝑤
3 × 1 = 3 𝑤
3 × 3 = 9 𝑤

𝑥0
𝑥1
𝑥2
𝑥3
𝑤01
𝑤11
𝑤21
31
𝑦
ො
𝑤21
𝑤02
𝑤12
22
𝑤32
𝑎1
𝑍2 𝑎2
(1)
(1)
(1)
(1)
(1)
𝑤(1)
𝑤(1)
(1)
(2)
(2)
𝑤01
𝑤11
(2)
(2)
𝑍1
(2)
(2)
(2)
(2)
(3)
𝑍1 𝑎1
(3)
𝑗
𝒁(2) = 𝒘(1)
𝑥𝑗 𝑗
𝑗
𝒁(𝑘+1) = 𝒘(𝑘)
𝑎(𝑘)
𝑎(𝑘+1)
= 𝜙 𝒁(𝑘+1)
𝑗 𝑗
Input Layer, k=1 Output Layer, k=3
Hidden Layer, k=2 35
𝑗 𝑗
1 Input Layer, 1 Hidden Layer NN with 3 input variables
and 1 output variable (numeric output) – Going Forward
𝑥 = 𝑎(1)
𝑎0 =1

1 Input Layer, 1 Hidden Layer NN with 3 input variables and
1 output variable (categorical output – with 3 categories)
𝑥0
𝑥1
𝑥2
𝑥3
𝑤01
𝑤11
𝑤21
31
𝑤02
𝑤12
22
𝑤32
2
𝑎1
2
(1)
(1)
(1)
(1)
(1)
𝑤(1)
𝑤(1)
(1)
0
(2)
𝑍1
(2)
𝑍 𝑎
(2)
(2)
𝑎(2)=1
𝑤23
(2)
𝑍1 𝑎1
22
13
𝑤12
𝑤11
𝑤02
𝑤03
(2)
𝑤
(2)
𝑤21
(2)
𝑤01
(2)
(2)
(2)
(2)
𝑤(2)
𝑦
ො 𝑗
Input Layer, k=1 Output Layer, k=3
Hidden Layer, k=2 36
𝑦
ො 𝑗
𝑦
ො 𝑗
(3) (3)
(3) (3)
𝑍2 𝑎2
(3)
𝑍3 𝑎3
(3)

• 𝑾(𝑘): matrix of weights controlling function mapping from layer (𝑘) to layer
(𝑘 + 1). (Here k = 1, … , 𝐿).
• 𝒁(𝑘+1): vector of linear combinations of weights and inputs from layer (𝑘):
𝒁(𝑘+1) = 𝒘(𝑘)
𝑎(𝑘)
𝑗 𝑗
𝑗 0
𝑗
where 𝑎(1)
= 𝑥 and 𝑎(𝑘)
= 1 (acts as a bias) and 𝑎(𝐿)
= 𝑦
ො
𝑗
𝑗
• 𝑎(𝑘)
:Activation of unit (𝑗) in Layer (𝑘) with a pre-specified activation
𝑗
function. (Here j = 0, … , 𝑃(𝑘) and specific to the layer).
𝑎(𝑘+1)
= 𝜙 𝒁(𝑘+1)
𝑗 𝑗
• There are several different activation functions:
38
𝑗 𝑗
𝑎(1)
= 𝑥 and (in case of a regression problem we have one output) and

39
Activation Functions
• In perceptrons, a small change in the weights or bias of any single perceptron
in the network can sometimes cause the output of that perceptron to
completely flip, say from 0 to 1. That flip may then cause the behaviour of
the rest of the network to completely change in some very complicated way.
• There are several different activation functions:
– Step function
– Constant function
– Threshold function (step)
– Threshold function (ramp)
– Linear function
– Sigmoid function
– Hyperbolic Tangent function

Activation Function – Step Function (Symmetric)
𝑎 = 𝜙 𝑍 =
1
40
𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
−1

Activation Function – Step Function (Binary)
𝑎 = 𝜙 𝑍 =
1 𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 ≥ 0
𝑗=0
0
41

Activation Function – Step Function (Linear)
𝑎 = 𝜙 𝑍 = 𝑍
42

Activation Function – Semilinear Function
𝑎 = 𝜙 𝑍 =
1 𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 > 1
𝑗=0
𝑝
− 1 ≤ ෍ 𝑤𝑗 𝑥𝑗 ≤ 1
𝑗=0
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 < −1
𝑗=0
𝑍 𝑖𝑓
−1 𝑖𝑓
43

Activation Function – Sigmoid Function
𝑎 = 𝜙 𝑍 =
1
1 + exp(−𝛼𝑍)
We will focus on sigmoid
activation function at the
moment.
44

Activation Function – Hyperbolic Tangent (Tanh)
Function
𝑎 = 𝜙 𝑍 =
exp 𝑍 − exp(−𝑍)
exp(𝑍) + exp(−𝑍)
45

Activation Function – ReLU Function (Rectified
Linear Unit)
• Non differentiable at 0, however, it is differentiable anywhere else.At the
value of zero, a random choice of 0 or 1 is possible.
𝑎 = 𝜙 𝑍 =
𝑍 𝑖𝑓
𝑝
𝑍 = ෍ 𝑤𝑗 𝑥𝑗 > 0
𝑗=0
0
46

47
How do we find the weights, w?
• One way of attacking the problem is to use calculus to try
to find the minimum analytically.
• We could compute derivatives and then try using them to
find places where C is an extremum. With some luck that
might work when C is a function of just one or a few
variables.
• But it'll turn into a nightmare when we have many more
variables.
• And for neural networks we'll often want far more
variables - the biggest neural networks have cost functions
which depend on billions of weights and biases in an
extremely complicated way.
• Using calculus to minimize that just won't work!

How do we find the weights, w? – Going Backward
• 2) Gradient DescentAlgorithm:
– Step 0: Training begins by assigning some initial random values for the network
parameters.
– Step 1: Presenting the input vectors to the network, apply the activation function
(FORWARD PROPAGATION)
– Step 2: Calculate the error using an activation function :
𝐽 𝑤 =
1
2𝑛 𝑖
=1
σ𝑛 𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
for regression problems (without any regularization)
𝐽 𝑤
2𝑛 𝑖
=1
= −1
σ𝑛
𝑦(𝑖) log 𝑦
ො (
𝑖
)
+ (1 − 𝑦(𝑖)) log 1 − 𝑦ො (
𝑖
) for classification problems
(without any regularization)
Our goal is to minimise the error 𝐽 𝑤 with respect to 𝑤𝑗
– Step 3: Update the weights according to the following rule:
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
48
𝜕𝑤𝑗
𝐽(𝑤)
Here 𝛼 is the learning rate and 0 < 𝛼 ≤ 1
• Continue the iteration until convergence.

How do we find the weights, w? – Going Backward
2) Gradient DescentAlgorithm:
𝑗
𝑗
• Let us examine the weight update function:
• 𝑤 ∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
• Partial derivative answers the question “What is the slope of the 𝐽 𝑤 at
point 𝑤.
• And the 𝛼 determines the amount of change that needs to be done. If 𝛼 is too
small take small steps to reach the optimal values. It will take too long to
reach the optimum.
• If 𝛼 is too big, we may miss the optimal values. Fail to converge.
• Let us have a look at these concepts with a small example:
x: (size) c(0, 1, 2, 3)
y: (price) c(0, 2, 4, 6)
one feature to estimate a numeric variable.
49

50
Batch, Stochastic Gradient Descent
• In batch gradient descent learning, the weight update is
calculated based on all samples in the training set (instead of
updating the weights incrementally after each sample),
which is why this approach is also referred to as “batch”
gradient descent.
• Vector – matrix operations

51
• Now imagine we have a very large dataset with millions of data points,
which is not uncommon in many machine learning applications. Running
batch gradient descent can be computationally quite costly in such scenarios
since we need to re-evaluate the whole training dataset each time we take
one step towards the global minimum.
• A popular alternative to the batch gradient descent algorithm is stochastic
gradient descent, sometimes also called iterative or on-line gradient descent.
Instead of updating the weights based on the sum of the accumulated errors
over all samples, we update the weights incrementally for each training
sample.
for (i in 1:n){
- calculate error
- calculate derivatives
- update weight
}

52
• Acompromise between batch gradient descent and stochastic
gradient descent is the so-called mini-batch learning. In
mini-batch learning, a neural network learns from just one
training input at a time.
• Mini-batch learning can be understood as applying batch
gradient descent to smaller subsets of the training data—for
example, 50 samples at a time.
• By averaging over this small sample it turns out that we
can quickly get a good estimate of the true gradient,
and this helps speed up gradient descent, and thus
learning

53
Final note on Gradient Descent:
Input variable preprocessing
• Gradient descent is one of the many algorithms that benefit from feature
scaling. Each input variable should be preprocessed so that its mean value,
averaged over the entire training sample, is close to zero, or else it will be
small compared to its standard deviation.
• In order to accelerate the back-propagation learning process, the
normalization of the inputs should also include two other measures (LeCun,
1993):
• The input variables contained in the training set should be uncorrelated; this
can be done by using principal-components analysis (USL).
• The decorrelated input variables should be scaled so that their covariances
are approximately equal, thereby ensuring that the different synaptic weights
in the network learn at approximately the same speed.

Final note on Gradient Descent:
Input variable preprocessing
• We will use a feature scaling method called standardization, which gives our
data the property of a standard normal distribution.
• The mean of each input feature is centered at value 0 and the feature column
has a standard deviation of 1:
𝑥𝑠𝑡 =
𝑥𝑗 − 𝑥𝑗
𝑗
𝑠𝑗
54

Hypothetical Data – The exact function is 𝑦
ො = 0 +
2𝑥
> dat
x y
1 0 0
2 1 2
3 2 4
4 3 6
𝑦
ො = 0 +
2𝑥
55

1 input layer, 1 output layer
Input Layer Output Layer
𝑥0
𝑥1
𝑤0
𝑤1 𝑍 𝜙 𝑍
𝑍 = 𝑤0𝑥0 + 𝑤1𝑥1
𝜙 𝑍 = 𝑍
𝑒𝑟𝑟𝑜𝑟 = 𝑦 − 𝑦
ො
𝑦
ො
56

For: 𝑤0 = 0, 𝑤1 = 0 𝐽 0,0 =7
> dat
x y yhat=0+0*x
1 0 0 0+0*0 = 0
2 1 2 0+0*1 = 0
3 2 4 0+0*2 = 0
4 3 6 0+0*3 = 0
1
𝑒 = 0
𝑒2 = 2
𝑒3 = 4
𝑒4 = 6
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 22 + 42 + 62 = 7
57

For: 𝑤0 = 0, 𝑤1 = 0.5 𝐽 0,0.5 =3.59375
> dat
1
x
1 0
y
0
yhat=0+0.5*x
0+0.5*0 = 0 𝑒4 = 4.5
2 1 2 0+0.5*1 = 0.5
3 2 4 0+0.5*2 = 1 𝑒3 = 2.5
4 3 6 0+0.5*3 = 1.5
𝑒2
𝑒 = 0 = 1.5
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 1.52 + 2.52 + 4.52
58
= 3.59375

For: 𝑤0 = 0, 𝑤1 =1 𝐽 0,1 =1.75
> dat
x y yhat=0+1*x
1 0 0 0+1*0 = 0
2 1 2 0+1*1 = 1
3 2 4 0+1*2 = 2
4 3 6 0+1*3 = 3
1
𝑒 = 0
𝑒2 = 1
𝑒3 = 2
𝑒4 = 3
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 12 + 22 + 32
59
= 1.75

For: 𝑤0 = 0, 𝑤1 =2 𝐽 0,2 = 0
> dat
x y yhat=0+2*x
1 0 0 0+2*0 = 0
2 1 2 0+2*1 = 2
3 2 4 0+2*2 = 4
4 3 6 0+2*3 = 6
𝑒1 = 0
𝑒3 = 0
𝑒2 = 0
𝑒4 =0
𝐽 𝑤
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) − 𝑦ො (
𝑖
)
2
=
1
2 ∗ 4
02 + 02 + 02 + 02 = 0
60

Cost function with respect to w1
𝑤1 = 0
𝜕
𝜕𝑤𝑗
𝐽 𝑤 < 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
61

𝑤1 = 0.5
𝜕
𝜕𝑤𝑗
𝐽 𝑤 < 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
62

𝜕
𝜕𝑤𝑗
𝐽 𝑤
𝑤1 = 1.25
< 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
63

𝑤1 = 2
𝜕
𝜕𝑤𝑗
𝐽 𝑤 < 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
64

𝑤1 =2.75
𝜕
𝜕𝑤𝑗
𝐽 𝑤 > 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
65

𝑤1 =2.5
𝜕
𝜕𝑤𝑗
𝐽 𝑤 > 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
66

𝑤1 =2
𝜕
𝜕𝑤𝑗
𝐽 𝑤 > 0
𝑤𝑗 𝑗
∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
67

Cost function with respect to w1, one w parameter to
optimize
Ref: Raschka, p35 68

Gradient Descent in a very simple example:
1 input layer, 1 output layer, 1 x, 1 numeric y
• Consider the linear regression example:
• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
and 𝑦ො (
𝑖
)= 𝑤 𝑥
0 0 + 𝑤 𝑥
1 1
• We have 2 weights!
•
𝜕
𝜕𝑤0
𝐽(𝑤) and
𝜕
𝜕𝑤1
𝐽(𝑤) need to be calculated:
• ∇𝐽 =
𝜕
𝜕𝑤0
𝐽 𝑤 ,
𝜕
𝜕𝑤1
𝐽(𝑤)
𝑗 𝑗
• 𝑤 ∶= 𝑤 − 𝛼
𝜕
70
𝜕𝑤𝑗
𝐽(𝑤)

• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
and 𝑦
ො (
𝑖
)
= 𝑤0𝑥0 + 𝑤1𝑥1
𝜕
𝜕𝑤0
−2
𝐽 𝑤 =
2𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
𝜕
𝜕𝑤0
𝑦
ො 𝑖𝑤 =
−1
𝑛
(𝑖) (𝑖
)
(𝑦 − 𝑦
ො )𝑥0
• and
𝜕
𝜕𝑤1
−2
𝐽 𝑤 =
2𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
𝜕
𝜕𝑤1
𝑦
ො
𝑖
𝑛
71
−1
𝑤 = (𝑦(𝑖) − 𝑦ො (𝑖))
𝑥
1

𝜕
𝜕𝑤0
𝐽 𝑤 =
−1
𝑛
𝑦 − 𝑤0𝑥0 − 𝑤1𝑥1 ∗ 𝑥0
𝜕
1
𝐽 𝑤 =
𝜕𝑤
−1
𝑛
𝑦 − 𝑤0𝑥0 − 𝑤1𝑥1 ∗ 𝑥1
𝑗
𝑤𝑗 ∶= 𝑤 − 𝛼
𝜕
𝜕𝑤𝑗
𝐽(𝑤)
•
𝜕
𝐽 𝑤 < 0 that means we will increase the weight
•
𝜕𝑤
𝜕
𝜕𝑤
𝜕
𝐽 𝑤 > 0 that means we will decrease the weight
•
𝜕
𝑤
𝐽 𝑤
72
= 0 that means we will not change the weight

𝑥0
𝑥1
𝑤01
𝑤11
𝑦
ො
𝑤21
𝑤02
𝑤12
𝑍2
𝑎1
𝑎2
(1)
(1)
(1)
(1)
(2)
(2)
𝑤01
𝑤11
(2)
(2)
𝑍1
(2)
(2)
(2)
(2)
(3)
𝑍1 𝑎1
(3)
𝑗
𝒁(2) = 𝒘(1)
𝑥𝑗
1 𝑗
𝑗
𝑦
ො = 𝑎(3)
= 𝒁(𝑘+1) = 𝒘(𝑘)
𝑎
(2)
𝑗
𝑎(2)
= 𝜙 𝑗
𝒁(2)
=
1
𝑗
1 + exp(−𝒁 2
)
𝑗 𝑗
1 input layer, 1 hidden layer (2 neurons + bias), 1 output
layer, 1 x, 1 numeric y
𝑥 = 𝑎(1)
𝑎0 =1

1 input layer, 1 hidden layer (2 neurons + bias), 1
output layer, 1 x, 1 numeric y
• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2
𝜕
𝜕
𝑤
• 𝐽 𝑤 = −2
2𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
𝜕
𝜕
𝑤
𝑦
ො 𝑖𝑤
75

1 input layer, 1 hidden layer (2 neurons + bias), 1
output layer, 1 x, 1 numeric y
• 𝑦
ො (
𝑖
)
and
0 01 1 11 2 21 0
= 𝑎(2)
𝑤(2)
+ 𝑎(2)
𝑤(2)
+ 𝑎(2)
𝑤(2)
𝑎(2)
= 1
Output layer weights Hidden layer weights
𝜕
𝑦
ො 𝑖𝑤 = 𝑎(2)
= 1
𝜕𝑤(2) 0
01
𝜕
𝜕𝑤(1)
01
𝑦
ො
𝑖
𝑤 =?
𝜕
𝑦
ො 𝑖𝑤 = 𝑎(2)
𝜕𝑤(2) 1
11
𝜕
𝜕𝑤(1)
02
𝑦
ො
𝑖
𝑤 =?
𝜕
𝑦
ො 𝑖𝑤 = 𝑎(2)
𝜕𝑤(2) 2
21
𝜕
𝜕𝑤(1)
11
𝑦
ො
𝑖
𝑤 =?
𝜕
𝜕𝑤(1)
12
𝑦
ො
𝑖
𝑤 =?
76

Update the weights from the hidden layer to the
output layer
• 𝑤(2)
:= 𝑤(2)
− 𝛼
01 01
𝑦 𝑖
− 𝑦
ො 𝑖
• 𝑤(2)
:= 𝑤(2)
− 𝛼
11 11
𝑦 𝑖
− 𝑦
ො 𝑖
1
𝑎(2)
• 𝑤(2)
:= 𝑤(2)
− 𝛼
21 21
−1
𝑛
−1
𝑛
−1
𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
77
2
𝑎(2)

78
• 𝑦
ො (
𝑖
)
= 1𝑤(2)
+ 𝑎(2)
𝑤(2)
+ 𝑎(2)
𝑤(2)
01 1 11 2 21 and 0
𝑎(2)
= 1
•
1
1
1+exp(−𝒁 2
)
1 2
𝑎(2)
= 𝑎(2)
= 1
2
1+exp(−𝒁 2
)
• 1
𝑍 2
01 11
= 𝑤(1)
1 + 𝑤(1)
𝑥1 2
𝑍 2
02 12
= 𝑤(1)
1 + 𝑤(1)
𝑥1
Output layer weights Hidden layer weights
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(2)
01
= 1
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
01
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
1 1
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
1 1 01
= 𝑤 2
𝑎 2
11 1
1 − 𝑎 2
1 1
𝜕
𝑦
ො 𝑖 𝑤 (2)
(2) = 𝑎1
𝜕𝑤11
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
11
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
1 1
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
1 1 11
= 𝑤 2
𝑎 2
11 1
1 − 𝑎 2
1 𝑥1
𝜕
𝑦
ො 𝑖 𝑤 (2)
(2) = 𝑎2
𝜕𝑤21
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
02
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
2 2
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
2 2 02
= 𝑤 2
𝑎 2
21 2
1 − 𝑎 2
2 1
𝜕
𝑦
ො 𝑖 𝑤
𝜕𝑤(1)
12
=
𝜕
𝑦
ො 𝑖 𝑤 𝜕𝑎(2)
𝜕𝑍 2
2 2
𝜕𝑎(2)
𝜕𝑍 2
𝜕𝑤(1)
2 2 12
= 𝑤 2
𝑎 2
21 2
1 − 𝑎 2
2 𝑥1

Update the weights from the input layer to the
hidden layer
• 𝑤(1)
:= 𝑤(1)
− 𝛼
01 01
𝑦 𝑖
− 𝑦
ො 𝑖
11 1
𝑤 2
𝑎 2
1
1 − 𝑎 2
• 𝑤(1)
:= 𝑤(1)
− 𝛼
11 11
𝑦 𝑖
− 𝑦
ො 𝑖
11 1
𝑤 2
𝑎 2
1
1 − 𝑎 2
𝑥1
• 𝑤(1)
:= 𝑤(1)
− 𝛼
02 02
𝑦 𝑖
− 𝑦
ො 𝑖
21 2
𝑤 2
𝑎 2
2
1 − 𝑎 2
• 𝑤(1)
:= 𝑤(1)
− 𝛼
12 12
−1
𝑛
−1
𝑛
−1
𝑛
−1
𝑛
𝑦 𝑖
− 𝑦
ො 𝑖
21 2
𝑤 2
𝑎 2
2
1 − 𝑎 2
𝑥
79
1

What does 𝛼 do?
• If 𝛼 is too small, the rate of change in the
weights will be tiny. It will take too long to
reach to the optimum solution.
• If 𝛼 is too big, the rate of change in the
weights will be very big. We may never
find the optimum solution, our algorithm
may fail to converge.
81

𝛼, adaptive learning
• In stochastic gradient descent implementations, the fixed learning rate 𝛼 is
often replaced by an adaptive learning rate that decreases over time, for
example,
𝐶1
#𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 + 𝐶2
• where 𝐶1and 𝐶2 are constants. Note that stochastic gradient descent does
not reach the global minimum but an area very close to it. By using an
adaptive learning rate, we can achieve further annealing to a better global
minimum.
82

What if we have 𝑤0 and 𝑤1 together to change?
𝑤1
𝑤0
• The cost function J(𝑤0, 𝑤1) will be a 3D surface plot (left pane)
• The contour plot will provide the same cost along the same contour (right pane)
J(𝑤0, 𝑤1)
𝑤0
84
𝑤1

85
Controlling the
Complexity of NNs

86
1) Regularization in
Neural Networks

87
Regularization in Neural Networks
• In multi-layer neural networks, the number of input and outputs units is
generally determined by the dimensionality of the data set.
• On the other hand, we are free with the number of hidden layer units (M).
We may typically have hundreds, thousands, or even billions of weights
that we need to optimize.
• Choose optimum number of hidden layer units (M) that gives the best
generalization performance for balance between underfitting and
overfitting.
• A network is said to generalize well when the input–output mapping
computed by the network is correct (or nearly so) for test data never used in
creating or training the network. Here, it is assumed that the test data are
drawn from the same population used to generate the training data.

88
• The generalization error, however, is not a simple function of the number of
hidden layer units (M) due to the presence of local minima in the error
function.
• Each time when we start with random values of the weight vector for each
hidden layer unit size considered, we see the effect of choosing multiple
random initializations for the weight vector for a range of values of M.
• In practice, one approach to choosing M is in fact to plot a graph of the M
vs the errors, then to choose the specific solution having the smallest
validation set error.

• 𝐽 𝑤 =
2𝑛 𝑖
=1
1
σ𝑛
• There are, however, other ways to control the complexity of a neural
network model in order to avoid over-fitting. Such as adding a quadratic
regularizer (L2):
2
• 𝐽
ሚ𝑤 = 𝑖
=1
1
σ𝑛
𝑦(𝑖) − 𝑦
ො (
𝑖
)
𝑦(𝑖) − 𝑦
ො (
𝑖
)
2𝑛 2
2 𝜆
+ 𝑤
89
2
• This regularizer is also known as weight decay.
• The effective model complexity is then determined by the choice of the
regularization coefficient λ.

91
Early Stopping in NNs
• An alternative to regularization as a way of controlling the effective
complexity of a network is the procedure of early stopping.
• The training of nonlinear network models corresponds to an iterative
reduction of the error function defined with respect to a set of training data.
• The error measured with respect to independent data, generally called a
validation set, often shows a decrease at first, followed by an increase as
the network starts to over-fit. Training can therefore be stopped at the point
of smallest error with respect to the validation data set, in order to obtain a
network having good generalization performance.

92
When to use NNs
• When dealing with unstructured datasets
• When you do not need interpretable results, for example when you just
want to classify your pictures based on cats and dogs, you don’t need to
know why the outcome is classified as a cat or a dog. You don’t need to
explain the relationships.
• When you have many features, with regularization
• When you have nonlinear relationships

93
Resources
• Afree online book by Michael Nielsen (brilliant resource for partial
derivatives and gradient descent):
http://neuralnetworksanddeeplearning.com/
• The Elements of Statistical Learning, Trevor Hastie book (p.389)
• Pattern Recognition and Machine Learning Book, Christopher Bishop
(p.227)
• Machine Learning with R, Brett Lantz (p.219)
• Neural Networks – a comprehensive foundation, Simon S Haykin
• Python Machine Learning, Sebastian Raschka (p.17)
• Neural Network Design, Hagan, Demuth, Beale, De Jesus
(http://hagan.okstate.edu/nnd.html)
• https://github.com/stephencwelch/Neural-Networks-Demystified
• Of course again Prof. Patrick Henry Winston’s MIT youtube lectures.

Neural Networks

Recommended

Recommended

More Related Content

Similar to Neural Networks

Similar to Neural Networks (20)

Recently uploaded

Recently uploaded (20)

Neural Networks