SlideShare a Scribd company logo
1 of 93
Alex Mirugwe
Victoria University
Artificial Neural Networks
2
Outline
The objective of this part of the Supervised Learning lectures will be to gain
and understanding of:
β€’ Background forANNs
β€’ HowANNs for regression and classification problems work
β€’ Perceptron learning algorithm
β€’ Gradient descent algorithm
β€’ Stochastic gradient descent algorithm
β€’ how to analyze datasets withANNs in R
β€’ how to interpret the results
Resources
β€’ Page 389, Chapter 11 Neural Networks
3
4
Biological Neuron and
Links to Perceptrons
5
Introduction
β€’ An Artificial Neural Network (ANN) models the relationship between a
set of input signals (features) and an output signal (y variable) using a
model derived from our understanding of how a biological brain responds
to stimuli from sensory inputs.
β€’ Just as a brain uses a network of interconnected cells called neurons to
create a massive parallel processor, ANN uses a network of artificial
neurons or nodes to solve learning problems.
β€’ Before we explainANNs, let us understand how the biological brain
works.
Neurons
Ref: Url: https://www.verywellmind.com/what-is-a-neuron-2794890 6
The Neuron
π‘₯
7
3
π‘₯2
π‘₯1
π‘₯4
π‘₯
𝑀1
𝑀2
𝑀3
𝑀4
𝑀5
𝑀1π‘₯1 + 𝑀2π‘₯2 + β‹― +
5 𝑀5π‘₯5
𝑗
𝑗
=
=
1
1
𝑝
𝑗
𝑗
𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 <
β‰₯ πœƒ
𝑦
ො 𝑦
ො
=01
Simple example
β€’ π’˜ =
𝑀0
𝑀1
𝑀2
𝑀3
=
1
2
βˆ’1
0.5
((p+1)x1) weights matrix
β€’ 𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
1
0
(nx(p+1)) input matrix
β€’ π’š = 0
1
1
(nx1) output matrix, π’š
ො
=
?
?
?
?
?
12
(nx1) predicted output matrix
Simple example
π‘₯0
π‘₯1
π‘₯2
π‘₯3
𝑀0 =1
𝑀1 = 2
𝑀2 = βˆ’1
𝑀3 = 0.5
𝑦
ො =
1
𝑍 πœ™ 𝑍
𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑀𝑗π‘₯𝑗=1*1+1*2+0*(-1)+(-1)*0.5=2.5
πœ™ 𝑍 =
𝑝
1 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0
𝑗=0
13
0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
Simple example
π‘₯0
π‘₯1
π‘₯2
π‘₯3
𝑀0 =1
𝑀1 = 2
𝑀2 = βˆ’1
𝑀3 = 0.5
y = 1
𝑍 πœ™ 𝑍
𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑀𝑗π‘₯𝑗=1*1+3*2+1*(-1)+2*0.5=7
πœ™ 𝑍 =
𝑝
1 𝑍 = 𝑀𝑗 π‘₯𝑗 β‰₯ 0
𝑗=0
17
0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
Simple example
β€’ π’˜ =
𝑀0
𝑀1
𝑀2
𝑀3
=
1
2
βˆ’1
0.5
((p+1)x1) weights matrix
β€’ 𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
1
0
(nx(p+1)) input matrix
β€’ π’š = 0
1
1
(nx1) output matrix, 𝑦=
1
1
1
?
?
18
(nx1) predicted output matrix
Simple example
π‘₯0
π‘₯1
π‘₯2
π‘₯3
𝑀0 =1
𝑀1 = 2
𝑀2 = βˆ’1
𝑀3 = 0.5
y= 1
𝑍 πœ™ 𝑍
𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑀𝑗π‘₯𝑗=1*1+4*2+0*(-1)+(-2)*0.5=8
πœ™ 𝑍 =
𝑝
1 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0
𝑗=0
19
0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
Simple example
β€’ π’˜ =
𝑀0
𝑀1
𝑀2
𝑀3
=
1
2
βˆ’1
0.5
((p+1)x1) weights matrix
β€’ 𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
(nx(p+1)) input matrix
β€’ π’š =
1
0
0
1
(nx1) output matrix, π’š
ො
=
1
1
1
1
?
20
(nx1) predicted output matrix
Simple example
π‘₯0
π‘₯1
π‘₯2
π‘₯3
𝑀0 =1
𝑀1 = 2
𝑀2 = βˆ’1
𝑀3 = 0.5
𝑦 = 1
𝑍 πœ™ 𝑍
𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
𝑗
=0
𝑍 = σ𝑝
𝑀𝑗π‘₯𝑗=1*1+5*2+1*(-1)+(1)*0.5=9.5
πœ™ 𝑍 =
𝑝
1 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0
𝑗=0
21
0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
Simple example
β€’ π’˜ =
𝑀0
𝑀1
𝑀2
𝑀3
=
1
2
βˆ’1
0.5
((p+1)x1) weights matrix
β€’ 𝒙 =
1 1 0 βˆ’1
1 2 1 0
1 3 1 2
1 4 0 βˆ’2
1 5 1 1
1
0
(nx(p+1)) input matrix
β€’ π’š = 0
1
1
(nx1) output matrix, 𝑦 =
1
1
1
1
1
22
(nx1) predicted output matrix
β€’ It is clear that this set of weights do not achieve a good prediction. Need to
be updated.
23
Perceptron Learning
Algorithm
24
How do we find the weights, w?
β€’ 1) Perceptron LearningAlgorithm:
– Step 0: Training begins by assigning some initial random values for the
network parameters.Agood initial heuristic is to start with the average of the
positive input vectors minus the average of the negative input vectors. In many
cases this yields an initial vector near the solution region.
– Step 1: Presenting the input vectors to the network, apply the activation
function (FORWARD PROPAGATION)
– Step 2: Update the weights according to the following rule
(BACKWARD PROPAGATION):
𝑀𝑗 ∢= 𝑀𝑗 βˆ’ Δ𝑀𝑗
Δ𝑀𝑗 = πœ‚ 𝑦 𝑖 βˆ’ 𝑦
ො
𝑖 𝑗
π‘₯ 𝑖
Here πœ‚ is the learning rate and 0 < πœ‚ ≀ 1 and 𝑖 = 1, … , 𝑛 representing the
samples
β€’ Continue the iteration until the perceptron classifies all training examples
correctly.
β€’ : = is an assignment
25
Perceptron as a neural network : Going Backward
Input Layer Output Layer
π‘₯0
π‘₯1
𝑀0
𝑀1
𝑀2
π‘₯2
𝑀3
π‘₯3
𝑍 πœ™ 𝑍
π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 𝑦 βˆ’ 𝑦
ො
Goal is to min error
𝑦
ො
26
𝑗
π‘₯ 𝑖
Rate of change: Δ𝑀𝑗 = πœ‚ 𝑦 𝑖 βˆ’ 𝑦
ො
𝑖
β€’ Scenario 1: The output is correct - 𝑦 𝑖 = 1, 𝑦
ො 𝑖 = 1
β€’ Scenario 2: The output is incorrect - 𝑦 𝑖 = 1, 𝑦
ො 𝑖
= 0
β€’ Scenario 1:
𝑗
Δ𝑀𝑗 = πœ‚ 1 βˆ’ 1 π‘₯ 𝑖
= 0 no change is necessary
β€’ Scenario 2:
Δ𝑀𝑗 = πœ‚ 1 βˆ’ 0 π‘₯ 𝑖
𝑗
𝑗
= πœ‚π‘₯ 𝑖
the weight update is
proportional to the value of π‘₯ 𝑖
𝑗
β€’ In summary: where the perceptron predicts the class label correctly, the weights
remain unchanged, where the perceptron predicts the class label incorrectly, the
weights are updated proportional to the value of the input. The perceptron
learning algorithm selects a search direction in weight space according to the
incorrect classification of the last tested vector
27
Perceptron Learning
Algorithm with Iris
Dataset
Our good old Iris Dataset
0
𝑀 + 𝑀1π‘₯1 + 𝑀2π‘₯2=0
π‘₯1
β€’ Check the perceptron learning algorithm R codes and the video.
π‘₯2
Slope = βˆ’π‘€1/𝑀2
Intercept = βˆ’π‘€0/𝑀2
28
29
Multilayer Perceptrons
Linearly separable – inseparable cases
β€’ It is important to note that the convergence of the perceptron is only
guaranteed if the two classes are linearly separable and the learning rate is
sufficiently small.
30
Multilayer perceptrons
β€’ Single layer perceptrons are only capable of solving linearly separable
problems.
β€’ In order to overcome the linearly inseparable problem, we can add 2 or
more perceptrons together, by creating a multilayer perceptrons.
β€’ Therefore by joining several hyper-planes, we can define a new set of
decision rules.
31
Example 1
β€’ In total we have 12+8 = 20 weights to optimize
3 Γ— 4 = 12 𝑀
32
4 Γ— 2 = 8 𝑀
3 neurons L=1 4 neurons L=2
2 neurons L=3
Example 2
β€’ In total we have 20+5 = 25 weights to optimize
33
4 Γ— 5 = 20 𝑀 5 Γ— 1 = 5 𝑀
Example 3
β€’ In total we have 12+9+3 = 24 weights to optimize
34
4 Γ— 3 = 12 𝑀
3 Γ— 1 = 3 𝑀
3 Γ— 3 = 9 𝑀
π‘₯0
π‘₯1
π‘₯2
π‘₯3
𝑀01
𝑀11
𝑀21
31
𝑦
ො
𝑀21
𝑀02
𝑀12
22
𝑀32
π‘Ž1
𝑍2 π‘Ž2
(1)
(1)
(1)
(1)
(1)
𝑀(1)
𝑀(1)
(1)
(2)
(2)
𝑀01
𝑀11
(2)
(2)
𝑍1
(2)
(2)
(2)
(2)
(3)
𝑍1 π‘Ž1
(3)
𝑗
𝒁(2) = π’˜(1)
π‘₯𝑗 𝑗
𝑗
𝒁(π‘˜+1) = π’˜(π‘˜)
π‘Ž(π‘˜)
π‘Ž(π‘˜+1)
= πœ™ 𝒁(π‘˜+1)
𝑗 𝑗
Input Layer, k=1 Output Layer, k=3
Hidden Layer, k=2 35
𝑗 𝑗
1 Input Layer, 1 Hidden Layer NN with 3 input variables
and 1 output variable (numeric output) – Going Forward
π‘₯ = π‘Ž(1)
π‘Ž0 =1
1 Input Layer, 1 Hidden Layer NN with 3 input variables and
1 output variable (categorical output – with 3 categories)
π‘₯0
π‘₯1
π‘₯2
π‘₯3
𝑀01
𝑀11
𝑀21
31
𝑀02
𝑀12
22
𝑀32
2
π‘Ž1
2
(1)
(1)
(1)
(1)
(1)
𝑀(1)
𝑀(1)
(1)
0
(2)
𝑍1
(2)
𝑍 π‘Ž
(2)
(2)
π‘Ž(2)=1
𝑀23
(2)
𝑍1 π‘Ž1
22
13
𝑀12
𝑀11
𝑀02
𝑀03
(2)
𝑀
(2)
𝑀21
(2)
𝑀01
(2)
(2)
(2)
(2)
𝑀(2)
𝑦
ො 𝑗
Input Layer, k=1 Output Layer, k=3
Hidden Layer, k=2 36
𝑦
ො 𝑗
𝑦
ො 𝑗
(3) (3)
(3) (3)
𝑍2 π‘Ž2
(3)
𝑍3 π‘Ž3
(3)
37
β€’ 𝑾(π‘˜): matrix of weights controlling function mapping from layer (π‘˜) to layer
(π‘˜ + 1). (Here k = 1, … , 𝐿).
β€’ 𝒁(π‘˜+1): vector of linear combinations of weights and inputs from layer (π‘˜):
𝒁(π‘˜+1) = π’˜(π‘˜)
π‘Ž(π‘˜)
𝑗 𝑗
𝑗 0
𝑗
where π‘Ž(1)
= π‘₯ and π‘Ž(π‘˜)
= 1 (acts as a bias) and π‘Ž(𝐿)
= 𝑦
ො
𝑗
𝑗
β€’ π‘Ž(π‘˜)
:Activation of unit (𝑗) in Layer (π‘˜) with a pre-specified activation
𝑗
function. (Here j = 0, … , 𝑃(π‘˜) and specific to the layer).
π‘Ž(π‘˜+1)
= πœ™ 𝒁(π‘˜+1)
𝑗 𝑗
β€’ There are several different activation functions:
38
𝑗 𝑗
π‘Ž(1)
= π‘₯ and (in case of a regression problem we have one output) and
39
Activation Functions
β€’ In perceptrons, a small change in the weights or bias of any single perceptron
in the network can sometimes cause the output of that perceptron to
completely flip, say from 0 to 1. That flip may then cause the behaviour of
the rest of the network to completely change in some very complicated way.
β€’ There are several different activation functions:
– Step function
– Constant function
– Threshold function (step)
– Threshold function (ramp)
– Linear function
– Sigmoid function
– Hyperbolic Tangent function
Activation Function – Step Function (Symmetric)
π‘Ž = πœ™ 𝑍 =
1
40
𝑖𝑓
𝑝
𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0
𝑗=0
π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
βˆ’1
Activation Function – Step Function (Binary)
π‘Ž = πœ™ 𝑍 =
1 𝑖𝑓
𝑝
𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0
𝑗=0
π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
0
41
Activation Function – Step Function (Linear)
π‘Ž = πœ™ 𝑍 = 𝑍
42
Activation Function – Semilinear Function
π‘Ž = πœ™ 𝑍 =
1 𝑖𝑓
𝑝
𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 > 1
𝑗=0
𝑝
βˆ’ 1 ≀ ෍ 𝑀𝑗 π‘₯𝑗 ≀ 1
𝑗=0
𝑝
𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 < βˆ’1
𝑗=0
𝑍 𝑖𝑓
βˆ’1 𝑖𝑓
43
Activation Function – Sigmoid Function
π‘Ž = πœ™ 𝑍 =
1
1 + exp(βˆ’π›Όπ‘)
We will focus on sigmoid
activation function at the
moment.
44
Activation Function – Hyperbolic Tangent (Tanh)
Function
π‘Ž = πœ™ 𝑍 =
exp 𝑍 βˆ’ exp(βˆ’π‘)
exp(𝑍) + exp(βˆ’π‘)
45
Activation Function – ReLU Function (Rectified
Linear Unit)
β€’ Non differentiable at 0, however, it is differentiable anywhere else.At the
value of zero, a random choice of 0 or 1 is possible.
π‘Ž = πœ™ 𝑍 =
𝑍 𝑖𝑓
𝑝
𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 > 0
𝑗=0
π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
0
46
47
How do we find the weights, w?
β€’ One way of attacking the problem is to use calculus to try
to find the minimum analytically.
β€’ We could compute derivatives and then try using them to
find places where C is an extremum. With some luck that
might work when C is a function of just one or a few
variables.
β€’ But it'll turn into a nightmare when we have many more
variables.
β€’ And for neural networks we'll often want far more
variables - the biggest neural networks have cost functions
which depend on billions of weights and biases in an
extremely complicated way.
β€’ Using calculus to minimize that just won't work!
How do we find the weights, w? – Going Backward
β€’ 2) Gradient DescentAlgorithm:
– Step 0: Training begins by assigning some initial random values for the network
parameters.
– Step 1: Presenting the input vectors to the network, apply the activation function
(FORWARD PROPAGATION)
– Step 2: Calculate the error using an activation function :
𝐽 𝑀 =
1
2𝑛 𝑖
=1
σ𝑛 𝑦(𝑖) βˆ’ 𝑦
ො (
𝑖
)
2
for regression problems (without any regularization)
𝐽 𝑀
2𝑛 𝑖
=1
= βˆ’1
σ𝑛
𝑦(𝑖) log 𝑦
ො (
𝑖
)
+ (1 βˆ’ 𝑦(𝑖)) log 1 βˆ’ π‘¦ΰ·œ (
𝑖
) for classification problems
(without any regularization)
Our goal is to minimise the error 𝐽 𝑀 with respect to 𝑀𝑗
– Step 3: Update the weights according to the following rule:
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
48
πœ•π‘€π‘—
𝐽(𝑀)
Here 𝛼 is the learning rate and 0 < 𝛼 ≀ 1
β€’ Continue the iteration until convergence.
How do we find the weights, w? – Going Backward
2) Gradient DescentAlgorithm:
𝑗
𝑗
β€’ Let us examine the weight update function:
β€’ 𝑀 ∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
β€’ Partial derivative answers the question β€œWhat is the slope of the 𝐽 𝑀 at
point 𝑀.
β€’ And the 𝛼 determines the amount of change that needs to be done. If 𝛼 is too
small take small steps to reach the optimal values. It will take too long to
reach the optimum.
β€’ If 𝛼 is too big, we may miss the optimal values. Fail to converge.
β€’ Let us have a look at these concepts with a small example:
x: (size) c(0, 1, 2, 3)
y: (price) c(0, 2, 4, 6)
one feature to estimate a numeric variable.
49
50
Batch, Stochastic Gradient Descent
β€’ In batch gradient descent learning, the weight update is
calculated based on all samples in the training set (instead of
updating the weights incrementally after each sample),
which is why this approach is also referred to as β€œbatch”
gradient descent.
β€’ Vector – matrix operations
51
Batch, Stochastic Gradient Descent
β€’ Now imagine we have a very large dataset with millions of data points,
which is not uncommon in many machine learning applications. Running
batch gradient descent can be computationally quite costly in such scenarios
since we need to re-evaluate the whole training dataset each time we take
one step towards the global minimum.
β€’ A popular alternative to the batch gradient descent algorithm is stochastic
gradient descent, sometimes also called iterative or on-line gradient descent.
Instead of updating the weights based on the sum of the accumulated errors
over all samples, we update the weights incrementally for each training
sample.
for (i in 1:n){
- calculate error
- calculate derivatives
- update weight
}
52
Batch, Stochastic Gradient Descent
β€’ Acompromise between batch gradient descent and stochastic
gradient descent is the so-called mini-batch learning. In
mini-batch learning, a neural network learns from just one
training input at a time.
β€’ Mini-batch learning can be understood as applying batch
gradient descent to smaller subsets of the training dataβ€”for
example, 50 samples at a time.
β€’ By averaging over this small sample it turns out that we
can quickly get a good estimate of the true gradient,
and this helps speed up gradient descent, and thus
learning
53
Final note on Gradient Descent:
Input variable preprocessing
β€’ Gradient descent is one of the many algorithms that benefit from feature
scaling. Each input variable should be preprocessed so that its mean value,
averaged over the entire training sample, is close to zero, or else it will be
small compared to its standard deviation.
β€’ In order to accelerate the back-propagation learning process, the
normalization of the inputs should also include two other measures (LeCun,
1993):
β€’ The input variables contained in the training set should be uncorrelated; this
can be done by using principal-components analysis (USL).
β€’ The decorrelated input variables should be scaled so that their covariances
are approximately equal, thereby ensuring that the different synaptic weights
in the network learn at approximately the same speed.
Final note on Gradient Descent:
Input variable preprocessing
β€’ We will use a feature scaling method called standardization, which gives our
data the property of a standard normal distribution.
β€’ The mean of each input feature is centered at value 0 and the feature column
has a standard deviation of 1:
π‘₯𝑠𝑑 =
π‘₯𝑗 βˆ’ π‘₯𝑗
𝑗
𝑠𝑗
54
Hypothetical Data – The exact function is 𝑦
ො = 0 +
2π‘₯
> dat
x y
1 0 0
2 1 2
3 2 4
4 3 6
𝑦
ො = 0 +
2π‘₯
55
1 input layer, 1 output layer
Input Layer Output Layer
π‘₯0
π‘₯1
𝑀0
𝑀1 𝑍 πœ™ 𝑍
𝑍 = 𝑀0π‘₯0 + 𝑀1π‘₯1
πœ™ 𝑍 = 𝑍
π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 𝑦 βˆ’ 𝑦
ො
𝑦
ො
56
For: 𝑀0 = 0, 𝑀1 = 0 𝐽 0,0 =7
> dat
x y yhat=0+0*x
1 0 0 0+0*0 = 0
2 1 2 0+0*1 = 0
3 2 4 0+0*2 = 0
4 3 6 0+0*3 = 0
1
𝑒 = 0
𝑒2 = 2
𝑒3 = 4
𝑒4 = 6
𝐽 𝑀
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ (
𝑖
)
2
=
1
2 βˆ— 4
02 + 22 + 42 + 62 = 7
57
For: 𝑀0 = 0, 𝑀1 = 0.5 𝐽 0,0.5 =3.59375
> dat
1
x
1 0
y
0
yhat=0+0.5*x
0+0.5*0 = 0 𝑒4 = 4.5
2 1 2 0+0.5*1 = 0.5
3 2 4 0+0.5*2 = 1 𝑒3 = 2.5
4 3 6 0+0.5*3 = 1.5
𝑒2
𝑒 = 0 = 1.5
𝐽 𝑀
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ (
𝑖
)
2
=
1
2 βˆ— 4
02 + 1.52 + 2.52 + 4.52
58
= 3.59375
For: 𝑀0 = 0, 𝑀1 =1 𝐽 0,1 =1.75
> dat
x y yhat=0+1*x
1 0 0 0+1*0 = 0
2 1 2 0+1*1 = 1
3 2 4 0+1*2 = 2
4 3 6 0+1*3 = 3
1
𝑒 = 0
𝑒2 = 1
𝑒3 = 2
𝑒4 = 3
𝐽 𝑀
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ (
𝑖
)
2
=
1
2 βˆ— 4
02 + 12 + 22 + 32
59
= 1.75
For: 𝑀0 = 0, 𝑀1 =2 𝐽 0,2 = 0
> dat
x y yhat=0+2*x
1 0 0 0+2*0 = 0
2 1 2 0+2*1 = 2
3 2 4 0+2*2 = 4
4 3 6 0+2*3 = 6
𝑒1 = 0
𝑒3 = 0
𝑒2 = 0
𝑒4 =0
𝐽 𝑀
2𝑛
𝑖
=1
𝑛
1
= ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ (
𝑖
)
2
=
1
2 βˆ— 4
02 + 02 + 02 + 02 = 0
60
Cost function with respect to w1
𝑀1 = 0
πœ•
πœ•π‘€π‘—
𝐽 𝑀 < 0
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
61
Cost function with respect to w1
𝑀1 = 0.5
πœ•
πœ•π‘€π‘—
𝐽 𝑀 < 0
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
62
Cost function with respect to w1
πœ•
πœ•π‘€π‘—
𝐽 𝑀
𝑀1 = 1.25
< 0
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
63
Cost function with respect to w1
𝑀1 = 2
πœ•
πœ•π‘€π‘—
𝐽 𝑀 < 0
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
64
Cost function with respect to w1
𝑀1 =2.75
πœ•
πœ•π‘€π‘—
𝐽 𝑀 > 0
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
65
Cost function with respect to w1
𝑀1 =2.5
πœ•
πœ•π‘€π‘—
𝐽 𝑀 > 0
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
66
Cost function with respect to w1
𝑀1 =2
πœ•
πœ•π‘€π‘—
𝐽 𝑀 > 0
𝑀𝑗 𝑗
∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
67
Cost function with respect to w1, one w parameter to
optimize
Ref: Raschka, p35 68
69
Example 1
Gradient Descent in a very simple example:
1 input layer, 1 output layer, 1 x, 1 numeric y
β€’ Consider the linear regression example:
β€’ 𝐽 𝑀 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) βˆ’ 𝑦
ො (
𝑖
)
2
and π‘¦ΰ·œ (
𝑖
)= 𝑀 π‘₯
0 0 + 𝑀 π‘₯
1 1
β€’ We have 2 weights!
β€’
πœ•
πœ•π‘€0
𝐽(𝑀) and
πœ•
πœ•π‘€1
𝐽(𝑀) need to be calculated:
β€’ βˆ‡π½ =
πœ•
πœ•π‘€0
𝐽 𝑀 ,
πœ•
πœ•π‘€1
𝐽(𝑀)
𝑗 𝑗
β€’ 𝑀 ∢= 𝑀 βˆ’ 𝛼
πœ•
70
πœ•π‘€π‘—
𝐽(𝑀)
β€’ 𝐽 𝑀 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) βˆ’ 𝑦
ො (
𝑖
)
2
and 𝑦
ො (
𝑖
)
= 𝑀0π‘₯0 + 𝑀1π‘₯1
πœ•
πœ•π‘€0
βˆ’2
𝐽 𝑀 =
2𝑛
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
πœ•
πœ•π‘€0
𝑦
ො 𝑖𝑀 =
βˆ’1
𝑛
(𝑖) (𝑖
)
(𝑦 βˆ’ 𝑦
ො )π‘₯0
β€’ and
πœ•
πœ•π‘€1
βˆ’2
𝐽 𝑀 =
2𝑛
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
πœ•
πœ•π‘€1
𝑦
ො
𝑖
𝑛
71
βˆ’1
𝑀 = (𝑦(𝑖) βˆ’ π‘¦ΰ·œ (𝑖))
π‘₯
1
πœ•
πœ•π‘€0
𝐽 𝑀 =
βˆ’1
𝑛
𝑦 βˆ’ 𝑀0π‘₯0 βˆ’ 𝑀1π‘₯1 βˆ— π‘₯0
πœ•
1
𝐽 𝑀 =
πœ•π‘€
βˆ’1
𝑛
𝑦 βˆ’ 𝑀0π‘₯0 βˆ’ 𝑀1π‘₯1 βˆ— π‘₯1
𝑗
𝑀𝑗 ∢= 𝑀 βˆ’ 𝛼
πœ•
πœ•π‘€π‘—
𝐽(𝑀)
β€’
πœ•
𝐽 𝑀 < 0 that means we will increase the weight
β€’
πœ•π‘€
πœ•
πœ•π‘€
πœ•
𝐽 𝑀 > 0 that means we will decrease the weight
β€’
πœ•
𝑀
𝐽 𝑀
72
= 0 that means we will not change the weight
73
Example 2
π‘₯0
π‘₯1
𝑀01
𝑀11
𝑦
ො
𝑀21
𝑀02
𝑀12
𝑍2
π‘Ž1
π‘Ž2
(1)
(1)
(1)
(1)
(2)
(2)
𝑀01
𝑀11
(2)
(2)
𝑍1
(2)
(2)
(2)
(2)
(3)
𝑍1 π‘Ž1
(3)
𝑗
𝒁(2) = π’˜(1)
π‘₯𝑗
1 𝑗
𝑗
𝑦
ො = π‘Ž(3)
= 𝒁(π‘˜+1) = π’˜(π‘˜)
π‘Ž
(2)
𝑗
π‘Ž(2)
= πœ™ 𝑗
𝒁(2)
=
1
𝑗
1 + exp(βˆ’π’ 2
)
𝑗 𝑗
Gradient Descent in a very simple example:
1 input layer, 1 hidden layer (2 neurons + bias), 1 output
layer, 1 x, 1 numeric y
π‘₯ = π‘Ž(1)
π‘Ž0 =1
Gradient Descent in a very simple example:
1 input layer, 1 hidden layer (2 neurons + bias), 1
output layer, 1 x, 1 numeric y
β€’ 𝐽 𝑀 =
2𝑛 𝑖
=1
1
σ𝑛
𝑦(𝑖) βˆ’ 𝑦
ො (
𝑖
)
2
πœ•
πœ•
𝑀
β€’ 𝐽 𝑀 = βˆ’2
2𝑛
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
πœ•
πœ•
𝑀
𝑦
ො 𝑖𝑀
75
Gradient Descent in a very simple example:
1 input layer, 1 hidden layer (2 neurons + bias), 1
output layer, 1 x, 1 numeric y
β€’ 𝑦
ො (
𝑖
)
and
0 01 1 11 2 21 0
= π‘Ž(2)
𝑀(2)
+ π‘Ž(2)
𝑀(2)
+ π‘Ž(2)
𝑀(2)
π‘Ž(2)
= 1
Output layer weights Hidden layer weights
πœ•
𝑦
ො 𝑖𝑀 = π‘Ž(2)
= 1
πœ•π‘€(2) 0
01
πœ•
πœ•π‘€(1)
01
𝑦
ො
𝑖
𝑀 =?
πœ•
𝑦
ො 𝑖𝑀 = π‘Ž(2)
πœ•π‘€(2) 1
11
πœ•
πœ•π‘€(1)
02
𝑦
ො
𝑖
𝑀 =?
πœ•
𝑦
ො 𝑖𝑀 = π‘Ž(2)
πœ•π‘€(2) 2
21
πœ•
πœ•π‘€(1)
11
𝑦
ො
𝑖
𝑀 =?
πœ•
πœ•π‘€(1)
12
𝑦
ො
𝑖
𝑀 =?
76
Update the weights from the hidden layer to the
output layer
β€’ 𝑀(2)
:= 𝑀(2)
βˆ’ 𝛼
01 01
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
β€’ 𝑀(2)
:= 𝑀(2)
βˆ’ 𝛼
11 11
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
1
π‘Ž(2)
β€’ 𝑀(2)
:= 𝑀(2)
βˆ’ 𝛼
21 21
βˆ’1
𝑛
βˆ’1
𝑛
βˆ’1
𝑛
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
77
2
π‘Ž(2)
78
β€’ 𝑦
ො (
𝑖
)
= 1𝑀(2)
+ π‘Ž(2)
𝑀(2)
+ π‘Ž(2)
𝑀(2)
01 1 11 2 21 and 0
π‘Ž(2)
= 1
β€’
1
1
1+exp(βˆ’π’ 2
)
1 2
π‘Ž(2)
= π‘Ž(2)
= 1
2
1+exp(βˆ’π’ 2
)
β€’ 1
𝑍 2
01 11
= 𝑀(1)
1 + 𝑀(1)
π‘₯1 2
𝑍 2
02 12
= 𝑀(1)
1 + 𝑀(1)
π‘₯1
Output layer weights Hidden layer weights
πœ•
𝑦
ො 𝑖 𝑀
πœ•π‘€(2)
01
= 1
πœ•
𝑦
ො 𝑖 𝑀
πœ•π‘€(1)
01
=
πœ•
𝑦
ො 𝑖 𝑀 πœ•π‘Ž(2)
πœ•π‘ 2
1 1
πœ•π‘Ž(2)
πœ•π‘ 2
πœ•π‘€(1)
1 1 01
= 𝑀 2
π‘Ž 2
11 1
1 βˆ’ π‘Ž 2
1 1
πœ•
𝑦
ො 𝑖 𝑀 (2)
(2) = π‘Ž1
πœ•π‘€11
πœ•
𝑦
ො 𝑖 𝑀
πœ•π‘€(1)
11
=
πœ•
𝑦
ො 𝑖 𝑀 πœ•π‘Ž(2)
πœ•π‘ 2
1 1
πœ•π‘Ž(2)
πœ•π‘ 2
πœ•π‘€(1)
1 1 11
= 𝑀 2
π‘Ž 2
11 1
1 βˆ’ π‘Ž 2
1 π‘₯1
πœ•
𝑦
ො 𝑖 𝑀 (2)
(2) = π‘Ž2
πœ•π‘€21
πœ•
𝑦
ො 𝑖 𝑀
πœ•π‘€(1)
02
=
πœ•
𝑦
ො 𝑖 𝑀 πœ•π‘Ž(2)
πœ•π‘ 2
2 2
πœ•π‘Ž(2)
πœ•π‘ 2
πœ•π‘€(1)
2 2 02
= 𝑀 2
π‘Ž 2
21 2
1 βˆ’ π‘Ž 2
2 1
πœ•
𝑦
ො 𝑖 𝑀
πœ•π‘€(1)
12
=
πœ•
𝑦
ො 𝑖 𝑀 πœ•π‘Ž(2)
πœ•π‘ 2
2 2
πœ•π‘Ž(2)
πœ•π‘ 2
πœ•π‘€(1)
2 2 12
= 𝑀 2
π‘Ž 2
21 2
1 βˆ’ π‘Ž 2
2 π‘₯1
Update the weights from the input layer to the
hidden layer
β€’ 𝑀(1)
:= 𝑀(1)
βˆ’ 𝛼
01 01
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
11 1
𝑀 2
π‘Ž 2
1
1 βˆ’ π‘Ž 2
β€’ 𝑀(1)
:= 𝑀(1)
βˆ’ 𝛼
11 11
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
11 1
𝑀 2
π‘Ž 2
1
1 βˆ’ π‘Ž 2
π‘₯1
β€’ 𝑀(1)
:= 𝑀(1)
βˆ’ 𝛼
02 02
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
21 2
𝑀 2
π‘Ž 2
2
1 βˆ’ π‘Ž 2
β€’ 𝑀(1)
:= 𝑀(1)
βˆ’ 𝛼
12 12
βˆ’1
𝑛
βˆ’1
𝑛
βˆ’1
𝑛
βˆ’1
𝑛
𝑦 𝑖
βˆ’ 𝑦
ො 𝑖
21 2
𝑀 2
π‘Ž 2
2
1 βˆ’ π‘Ž 2
π‘₯
79
1
80
𝛼 Learning Rate
What does 𝛼 do?
β€’ If 𝛼 is too small, the rate of change in the
weights will be tiny. It will take too long to
reach to the optimum solution.
β€’ If 𝛼 is too big, the rate of change in the
weights will be very big. We may never
find the optimum solution, our algorithm
may fail to converge.
81
𝛼, adaptive learning
β€’ In stochastic gradient descent implementations, the fixed learning rate 𝛼 is
often replaced by an adaptive learning rate that decreases over time, for
example,
𝐢1
#π‘–π‘‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›π‘  + 𝐢2
β€’ where 𝐢1and 𝐢2 are constants. Note that stochastic gradient descent does
not reach the global minimum but an area very close to it. By using an
adaptive learning rate, we can achieve further annealing to a better global
minimum.
82
β€’ Ref: Raschka, p.40
83
What if we have 𝑀0 and 𝑀1 together to change?
𝑀1
𝑀0
β€’ The cost function J(𝑀0, 𝑀1) will be a 3D surface plot (left pane)
β€’ The contour plot will provide the same cost along the same contour (right pane)
J(𝑀0, 𝑀1)
𝑀0
84
𝑀1
85
Controlling the
Complexity of NNs
86
1) Regularization in
Neural Networks
87
Regularization in Neural Networks
β€’ In multi-layer neural networks, the number of input and outputs units is
generally determined by the dimensionality of the data set.
β€’ On the other hand, we are free with the number of hidden layer units (M).
We may typically have hundreds, thousands, or even billions of weights
that we need to optimize.
β€’ Choose optimum number of hidden layer units (M) that gives the best
generalization performance for balance between underfitting and
overfitting.
β€’ A network is said to generalize well when the input–output mapping
computed by the network is correct (or nearly so) for test data never used in
creating or training the network. Here, it is assumed that the test data are
drawn from the same population used to generate the training data.
88
Regularization in Neural Networks
β€’ The generalization error, however, is not a simple function of the number of
hidden layer units (M) due to the presence of local minima in the error
function.
β€’ Each time when we start with random values of the weight vector for each
hidden layer unit size considered, we see the effect of choosing multiple
random initializations for the weight vector for a range of values of M.
β€’ In practice, one approach to choosing M is in fact to plot a graph of the M
vs the errors, then to choose the specific solution having the smallest
validation set error.
Regularization in Neural Networks
β€’ 𝐽 𝑀 =
2𝑛 𝑖
=1
1
σ𝑛
β€’ There are, however, other ways to control the complexity of a neural
network model in order to avoid over-fitting. Such as adding a quadratic
regularizer (L2):
2
β€’ 𝐽
αˆšπ‘€ = 𝑖
=1
1
σ𝑛
𝑦(𝑖) βˆ’ 𝑦
ො (
𝑖
)
𝑦(𝑖) βˆ’ 𝑦
ො (
𝑖
)
2𝑛 2
2 πœ†
+ 𝑀
89
2
β€’ This regularizer is also known as weight decay.
β€’ The effective model complexity is then determined by the choice of the
regularization coefficient Ξ».
90
2) Early Stopping
91
Early Stopping in NNs
β€’ An alternative to regularization as a way of controlling the effective
complexity of a network is the procedure of early stopping.
β€’ The training of nonlinear network models corresponds to an iterative
reduction of the error function defined with respect to a set of training data.
β€’ The error measured with respect to independent data, generally called a
validation set, often shows a decrease at first, followed by an increase as
the network starts to over-fit. Training can therefore be stopped at the point
of smallest error with respect to the validation data set, in order to obtain a
network having good generalization performance.
92
When to use NNs
β€’ When dealing with unstructured datasets
β€’ When you do not need interpretable results, for example when you just
want to classify your pictures based on cats and dogs, you don’t need to
know why the outcome is classified as a cat or a dog. You don’t need to
explain the relationships.
β€’ When you have many features, with regularization
β€’ When you have nonlinear relationships
93
Resources
β€’ Afree online book by Michael Nielsen (brilliant resource for partial
derivatives and gradient descent):
http://neuralnetworksanddeeplearning.com/
β€’ The Elements of Statistical Learning, Trevor Hastie book (p.389)
β€’ Pattern Recognition and Machine Learning Book, Christopher Bishop
(p.227)
β€’ Machine Learning with R, Brett Lantz (p.219)
β€’ Neural Networks – a comprehensive foundation, Simon S Haykin
β€’ Python Machine Learning, Sebastian Raschka (p.17)
β€’ Neural Network Design, Hagan, Demuth, Beale, De Jesus
(http://hagan.okstate.edu/nnd.html)
β€’ https://github.com/stephencwelch/Neural-Networks-Demystified
β€’ Of course again Prof. Patrick Henry Winston’s MIT youtube lectures.

More Related Content

Similar to Neural Networks

Neural Networks
Neural NetworksNeural Networks
Neural NetworksAdri Jovin
Β 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networksarjitkantgupta
Β 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Networkssuserab4f3e
Β 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
Β 
Multilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPMultilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPAbdullah al Mamun
Β 
Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9Randa Elanwar
Β 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyayabhishek upadhyay
Β 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satChenYiHuang5
Β 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
Β 
ERF Training Workshop Panel Data 5
ERF Training WorkshopPanel Data 5ERF Training WorkshopPanel Data 5
ERF Training Workshop Panel Data 5Economic Research Forum
Β 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptxKarasuLee
Β 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr taeseon ryu
Β 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networksLet's talk about IT
Β 
Lec 3-4-5-learning
Lec 3-4-5-learningLec 3-4-5-learning
Lec 3-4-5-learningTaymoor Nazmy
Β 
Echo state networks and locomotion patterns
Echo state networks and locomotion patternsEcho state networks and locomotion patterns
Echo state networks and locomotion patternsVito Strano
Β 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
Β 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptGayathriRHICETCSESTA
Β 

Similar to Neural Networks (20)

Neural Networks
Neural NetworksNeural Networks
Neural Networks
Β 
Artificial neural networks
Artificial neural networksArtificial neural networks
Artificial neural networks
Β 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
Β 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
Β 
Multilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLPMultilayer Perceptron Neural Network MLP
Multilayer Perceptron Neural Network MLP
Β 
Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9Introduction to Neural networks (under graduate course) Lecture 4 of 9
Introduction to Neural networks (under graduate course) Lecture 4 of 9
Β 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
Β 
Paper study: Learning to solve circuit sat
Paper study: Learning to solve circuit satPaper study: Learning to solve circuit sat
Paper study: Learning to solve circuit sat
Β 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
Β 
ERF Training Workshop Panel Data 5
ERF Training WorkshopPanel Data 5ERF Training WorkshopPanel Data 5
ERF Training Workshop Panel Data 5
Β 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptx
Β 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
Β 
DNN.pptx
DNN.pptxDNN.pptx
DNN.pptx
Β 
Machine learning with neural networks
Machine learning with neural networksMachine learning with neural networks
Machine learning with neural networks
Β 
Lec 3-4-5-learning
Lec 3-4-5-learningLec 3-4-5-learning
Lec 3-4-5-learning
Β 
Echo state networks and locomotion patterns
Echo state networks and locomotion patternsEcho state networks and locomotion patterns
Echo state networks and locomotion patterns
Β 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
Β 
Neural
NeuralNeural
Neural
Β 
feedforward-network-
feedforward-network-feedforward-network-
feedforward-network-
Β 
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.pptcs621-lect18-feedforward-network-contd-2009-9-24.ppt
cs621-lect18-feedforward-network-contd-2009-9-24.ppt
Β 

Recently uploaded

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
Β 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........LeaCamillePacle
Β 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
Β 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
Β 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxEyham Joco
Β 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementmkooblal
Β 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
Β 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
Β 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
Β 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
Β 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxRaymartEstabillo3
Β 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
Β 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
Β 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.arsicmarija21
Β 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxLigayaBacuel1
Β 

Recently uploaded (20)

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
Β 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
Β 
Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........Atmosphere science 7 quarter 4 .........
Atmosphere science 7 quarter 4 .........
Β 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
Β 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
Β 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
Β 
Types of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptxTypes of Journalistic Writing Grade 8.pptx
Types of Journalistic Writing Grade 8.pptx
Β 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
Β 
Hierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of managementHierarchy of management that covers different levels of management
Hierarchy of management that covers different levels of management
Β 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
Β 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
Β 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
Β 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
Β 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Β 
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptxEPANDING THE CONTENT OF AN OUTLINE using notes.pptx
EPANDING THE CONTENT OF AN OUTLINE using notes.pptx
Β 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Β 
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”Model Call Girl in Bikash Puri  Delhi reach out to us at πŸ”9953056974πŸ”
Model Call Girl in Bikash Puri Delhi reach out to us at πŸ”9953056974πŸ”
Β 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
Β 
AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.AmericanHighSchoolsprezentacijaoskolama.
AmericanHighSchoolsprezentacijaoskolama.
Β 
Planning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptxPlanning a health career 4th Quarter.pptx
Planning a health career 4th Quarter.pptx
Β 

Neural Networks

  • 2. 2 Outline The objective of this part of the Supervised Learning lectures will be to gain and understanding of: β€’ Background forANNs β€’ HowANNs for regression and classification problems work β€’ Perceptron learning algorithm β€’ Gradient descent algorithm β€’ Stochastic gradient descent algorithm β€’ how to analyze datasets withANNs in R β€’ how to interpret the results
  • 3. Resources β€’ Page 389, Chapter 11 Neural Networks 3
  • 5. 5 Introduction β€’ An Artificial Neural Network (ANN) models the relationship between a set of input signals (features) and an output signal (y variable) using a model derived from our understanding of how a biological brain responds to stimuli from sensory inputs. β€’ Just as a brain uses a network of interconnected cells called neurons to create a massive parallel processor, ANN uses a network of artificial neurons or nodes to solve learning problems. β€’ Before we explainANNs, let us understand how the biological brain works.
  • 7. The Neuron π‘₯ 7 3 π‘₯2 π‘₯1 π‘₯4 π‘₯ 𝑀1 𝑀2 𝑀3 𝑀4 𝑀5 𝑀1π‘₯1 + 𝑀2π‘₯2 + β‹― + 5 𝑀5π‘₯5 𝑗 𝑗 = = 1 1 𝑝 𝑗 𝑗 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 < β‰₯ πœƒ 𝑦 ො 𝑦 ො =01
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. Simple example β€’ π’˜ = 𝑀0 𝑀1 𝑀2 𝑀3 = 1 2 βˆ’1 0.5 ((p+1)x1) weights matrix β€’ 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 1 0 (nx(p+1)) input matrix β€’ π’š = 0 1 1 (nx1) output matrix, π’š ො = ? ? ? ? ? 12 (nx1) predicted output matrix
  • 13. Simple example π‘₯0 π‘₯1 π‘₯2 π‘₯3 𝑀0 =1 𝑀1 = 2 𝑀2 = βˆ’1 𝑀3 = 0.5 𝑦 ො = 1 𝑍 πœ™ 𝑍 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑀𝑗π‘₯𝑗=1*1+1*2+0*(-1)+(-1)*0.5=2.5 πœ™ 𝑍 = 𝑝 1 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0 𝑗=0 13 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
  • 14.
  • 15.
  • 16.
  • 17. Simple example π‘₯0 π‘₯1 π‘₯2 π‘₯3 𝑀0 =1 𝑀1 = 2 𝑀2 = βˆ’1 𝑀3 = 0.5 y = 1 𝑍 πœ™ 𝑍 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑀𝑗π‘₯𝑗=1*1+3*2+1*(-1)+2*0.5=7 πœ™ 𝑍 = 𝑝 1 𝑍 = 𝑀𝑗 π‘₯𝑗 β‰₯ 0 𝑗=0 17 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
  • 18. Simple example β€’ π’˜ = 𝑀0 𝑀1 𝑀2 𝑀3 = 1 2 βˆ’1 0.5 ((p+1)x1) weights matrix β€’ 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 1 0 (nx(p+1)) input matrix β€’ π’š = 0 1 1 (nx1) output matrix, 𝑦= 1 1 1 ? ? 18 (nx1) predicted output matrix
  • 19. Simple example π‘₯0 π‘₯1 π‘₯2 π‘₯3 𝑀0 =1 𝑀1 = 2 𝑀2 = βˆ’1 𝑀3 = 0.5 y= 1 𝑍 πœ™ 𝑍 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑀𝑗π‘₯𝑗=1*1+4*2+0*(-1)+(-2)*0.5=8 πœ™ 𝑍 = 𝑝 1 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0 𝑗=0 19 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
  • 20. Simple example β€’ π’˜ = 𝑀0 𝑀1 𝑀2 𝑀3 = 1 2 βˆ’1 0.5 ((p+1)x1) weights matrix β€’ 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 (nx(p+1)) input matrix β€’ π’š = 1 0 0 1 (nx1) output matrix, π’š ො = 1 1 1 1 ? 20 (nx1) predicted output matrix
  • 21. Simple example π‘₯0 π‘₯1 π‘₯2 π‘₯3 𝑀0 =1 𝑀1 = 2 𝑀2 = βˆ’1 𝑀3 = 0.5 𝑦 = 1 𝑍 πœ™ 𝑍 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 𝑗 =0 𝑍 = σ𝑝 𝑀𝑗π‘₯𝑗=1*1+5*2+1*(-1)+(1)*0.5=9.5 πœ™ 𝑍 = 𝑝 1 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0 𝑗=0 21 0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
  • 22. Simple example β€’ π’˜ = 𝑀0 𝑀1 𝑀2 𝑀3 = 1 2 βˆ’1 0.5 ((p+1)x1) weights matrix β€’ 𝒙 = 1 1 0 βˆ’1 1 2 1 0 1 3 1 2 1 4 0 βˆ’2 1 5 1 1 1 0 (nx(p+1)) input matrix β€’ π’š = 0 1 1 (nx1) output matrix, 𝑦 = 1 1 1 1 1 22 (nx1) predicted output matrix β€’ It is clear that this set of weights do not achieve a good prediction. Need to be updated.
  • 24. 24 How do we find the weights, w? β€’ 1) Perceptron LearningAlgorithm: – Step 0: Training begins by assigning some initial random values for the network parameters.Agood initial heuristic is to start with the average of the positive input vectors minus the average of the negative input vectors. In many cases this yields an initial vector near the solution region. – Step 1: Presenting the input vectors to the network, apply the activation function (FORWARD PROPAGATION) – Step 2: Update the weights according to the following rule (BACKWARD PROPAGATION): 𝑀𝑗 ∢= 𝑀𝑗 βˆ’ Δ𝑀𝑗 Δ𝑀𝑗 = πœ‚ 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 𝑗 π‘₯ 𝑖 Here πœ‚ is the learning rate and 0 < πœ‚ ≀ 1 and 𝑖 = 1, … , 𝑛 representing the samples β€’ Continue the iteration until the perceptron classifies all training examples correctly. β€’ : = is an assignment
  • 25. 25 Perceptron as a neural network : Going Backward Input Layer Output Layer π‘₯0 π‘₯1 𝑀0 𝑀1 𝑀2 π‘₯2 𝑀3 π‘₯3 𝑍 πœ™ 𝑍 π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 𝑦 βˆ’ 𝑦 ො Goal is to min error 𝑦 ො
  • 26. 26 𝑗 π‘₯ 𝑖 Rate of change: Δ𝑀𝑗 = πœ‚ 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 β€’ Scenario 1: The output is correct - 𝑦 𝑖 = 1, 𝑦 ො 𝑖 = 1 β€’ Scenario 2: The output is incorrect - 𝑦 𝑖 = 1, 𝑦 ො 𝑖 = 0 β€’ Scenario 1: 𝑗 Δ𝑀𝑗 = πœ‚ 1 βˆ’ 1 π‘₯ 𝑖 = 0 no change is necessary β€’ Scenario 2: Δ𝑀𝑗 = πœ‚ 1 βˆ’ 0 π‘₯ 𝑖 𝑗 𝑗 = πœ‚π‘₯ 𝑖 the weight update is proportional to the value of π‘₯ 𝑖 𝑗 β€’ In summary: where the perceptron predicts the class label correctly, the weights remain unchanged, where the perceptron predicts the class label incorrectly, the weights are updated proportional to the value of the input. The perceptron learning algorithm selects a search direction in weight space according to the incorrect classification of the last tested vector
  • 28. Our good old Iris Dataset 0 𝑀 + 𝑀1π‘₯1 + 𝑀2π‘₯2=0 π‘₯1 β€’ Check the perceptron learning algorithm R codes and the video. π‘₯2 Slope = βˆ’π‘€1/𝑀2 Intercept = βˆ’π‘€0/𝑀2 28
  • 30. Linearly separable – inseparable cases β€’ It is important to note that the convergence of the perceptron is only guaranteed if the two classes are linearly separable and the learning rate is sufficiently small. 30
  • 31. Multilayer perceptrons β€’ Single layer perceptrons are only capable of solving linearly separable problems. β€’ In order to overcome the linearly inseparable problem, we can add 2 or more perceptrons together, by creating a multilayer perceptrons. β€’ Therefore by joining several hyper-planes, we can define a new set of decision rules. 31
  • 32. Example 1 β€’ In total we have 12+8 = 20 weights to optimize 3 Γ— 4 = 12 𝑀 32 4 Γ— 2 = 8 𝑀 3 neurons L=1 4 neurons L=2 2 neurons L=3
  • 33. Example 2 β€’ In total we have 20+5 = 25 weights to optimize 33 4 Γ— 5 = 20 𝑀 5 Γ— 1 = 5 𝑀
  • 34. Example 3 β€’ In total we have 12+9+3 = 24 weights to optimize 34 4 Γ— 3 = 12 𝑀 3 Γ— 1 = 3 𝑀 3 Γ— 3 = 9 𝑀
  • 35. π‘₯0 π‘₯1 π‘₯2 π‘₯3 𝑀01 𝑀11 𝑀21 31 𝑦 ො 𝑀21 𝑀02 𝑀12 22 𝑀32 π‘Ž1 𝑍2 π‘Ž2 (1) (1) (1) (1) (1) 𝑀(1) 𝑀(1) (1) (2) (2) 𝑀01 𝑀11 (2) (2) 𝑍1 (2) (2) (2) (2) (3) 𝑍1 π‘Ž1 (3) 𝑗 𝒁(2) = π’˜(1) π‘₯𝑗 𝑗 𝑗 𝒁(π‘˜+1) = π’˜(π‘˜) π‘Ž(π‘˜) π‘Ž(π‘˜+1) = πœ™ 𝒁(π‘˜+1) 𝑗 𝑗 Input Layer, k=1 Output Layer, k=3 Hidden Layer, k=2 35 𝑗 𝑗 1 Input Layer, 1 Hidden Layer NN with 3 input variables and 1 output variable (numeric output) – Going Forward π‘₯ = π‘Ž(1) π‘Ž0 =1
  • 36. 1 Input Layer, 1 Hidden Layer NN with 3 input variables and 1 output variable (categorical output – with 3 categories) π‘₯0 π‘₯1 π‘₯2 π‘₯3 𝑀01 𝑀11 𝑀21 31 𝑀02 𝑀12 22 𝑀32 2 π‘Ž1 2 (1) (1) (1) (1) (1) 𝑀(1) 𝑀(1) (1) 0 (2) 𝑍1 (2) 𝑍 π‘Ž (2) (2) π‘Ž(2)=1 𝑀23 (2) 𝑍1 π‘Ž1 22 13 𝑀12 𝑀11 𝑀02 𝑀03 (2) 𝑀 (2) 𝑀21 (2) 𝑀01 (2) (2) (2) (2) 𝑀(2) 𝑦 ො 𝑗 Input Layer, k=1 Output Layer, k=3 Hidden Layer, k=2 36 𝑦 ො 𝑗 𝑦 ො 𝑗 (3) (3) (3) (3) 𝑍2 π‘Ž2 (3) 𝑍3 π‘Ž3 (3)
  • 37. 37
  • 38. β€’ 𝑾(π‘˜): matrix of weights controlling function mapping from layer (π‘˜) to layer (π‘˜ + 1). (Here k = 1, … , 𝐿). β€’ 𝒁(π‘˜+1): vector of linear combinations of weights and inputs from layer (π‘˜): 𝒁(π‘˜+1) = π’˜(π‘˜) π‘Ž(π‘˜) 𝑗 𝑗 𝑗 0 𝑗 where π‘Ž(1) = π‘₯ and π‘Ž(π‘˜) = 1 (acts as a bias) and π‘Ž(𝐿) = 𝑦 ො 𝑗 𝑗 β€’ π‘Ž(π‘˜) :Activation of unit (𝑗) in Layer (π‘˜) with a pre-specified activation 𝑗 function. (Here j = 0, … , 𝑃(π‘˜) and specific to the layer). π‘Ž(π‘˜+1) = πœ™ 𝒁(π‘˜+1) 𝑗 𝑗 β€’ There are several different activation functions: 38 𝑗 𝑗 π‘Ž(1) = π‘₯ and (in case of a regression problem we have one output) and
  • 39. 39 Activation Functions β€’ In perceptrons, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 0 to 1. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. β€’ There are several different activation functions: – Step function – Constant function – Threshold function (step) – Threshold function (ramp) – Linear function – Sigmoid function – Hyperbolic Tangent function
  • 40. Activation Function – Step Function (Symmetric) π‘Ž = πœ™ 𝑍 = 1 40 𝑖𝑓 𝑝 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0 𝑗=0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’ βˆ’1
  • 41. Activation Function – Step Function (Binary) π‘Ž = πœ™ 𝑍 = 1 𝑖𝑓 𝑝 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 β‰₯ 0 𝑗=0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’ 0 41
  • 42. Activation Function – Step Function (Linear) π‘Ž = πœ™ 𝑍 = 𝑍 42
  • 43. Activation Function – Semilinear Function π‘Ž = πœ™ 𝑍 = 1 𝑖𝑓 𝑝 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 > 1 𝑗=0 𝑝 βˆ’ 1 ≀ ෍ 𝑀𝑗 π‘₯𝑗 ≀ 1 𝑗=0 𝑝 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 < βˆ’1 𝑗=0 𝑍 𝑖𝑓 βˆ’1 𝑖𝑓 43
  • 44. Activation Function – Sigmoid Function π‘Ž = πœ™ 𝑍 = 1 1 + exp(βˆ’π›Όπ‘) We will focus on sigmoid activation function at the moment. 44
  • 45. Activation Function – Hyperbolic Tangent (Tanh) Function π‘Ž = πœ™ 𝑍 = exp 𝑍 βˆ’ exp(βˆ’π‘) exp(𝑍) + exp(βˆ’π‘) 45
  • 46. Activation Function – ReLU Function (Rectified Linear Unit) β€’ Non differentiable at 0, however, it is differentiable anywhere else.At the value of zero, a random choice of 0 or 1 is possible. π‘Ž = πœ™ 𝑍 = 𝑍 𝑖𝑓 𝑝 𝑍 = ෍ 𝑀𝑗 π‘₯𝑗 > 0 𝑗=0 π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’ 0 46
  • 47. 47 How do we find the weights, w? β€’ One way of attacking the problem is to use calculus to try to find the minimum analytically. β€’ We could compute derivatives and then try using them to find places where C is an extremum. With some luck that might work when C is a function of just one or a few variables. β€’ But it'll turn into a nightmare when we have many more variables. β€’ And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. β€’ Using calculus to minimize that just won't work!
  • 48. How do we find the weights, w? – Going Backward β€’ 2) Gradient DescentAlgorithm: – Step 0: Training begins by assigning some initial random values for the network parameters. – Step 1: Presenting the input vectors to the network, apply the activation function (FORWARD PROPAGATION) – Step 2: Calculate the error using an activation function : 𝐽 𝑀 = 1 2𝑛 𝑖 =1 σ𝑛 𝑦(𝑖) βˆ’ 𝑦 ො ( 𝑖 ) 2 for regression problems (without any regularization) 𝐽 𝑀 2𝑛 𝑖 =1 = βˆ’1 σ𝑛 𝑦(𝑖) log 𝑦 ො ( 𝑖 ) + (1 βˆ’ 𝑦(𝑖)) log 1 βˆ’ π‘¦ΰ·œ ( 𝑖 ) for classification problems (without any regularization) Our goal is to minimise the error 𝐽 𝑀 with respect to 𝑀𝑗 – Step 3: Update the weights according to the following rule: 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• 48 πœ•π‘€π‘— 𝐽(𝑀) Here 𝛼 is the learning rate and 0 < 𝛼 ≀ 1 β€’ Continue the iteration until convergence.
  • 49. How do we find the weights, w? – Going Backward 2) Gradient DescentAlgorithm: 𝑗 𝑗 β€’ Let us examine the weight update function: β€’ 𝑀 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) β€’ Partial derivative answers the question β€œWhat is the slope of the 𝐽 𝑀 at point 𝑀. β€’ And the 𝛼 determines the amount of change that needs to be done. If 𝛼 is too small take small steps to reach the optimal values. It will take too long to reach the optimum. β€’ If 𝛼 is too big, we may miss the optimal values. Fail to converge. β€’ Let us have a look at these concepts with a small example: x: (size) c(0, 1, 2, 3) y: (price) c(0, 2, 4, 6) one feature to estimate a numeric variable. 49
  • 50. 50 Batch, Stochastic Gradient Descent β€’ In batch gradient descent learning, the weight update is calculated based on all samples in the training set (instead of updating the weights incrementally after each sample), which is why this approach is also referred to as β€œbatch” gradient descent. β€’ Vector – matrix operations
  • 51. 51 Batch, Stochastic Gradient Descent β€’ Now imagine we have a very large dataset with millions of data points, which is not uncommon in many machine learning applications. Running batch gradient descent can be computationally quite costly in such scenarios since we need to re-evaluate the whole training dataset each time we take one step towards the global minimum. β€’ A popular alternative to the batch gradient descent algorithm is stochastic gradient descent, sometimes also called iterative or on-line gradient descent. Instead of updating the weights based on the sum of the accumulated errors over all samples, we update the weights incrementally for each training sample. for (i in 1:n){ - calculate error - calculate derivatives - update weight }
  • 52. 52 Batch, Stochastic Gradient Descent β€’ Acompromise between batch gradient descent and stochastic gradient descent is the so-called mini-batch learning. In mini-batch learning, a neural network learns from just one training input at a time. β€’ Mini-batch learning can be understood as applying batch gradient descent to smaller subsets of the training dataβ€”for example, 50 samples at a time. β€’ By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient, and this helps speed up gradient descent, and thus learning
  • 53. 53 Final note on Gradient Descent: Input variable preprocessing β€’ Gradient descent is one of the many algorithms that benefit from feature scaling. Each input variable should be preprocessed so that its mean value, averaged over the entire training sample, is close to zero, or else it will be small compared to its standard deviation. β€’ In order to accelerate the back-propagation learning process, the normalization of the inputs should also include two other measures (LeCun, 1993): β€’ The input variables contained in the training set should be uncorrelated; this can be done by using principal-components analysis (USL). β€’ The decorrelated input variables should be scaled so that their covariances are approximately equal, thereby ensuring that the different synaptic weights in the network learn at approximately the same speed.
  • 54. Final note on Gradient Descent: Input variable preprocessing β€’ We will use a feature scaling method called standardization, which gives our data the property of a standard normal distribution. β€’ The mean of each input feature is centered at value 0 and the feature column has a standard deviation of 1: π‘₯𝑠𝑑 = π‘₯𝑗 βˆ’ π‘₯𝑗 𝑗 𝑠𝑗 54
  • 55. Hypothetical Data – The exact function is 𝑦 ො = 0 + 2π‘₯ > dat x y 1 0 0 2 1 2 3 2 4 4 3 6 𝑦 ො = 0 + 2π‘₯ 55
  • 56. 1 input layer, 1 output layer Input Layer Output Layer π‘₯0 π‘₯1 𝑀0 𝑀1 𝑍 πœ™ 𝑍 𝑍 = 𝑀0π‘₯0 + 𝑀1π‘₯1 πœ™ 𝑍 = 𝑍 π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 𝑦 βˆ’ 𝑦 ො 𝑦 ො 56
  • 57. For: 𝑀0 = 0, 𝑀1 = 0 𝐽 0,0 =7 > dat x y yhat=0+0*x 1 0 0 0+0*0 = 0 2 1 2 0+0*1 = 0 3 2 4 0+0*2 = 0 4 3 6 0+0*3 = 0 1 𝑒 = 0 𝑒2 = 2 𝑒3 = 4 𝑒4 = 6 𝐽 𝑀 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ ( 𝑖 ) 2 = 1 2 βˆ— 4 02 + 22 + 42 + 62 = 7 57
  • 58. For: 𝑀0 = 0, 𝑀1 = 0.5 𝐽 0,0.5 =3.59375 > dat 1 x 1 0 y 0 yhat=0+0.5*x 0+0.5*0 = 0 𝑒4 = 4.5 2 1 2 0+0.5*1 = 0.5 3 2 4 0+0.5*2 = 1 𝑒3 = 2.5 4 3 6 0+0.5*3 = 1.5 𝑒2 𝑒 = 0 = 1.5 𝐽 𝑀 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ ( 𝑖 ) 2 = 1 2 βˆ— 4 02 + 1.52 + 2.52 + 4.52 58 = 3.59375
  • 59. For: 𝑀0 = 0, 𝑀1 =1 𝐽 0,1 =1.75 > dat x y yhat=0+1*x 1 0 0 0+1*0 = 0 2 1 2 0+1*1 = 1 3 2 4 0+1*2 = 2 4 3 6 0+1*3 = 3 1 𝑒 = 0 𝑒2 = 1 𝑒3 = 2 𝑒4 = 3 𝐽 𝑀 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ ( 𝑖 ) 2 = 1 2 βˆ— 4 02 + 12 + 22 + 32 59 = 1.75
  • 60. For: 𝑀0 = 0, 𝑀1 =2 𝐽 0,2 = 0 > dat x y yhat=0+2*x 1 0 0 0+2*0 = 0 2 1 2 0+2*1 = 2 3 2 4 0+2*2 = 4 4 3 6 0+2*3 = 6 𝑒1 = 0 𝑒3 = 0 𝑒2 = 0 𝑒4 =0 𝐽 𝑀 2𝑛 𝑖 =1 𝑛 1 = ෍ 𝑦(𝑖) βˆ’ π‘¦ΰ·œ ( 𝑖 ) 2 = 1 2 βˆ— 4 02 + 02 + 02 + 02 = 0 60
  • 61. Cost function with respect to w1 𝑀1 = 0 πœ• πœ•π‘€π‘— 𝐽 𝑀 < 0 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) 61
  • 62. Cost function with respect to w1 𝑀1 = 0.5 πœ• πœ•π‘€π‘— 𝐽 𝑀 < 0 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) 62
  • 63. Cost function with respect to w1 πœ• πœ•π‘€π‘— 𝐽 𝑀 𝑀1 = 1.25 < 0 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) 63
  • 64. Cost function with respect to w1 𝑀1 = 2 πœ• πœ•π‘€π‘— 𝐽 𝑀 < 0 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) 64
  • 65. Cost function with respect to w1 𝑀1 =2.75 πœ• πœ•π‘€π‘— 𝐽 𝑀 > 0 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) 65
  • 66. Cost function with respect to w1 𝑀1 =2.5 πœ• πœ•π‘€π‘— 𝐽 𝑀 > 0 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) 66
  • 67. Cost function with respect to w1 𝑀1 =2 πœ• πœ•π‘€π‘— 𝐽 𝑀 > 0 𝑀𝑗 𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) 67
  • 68. Cost function with respect to w1, one w parameter to optimize Ref: Raschka, p35 68
  • 70. Gradient Descent in a very simple example: 1 input layer, 1 output layer, 1 x, 1 numeric y β€’ Consider the linear regression example: β€’ 𝐽 𝑀 = 2𝑛 𝑖 =1 1 σ𝑛 𝑦(𝑖) βˆ’ 𝑦 ො ( 𝑖 ) 2 and π‘¦ΰ·œ ( 𝑖 )= 𝑀 π‘₯ 0 0 + 𝑀 π‘₯ 1 1 β€’ We have 2 weights! β€’ πœ• πœ•π‘€0 𝐽(𝑀) and πœ• πœ•π‘€1 𝐽(𝑀) need to be calculated: β€’ βˆ‡π½ = πœ• πœ•π‘€0 𝐽 𝑀 , πœ• πœ•π‘€1 𝐽(𝑀) 𝑗 𝑗 β€’ 𝑀 ∢= 𝑀 βˆ’ 𝛼 πœ• 70 πœ•π‘€π‘— 𝐽(𝑀)
  • 71. β€’ 𝐽 𝑀 = 2𝑛 𝑖 =1 1 σ𝑛 𝑦(𝑖) βˆ’ 𝑦 ො ( 𝑖 ) 2 and 𝑦 ො ( 𝑖 ) = 𝑀0π‘₯0 + 𝑀1π‘₯1 πœ• πœ•π‘€0 βˆ’2 𝐽 𝑀 = 2𝑛 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 πœ• πœ•π‘€0 𝑦 ො 𝑖𝑀 = βˆ’1 𝑛 (𝑖) (𝑖 ) (𝑦 βˆ’ 𝑦 ො )π‘₯0 β€’ and πœ• πœ•π‘€1 βˆ’2 𝐽 𝑀 = 2𝑛 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 πœ• πœ•π‘€1 𝑦 ො 𝑖 𝑛 71 βˆ’1 𝑀 = (𝑦(𝑖) βˆ’ π‘¦ΰ·œ (𝑖)) π‘₯ 1
  • 72. πœ• πœ•π‘€0 𝐽 𝑀 = βˆ’1 𝑛 𝑦 βˆ’ 𝑀0π‘₯0 βˆ’ 𝑀1π‘₯1 βˆ— π‘₯0 πœ• 1 𝐽 𝑀 = πœ•π‘€ βˆ’1 𝑛 𝑦 βˆ’ 𝑀0π‘₯0 βˆ’ 𝑀1π‘₯1 βˆ— π‘₯1 𝑗 𝑀𝑗 ∢= 𝑀 βˆ’ 𝛼 πœ• πœ•π‘€π‘— 𝐽(𝑀) β€’ πœ• 𝐽 𝑀 < 0 that means we will increase the weight β€’ πœ•π‘€ πœ• πœ•π‘€ πœ• 𝐽 𝑀 > 0 that means we will decrease the weight β€’ πœ• 𝑀 𝐽 𝑀 72 = 0 that means we will not change the weight
  • 74. π‘₯0 π‘₯1 𝑀01 𝑀11 𝑦 ො 𝑀21 𝑀02 𝑀12 𝑍2 π‘Ž1 π‘Ž2 (1) (1) (1) (1) (2) (2) 𝑀01 𝑀11 (2) (2) 𝑍1 (2) (2) (2) (2) (3) 𝑍1 π‘Ž1 (3) 𝑗 𝒁(2) = π’˜(1) π‘₯𝑗 1 𝑗 𝑗 𝑦 ො = π‘Ž(3) = 𝒁(π‘˜+1) = π’˜(π‘˜) π‘Ž (2) 𝑗 π‘Ž(2) = πœ™ 𝑗 𝒁(2) = 1 𝑗 1 + exp(βˆ’π’ 2 ) 𝑗 𝑗 Gradient Descent in a very simple example: 1 input layer, 1 hidden layer (2 neurons + bias), 1 output layer, 1 x, 1 numeric y π‘₯ = π‘Ž(1) π‘Ž0 =1
  • 75. Gradient Descent in a very simple example: 1 input layer, 1 hidden layer (2 neurons + bias), 1 output layer, 1 x, 1 numeric y β€’ 𝐽 𝑀 = 2𝑛 𝑖 =1 1 σ𝑛 𝑦(𝑖) βˆ’ 𝑦 ො ( 𝑖 ) 2 πœ• πœ• 𝑀 β€’ 𝐽 𝑀 = βˆ’2 2𝑛 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 πœ• πœ• 𝑀 𝑦 ො 𝑖𝑀 75
  • 76. Gradient Descent in a very simple example: 1 input layer, 1 hidden layer (2 neurons + bias), 1 output layer, 1 x, 1 numeric y β€’ 𝑦 ො ( 𝑖 ) and 0 01 1 11 2 21 0 = π‘Ž(2) 𝑀(2) + π‘Ž(2) 𝑀(2) + π‘Ž(2) 𝑀(2) π‘Ž(2) = 1 Output layer weights Hidden layer weights πœ• 𝑦 ො 𝑖𝑀 = π‘Ž(2) = 1 πœ•π‘€(2) 0 01 πœ• πœ•π‘€(1) 01 𝑦 ො 𝑖 𝑀 =? πœ• 𝑦 ො 𝑖𝑀 = π‘Ž(2) πœ•π‘€(2) 1 11 πœ• πœ•π‘€(1) 02 𝑦 ො 𝑖 𝑀 =? πœ• 𝑦 ො 𝑖𝑀 = π‘Ž(2) πœ•π‘€(2) 2 21 πœ• πœ•π‘€(1) 11 𝑦 ො 𝑖 𝑀 =? πœ• πœ•π‘€(1) 12 𝑦 ො 𝑖 𝑀 =? 76
  • 77. Update the weights from the hidden layer to the output layer β€’ 𝑀(2) := 𝑀(2) βˆ’ 𝛼 01 01 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 β€’ 𝑀(2) := 𝑀(2) βˆ’ 𝛼 11 11 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 1 π‘Ž(2) β€’ 𝑀(2) := 𝑀(2) βˆ’ 𝛼 21 21 βˆ’1 𝑛 βˆ’1 𝑛 βˆ’1 𝑛 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 77 2 π‘Ž(2)
  • 78. 78 β€’ 𝑦 ො ( 𝑖 ) = 1𝑀(2) + π‘Ž(2) 𝑀(2) + π‘Ž(2) 𝑀(2) 01 1 11 2 21 and 0 π‘Ž(2) = 1 β€’ 1 1 1+exp(βˆ’π’ 2 ) 1 2 π‘Ž(2) = π‘Ž(2) = 1 2 1+exp(βˆ’π’ 2 ) β€’ 1 𝑍 2 01 11 = 𝑀(1) 1 + 𝑀(1) π‘₯1 2 𝑍 2 02 12 = 𝑀(1) 1 + 𝑀(1) π‘₯1 Output layer weights Hidden layer weights πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘€(2) 01 = 1 πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘€(1) 01 = πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘Ž(2) πœ•π‘ 2 1 1 πœ•π‘Ž(2) πœ•π‘ 2 πœ•π‘€(1) 1 1 01 = 𝑀 2 π‘Ž 2 11 1 1 βˆ’ π‘Ž 2 1 1 πœ• 𝑦 ො 𝑖 𝑀 (2) (2) = π‘Ž1 πœ•π‘€11 πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘€(1) 11 = πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘Ž(2) πœ•π‘ 2 1 1 πœ•π‘Ž(2) πœ•π‘ 2 πœ•π‘€(1) 1 1 11 = 𝑀 2 π‘Ž 2 11 1 1 βˆ’ π‘Ž 2 1 π‘₯1 πœ• 𝑦 ො 𝑖 𝑀 (2) (2) = π‘Ž2 πœ•π‘€21 πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘€(1) 02 = πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘Ž(2) πœ•π‘ 2 2 2 πœ•π‘Ž(2) πœ•π‘ 2 πœ•π‘€(1) 2 2 02 = 𝑀 2 π‘Ž 2 21 2 1 βˆ’ π‘Ž 2 2 1 πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘€(1) 12 = πœ• 𝑦 ො 𝑖 𝑀 πœ•π‘Ž(2) πœ•π‘ 2 2 2 πœ•π‘Ž(2) πœ•π‘ 2 πœ•π‘€(1) 2 2 12 = 𝑀 2 π‘Ž 2 21 2 1 βˆ’ π‘Ž 2 2 π‘₯1
  • 79. Update the weights from the input layer to the hidden layer β€’ 𝑀(1) := 𝑀(1) βˆ’ 𝛼 01 01 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 11 1 𝑀 2 π‘Ž 2 1 1 βˆ’ π‘Ž 2 β€’ 𝑀(1) := 𝑀(1) βˆ’ 𝛼 11 11 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 11 1 𝑀 2 π‘Ž 2 1 1 βˆ’ π‘Ž 2 π‘₯1 β€’ 𝑀(1) := 𝑀(1) βˆ’ 𝛼 02 02 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 21 2 𝑀 2 π‘Ž 2 2 1 βˆ’ π‘Ž 2 β€’ 𝑀(1) := 𝑀(1) βˆ’ 𝛼 12 12 βˆ’1 𝑛 βˆ’1 𝑛 βˆ’1 𝑛 βˆ’1 𝑛 𝑦 𝑖 βˆ’ 𝑦 ො 𝑖 21 2 𝑀 2 π‘Ž 2 2 1 βˆ’ π‘Ž 2 π‘₯ 79 1
  • 81. What does 𝛼 do? β€’ If 𝛼 is too small, the rate of change in the weights will be tiny. It will take too long to reach to the optimum solution. β€’ If 𝛼 is too big, the rate of change in the weights will be very big. We may never find the optimum solution, our algorithm may fail to converge. 81
  • 82. 𝛼, adaptive learning β€’ In stochastic gradient descent implementations, the fixed learning rate 𝛼 is often replaced by an adaptive learning rate that decreases over time, for example, 𝐢1 #π‘–π‘‘π‘’π‘Ÿπ‘Žπ‘‘π‘–π‘œπ‘›π‘  + 𝐢2 β€’ where 𝐢1and 𝐢2 are constants. Note that stochastic gradient descent does not reach the global minimum but an area very close to it. By using an adaptive learning rate, we can achieve further annealing to a better global minimum. 82
  • 84. What if we have 𝑀0 and 𝑀1 together to change? 𝑀1 𝑀0 β€’ The cost function J(𝑀0, 𝑀1) will be a 3D surface plot (left pane) β€’ The contour plot will provide the same cost along the same contour (right pane) J(𝑀0, 𝑀1) 𝑀0 84 𝑀1
  • 87. 87 Regularization in Neural Networks β€’ In multi-layer neural networks, the number of input and outputs units is generally determined by the dimensionality of the data set. β€’ On the other hand, we are free with the number of hidden layer units (M). We may typically have hundreds, thousands, or even billions of weights that we need to optimize. β€’ Choose optimum number of hidden layer units (M) that gives the best generalization performance for balance between underfitting and overfitting. β€’ A network is said to generalize well when the input–output mapping computed by the network is correct (or nearly so) for test data never used in creating or training the network. Here, it is assumed that the test data are drawn from the same population used to generate the training data.
  • 88. 88 Regularization in Neural Networks β€’ The generalization error, however, is not a simple function of the number of hidden layer units (M) due to the presence of local minima in the error function. β€’ Each time when we start with random values of the weight vector for each hidden layer unit size considered, we see the effect of choosing multiple random initializations for the weight vector for a range of values of M. β€’ In practice, one approach to choosing M is in fact to plot a graph of the M vs the errors, then to choose the specific solution having the smallest validation set error.
  • 89. Regularization in Neural Networks β€’ 𝐽 𝑀 = 2𝑛 𝑖 =1 1 σ𝑛 β€’ There are, however, other ways to control the complexity of a neural network model in order to avoid over-fitting. Such as adding a quadratic regularizer (L2): 2 β€’ 𝐽 αˆšπ‘€ = 𝑖 =1 1 σ𝑛 𝑦(𝑖) βˆ’ 𝑦 ො ( 𝑖 ) 𝑦(𝑖) βˆ’ 𝑦 ො ( 𝑖 ) 2𝑛 2 2 πœ† + 𝑀 89 2 β€’ This regularizer is also known as weight decay. β€’ The effective model complexity is then determined by the choice of the regularization coefficient Ξ».
  • 91. 91 Early Stopping in NNs β€’ An alternative to regularization as a way of controlling the effective complexity of a network is the procedure of early stopping. β€’ The training of nonlinear network models corresponds to an iterative reduction of the error function defined with respect to a set of training data. β€’ The error measured with respect to independent data, generally called a validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation data set, in order to obtain a network having good generalization performance.
  • 92. 92 When to use NNs β€’ When dealing with unstructured datasets β€’ When you do not need interpretable results, for example when you just want to classify your pictures based on cats and dogs, you don’t need to know why the outcome is classified as a cat or a dog. You don’t need to explain the relationships. β€’ When you have many features, with regularization β€’ When you have nonlinear relationships
  • 93. 93 Resources β€’ Afree online book by Michael Nielsen (brilliant resource for partial derivatives and gradient descent): http://neuralnetworksanddeeplearning.com/ β€’ The Elements of Statistical Learning, Trevor Hastie book (p.389) β€’ Pattern Recognition and Machine Learning Book, Christopher Bishop (p.227) β€’ Machine Learning with R, Brett Lantz (p.219) β€’ Neural Networks – a comprehensive foundation, Simon S Haykin β€’ Python Machine Learning, Sebastian Raschka (p.17) β€’ Neural Network Design, Hagan, Demuth, Beale, De Jesus (http://hagan.okstate.edu/nnd.html) β€’ https://github.com/stephencwelch/Neural-Networks-Demystified β€’ Of course again Prof. Patrick Henry Winston’s MIT youtube lectures.