2. 2
Outline
The objective of this part of the Supervised Learning lectures will be to gain
and understanding of:
β’ Background forANNs
β’ HowANNs for regression and classification problems work
β’ Perceptron learning algorithm
β’ Gradient descent algorithm
β’ Stochastic gradient descent algorithm
β’ how to analyze datasets withANNs in R
β’ how to interpret the results
5. 5
Introduction
β’ An Artificial Neural Network (ANN) models the relationship between a
set of input signals (features) and an output signal (y variable) using a
model derived from our understanding of how a biological brain responds
to stimuli from sensory inputs.
β’ Just as a brain uses a network of interconnected cells called neurons to
create a massive parallel processor, ANN uses a network of artificial
neurons or nodes to solve learning problems.
β’ Before we explainANNs, let us understand how the biological brain
works.
24. 24
How do we find the weights, w?
β’ 1) Perceptron LearningAlgorithm:
β Step 0: Training begins by assigning some initial random values for the
network parameters.Agood initial heuristic is to start with the average of the
positive input vectors minus the average of the negative input vectors. In many
cases this yields an initial vector near the solution region.
β Step 1: Presenting the input vectors to the network, apply the activation
function (FORWARD PROPAGATION)
β Step 2: Update the weights according to the following rule
(BACKWARD PROPAGATION):
π€π βΆ= π€π β Ξπ€π
Ξπ€π = π π¦ π β π¦
ΰ·
π π
π₯ π
Here π is the learning rate and 0 < π β€ 1 and π = 1, β¦ , π representing the
samples
β’ Continue the iteration until the perceptron classifies all training examples
correctly.
β’ : = is an assignment
25. 25
Perceptron as a neural network : Going Backward
Input Layer Output Layer
π₯0
π₯1
π€0
π€1
π€2
π₯2
π€3
π₯3
π π π
πππππ = π¦ β π¦
ΰ·
Goal is to min error
π¦
ΰ·
26. 26
π
π₯ π
Rate of change: Ξπ€π = π π¦ π β π¦
ΰ·
π
β’ Scenario 1: The output is correct - π¦ π = 1, π¦
ΰ· π = 1
β’ Scenario 2: The output is incorrect - π¦ π = 1, π¦
ΰ· π
= 0
β’ Scenario 1:
π
Ξπ€π = π 1 β 1 π₯ π
= 0 no change is necessary
β’ Scenario 2:
Ξπ€π = π 1 β 0 π₯ π
π
π
= ππ₯ π
the weight update is
proportional to the value of π₯ π
π
β’ In summary: where the perceptron predicts the class label correctly, the weights
remain unchanged, where the perceptron predicts the class label incorrectly, the
weights are updated proportional to the value of the input. The perceptron
learning algorithm selects a search direction in weight space according to the
incorrect classification of the last tested vector
30. Linearly separable β inseparable cases
β’ It is important to note that the convergence of the perceptron is only
guaranteed if the two classes are linearly separable and the learning rate is
sufficiently small.
30
31. Multilayer perceptrons
β’ Single layer perceptrons are only capable of solving linearly separable
problems.
β’ In order to overcome the linearly inseparable problem, we can add 2 or
more perceptrons together, by creating a multilayer perceptrons.
β’ Therefore by joining several hyper-planes, we can define a new set of
decision rules.
31
32. Example 1
β’ In total we have 12+8 = 20 weights to optimize
3 Γ 4 = 12 π€
32
4 Γ 2 = 8 π€
3 neurons L=1 4 neurons L=2
2 neurons L=3
33. Example 2
β’ In total we have 20+5 = 25 weights to optimize
33
4 Γ 5 = 20 π€ 5 Γ 1 = 5 π€
34. Example 3
β’ In total we have 12+9+3 = 24 weights to optimize
34
4 Γ 3 = 12 π€
3 Γ 1 = 3 π€
3 Γ 3 = 9 π€
38. β’ πΎ(π): matrix of weights controlling function mapping from layer (π) to layer
(π + 1). (Here k = 1, β¦ , πΏ).
β’ π(π+1): vector of linear combinations of weights and inputs from layer (π):
π(π+1) = π(π)
π(π)
π π
π 0
π
where π(1)
= π₯ and π(π)
= 1 (acts as a bias) and π(πΏ)
= π¦
ΰ·
π
π
β’ π(π)
:Activation of unit (π) in Layer (π) with a pre-specified activation
π
function. (Here j = 0, β¦ , π(π) and specific to the layer).
π(π+1)
= π π(π+1)
π π
β’ There are several different activation functions:
38
π π
π(1)
= π₯ and (in case of a regression problem we have one output) and
39. 39
Activation Functions
β’ In perceptrons, a small change in the weights or bias of any single perceptron
in the network can sometimes cause the output of that perceptron to
completely flip, say from 0 to 1. That flip may then cause the behaviour of
the rest of the network to completely change in some very complicated way.
β’ There are several different activation functions:
β Step function
β Constant function
β Threshold function (step)
β Threshold function (ramp)
β Linear function
β Sigmoid function
β Hyperbolic Tangent function
46. Activation Function β ReLU Function (Rectified
Linear Unit)
β’ Non differentiable at 0, however, it is differentiable anywhere else.At the
value of zero, a random choice of 0 or 1 is possible.
π = π π =
π ππ
π
π = ΰ· π€π π₯π > 0
π=0
ππ‘βπππ€ππ π
0
46
47. 47
How do we find the weights, w?
β’ One way of attacking the problem is to use calculus to try
to find the minimum analytically.
β’ We could compute derivatives and then try using them to
find places where C is an extremum. With some luck that
might work when C is a function of just one or a few
variables.
β’ But it'll turn into a nightmare when we have many more
variables.
β’ And for neural networks we'll often want far more
variables - the biggest neural networks have cost functions
which depend on billions of weights and biases in an
extremely complicated way.
β’ Using calculus to minimize that just won't work!
48. How do we find the weights, w? β Going Backward
β’ 2) Gradient DescentAlgorithm:
β Step 0: Training begins by assigning some initial random values for the network
parameters.
β Step 1: Presenting the input vectors to the network, apply the activation function
(FORWARD PROPAGATION)
β Step 2: Calculate the error using an activation function :
π½ π€ =
1
2π π
=1
Οπ π¦(π) β π¦
ΰ· (
π
)
2
for regression problems (without any regularization)
π½ π€
2π π
=1
= β1
Οπ
π¦(π) log π¦
ΰ· (
π
)
+ (1 β π¦(π)) log 1 β π¦ΰ· (
π
) for classification problems
(without any regularization)
Our goal is to minimise the error π½ π€ with respect to π€π
β Step 3: Update the weights according to the following rule:
π€π π
βΆ= π€ β πΌ
π
48
ππ€π
π½(π€)
Here πΌ is the learning rate and 0 < πΌ β€ 1
β’ Continue the iteration until convergence.
49. How do we find the weights, w? β Going Backward
2) Gradient DescentAlgorithm:
π
π
β’ Let us examine the weight update function:
β’ π€ βΆ= π€ β πΌ
π
ππ€π
π½(π€)
β’ Partial derivative answers the question βWhat is the slope of the π½ π€ at
point π€.
β’ And the πΌ determines the amount of change that needs to be done. If πΌ is too
small take small steps to reach the optimal values. It will take too long to
reach the optimum.
β’ If πΌ is too big, we may miss the optimal values. Fail to converge.
β’ Let us have a look at these concepts with a small example:
x: (size) c(0, 1, 2, 3)
y: (price) c(0, 2, 4, 6)
one feature to estimate a numeric variable.
49
50. 50
Batch, Stochastic Gradient Descent
β’ In batch gradient descent learning, the weight update is
calculated based on all samples in the training set (instead of
updating the weights incrementally after each sample),
which is why this approach is also referred to as βbatchβ
gradient descent.
β’ Vector β matrix operations
51. 51
Batch, Stochastic Gradient Descent
β’ Now imagine we have a very large dataset with millions of data points,
which is not uncommon in many machine learning applications. Running
batch gradient descent can be computationally quite costly in such scenarios
since we need to re-evaluate the whole training dataset each time we take
one step towards the global minimum.
β’ A popular alternative to the batch gradient descent algorithm is stochastic
gradient descent, sometimes also called iterative or on-line gradient descent.
Instead of updating the weights based on the sum of the accumulated errors
over all samples, we update the weights incrementally for each training
sample.
for (i in 1:n){
- calculate error
- calculate derivatives
- update weight
}
52. 52
Batch, Stochastic Gradient Descent
β’ Acompromise between batch gradient descent and stochastic
gradient descent is the so-called mini-batch learning. In
mini-batch learning, a neural network learns from just one
training input at a time.
β’ Mini-batch learning can be understood as applying batch
gradient descent to smaller subsets of the training dataβfor
example, 50 samples at a time.
β’ By averaging over this small sample it turns out that we
can quickly get a good estimate of the true gradient,
and this helps speed up gradient descent, and thus
learning
53. 53
Final note on Gradient Descent:
Input variable preprocessing
β’ Gradient descent is one of the many algorithms that benefit from feature
scaling. Each input variable should be preprocessed so that its mean value,
averaged over the entire training sample, is close to zero, or else it will be
small compared to its standard deviation.
β’ In order to accelerate the back-propagation learning process, the
normalization of the inputs should also include two other measures (LeCun,
1993):
β’ The input variables contained in the training set should be uncorrelated; this
can be done by using principal-components analysis (USL).
β’ The decorrelated input variables should be scaled so that their covariances
are approximately equal, thereby ensuring that the different synaptic weights
in the network learn at approximately the same speed.
54. Final note on Gradient Descent:
Input variable preprocessing
β’ We will use a feature scaling method called standardization, which gives our
data the property of a standard normal distribution.
β’ The mean of each input feature is centered at value 0 and the feature column
has a standard deviation of 1:
π₯π π‘ =
π₯π β π₯π
π
π π
54
55. Hypothetical Data β The exact function is π¦
ΰ· = 0 +
2π₯
> dat
x y
1 0 0
2 1 2
3 2 4
4 3 6
π¦
ΰ· = 0 +
2π₯
55
81. What does πΌ do?
β’ If πΌ is too small, the rate of change in the
weights will be tiny. It will take too long to
reach to the optimum solution.
β’ If πΌ is too big, the rate of change in the
weights will be very big. We may never
find the optimum solution, our algorithm
may fail to converge.
81
82. πΌ, adaptive learning
β’ In stochastic gradient descent implementations, the fixed learning rate πΌ is
often replaced by an adaptive learning rate that decreases over time, for
example,
πΆ1
#ππ‘ππππ‘ππππ + πΆ2
β’ where πΆ1and πΆ2 are constants. Note that stochastic gradient descent does
not reach the global minimum but an area very close to it. By using an
adaptive learning rate, we can achieve further annealing to a better global
minimum.
82
84. What if we have π€0 and π€1 together to change?
π€1
π€0
β’ The cost function J(π€0, π€1) will be a 3D surface plot (left pane)
β’ The contour plot will provide the same cost along the same contour (right pane)
J(π€0, π€1)
π€0
84
π€1
87. 87
Regularization in Neural Networks
β’ In multi-layer neural networks, the number of input and outputs units is
generally determined by the dimensionality of the data set.
β’ On the other hand, we are free with the number of hidden layer units (M).
We may typically have hundreds, thousands, or even billions of weights
that we need to optimize.
β’ Choose optimum number of hidden layer units (M) that gives the best
generalization performance for balance between underfitting and
overfitting.
β’ A network is said to generalize well when the inputβoutput mapping
computed by the network is correct (or nearly so) for test data never used in
creating or training the network. Here, it is assumed that the test data are
drawn from the same population used to generate the training data.
88. 88
Regularization in Neural Networks
β’ The generalization error, however, is not a simple function of the number of
hidden layer units (M) due to the presence of local minima in the error
function.
β’ Each time when we start with random values of the weight vector for each
hidden layer unit size considered, we see the effect of choosing multiple
random initializations for the weight vector for a range of values of M.
β’ In practice, one approach to choosing M is in fact to plot a graph of the M
vs the errors, then to choose the specific solution having the smallest
validation set error.
89. Regularization in Neural Networks
β’ π½ π€ =
2π π
=1
1
Οπ
β’ There are, however, other ways to control the complexity of a neural
network model in order to avoid over-fitting. Such as adding a quadratic
regularizer (L2):
2
β’ π½
απ€ = π
=1
1
Οπ
π¦(π) β π¦
ΰ· (
π
)
π¦(π) β π¦
ΰ· (
π
)
2π 2
2 π
+ π€
89
2
β’ This regularizer is also known as weight decay.
β’ The effective model complexity is then determined by the choice of the
regularization coefficient Ξ».
91. 91
Early Stopping in NNs
β’ An alternative to regularization as a way of controlling the effective
complexity of a network is the procedure of early stopping.
β’ The training of nonlinear network models corresponds to an iterative
reduction of the error function defined with respect to a set of training data.
β’ The error measured with respect to independent data, generally called a
validation set, often shows a decrease at first, followed by an increase as
the network starts to over-fit. Training can therefore be stopped at the point
of smallest error with respect to the validation data set, in order to obtain a
network having good generalization performance.
92. 92
When to use NNs
β’ When dealing with unstructured datasets
β’ When you do not need interpretable results, for example when you just
want to classify your pictures based on cats and dogs, you donβt need to
know why the outcome is classified as a cat or a dog. You donβt need to
explain the relationships.
β’ When you have many features, with regularization
β’ When you have nonlinear relationships
93. 93
Resources
β’ Afree online book by Michael Nielsen (brilliant resource for partial
derivatives and gradient descent):
http://neuralnetworksanddeeplearning.com/
β’ The Elements of Statistical Learning, Trevor Hastie book (p.389)
β’ Pattern Recognition and Machine Learning Book, Christopher Bishop
(p.227)
β’ Machine Learning with R, Brett Lantz (p.219)
β’ Neural Networks β a comprehensive foundation, Simon S Haykin
β’ Python Machine Learning, Sebastian Raschka (p.17)
β’ Neural Network Design, Hagan, Demuth, Beale, De Jesus
(http://hagan.okstate.edu/nnd.html)
β’ https://github.com/stephencwelch/Neural-Networks-Demystified
β’ Of course again Prof. Patrick Henry Winstonβs MIT youtube lectures.