The document provides an overview of a lecture on artificial neural networks and the multilayer perceptron model. It discusses the basics of ANNs, including their ability to learn from examples without being explicitly programmed. It then describes the multilayer perceptron model and the backpropagation algorithm for training neural networks in a supervised manner by minimizing error. Key aspects covered include the network architecture of perceptrons with multiple layers and nonlinear activation functions, as well as the backward propagation of errors to update weights to reduce error.
1. Machine Learning Algorithms: Theory,
Applications and Software Tools
Lecture 2
Basics of ANN: MLP
Prof. Mikhail Kanevski
Institute of Geomatics and Analysis of Risk,
University of Lausanne
Mikhail.Kanevski@unil.ch
Prof. M. Kanevski 1
2. Contents
• Introduction to artificial neural networks
• Multilayer perceptron
• Case studies
Prof. M. Kanevski 2
3. Basics of ANN
Artificial neural networks are analytical systems that
address problems whose solutions have not been
explicitly formulated.
In this way they contrast to classical computers and
computer programs, which are designed to solve
problems whose solutions - although they may be
extremely complex - have been made explicit.
Prof. M. Kanevski 3
4. Basics of ANN
• We can program or train neural networks to store,
recognise, and associatively retrieve patterns;
• to filter noise from measurement data;
• to control ill-defined problems;
in summary:
• to estimate sampled functions when we do not
know the form of the functions.
Prof. M. Kanevski 4
5. Basics of ANN
Unlike statistical estimators, they estimate a function
without a mathematical model of how outputs
depend on inputs.
Neural networks are model-semifree estimators
(semiparametric models). They "learn from
experience" with numerical and, sometimes,
linguistic sample data.
Prof. M. Kanevski 5
6. Basics of ANN
The major applications of ANN:
• Feature recognition (pattern classification). Speech
recognition
• Signal processing
• Time-series prediction
• Function approximation and regression, classification
• Data Mining
• Intelligent control
• Associative memories
• Optimisation
• And many others
Prof. M. Kanevski 6
9. Examples of transfer
functions.
1
f (x) =
[1 + exp( − x )]
[exp( x ) − exp( − x )]
tanh( x ) =
[exp( x ) + exp( − x )]
Prof. M. Kanevski 9
10. Basics of ANN
The main parts of ANN:
• Neurones
(nodes, cells, units, processing
elements)
• Network topology
(connections between neurones)
Prof. M. Kanevski 10
11. Basics of ANN
In general, Artificial Neural Networks are
a collection of simple computational
units (cells) interlinked by a system of
connections (synaptic connections).
The number of units and connections
form a network topology.
Prof. M. Kanevski 11
13. Basics of ANN.
ANN learning/training
Supervised learning is the most common training. Many samples
Input(i), Output(i) are prepared as a training set. Then a subset from
the training data set is selected. Samples from this subset are
presented to the network one by one. For each sample results
obtained by the network O[(input(i)] are compared with the desired
O[utput(i)]. After presenting the entire training subset the weights are
updated. This updating is done in such a way that a measure of the
error between the network's and desired outputs is reduced. One pass
through the subset of training samples, along with an updating of the
weights is called an epoch. The number of samples in the subset is
called epoch size. Sometimes an epoch size of one is used .
Prof. M. Kanevski 13
14. Basics of ANN.
ANN supervised learning.
Teacher
Examples
Response
Neural network
Evaluation
Modifications
Of Response
to Network
Learning
Algorithm
Prof. M. Kanevski 14
15. Basics of ANN
Feedforward ANN.
If there are no feedback and lateral
connections we have feedforward
ANN. The most frequently used model
is so called - multi-layer perceptron.
The term feedforward means that
information flows only in one direction -
from the input to the output.
Prof. M. Kanevski 15
16. ANN Multi-layer Perceptron (MLP)
• Depends only on the data
and its inner structure
• Is able to learn from data
and generalise
• Good at modelling non-
linearities
• Robust to noise and
outliers
[ANN = artificial neurons + connection weights]
Prof. M. Kanevski 16
17. Basics of ANN
All knowledge of ANN is based on
synaptic weights between units.
Prof. M. Kanevski 17
18. The Universality Property
• A two layer feed-forward neural network
with step activation functions can
implement any Boolean function,
provided that the number of hidden
neurons H is sufficiently large.
Prof. M. Kanevski 18
19. MLP modelling
F1 (t , w ) = w1out f ( w1t + b1 ) + bout ,
F2 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + bout ,
out
F3 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + w3 f ( w3t + b3 ) + bout .
out out
Prof. M. Kanevski 19
21. Error function depends on
network’s weights (W)
n −1
1
n j =0
{
El (W ) = ∑ Tlj − Z lj (W )
out
} 2
Prof. M. Kanevski 21
22. MLP training algorithms
Optimisation algorithms used for MLP training:
• Stochastic
− Annealing
− Genetic algorithm
• Gradient
− Conjugate gradients (slow 1st order gradient algorithm)
− Levenberg-Marquardt (fast 2nd order gradient algorithm)
− BFGS formula – quasi Newton
− Steepest Descent
− RProp – resilient propagation
− BackProp – back propagation
Prof. M. Kanevski 22
23. Feedforward ANN: Multilayer
perceptron. Backprop algorithm
• The possibilities and capabilities of multi-layer perceptrons stem from
the non-linearities used within nodes. MLP can learn with supervised
learning rule - backpropagation algorithm. The Backword Error
Propagation algorithm for the ANN learning/training caused a
breakthrough in the application of multilayer perceptrons.
• The backpropagation algorithm is a supervised learning algorithm. The
backpropagation algorithm is an iterative gradient algorithm
designed to minimise the error measure between the actual output of
the neural network and the desired output. We have to optimise a very
non-linear system consisting of a large number of highly correlated
variables.
Prof. M. Kanevski 23
24. Basics of ANN
Backpropagation Algorithm
The backpropagation algorithm follows the next
algorithmic steps:
• 1. Initialize weights. Usually it is recommended to set all
weights and node offsets to small random variables. In our
study we shall use simulated annealing and/or genetic
algorithm to select starting values more intelligently as it is
recommended in [Masters].
• 2. Present inputs and desired outputs. The vectors
(Inputl, Outputl=tl) are presented to the network.
• 3. Calculate the actual output of the ANN.
Prof. M. Kanevski 24
25. Basics of ANN
Backpropagation Algorithm
• 4. Calculate error measure and update the
weights. Use a recursive algorithm starting at the
output neurons (nodes) and working back to the
first hidden layer - it is this backward propagation
of output errors that inspired the name for this
training algorithm. Update the weights W by
Prof. M. Kanevski 25
26. We want to know how to modify
weights in order to decrease the
error function
∂ E(t)
wij (t +1) − wij (t) ∝ −
∂ wij (t)
Prof. M. Kanevski 26
27. Basics of ANN
Backpropagation Algorithm
m m m (m−1)
w (n +1) = w (n) +ηδ Z
ij ij i j
−1
(m )
w n - iteration step, η- rate of learning 0<η≤1), Zj
here - output of the j-th neurone in the layer
m
(m error δi for the output layer is defined byequation
-1),
Prof. M. Kanevski 27
28. Basics of ANN
Backpropagation Algorithm
out out out out
δ i = Z (1− Z )(Ti − Z )
i i i
δ i
( h −1)
= Z (1 − Z ) ∑ w δ
i
h
i
h h
ij
h
j
j
Prof. M. Kanevski 28
29. Basics of ANN
Backpropagation Algorithm
Other error measures (such as maximum absolute error and
median squared error) have even greater advantages in
many situations. For example, median squared error is
useful because unlike the mean the median is a robust
statistic - its value is insensitive to occasional large errors
in the training data. Unfortunately, practical techniques for
implementing these more desirable error measures do not
yet exist. Thus, most neural networks today are tied to
mean squared error measurements.
Prof. M. Kanevski 29
30. Basics of ANN
Backpropagation Algorithm
More general error functions can be written taking into
account (weighting, declustering, economic criteria, etc.)
importance of the samples presented to the network :
n −1
∑{ }
out 2
E l (W ) = T lj − Z lj (W ) ω lj
j=0
Prof. M. Kanevski 30
31. Gradient descent
J(w)
Direction of the gradient
J’(W)
Minimum w
Prof. M. Kanevski 31
33. In reality the situation with error
function and corresponding
optimization problem is much more
complicated:
the presence of multiple local minima!
Prof. M. Kanevski 33
36. How important are local
minima?
(Duda et al. 2001)
In computational practice, we do not want our
network to be caught in a local minimum having
high training error because this usually indicates
that key features of the problem have not been
learned by the network.
In such cases it is traditional to reinitialize the
weights and train again, possibly also altering
other parameters in the net
Prof. M. Kanevski 36
37. How important are local
minima?
(Duda et al. 2001)
In many problems, convergence to a
nonglobal minimum is acceptable, if the
error is nevertheless fairly low.
Furthermore, common stopping criteria
demand that training terminate even
before the minimum is reached, and thus it
is not essential that the network be
converging toward the global minimum or
acceptable performance.
Prof. M. Kanevski 37
38. In short
The presence of multiple minima does not
necessarily present difficulties in training
nets, and a few simple heuristics can often
overcome such problems (see next slide)
Prof. M. Kanevski 38
39. Practical techniques for
improving backpropagation
• Activation function (sigmoid, hyperbolic tangent,..)
• Scaling inputs
• Training with noise (noise injection)
• Initializing weights (simulated annealing)
• Regularization (weight decay)
• Number of hidden layers
• Learning parameters (rates, momentum,..)
• Cost function
• ………………………………….
Prof. M. Kanevski 39
40. Interpretation of network’s
outputs
Consider the limit in which the size N of the training data set goes to
infinity [Bishop 1995]. In this limit we can replace the finite sum over
patterns in the sum-of-squares error with an integral of the form
N
1
E = lim
2N
∑ ∑
n =1 k
{ y k ( x n ; w ) − t kn } 2
1
=
2
∑ ∫∫ { y
k
k
2
( x ; w ) − t k } p ( t k , x ) dt k dx
Prof. M. Kanevski 40
41. Interpretation of network’s
outputs
the network mapping is given by the conditional
average of the target data, the regression of tk
conditioned on x.
y k ( x ; w *) = 〈 t k | x 〉
Prof. M. Kanevski 41
43. MLP and number of layers
• The problem with MLP using single hidden
layer is that the neurons tend to interact with
each other globally. In complex situations ,
this interaction makes it difficult to improve
the approximation at one point without
worsening it at some other point.
• On the other hand, with two hidden layers,
the approximation process becomes more
manageable.
Prof. M. Kanevski 43
44. Two hidden layers! (Haykin)
1. Local features are extracted in the first hidden
layer. Specifically, some neurons in the first
hidden layer are used to partition the input space
into regions, and other neurons in that layer
learn the local features characterizing those
regions.
2. Global features are extracted in the second
layer. Specifically, a neuron in the second
hidden layer combines the outputs of neurons in
the first hidden layer operating on a particular
region of the input space and thereby learns the
global features for that region and outputs zero
elsewhere.
Prof. M. Kanevski 44
45. Data Preprocessing
• Machine learning
Input data
algorithms are data-
driven methods.
Pre-processing
• The quality and
MLA
quantity of data is
essential for training
and generalization Post-processing
Results
Prof. M. Kanevski 45
46. Types of pre-processing:
1. Linear and nonlinear transformations
e.g input scaling/normalisation, Z-score transform,
square root transform, N-score transform, etc.
2. Dimensionality reduction
3. Incorporate prior knowledge
Invariants, hints,…
4. Feature extraction
linear/nonlinear combination of input variables
5. Feature selection
decide which features to use
Prof. M. Kanevski 46
47. Dimensionality reduction
• Two approaches are available to perform
dimensionality reduction:
• Feature extraction: creating a subset of new
features by combinations of the existing
features
• Feature selection: choosing a subset of all
the features (the ones more informative)
Prof. M. Kanevski 47
49. Feature selection
• Reducing the feature space by throwing
out some of the features (covariates)
– Also called variable selection
• Motivating idea: try to find a simple,
“parsimonious” model (Occam’s razor!)
Prof. M. Kanevski 49
51. Dimensionality Reduction
Clearly losing some information but this can be helpful
due to curse of dimensionality
Need some way of deciding what dimensions to keep
1. Random choice
2. Principal components analysis (PCA)
3. Independent components analysis (ICA)
4. Self-organised maps (SOM)
Prof. M. Kanevski 51
52. Data transform
• Y = aZ+b
• Y = Log(Z)
• Y = Ind(Z, Zs)
• Normalisation: Zscore
Y = (Z-Zm)/σ
• Box-Cox nonlinear transform :
λ
Z −1
Y (λ ) = si λ > 0
λ
Y (λ = 0) = Ln( Z )
Prof. M. Kanevski 52
54. Guillaume d'Occam (1285 - 1349)
“Pluralitas non est ponenda sine
necessitate”
Occam’s razor:
“The more simple explanation
of the phenomena is more
likely to be correct”
Prof. M. Kanevski 54
56. Model Selection:
Estimating the performance of different
models in order to choose the
(approximate) best one
Model Assessment:
Having chosen a final model, estimating its
prediction error (generalization error) on
new data
Prof. M. Kanevski 56
57. If we are in a data-rich situation, the best
solution is to split randomly (?) data
Raw Data
Train: 50% Validation:25% Test:25%
(Train) (test) (validation)
Prof. M. Kanevski 57
58. Interpretation
• The training set is used to fit the models
• The validation set is used to estimate prediction
error for model selection (tuning
hyperparameters)
• The test set is used for assessment of the
generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001
Prof. M. Kanevski 58
59. Bias and Variance.
Model’s complexity
c. Underfitting
3
2.5
2 b. Overfitting
3
1.5
2.5
1
2
0.5
1.5
2 4 6 8 10 1
0.5
2 4 6 8 10
Prof. M. Kanevski 59
60. One of the most serious problems that arises in
connectionist learning by neural networks is
overfitting of the provided training examples.
This means that the learned function fits very
closely the training data however it does not
generalise well, that is it can not model
sufficiently well unseen data from the same task.
Solution: Balance the statistical bias and statistical
variance when doing neural network learning in
order to achieve smallest average generalization
error
Prof. M. Kanevski 60
62. We can derive an expression for the
expected prediction error of a
regression at an input point X=x0
using squared-error loss:
Prof. M. Kanevski 62
63. ∧
2
Err ( x0 ) = E[(Y − f ( x0 )) ¦ X = x0 ] =
∧ ∧ ∧
2 2 2
σ ε + [ E f ( x0 ) − f ( x0 )] + E[ f ( x0 ) − E f ( x0 )] =
∧ ∧
2 2
σ ε + Bias ( f ( x0 )) + Var ( f ( x0 )) =
2
IrreducibleError + Bias + Variance
Prof. M. Kanevski 63
64. • The first term is the variance of the target around
its true mean f(x0), and cannot be avoided no
matter how well we estimate f(x0), unless σε2=0.
• The second term is the squared bias, the amount
by which the average of our estimate differs from
the true mean
• The last term is the variance, the expected
∧
squared deviation of f (x )around its mean.
0
Prof. M. Kanevski 64
67. • A neural network is only as good as the
training data!
• Poor training data inevitably leads to an
unreliable and unpredictable network.
• Exploratory Data Analysis and data
preprocessing are extremely important!!!
Prof. M. Kanevski 67
68. MLP modelling. Case Studies.
Original (10 000 points) Training (900 points)
Prof. M. Kanevski 68
69. MLP modeling
Original MLP prediction
Train
Which result do you prefer? RMSE 1.97
Ro 0.69
Prof. M. Kanevski 69
70. MLP modeling
Original MLP prediction
Which result do you prefer? Train
RMSE 1.61
Ro 0.80
Prof. M. Kanevski 70
71. MLP modeling
Original MLP prediction
Which result do you prefer? Train
RMSE 1.67
Ro 0.79
Prof. M. Kanevski 71
72. MLP modeling
Original MLP prediction
Train
Which result do you prefer? RMSE 1.10
Ro 0.92
Prof. M. Kanevski 72
73. MLP modeling
Original MLP prediction
Which result do you prefer? Train
RMSE 0.83
Ro 0.95
Prof. M. Kanevski 73
74. MLP modeling
Original MLP prediction
Train
Which result do you prefer? RMSE 0.55
Ro 0.98
Prof. M. Kanevski 74
75. MLP modeling
1.00
15-15 20-20
Trainig statistics 0.95
10-10
0.90
5
1.90
0.85
5-5
Ro
1.70 10 10 5-5
0.80
1.50
0.75
1.30 5
RMSE
0.70
10-10
1.10 0.65
5 10 5-5 10-10 15-15 20-20
15-15 MLP
0.90
0.70
20-20
0.50 Model 20-20 is the best ?
5 10 5-5 10-10 15-15 20-20
M LP
Prof. M. Kanevski 75
80. ANNEX model: Artificial Neural
Networks with External drift
environmental data mapping
Prof. M. Kanevski 80
81. Traditional application of
ANN to spatial predictions
Data are available at measurement points: F(xi,yi),
for i= 1,…N
Problem: Predict F(x,y) at the points without
measurements. Usually regular grid
ANN solution: x,y - 2 inputs, F - output
- select ANN architecture
- train with available data
- after training use to predict
Prof. M. Kanevski 81
82. ANNEX is similar to “Kriging with
External Drift Model”:
If there is an additional information
(available at training and prediction points)
related to the primary one, we can use it as
an additional inputs to the ANN.
Inputs: x,y,+fext(x,y)
Prof. M. Kanevski 82
83. Examples of external
information
• Cheap information on secondary
variable
Physical model of the phenomena
Remotely sensed images
GIS data
DEM data
Prof. M. Kanevski 83
84. Kriging with external drift
Kriging with external drift is the model when trends
are limited to
E{F(x,y)}=m(x,y) = λ0 +λ1 fext(x,y) (1)
where the smooth variability of the secondary variable
is considered to be related (e.g., linearly correlated) to
that of primary variable F(x,y) being estimated.
In general, kriging with an external drift is a simple
and efficient algorithms to incorporate a secondary
variable in the estimation of the primary variable.
Prof. M. Kanevski 84
85. ANNEX model
What relationship between primary and
external information should be in case
of ANNEX?
Prof. M. Kanevski 85
86. ANNEX model
What does external “related”
(how to measure: correlation between variables?)
information bring?
Improved accuracy of prediction?
Reduce uncertainty of prediction?
An important problem is related to the question of the
quality of additional data: there is a dilemma between
introducing new information and/or new noise.
Prof. M. Kanevski 86
87. Case study: Kazakh Priaralie,
monitoring network
1 400 000 km2 - 400 monitoring stations 87
Prof. M. Kanevski
88. Datasets
GIS DEM
model
Average long-term
temperatures of air in
June (°C)
Prof. M. Kanevski 88
98. MLP: real case study
Wind fields in Switzerland
Prof. M. Kanevski 98
99. Modeling of wind fields with MLP
using regularization technique
(pp 168-172 of the book)
Monitoring network:
111 stations in Switzerland
(80 training + 31 for validation)
Mapping of daily:
• Mean speed
• Maximum gust
• Average direction
Prof. M. Kanevski 99
100. Modeling of wind fields with MLP
and regularization technique
Monitoring network:
111 stations in Switzerland (80 training + 31 for validation)
Mapping of daily:
• Mean speed
• Maximum gust
• Average direction
Input information:
X,Y geographical coordinates
DEM (resolution 500 m)
23 DEM-based « geo-features »
Total 26 features
Model:
MLP 26-20-20-3
Prof. M. Kanevski 100
101. Training of the MLP
Model:
MLP 26-20-20-3
Training:
• Random initialization
• 500 iterations of the
RPROP algorithm
Prof. M. Kanevski 101
104. Results: summary
Noisy ejection regularization
Without regularization (overfitting)
Prof. M. Kanevski 104
105. Conclusion
• MLP is a nonlinear universal tool for the
learning from and modeling of data.
Excellent exploratory tool.
• Application demands deep expert
knowledge and experience
Prof. M. Kanevski 105