Csss2010 20100803-kanevski-lecture2

Machine Learning Algorithms: Theory,
Applications and Software Tools
Lecture 2
Basics of ANN: MLP
Prof. Mikhail Kanevski
Institute of Geomatics and Analysis of Risk,
University of Lausanne

Mikhail.Kanevski@unil.ch

Prof. M. Kanevski 1

Contents

• Introduction to artificial neural networks
• Multilayer perceptron
• Case studies

Prof. M. Kanevski 2

Basics of ANN
Artificial neural networks are analytical systems that
address problems whose solutions have not been
explicitly formulated.
In this way they contrast to classical computers and
computer programs, which are designed to solve
problems whose solutions - although they may be
extremely complex - have been made explicit.

Prof. M. Kanevski 3

Basics of ANN
• We can program or train neural networks to store,
recognise, and associatively retrieve patterns;
• to filter noise from measurement data;
• to control ill-defined problems;
in summary:
• to estimate sampled functions when we do not
know the form of the functions.

Prof. M. Kanevski 4

Basics of ANN
Unlike statistical estimators, they estimate a function
without a mathematical model of how outputs
depend on inputs.
Neural networks are model-semifree estimators
(semiparametric models). They "learn from
experience" with numerical and, sometimes,
linguistic sample data.

Prof. M. Kanevski 5

Basics of ANN
The major applications of ANN:
• Feature recognition (pattern classification). Speech
recognition
• Signal processing
• Time-series prediction
• Function approximation and regression, classification
• Data Mining
• Intelligent control
• Associative memories
• Optimisation
• And many others

Prof. M. Kanevski 6

Basics of ANN.
Simple biological neuron

Prof. M. Kanevski 7

Basics of ANN
Simple model of the neuron

Prof. M. Kanevski 8

Examples of transfer
functions.

1
f (x) =
[1 + exp( − x )]
[exp( x ) − exp( − x )]
tanh( x ) =
[exp( x ) + exp( − x )]
Prof. M. Kanevski 9

Basics of ANN

The main parts of ANN:
• Neurones
(nodes, cells, units, processing
elements)
• Network topology
(connections between neurones)

Prof. M. Kanevski 10

Basics of ANN

In general, Artificial Neural Networks are
a collection of simple computational
units (cells) interlinked by a system of
connections (synaptic connections).
The number of units and connections
form a network topology.


Multilayer perceptron


Basics of ANN.
ANN learning/training
Supervised learning is the most common training. Many samples
Input(i), Output(i) are prepared as a training set. Then a subset from
the training data set is selected. Samples from this subset are
presented to the network one by one. For each sample results
obtained by the network O[(input(i)] are compared with the desired
O[utput(i)]. After presenting the entire training subset the weights are
updated. This updating is done in such a way that a measure of the
error between the network's and desired outputs is reduced. One pass
through the subset of training samples, along with an updating of the
weights is called an epoch. The number of samples in the subset is
called epoch size. Sometimes an epoch size of one is used .


Basics of ANN.
ANN supervised learning.

Teacher
Examples
Response
Neural network

Evaluation
Modifications
Of Response
to Network
Learning
Algorithm


Basics of ANN
Feedforward ANN.
If there are no feedback and lateral
connections we have feedforward
ANN. The most frequently used model
is so called - multi-layer perceptron.
The term feedforward means that
information flows only in one direction -
from the input to the output.


ANN Multi-layer Perceptron (MLP)

• Depends only on the data
and its inner structure
• Is able to learn from data
and generalise
• Good at modelling non-
linearities
• Robust to noise and
outliers

[ANN = artificial neurons + connection weights]

Basics of ANN

All knowledge of ANN is based on
synaptic weights between units.


The Universality Property

• A two layer feed-forward neural network
with step activation functions can
implement any Boolean function,
provided that the number of hidden
neurons H is sufficiently large.


MLP modelling

F1 (t , w ) = w1out f ( w1t + b1 ) + bout ,
F2 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + bout ,
out

F3 (t , w ) = w1out f ( w1t + b1 ) + w2 f ( w2t + b2 ) + w3 f ( w3t + b3 ) + bout .
out out


Backpropagation training


Error function depends on
network’s weights (W)

n −1
1
n j =0
{
El (W ) = ∑ Tlj − Z lj (W )
out
} 2


MLP training algorithms

Optimisation algorithms used for MLP training:
• Stochastic
− Annealing
− Genetic algorithm
• Gradient
− Conjugate gradients (slow 1st order gradient algorithm)
− Levenberg-Marquardt (fast 2nd order gradient algorithm)
− BFGS formula – quasi Newton
− Steepest Descent
− RProp – resilient propagation
− BackProp – back propagation

Feedforward ANN: Multilayer
perceptron. Backprop algorithm

• The possibilities and capabilities of multi-layer perceptrons stem from
the non-linearities used within nodes. MLP can learn with supervised
learning rule - backpropagation algorithm. The Backword Error
Propagation algorithm for the ANN learning/training caused a
breakthrough in the application of multilayer perceptrons.
• The backpropagation algorithm is a supervised learning algorithm. The
backpropagation algorithm is an iterative gradient algorithm
designed to minimise the error measure between the actual output of
the neural network and the desired output. We have to optimise a very
non-linear system consisting of a large number of highly correlated
variables.


Basics of ANN
Backpropagation Algorithm
The backpropagation algorithm follows the next
algorithmic steps:
• 1. Initialize weights. Usually it is recommended to set all
weights and node offsets to small random variables. In our
study we shall use simulated annealing and/or genetic
algorithm to select starting values more intelligently as it is
recommended in [Masters].
• 2. Present inputs and desired outputs. The vectors
(Inputl, Outputl=tl) are presented to the network.
• 3. Calculate the actual output of the ANN.


Basics of ANN
• 4. Calculate error measure and update the
weights. Use a recursive algorithm starting at the
output neurons (nodes) and working back to the
first hidden layer - it is this backward propagation
of output errors that inspired the name for this
training algorithm. Update the weights W by


We want to know how to modify
weights in order to decrease the
error function

∂ E(t)
wij (t +1) − wij (t) ∝ −
∂ wij (t)


Basics of ANN

m m m (m−1)
w (n +1) = w (n) +ηδ Z
ij ij i j

−1
(m )
w n - iteration step, η- rate of learning 0<η≤1), Zj
here - output of the j-th neurone in the layer
m
(m error δi for the output layer is defined byequation
-1),


Basics of ANN

out out out out
δ i = Z (1− Z )(Ti − Z )
i i i

δ i
( h −1)
= Z (1 − Z ) ∑ w δ
i
h
i
h h
ij
h
j
j


Basics of ANN
Other error measures (such as maximum absolute error and
median squared error) have even greater advantages in
many situations. For example, median squared error is
useful because unlike the mean the median is a robust
statistic - its value is insensitive to occasional large errors
in the training data. Unfortunately, practical techniques for
implementing these more desirable error measures do not
yet exist. Thus, most neural networks today are tied to
mean squared error measurements.


Basics of ANN
More general error functions can be written taking into
account (weighting, declustering, economic criteria, etc.)
importance of the samples presented to the network :

n −1

∑{ }
out 2
E l (W ) = T lj − Z lj (W ) ω lj
j=0


Gradient descent

J(w)

Direction of the gradient
J’(W)

Minimum w


Gradient descent

J(w)

Minimum w


In reality the situation with error
function and corresponding
optimization problem is much more
complicated:

the presence of multiple local minima!


Gradient descent

Local minima


SA: Illustration


How important are local
minima?
(Duda et al. 2001)
In computational practice, we do not want our
network to be caught in a local minimum having
high training error because this usually indicates
that key features of the problem have not been
learned by the network.
In such cases it is traditional to reinitialize the
weights and train again, possibly also altering
other parameters in the net


How important are local
minima?
(Duda et al. 2001)

In many problems, convergence to a
nonglobal minimum is acceptable, if the
error is nevertheless fairly low.
Furthermore, common stopping criteria
demand that training terminate even
before the minimum is reached, and thus it
is not essential that the network be
converging toward the global minimum or
acceptable performance.

In short

The presence of multiple minima does not
necessarily present difficulties in training
nets, and a few simple heuristics can often
overcome such problems (see next slide)


Practical techniques for
improving backpropagation
• Activation function (sigmoid, hyperbolic tangent,..)
• Scaling inputs
• Training with noise (noise injection)
• Initializing weights (simulated annealing)
• Regularization (weight decay)
• Number of hidden layers
• Learning parameters (rates, momentum,..)
• Cost function
• ………………………………….


Interpretation of network’s
outputs
Consider the limit in which the size N of the training data set goes to
infinity [Bishop 1995]. In this limit we can replace the finite sum over
patterns in the sum-of-squares error with an integral of the form

N
1
E = lim
2N
∑ ∑
n =1 k
{ y k ( x n ; w ) − t kn } 2

1
=
2
∑ ∫∫ { y
k
k
2
( x ; w ) − t k } p ( t k , x ) dt k dx


Interpretation of network’s
outputs
the network mapping is given by the conditional
average of the target data, the regression of tk
conditioned on x.

y k ( x ; w *) = 〈 t k | x 〉


DEMO


MLP and number of layers

• The problem with MLP using single hidden
layer is that the neurons tend to interact with
each other globally. In complex situations ,
this interaction makes it difficult to improve
the approximation at one point without
worsening it at some other point.
• On the other hand, with two hidden layers,
the approximation process becomes more
manageable.

Two hidden layers! (Haykin)
1. Local features are extracted in the first hidden
layer. Specifically, some neurons in the first
hidden layer are used to partition the input space
into regions, and other neurons in that layer
learn the local features characterizing those
regions.
2. Global features are extracted in the second
layer. Specifically, a neuron in the second
hidden layer combines the outputs of neurons in
the first hidden layer operating on a particular
region of the input space and thereby learns the
global features for that region and outputs zero
elsewhere.

Data Preprocessing
• Machine learning
Input data
algorithms are data-
driven methods.
Pre-processing
• The quality and
MLA
quantity of data is
essential for training
and generalization Post-processing

Results


Types of pre-processing:
1. Linear and nonlinear transformations
e.g input scaling/normalisation, Z-score transform,
square root transform, N-score transform, etc.
2. Dimensionality reduction
3. Incorporate prior knowledge
Invariants, hints,…
4. Feature extraction
linear/nonlinear combination of input variables
5. Feature selection
decide which features to use


Dimensionality reduction

• Two approaches are available to perform
dimensionality reduction:
• Feature extraction: creating a subset of new
features by combinations of the existing
features
• Feature selection: choosing a subset of all
the features (the ones more informative)


Feature selection/extraction


Feature selection

• Reducing the feature space by throwing
out some of the features (covariates)
– Also called variable selection
• Motivating idea: try to find a simple,
“parsimonious” model (Occam’s razor!)


Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006


Dimensionality Reduction

Clearly losing some information but this can be helpful
due to curse of dimensionality

Need some way of deciding what dimensions to keep

1. Random choice
2. Principal components analysis (PCA)
3. Independent components analysis (ICA)
4. Self-organised maps (SOM)

Data transform
• Y = aZ+b
• Y = Log(Z)
• Y = Ind(Z, Zs)
• Normalisation: Zscore
Y = (Z-Zm)/σ
• Box-Cox nonlinear transform :

λ
Z −1
Y (λ ) = si λ > 0
λ
Y (λ = 0) = Ln( Z )

Model Selection & Model Evaluation


Guillaume d'Occam (1285 - 1349)

“Pluralitas non est ponenda sine
necessitate”

Occam’s razor:
“The more simple explanation
of the phenomena is more
likely to be correct”

Model Assessment and Model
Selection:
Two separate goals


Model Selection:

Estimating the performance of different
models in order to choose the
(approximate) best one

Model Assessment:
Having chosen a final model, estimating its
prediction error (generalization error) on
new data


If we are in a data-rich situation, the best
solution is to split randomly (?) data

Raw Data

Train: 50% Validation:25% Test:25%
(Train) (test) (validation)


Interpretation

• The training set is used to fit the models

• The validation set is used to estimate prediction
error for model selection (tuning
hyperparameters)

• The test set is used for assessment of the
generalization error of the final chosen model
Elements of Statistical Learning- Hastie, Tibshirani & Friedman 2001


Bias and Variance.
Model’s complexity

c. Underfitting
3

2.5

2 b. Overfitting
3
1.5
2.5

1
2

0.5
1.5

2 4 6 8 10 1

0.5

2 4 6 8 10


One of the most serious problems that arises in
connectionist learning by neural networks is
overfitting of the provided training examples.
This means that the learned function fits very
closely the training data however it does not
generalise well, that is it can not model
sufficiently well unseen data from the same task.
Solution: Balance the statistical bias and statistical
variance when doing neural network learning in
order to achieve smallest average generalization
error


Bias-Variance Dilemma

Assume that
Y = f (X) + ε
where
E(ε ) = 0,
2
Var(ε ) = σε

We can derive an expression for the
expected prediction error of a
regression at an input point X=x0
using squared-error loss:


∧
2
Err ( x0 ) = E[(Y − f ( x0 )) ¦ X = x0 ] =
∧ ∧ ∧
2 2 2
σ ε + [ E f ( x0 ) − f ( x0 )] + E[ f ( x0 ) − E f ( x0 )] =
∧ ∧
2 2
σ ε + Bias ( f ( x0 )) + Var ( f ( x0 )) =
2
IrreducibleError + Bias + Variance


• The first term is the variance of the target around
its true mean f(x0), and cannot be avoided no
matter how well we estimate f(x0), unless σε2=0.
• The second term is the squared bias, the amount
by which the average of our estimate differs from
the true mean
• The last term is the variance, the expected
∧
squared deviation of f (x )around its mean.
0


Elements of Statistical Learning. Hastie, Tibshirani & Friedman 2001


• A neural network is only as good as the
training data!

• Poor training data inevitably leads to an
unreliable and unpredictable network.

• Exploratory Data Analysis and data
preprocessing are extremely important!!!


MLP modelling. Case Studies.
Original (10 000 points) Training (900 points)


MLP modeling
Original MLP prediction

Train
Which result do you prefer? RMSE 1.97
Ro 0.69


MLP modeling

Which result do you prefer? Train
RMSE 1.61
Ro 0.80


MLP modeling

RMSE 1.67
Ro 0.79


MLP modeling

Train
Ro 0.92


MLP modeling

RMSE 0.83
Ro 0.95


MLP modeling

Train
Ro 0.98


MLP modeling
1.00

15-15 20-20

Trainig statistics 0.95
10-10

0.90
5
1.90
0.85
5-5

Ro
1.70 10 10 5-5
0.80

1.50
0.75

1.30 5
RMSE

0.70

10-10
1.10 0.65
5 10 5-5 10-10 15-15 20-20
15-15 MLP
0.90

0.70
20-20

0.50 Model 20-20 is the best ?
5 10 5-5 10-10 15-15 20-20
M LP

MLP modeling
Trainig statistics

MLP RMSE Ro
5 1.97 0.69

1.61 0.80
10
5-5 1.67 0.79

10-10 1.10 0.92

15-15 0.83 0.95

20-20 0.55 0.98


MLP modeling

Training &Validation statistics
1.00
Validationg Training

2.10
5 0.95
10-10
1.90 15-15 20-20
0.90
10 5-5
1.70
0.85
1.50 10 5-5
20-20

Ro
0.80
RMSE

10-10 15-15
1.30
0.75
1.10
5
0.70
0.90

0.65
0.70

0.60
0.50
5 10 5-5 10-10 15-15 20-20
5 10 5-5 10-10 15-15 20-20
MLP Prof. M. Kanevski MLP 77

MLP modeling

Training &Validation statistics
1.00
Validationg Training

2.10
5 0.95
10-10
1.90 15-15 20-20
0.90
10 5-5
1.70
0.85
1.50 10 5-5
20-20

Ro
0.80
RMSE

10-10 15-15
1.30
0.75
1.10
5
0.70
0.90

0.65
0.70

0.60
0.50
5 10 5-5 10-10 15-15 20-20
5 10 5-5 10-10 15-15 20-20
MLP Prof. M. Kanevski MLP 78

MLP modeling
Validation statistics

MLP RMSE Ro
5 2.01 0.68

1.66 0.80
10
5-5 1.70 0.79

10-10 1.25 0.89

15-15 1.24 0.89

20-20 1.39 0.88


ANNEX model: Artificial Neural
Networks with External drift
environmental data mapping


Traditional application of
ANN to spatial predictions
Data are available at measurement points: F(xi,yi),
for i= 1,…N
Problem: Predict F(x,y) at the points without
measurements. Usually regular grid
ANN solution: x,y - 2 inputs, F - output
- select ANN architecture
- train with available data
- after training use to predict

ANNEX is similar to “Kriging with
External Drift Model”:
If there is an additional information
(available at training and prediction points)
related to the primary one, we can use it as
an additional inputs to the ANN.

Inputs: x,y,+fext(x,y)


Examples of external
information
• Cheap information on secondary
variable

Physical model of the phenomena
Remotely sensed images
GIS data
DEM data


Kriging with external drift
Kriging with external drift is the model when trends
are limited to

E{F(x,y)}=m(x,y) = λ0 +λ1 fext(x,y) (1)
where the smooth variability of the secondary variable
is considered to be related (e.g., linearly correlated) to
that of primary variable F(x,y) being estimated.
In general, kriging with an external drift is a simple
and efficient algorithms to incorporate a secondary
variable in the estimation of the primary variable.

ANNEX model

What relationship between primary and
external information should be in case
of ANNEX?


ANNEX model
What does external “related”
(how to measure: correlation between variables?)
information bring?
Improved accuracy of prediction?
Reduce uncertainty of prediction?
An important problem is related to the question of the
quality of additional data: there is a dilemma between
introducing new information and/or new noise.

Case study: Kazakh Priaralie,
monitoring network

1 400 000 km2 - 400 monitoring stations 87
Prof. M. Kanevski

Datasets
GIS DEM
model

Average long-term
temperatures of air in
June (°C)


Correlation
Air temperature vs. Altitude


Train and Test datasets

Train
Test


ANN and ANNEX models
Model Correlation RMSE MAE MRE

2-7-5-1 0.917 2.57 1.96 -0.02

3-3-1 0.989 0.96 0.73 -0.01
3-5-1 0.99 0.9 0.7 -0.007
3-7-1 0.991 0.85 0.66 -0.004
3-8-1 0.991 0.84 0.68 -0.001

3-9-1 0.991 0.88 0.69 -0.01

3-10-1 0.99 0.92 0.74 -0.01
Kriging with
0.984 1.19 0.91 -0.03
external drift


Scatter plots

Kriging Cokriging Drift ANNEX
Kriging

Mapping results
Kriging Cokriging

Drift ANNEX
Kriging


Modelling noisy “altitude”
effect (100 %)

Before After

Scatter plots between variables
(noisy 100 % altitude)

Train Test 95
Prof. M. Kanevski

Mapping noise results
ANNEX

Air temperature (°C)

Noise results
Model Correlation RMSE MAE MRE

Kriging 0.874 3.13 2.04 -0.06

Kriging – external drift 0.984 1.19 0.91 -0.03

3-7-1 0.991 0.85 0.66 -0.004

3-8-1 0.991 0.84 0.68 -0.001

3-8-1
0.839 3.54 2.37 -0.13
(100% noise)
3-7-1
0.939 2.32 -1.49 -0.003
(10% noise) Test 1
Kriging – external drift
0.941 2.23 1.54 -0.06
(10% noise) Test 1
3-7-1
0.899 2.81 1.52 -0.08
(10% noise) Test 2
Kriging – external drift
0.903 2.81 1.59 -0.103
(10% noise) Test 2


MLP: real case study

Wind fields in Switzerland


Modeling of wind fields with MLP
using regularization technique
(pp 168-172 of the book)
Monitoring network:
111 stations in Switzerland
(80 training + 31 for validation)

Mapping of daily:
• Mean speed
• Maximum gust
• Average direction


Modeling of wind fields with MLP
and regularization technique
Monitoring network:
111 stations in Switzerland (80 training + 31 for validation)

Mapping of daily:
• Mean speed
• Maximum gust
• Average direction

Input information:
X,Y geographical coordinates
DEM (resolution 500 m)
23 DEM-based « geo-features »
Total 26 features

Model:
MLP 26-20-20-3


Training of the MLP

Model:
MLP 26-20-20-3

Training:
• Random initialization
• 500 iterations of the
RPROP algorithm


Results: naîve approach


Results: Noisy ejection regularization


Results: summary
Noisy ejection regularization

Without regularization (overfitting)


Conclusion

• MLP is a nonlinear universal tool for the
learning from and modeling of data.
Excellent exploratory tool.

• Application demands deep expert
knowledge and experience


Csss2010 20100803-kanevski-lecture2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Csss2010 20100803-kanevski-lecture2

Similar to Csss2010 20100803-kanevski-lecture2 (20)

Recently uploaded

Recently uploaded (20)

Csss2010 20100803-kanevski-lecture2