A short introduction to statistical learning

A short introduction to statistical learning
Nathalie Villa-Vialaneix
nathalie.villa-vialaneix@inra.fr
http://www.nathalievilla.org
GT “SPARKGEN”
March 12th, 2018 - INRA, Toulouse
Nathalie Villa-Vialaneix | Introduction to statistical learning 1/58

Outline
1 Introduction
Background and notations
Underﬁtting / Overﬁtting
Consistency
2 CART and random forests
Introduction to CART
Learning
Prediction
Overview of random forests
Bootstrap/Bagging
Random forest
3 SVM
4 (not deep) Neural networks
Seminal references
Multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice

Outline
1 Introduction
Consistency
Learning
Prediction
Bootstrap/Bagging
Random forest
3 SVM
Seminal references

Background
Purpose: predict Y from X;

Background
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);

Background
What we want: estimate unknown Y from new X: xn+1, . . . , xm.

Background
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.

Background
X can be:
numeric variables;
or factors;
or a combination of numeric variables and factors.
Y can be:
a numeric variable (Y ∈ R) ⇒ (supervised) regression régression;
a factor ⇒ (supervised) classiﬁcation discrimination.

Basics
From (xi, yi)i, deﬁnition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).

Basics
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
classification;
if Y is a factor, Φn
is called a classifier classifieur;

Basics
s.t.:
ˆynew = Φn
(xnew).
classiﬁcation;
Φn
is said to be trained or learned from the observations (xi, yi)i.

Basics
s.t.:
ˆynew = Φn
(xnew).
classiﬁcation;
Φn
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;

Basics
s.t.:
ˆynew = Φn
(xnew).
classiﬁcation;
Φn
generalization ability: predictions made on new data are also
accurate.

Basics
s.t.:
ˆynew = Φn
(xnew).
classiﬁcation;
Φn
generalization ability: predictions made on new data are also
accurate.
Conﬂicting objectives!!

Underﬁtting/Overﬁtting sous/sur - apprentissage
Function x → y to be estimated

Observations we might have

Observations we do have

First estimation from the observations: underﬁtting

Second estimation from the observations: accurate estimation

Third estimation from the observations: overﬁtting

Summary

Errors
training error (measures the accuracy to the observations)

Errors
if y is a factor: misclassiﬁcation rate
{ˆyi yi, i = 1, . . . , n}
n

Errors
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
test error: a way to prevent overﬁtting (estimates the generalization
error) is the simple validation

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
test error: a way to prevent overﬁtting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train Φn
from the training dataset
3 calculate the test error from the remaining data

Example
Observations

Example
Training/Test datasets

Example
Training/Test errors

Example
Summary

Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small ﬂuctuations in the
training set (variance of the predictor)

Bias / Variance trade-off
This problem is also related to the well known bias / variance trade-off
bias: error that comes from erroneous assumptions in the learning
algorithm (average error of the predictor)
variance: error that comes from sensitivity to small ﬂuctuations in the
training set (variance of the predictor)
Overall error is: E(MSE) = Bias2
+ Variance

Consistency in the parametric/non parametric case
Example in the parametric framework (linear methods)
an assumption is made on the form of the relation between X and Y:
Y = βT
X +
β is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a βn
.
The estimation is said to be consistent if βn n→+∞
−−−−−−→ β under (possibly)
technical assumptions on X, , Y.

Consistency in the parametric/non parametric case
Example in the nonparametric framework
the form of the relation between X and Y is unknown:
Y = Φ(X) +
Φ is estimated from the observations (x1, y1), . . . , (xn, yn) by a given
method which calculates a Φn
.
The estimation is said to be consistent if Φn n→+∞
−−−−−−→ Φ under (possibly)
technical assumptions on X, , Y.

Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...

Consistency from the statistical learning perspective
[Vapnik, 1995]
Question: Are we really interested in estimating Φ or...
... rather in having the smallest prediction error?
Statistical learning perspective: a method that builds a machine Φn
from
the observations is said to be (universally) consistent if, given a risk
function R : R × R → R+ (which calculates an error),
E (R(Φn
(X), Y))
n→+∞
−−−−−−→ inf
Φ:X→R
E (R(Φ(X), Y)) ,
for any distribution of (X, Y) ∈ X × R.
Deﬁnitions: L∗ = infΦ:X→R E (R(Φ(X), Y)) and LΦ = E (R(Φ(X), Y)).

Desirable properties from a mathematical perspective
Simpliﬁed framework: X ∈ X and Y ∈ {−1, 1} (binary classiﬁcation)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);

Desirable properties from a mathematical perspective
Simpliﬁed framework: X ∈ X and Y ∈ {−1, 1} (binary classiﬁcation)
Learning process: choose a machine Φn
in a class of functions
C ⊂ {Φ : X → R} (e.g., C is the set of all functions that can be build using a
SVM).
Error decomposition
LΦn
− L∗
≤ LΦn
− inf
Φ∈C
LΦ + inf
Φ∈C
LΦ − L∗
with
infΦ∈C LΦ − L∗ is the richness of C (i.e., C must be rich to ensure that
this term is small);
LΦn
− infΦ∈C LΦ ≤ 2 supΦ∈C |Ln
Φ − LΦ|, Ln
Φ = 1
n
n
i=1 R(Φ(xi), yi) is
the generalization capability of C (i.e., in the worst case, the empirical
error must be close to the true error: C must not be too rich to ensure
that this term is small).

Outline
1 Introduction
Consistency
Learning
Prediction
Bootstrap/Bagging
Random forest
3 SVM
Seminal references

Overview
CART: Classiﬁcation And Regression Trees introduced by
[Breiman et al., 1984].

Overview
Advantages
classiﬁcation OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: no prior assumption needed;
can deal with a large number of input variables, either numeric
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.

Overview
Advantages
factor);
non parametric method: no prior assumption needed;
variables or factors (a variable selection is included in the method);
provide an intuitive interpretation.
Drawbacks
require a large training dataset to be efﬁcient;
as a consequence, are often too simple to provide accurate
predictions.

Example
X = (Gender, Age, Height) and Y = Weight
Y1 Y2 Y3
Y4 Y5
Height<1.60m Height>1.60m
Gender=M Gender=F Age<30 Age>30
Gender=M Gender=F
Root
a split
a node
a leaf
(terminal node)

CART learning process
Algorithm
Start from root
repeat
move to a “new” node
if the node is homogeneous or small enough then
STOP
else
split the node into two child nodes with maximal “homogeneity”
end if
until all nodes are processed

Further details
Homogeneity?
if Y is a numeric variable, variance of (yi)i for the observations
assigned to the node (Gini index is also sometimes used);

Further details
Homogeneity?
if Y is a factor, node purity: % of observations assigned to the node
whose Y values are not the node majority class.

Further details
Homogeneity?
Stopping criteria?
Minimum size node (generally 1 or 5)
Minimum node purity or variance
Maximum tree depth

Further details
Homogeneity?
Stopping criteria?
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.

Further details
Homogeneity?
Stopping criteria?
Maximum tree depth
Hyperparameters can be tuned by cross-validation using a grid search.
An alternative approach is pruning...

Making new predictions
A new observation, xnew
is assigned to a leaf (straightforward)

the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
assigned to the same leaf

the corresponding predicted ˆynew is
if Y is numeric, the mean value of the observations (training set)
if Y is a factor, the majority class of the observations (training set)

Advantages/Drawbacks
Random Forest: introduced by [Breiman, 2001]

Advantages
factor)
non parametric method (no prior assumption needed) and accurate
variables or factors
can deal with small samples

Advantages
factor)
non parametric method (no prior assumption needed) and accurate
variables or factors
can deal with small samples
Drawbacks
black box model
is only supported by a few mathematical results (consistency...) until
now

Basic description
A fact: When the sample size is small, you might be unable to estimate
properly

Basic description
properly
This issue is commonly tackled by bootstrapping and, more speciﬁcally,
bagging (Boostrap Aggregating) ⇒ it reduces the variance of the estimator

Basic description
properly
Bagging: combination of simple (and underefﬁcient) regression (or
classiﬁcation) functions

Basic description
properly
Bagging: combination of simple (and underefﬁcient) regression (or
classiﬁcation) functions
Random forest CART bagging

Bootstrap
Bootstrap sample: random sampling (with replacement) of the training
dataset - samples have the same size than the original dataset

Bootstrap
General (and robust) approach to solve several problems:

Bootstrap
Estimating conﬁdence intervals (of X with no prior assumption on the
distribution of X)
1 Build P bootstrap samples from (xi)i
2 Use them to estimate X P times
3 The conﬁdence interval is based on the percentiles of the empirical
distribution of X

Bootstrap
distribution of X)

Bootstrap
distribution of X)
Also useful to estimate p-values, residuals, ...

Bagging
Average the estimates of the regression (or the classiﬁcation) function
obtained from B bootstrap samples

Bagging
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)

Bagging
Bagging with regression trees
1: for b = 1, . . . , B do
2: Build a bootstrap sample ξb
3: Train a regression tree from ξb, ˆφb
4: end for
5: Estimate the regression function by
ˆΦn
(x) =
1
B
B
b=1
ˆφb(x)
For classiﬁcation, the predicted class is the majority vote class.

Random forests
CART bagging with additional variations
1 each node is based on a random (and different) subset of q variables
(an advisable choice for q is
√
p for classiﬁcation and p/3 for
regression)

Random forests
√
regression)
2 the tree is fully developed (overﬁtted)
Hyperparameters
those of the CART algorithm
those that are speciﬁc to the random forest: q, number of trees

Random forests
√
regression)
2 the tree is fully developed (overﬁtted)
Hyperparameters
those of the CART algorithm
those that are speciﬁc to the random forest: q, number of trees
Random forests are not very sensitive to hyper-parameter settings: default
values for q and 500/1000 trees should work in most cases.

Additional tools
OOB (Out-Of Bags) error: error based on the observations not
included in the “bag”

Additional tools
Stabilization of OOB error is a good indication that there is enough
trees in the forest

Additional tools
Importance of a variable to help interpretation: for a given variable Xj
1: randomize the values of the variable
2: make predictions from this new dataset
3: the importance is the mean decrease in accuracy (MSE or
misclassiﬁcation rate)

Outline
1 Introduction
Consistency
Learning
Prediction
Bootstrap/Bagging
Random forest
3 SVM
Seminal references

Basic introduction
Binary classiﬁcation problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)

Basic introduction
Binary classiﬁcation problem: X ∈ H et Y ∈ {−1; 1}
A training set is given: (x1, y1), . . . , (xn, yn)
SVM is a method based on kernels. It is universally consistent method,
given that the kernel is universal [Steinwart, 2002].
Extensions to the regression case exist (SVR or LS-SVM) that are also
universally consistent when the kernel is universal.

Optimal margin classiﬁcation

w
margin: 1
w 2
Support Vector

w
margin: 1
w 2
Support Vector
w is chosen such that:
minw w 2
(the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1, 1 ≤ i ≤ n (the separation
between the two classes is perfect).
⇒ ensures a good generalization capability.

Soft margin classiﬁcation

w
margin: 1
w 2
Support Vector

w
margin: 1
w 2
Support Vector
w is chosen such that:
minw,ξ w 2
+ C n
i=1 ξi (the margin is the largest),
under the constraints: yi( w, xi + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes is almost perfect).
⇒ allowing a few errors improves the richness of the class.

Non linear SVM
Original space X

Non linear SVM
Original space X Feature space H
Ψ (non linear)

Non linear SVM
Original space X Feature space H
Ψ (non linear)
w ∈ H is chosen such that (PC,H ):
minw,ξ w 2
H
+ C n
i=1 ξi (the margin in the feature space is the
largest),
under the constraints: yi( w, Ψ(xi) H + b) ≥ 1 − ξi, 1 ≤ i ≤ n,
ξi ≥ 0, 1 ≤ i ≤ n.
(the separation between the two classes in the feature space is
almost perfect).

SVM from different points of view
A regularization problem: (PC,H ) ⇔
(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
where fw(x) = Ψ(x), w H and R(ˆy, y) = max(0, 1 − ˆyy) (hinge loss
function)
errors versus ˆy for y = 1:
blue: hinge loss;
green: misclassiﬁcation error.

(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyj Ψ(xi), Ψ(xj) H ,
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.

(P2
λ,H ) : min
w∈H
1
n
n
i=1
R(fw(xi), yi)
error term
+λ w 2
H
penalization term
,
function)
A dual problem: (PC,H ) ⇔
(DC,X) : maxα∈Rn
n
i=1 αi − n
i=1
n
j=1 αiαjyiyjK(xi, xj),
with N
i=1 αiyi = 0,
0 ≤ αi ≤ C, 1 ≤ i ≤ n.
There is no need to know Ψ and H:
choose a function K with a few good properties;
use it as the dot product in H:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H .

Which kernels?
Minimum properties that a kernel should fulﬁlled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H

Which kernels?
Minimum properties that a kernel should fulﬁlled
symmetry: K(u, u ) = K(u , u)
positivity: ∀ N ∈ N, ∀ (αi) ⊂ RN
, ∀ (xi) ⊂ XN
, i,j αiαjK(xi, xj) ≥ 0.
[Aronszajn, 1950]: ∃ a Hilbert space (H, ., . H ) and a function Ψ : X → H
such that:
∀ u, v ∈ H, K(u, v) = Ψ(u), Ψ(v) H
Examples
the Gaussian kernel: ∀ x, x ∈ Rd
, K(x, x ) = e−γ x−x 2
(it is universal
for all bounded subset of Rd
);
the linear kernel: ∀ x, x ∈ Rd
, K(x, x ) = xT
(x ) (it is not universal).

In summary, how does the solution write????
Φn
(x) =
i
αiyiK(xi, x)
where only a few αi 0. i such that αi 0 are the support vectors!

I’m almost dead with all these stuffs on my mind!!!
What in practice?
data(iris)
iris <- iris[iris$Species%in%c("versicolor","virginica"),]
plot(iris$Petal.Length, iris$Petal.Width, col=iris$Species ,
pch=19)
legend("topleft", pch=19, col=c(2,3),
legend=c("versicolor", "virginica"))

What in practice?
library(e1071)
res.tune <- tune.svm(Species ~ ., data=iris, kernel="linear",
cost = 2^(-1:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# cost
# 0.5
# - best performance: 0.05
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, cost = 2^(-1:4),
# kernel = "linear")
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: linear
# cost: 0.5
# gamma: 0.25
# Number of Support Vectors: 21

What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 45 0
% virginica 0 5 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))

What in practice?
res.tune <- tune.svm(Species ~ ., data=iris, gamma = 2^(-1:1),
cost = 2^(2:4))
# Parameter tuning of ’svm’:
# - sampling method: 10fold cross validation
# - best parameters:
# gamma cost
# 0.5 4
# - best performance: 0.08
res.tune$best.model
# Call:
# best.svm(x = Species ~ ., data = iris, gamma = 2^(-1:1),
# cost = 2^(2:4))
# Parameters:
# SVM-Type: C-classification
# SVM-Kernel: radial
# cost: 4
# gamma: 0.5
# Number of Support Vectors: 32

What in practice?
table(res.tune$best.model$fitted, iris$Species)
% setosa versicolor virginica
% setosa 0 0 0
% versicolor 0 49 0
% virginica 0 1 50
plot(res.tune$best.model, data=iris, Petal.Width~Petal.Length,
slice = list(Sepal.Width = 2.872, Sepal.Length = 6.262))

Outline
1 Introduction
Consistency
Learning
Prediction
Bootstrap/Bagging
Random forest
3 SVM
Seminal references

What are (artiﬁcial) neural networks?
Common properties
(artiﬁcial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain

Common properties
combination (network) of simple elements (neurons)

Common properties
combination (network) of simple elements (neurons)
Example of graphical representation:
INPUTS
OUTPUTS

Different types of neural networks
A neural network is deﬁned by:
1 the network structure;
2 the neuron type.

2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classiﬁcation and regression);

2 the neuron type.
Standard examples
Radial basis function networks (RBF): same purpose but based on
local smoothing;

2 the neuron type.
Standard examples
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .

2 the neuron type.
Standard examples
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
In this talk, focus on MLP.

MLP: Advantages/Drawbacks
Advantages
factor);
non parametric method: ﬂexible;
good theoretical properties.

MLP: Advantages/Drawbacks
Advantages
factor);
non parametric method: ﬂexible;
good theoretical properties.
Drawbacks
hard to train (high computational cost, especially when d is large);
overﬁt easily;
“black box” models (hard to interpret).

References
Advised references:
[Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more
than statistical) perspective
[Devroye et al., 1996, Györﬁ et al., 2002] in dedicated chapters present
statistical properties of perceptrons

Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)

from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated

from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
3 ... and a signal is sent to
other neurons through
the axon

First model of artiﬁcial neuron
[Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962]
x(1)
x(2)
x(p)
f(x)Σ+
w1
w2
wp
w0
f : x ∈ Rp
→ 1 p
j=1
wjx(j)+w0 ≥ 0

(artiﬁcial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
)

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
weights w
(1)
jk

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3
+
w0 (Bias Biais)

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions fonctions de transfert / d’activation
Biologically inspired: Heaviside function
h(z) =
0 if z < 0
1 otherwise.

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Main issue with the Heaviside function: not continuous!
Identity
h(z) = z

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
But identity activation function gives linear model if used with one hidden
layer: not ﬂexible enough
Logistic function
h(z) = 1
1+exp(−z)

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Another popular activation function (useful to model positive real numbers)
Rectiﬁed linear (ReLU)
h(z) = max(0, z)

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
General sigmoid
sigmoid: nondecreasing function h : R → R such that
lim
z→+∞
h(z) = 1 lim
z→−∞
h(z) = 0

Focus on one-hidden-layer perceptrons
Regression case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
f(x)
f(x) =
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
, with hk a (logistic) sigmoid

Binary classiﬁcation case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
ψ(x) = h0


Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0


with h0 logistic sigmoid or identity.

Binary classiﬁcation case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
decision with:
f(x) =
0 if ψ(x) < 1/2
1 otherwise

Extension to any classiﬁcation problem in {1, . . . , K − 1}
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
ψ(x)
Straightforward extension to multiple classes with a multiple output
perceptron (number of output units equal to K) and a maximum probability
rule for the decision.

This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?

This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
2 when a perceptron is trained with i.i.d. observations from an arbitrary
random variable pair (X, Y), is it consistent? (i.e., does it reach the
minimum possible error asymptotically when the number of
observations grows to inﬁnity?)

Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1

Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
trying to approximate (how this is performed is explained later in this talk) this
function with MLP having different numbers of neurons on their
hidden layer

Universal property from a theoretical point of view
Set of MLPs with a given size:
PQ
(h) =



x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp




PQ
(h) =



x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp



Set of all MLPs: P(h) = ∪Q∈NPQ
(h)

PQ
(h) =



x ∈ Rp
→
Q
k=1
w
(2)
k
h x w
(1)
k
+ w
(0)
k
+ w
(2)
0
: w
(2)
k
, w
(0)
k
∈ R, w
(1)
k
∈ Rp



Set of all MLPs: P(h) = ∪Q∈NPQ
(h)
Universal approximation [Pinkus, 1999]
If h is a non polynomial continuous function, then, for any continuous
function g : [0, 1]p
→ R and any > 0, ∃ f ∈ P(h) such that:
sup
x∈[0,1]p
|f(x) − g(x)| ≤ .

Remarks on universal approximation
continuity of the activation function is not required (see
[Devroye et al., 1996] for a result with arbitrary sigmoids)
other versions of this property are given in
[Hornik, 1991, Hornik, 1993, Stinchcombe, 1999] for different functional
spaces for g
none of the spaces PQ
(h), for a ﬁxed Q, has this property
this result can be used to show that perceptron are consistent
whenever Q log(n)/n
n→+∞
−−−−−−→ 0
[Farago and Lugosi, 1993, Devroye et al., 1996]

Empirical error minimization
Given i.i.d. observations of (X, Y), (Xi, Yi), how to choose the weights w?
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
fw(x)

Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classiﬁcation case, with the associated decision rule
x → 1{fw (x)≤1/2}.

Standard approach: minimize the empirical L2 risk:
Rn(w) =
n
i=1
[fw(Xi) − Yi]2
with
Yi ∈ R for the regression case
Yi ∈ {0, 1} for the classiﬁcation case, with the associated decision rule
x → 1{fw (x)≤1/2}.
But: ˆRn(w) is not convex in w ⇒ general optimization problem

Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));

Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) w
ˆRn(w(t));
online (or stochastic) approach: write
ˆRn(w) =
n
i=1
[fw(Xi) − Yi]2
=Ei
and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update:
w(t + 1) = w(t) − µ(t) wEi(w(t)).

Discussion about practical choices for this approach
batch version converges (in an optimization point of view) to a local
minimum of the error for a good choice of µ(t) but convergence can
be slow
stochastic version is usually very inefﬁcient but is useful for large
datasets (n large)
more efﬁcient algorithms exist to solve the optimization task. The one
implemented in the R package nnet uses higher order derivatives
(BFGS algorithm)
in all cases, solutions returned are, at best, local minima which
strongly depends on the initialization: using more than one
initialization state is advised

Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):

Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily calculate gradients in perceptrons (or in other types of
neural network):
This way, stochastic gradient descent alternates:
a forward step which aims at calculating outputs from all observations
Xi given a value of the weights w
a backward step in which the gradient backpropagation principle is
used to obtain the gradient for the current weights w

Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
initialize weights

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
Forward step: for all k, calculate a
(1)
k
= Xi
w
(1)
k
+ w
(0)
k

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
Forward step: for all k, calculate z
(1)
k
= hk (a
(1)
k
)

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x) = a(2)
Forward step: calculate a(2) = Q
k=1
w
(2)
k
z
(1)
k
+ w
(2)
0

x(1)
x(2)
x(p)
+
+
w(1)
w
(0)
1
w
(0)
Q
w(2)
fw(x) = a(2)
Backward step: calculate ∂Ei
∂w
(2)
k
= δ(2) × zk with
δ(2)
= ∂Ei
∂a(2) =
[h0(a(2))−Yi ]2
∂a(2) = 2h0(a(2)
) × [h0(a(2)
) − Yi]

x(1)
x(2)
x(p)
+
+
w(2)
w
(0)
1
w
(0)
Q
w(1)
fw(x)
Backward step: ∂Ei
∂w
(1)
kj
= δ
(1)
k
× X
(j)
i
with
δ
(1)
k
= ∂Ei
∂a
(1)
k
= ∂Ei
∂a(2) × ∂a(2)
∂a
(1)
k
= δ(2)
× w
(2)
k
hk (a
(1)
k
)

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
fw(x)
Backward step: ∂Ei
∂w
(0)
k
= δ
(1)
k

Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)

(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))

(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
In the R package nnet, weights are sampled uniformly between
[−0.5, 0.5] or between − 1
maxi X
(j)
i
, 1
maxi X
(j)
i
if X(j) is large.
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆRn(w)
target value of the evolution ˆRn(w(t)) − ˆRn(w(t + 1))
In the R package nnet, a combination of the three criteria is used and
tunable.

Strategies to avoid overﬁtting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method

Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error calculated on this dataset starts to increase

Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w

Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆRn(w) + λw w
Noise injection: modify the input data with a random noise during the
training

References
Aronszajn, N. (1950).
Theory of reproducing kernels.
Transactions of the American Mathematical Society, 68(3):337–404.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, New York, USA.
Breiman, L. (2001).
Random forests.
Machine Learning, 45(1):5–32.
Breiman, L., Friedman, J., Olsen, R., and Stone, C. (1984).
Classification and Regression Trees.
Chapman and Hall, Boca Raton, Florida, USA.
Devroye, L., Györfi, L., and Lugosi, G. (1996).
A Probabilistic Theory for Pattern Recognition.
Springer-Verlag, New York, NY, USA.
Farago, A. and Lugosi, G. (1993).
Strong universal consistency of neural network classifiers.
IEEE Transactions on Information Theory, 39(4):1146–1151.
Györfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002).
A Distribution-Free Theory of Nonparametric Regression.
Springer-Verlag, New York, NY, USA.
Hornik, K. (1991).
Approximation capabilities of multilayer feedfoward networks.
Neural Networks, 4(2):251–257.
Hornik, K. (1993).

Some new results on neural network approximation.
Neural Networks, 6(8):1069–1072.
Mc Culloch, W. and Pitts, W. (1943).
A logical calculus of ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 5(4):115–133.
Pinkus, A. (1999).
Approximation theory of the MLP model in neural networks.
Acta Numerica, 8:143–195.
Ripley, B. (1996).
Pattern Recognition and Neural Networks.
Cambridge University Press.
Rosenblatt, F. (1958).
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological Review, 65:386–408.
Rosenblatt, F. (1962).
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.
Spartan Books, Washington, DC, USA.
Rumelhart, D. and Mc Clelland, J. (1986).
Parallel Distributed Processing: Exploration in the MicroStructure of Cognition.
MIT Press, Cambridge, MA, USA.
Steinwart, I. (2002).
Support vector machines are universally consistent.
Journal of Complexity, 18:768–791.
Stinchcombe, M. (1999).
Neural network approximation of continuous functionals and continuous functions on compactiﬁcations.
Neural Network, 12(3):467–477.

Vapnik, V. (1995).
The Nature of Statistical Learning Theory.
Springer Verlag, New York, USA.

A short introduction to statistical learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A short introduction to statistical learning

Similar to A short introduction to statistical learning (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

A short introduction to statistical learning