High Profile š 8250077686 š Call Girls Service in GTB Nagarš
Ā
An introduction to neural network
1. An introduction to neural networks
Nathalie Vialaneix & Hyphen
nathalie.vialaneix@inra.fr
http://www.nathalievialaneix.eu &
https://www.hyphen-stat.com/en/
DATE ET LIEU
Nathalie Vialaneix & Hyphen | Introduction to neural networks 1/41
2. Outline
1 Statistical / machine / automatic learning
Background and notations
Underļ¬tting / Overļ¬tting
Consistency
Avoiding overļ¬tting
2 (non deep) Neural networks
Presentation of multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
3 Deep neural networks
Overview
CNN
Nathalie Vialaneix & Hyphen | Introduction to neural networks 2/41
3. Outline
1 Statistical / machine / automatic learning
Background and notations
Underļ¬tting / Overļ¬tting
Consistency
Avoiding overļ¬tting
2 (non deep) Neural networks
Presentation of multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
3 Deep neural networks
Overview
CNN
Nathalie Vialaneix & Hyphen | Introduction to neural networks 3/41
5. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
Nathalie Vialaneix & Hyphen | Introduction to neural networks 4/41
6. Background
Purpose: predict Y from X;
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 4/41
7. Basics
From (xi, yi)i, deļ¬nition of a machine, Ī¦n
s.t.:
Ėynew = Ī¦n
(xnew).
Nathalie Vialaneix & Hyphen | Introduction to neural networks 5/41
13. Underļ¬tting/Overļ¬tting sous/sur - apprentissage
Function x ā y to be estimated
Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
16. Underļ¬tting/Overļ¬tting sous/sur - apprentissage
First estimation from the observations: underļ¬tting
Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
17. Underļ¬tting/Overļ¬tting sous/sur - apprentissage
Second estimation from the observations: accurate estimation
Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
18. Underļ¬tting/Overļ¬tting sous/sur - apprentissage
Third estimation from the observations: overļ¬tting
Nathalie Vialaneix & Hyphen | Introduction to neural networks 6/41
20. Errors
training error (measures the accuracy to the observations)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
21. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassiļ¬cation rate
{Ėyi yi, i = 1, . . . , n}
n
Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
22. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassiļ¬cation rate
{Ėyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(Ėyi ā yi)2
Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
23. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassiļ¬cation rate
{Ėyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(Ėyi ā yi)2
or root mean square error (RMSE) or pseudo-R2
: 1āMSE/Var((yi)i)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
24. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassiļ¬cation rate
{Ėyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(Ėyi ā yi)2
or root mean square error (RMSE) or pseudo-R2
: 1āMSE/Var((yi)i)
test error: a way to prevent overļ¬tting (estimates the generalization
error) is the simple validation
Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
25. Errors
training error (measures the accuracy to the observations)
if y is a factor: misclassiļ¬cation rate
{Ėyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(Ėyi ā yi)2
or root mean square error (RMSE) or pseudo-R2
: 1āMSE/Var((yi)i)
test error: a way to prevent overļ¬tting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train Ī¦n
from the training dataset
3 compute the test error from the remaining data
Nathalie Vialaneix & Hyphen | Introduction to neural networks 7/41
30. Practical consequences
A machine or an algorithm is a method that:
start from a given set of prediction functions C
use observations (xi, yi)i to pick the ābestā prediction function
according to what has already been observed.
This is often done by the estimation of some parameters (e.g.,
coefļ¬cients in linear models, weights in neural networks, ...)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 9/41
31. Practical consequences
A machine or an algorithm is a method that:
start from a given set of prediction functions C
use observations (xi, yi)i to pick the ābestā prediction function
according to what has already been observed.
This is often done by the estimation of some parameters (e.g.,
coefļ¬cients in linear models, weights in neural networks, ...)
C can depend on user choices (e.g., achitecture, number of neurons in
neural networks) ā hyper-parameters
parameters (learned from data) hyper-parameters (can not be learned from data)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 9/41
32. Practical consequences
A machine or an algorithm is a method that:
start from a given set of prediction functions C
use observations (xi, yi)i to pick the ābestā prediction function
according to what has already been observed.
This is often done by the estimation of some parameters (e.g.,
coefļ¬cients in linear models, weights in neural networks, ...)
C can depend on user choices (e.g., achitecture, number of neurons in
neural networks) ā hyper-parameters
parameters (learned from data) hyper-parameters (can not be learned from data)
hyper-parameters often control the richness of C: must be carefully tuned
to ensure a good accuracy and avoid overļ¬tting
Nathalie Vialaneix & Hyphen | Introduction to neural networks 9/41
33. Model selection / Fair error evaluation
Several strategies can be used to estimate the error of a set C and to be
able to select the best family of models C (tuning of hyper-parameters):
split the data into a training/test set (ā¼ 67/33%): the model is
estimated with the training dataset and a test error is computed with
the remaining observations;
Nathalie Vialaneix & Hyphen | Introduction to neural networks 10/41
34. Model selection / Fair error evaluation
Several strategies can be used to estimate the error of a set C and to be
able to select the best family of models C (tuning of hyper-parameters):
split the data into a training/test set (ā¼ 67/33%)
cross validation: split the data into L folds. For each fold, train a
model without the data included in the current fold and compute the
error with the data included in the fold: the averaging of these L errors
is the cross validation error;
fold 1 fold 2 fold 3 . . . fold K
train without fold 2: Ī¦ā2
error fold 2: err(Ļā2
, fold 2)
CV error: 1
K l err(Ī¦āl
, fold l)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 10/41
35. Model selection / Fair error evaluation
Several strategies can be used to estimate the error of a set C and to be
able to select the best family of models C (tuning of hyper-parameters):
split the data into a training/test set (ā¼ 67/33%)
cross validation: split the data into L folds. For each fold, train a
model without the data included in the current fold and compute the
error with the data included in the fold: the averaging of these L errors
is the cross validation error;
out-of-bag error: create B bootstrap samples (size n taken with
replacement in the original sample) and compute a prediction based
on Out-Of-Bag samples (see next slide).
Nathalie Vialaneix & Hyphen | Introduction to neural networks 10/41
36. Out-of-bag observations, prediction and error
OOB (Out-Of Bags) error: error based on the observations not included in
the ābagā:
(xi, yi)
i ā T 1
i ā T b i T B
Ī¦1
Ī¦b
Ī¦B
OOB prediction: Ī¦B
(xi)
used to train the model
prediction algorithm
Nathalie Vialaneix & Hyphen | Introduction to neural networks 11/41
37. Other model selection methods and strategies to avoid
overļ¬tting
Main idea: Penalize the training error with something that represents the
complexity of the model
model selection with explicit penalization strategy: minimize (ā¼) error
+ complexity
Example: BIC/AIC criterion in linear models
ā2L + log n Ć p
avoiding overļ¬tting with implicit penalization strategy: constrain the
parameters w to take their values within a reduced subset
min error(w) + Ī» w
with w the parameter of the model (learned) and . typically the 1 or
2 norm
Nathalie Vialaneix & Hyphen | Introduction to neural networks 12/41
38. Outline
1 Statistical / machine / automatic learning
Background and notations
Underļ¬tting / Overļ¬tting
Consistency
Avoiding overļ¬tting
2 (non deep) Neural networks
Presentation of multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
3 Deep neural networks
Overview
CNN
Nathalie Vialaneix & Hyphen | Introduction to neural networks 13/41
39. What are (artiļ¬cial) neural networks?
Common properties
(artiļ¬cial) āNeural networksā: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain;
Nathalie Vialaneix & Hyphen | Introduction to neural networks 14/41
40. What are (artiļ¬cial) neural networks?
Common properties
(artiļ¬cial) āNeural networksā: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain;
combination (network) of simple elements (neurons).
Nathalie Vialaneix & Hyphen | Introduction to neural networks 14/41
41. What are (artiļ¬cial) neural networks?
Common properties
(artiļ¬cial) āNeural networksā: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain;
combination (network) of simple elements (neurons).
Example of graphical representation:
INPUTS
OUTPUTS
Nathalie Vialaneix & Hyphen | Introduction to neural networks 14/41
42. Different types of neural networks
A neural network is deļ¬ned by:
1 the network structure;
2 the neuron type.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
43. Different types of neural networks
A neural network is deļ¬ned by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classiļ¬cation and regression);
Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
44. Different types of neural networks
A neural network is deļ¬ned by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classiļ¬cation and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
45. Different types of neural networks
A neural network is deļ¬ned by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classiļ¬cation and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonenās maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
46. Different types of neural networks
A neural network is deļ¬ned by:
1 the network structure;
2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classiļ¬cation and regression);
Radial basis function networks (RBF): same purpose but based on
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonenās maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
In this talk, focus on MLP.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 15/41
47. MLP: Advantages/Drawbacks
Advantages
classiļ¬cation OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: ļ¬exible;
good theoretical properties.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 16/41
48. MLP: Advantages/Drawbacks
Advantages
classiļ¬cation OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: ļ¬exible;
good theoretical properties.
Drawbacks
hard to train (high computational cost, especially when d is large);
overļ¬t easily;
āblack boxā models (hard to interpret).
Nathalie Vialaneix & Hyphen | Introduction to neural networks 16/41
49. References
Advised references:
[Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more
than statistical) perspective
[Devroye et al., 1996, Gyƶrļ¬ et al., 2002] in dedicated chapters present
statistical properties of perceptrons
Nathalie Vialaneix & Hyphen | Introduction to neural networks 17/41
50. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 18/41
51. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 18/41
52. Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
3 ... and a signal is sent to
other neurons through
the axon
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 18/41
53. First model of artiļ¬cial neuron
[Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962]
x(1)
x(2)
x(p)
f(x)Ī£+
w1
w2
wp
w0
f : x ā Rp
ā 1 p
j=1
wjx(j)+w0 ā„ 0
Nathalie Vialaneix & Hyphen | Introduction to neural networks 19/41
54. (artiļ¬cial) Perceptron
Layers
MLP have one input layer (x ā Rp
), one output layer (y ā R or
ā {1, . . . , K ā 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ā R):
INPUTS
x = (x(1)
, . . . , x(p)
)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
55. (artiļ¬cial) Perceptron
Layers
MLP have one input layer (x ā Rp
), one output layer (y ā R or
ā {1, . . . , K ā 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ā R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
weights w
(1)
jk
Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
56. (artiļ¬cial) Perceptron
Layers
MLP have one input layer (x ā Rp
), one output layer (y ā R or
ā {1, . . . , K ā 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ā R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
57. (artiļ¬cial) Perceptron
Layers
MLP have one input layer (x ā Rp
), one output layer (y ā R or
ā {1, . . . , K ā 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ā R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2
Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
58. (artiļ¬cial) Perceptron
Layers
MLP have one input layer (x ā Rp
), one output layer (y ā R or
ā {1, . . . , K ā 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ā R):
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP
Nathalie Vialaneix & Hyphen | Introduction to neural networks 20/41
59. A neuron in MLP
v1
v2
v3
Ćw1
Ćw2
Ćw3
+
w0 (Bias Biais)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
60. A neuron in MLP
v1
v2
v3
Ćw1
Ćw2
Ćw3 Ćw2
Ćw1
+
w0
Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
61. A neuron in MLP
v1
v2
v3
Ćw1
Ćw2
Ćw3 Ćw2
Ćw1
+
w0
Standard activation functions fonctions de transfert / dāactivation
Biologically inspired: Heaviside function
h(z) =
0 if z < 0
1 otherwise.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
62. A neuron in MLP
v1
v2
v3
Ćw1
Ćw2
Ćw3 Ćw2
Ćw1
+
w0
Standard activation functions
Main issue with the Heaviside function: not continuous!
Identity
h(z) = z
Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
63. A neuron in MLP
v1
v2
v3
Ćw1
Ćw2
Ćw3 Ćw2
Ćw1
+
w0
Standard activation functions
But identity activation function gives linear model if used with one hidden
layer: not ļ¬exible enough
Logistic function
h(z) = 1
1+exp(āz)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
64. A neuron in MLP
v1
v2
v3
Ćw1
Ćw2
Ćw3 Ćw2
Ćw1
+
w0
Standard activation functions
Another popular activation function (useful to model positive real numbers)
Rectiļ¬ed linear (ReLU)
h(z) = max(0, z)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
65. A neuron in MLP
v1
v2
v3
Ćw1
Ćw2
Ćw3 Ćw2
Ćw1
+
w0
General sigmoid
sigmoid: nondecreasing function h : R ā R such that
lim
zā+ā
h(z) = 1 lim
zāāā
h(z) = 0
Nathalie Vialaneix & Hyphen | Introduction to neural networks 21/41
66. Focus on one-hidden-layer perceptrons (R package nnet)
Regression case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
Ī¦(x)
Ī¦(x) =
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
, with hk a (logistic) sigmoid
Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
67. Focus on one-hidden-layer perceptrons (R package nnet)
Binary classiļ¬cation case (y ā {0, 1})
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
Ļ(x)
Ļ(x) = h0
ļ£«
ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£¬ļ£
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
ļ£¶
ļ£·ļ£·ļ£·ļ£·ļ£·ļ£·ļ£ø
with h0 logistic sigmoid or identity.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
68. Focus on one-hidden-layer perceptrons (R package nnet)
Binary classiļ¬cation case (y ā {0, 1})
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
Ļ(x)
decision with:
Ī¦(x) =
0 if Ļ(x) < 1/2
1 otherwise
Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
69. Focus on one-hidden-layer perceptrons (R package nnet)
Extension to any classiļ¬cation problem in {1, . . . , K ā 1}
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
Ļ(x)
Straightforward extension to multiple classes with a multiple output
perceptron (number of output units equal to K) and a maximum probability
rule for the decision.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 22/41
70. Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
ā R arbitrary well with a
perceptron?
Nathalie Vialaneix & Hyphen | Introduction to neural networks 23/41
71. Theoretical properties of perceptrons
This section answers two questions:
1 can we approximate any function g : [0, 1]p
ā R arbitrary well with a
perceptron?
2 when a perceptron is trained with i.i.d. observations from an arbitrary
random variable pair (X, Y), is it consistent? (i.e., does it reach the
minimum possible error asymptotically when the number of
observations grows to inļ¬nity?)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 23/41
72. Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] ā sin 1
x+0.1
Nathalie Vialaneix & Hyphen | Introduction to neural networks 24/41
73. Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] ā sin 1
x+0.1
trying to approximate (how this is performed is explained later in this talk) this
function with MLP having different numbers of neurons on their
hidden layer
Nathalie Vialaneix & Hyphen | Introduction to neural networks 24/41
74. Sketch of main properties of 1-hidden layer MLP
1 universal approximation [Pinkus, 1999, Devroye et al., 1996,
Hornik, 1991, Hornik, 1993, Stinchcombe, 1999]: 1-hidden-layer MLP
can approximate arbitrary well any (regular enough) function (but, in
theory, the number of neurons can grow arbitrarily for that)
2 consistency
[Farago and Lugosi, 1993, Devroye et al., 1996, White, 1990,
White, 1991, Barron, 1994, McCaffrey and Gallant, 1994]:
1-hidden-layer MLP are universally consistent with a number of
neurons that grows slower than O n
log n to +ā
ā the number of neurons controls the richness of the MLP and has to
increase with n
Nathalie Vialaneix & Hyphen | Introduction to neural networks 25/41
75. Empirical error minimization
Given i.i.d. observations of (X, Y), (xi, yi)i, how to choose the weights w?
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
Ī¦w(x)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 26/41
76. Empirical error minimization
Given i.i.d. observations of (X, Y), (xi, yi)i, how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Ln(w) =
n
i=1
[fw(xi) ā yi]2
with
yi ā R for the regression case
yi ā {0, 1} for the classiļ¬cation case, with the associated decision rule
x ā 1{Ī¦w (x)ā¤1/2}.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 26/41
77. Empirical error minimization
Given i.i.d. observations of (X, Y), (xi, yi)i, how to choose the weights w?
Standard approach: minimize the empirical L2 risk:
Ln(w) =
n
i=1
[fw(xi) ā yi]2
with
yi ā R for the regression case
yi ā {0, 1} for the classiļ¬cation case, with the associated decision rule
x ā 1{Ī¦w (x)ā¤1/2}.
But: Ln(w) is not convex in w ā general optimization problem
Nathalie Vialaneix & Hyphen | Introduction to neural networks 26/41
78. Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ā RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) ā Āµ(t) wLn(w)(w(t));
Nathalie Vialaneix & Hyphen | Introduction to neural networks 27/41
79. Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ā RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) ā Āµ(t) wLn(w)(w(t));
online (or stochastic) approach: write
Ln(w)(w) =
n
i=1
[fw(xi) ā yi]2
=Ei
and for t = 1, . . . , T, randomly pick i ā {1, . . . , n} and update:
w(t + 1) = w(t) ā Āµ(t) wEi(w(t)).
Nathalie Vialaneix & Hyphen | Introduction to neural networks 27/41
80. Discussion about practical choices for this approach
batch version converges (from an optimization point of view) to a local
minimum of the error for a good choice of Āµ(t) but convergence can
be slow
stochastic version is usually inefļ¬cient but is useful for large datasets
(n large)
more efļ¬cient algorithms exist to solve the optimization task. The one
implemented in the R package nnet uses higher order derivatives
(BFGS algorithm) but this is a very active area of research in which
many improvements have been made recently
in all cases, solutions returned are, at best, local minima which
strongly depends on the initialization: using more than one
initialization state is advised, as well as checking the variability among
a few runs
Nathalie Vialaneix & Hyphen | Introduction to neural networks 28/41
90. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
ā¼ N(0, 1/
ā
p) and
w
(2)
k
ā¼ N(0, 1/
ā
Q)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 31/41
91. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
ā¼ N(0, 1/
ā
p) and
w
(2)
k
ā¼ N(0, 1/
ā
Q)
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ĖLn(w)
target value of the evolution ĖLn(w(t)) ā ĖLn(w(t + 1))
Nathalie Vialaneix & Hyphen | Introduction to neural networks 31/41
92. Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
ā¼ N(0, 1/
ā
p) and
w
(2)
k
ā¼ N(0, 1/
ā
Q)
In the R package nnet, weights are sampled uniformly between
[ā0.5, 0.5] or between ā 1
maxi X
(j)
i
, 1
maxi X
(j)
i
if X(j) is large.
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ĖLn(w)
target value of the evolution ĖLn(w(t)) ā ĖLn(w(t + 1))
In the R package nnet, a combination of the three criteria is used and
tunable.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 31/41
93. Strategies to avoid overļ¬tting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
94. Strategies to avoid overļ¬tting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error computed on this dataset starts to increase
Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
95. Strategies to avoid overļ¬tting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error computed on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ĖLn(w) + Ī»w w
Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
96. Strategies to avoid overļ¬tting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method
Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error computed on this dataset starts to increase
Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ĖLn(w) + Ī»w w
Noise injection: modify the input data with a random noise during the
training
Nathalie Vialaneix & Hyphen | Introduction to neural networks 32/41
97. Outline
1 Statistical / machine / automatic learning
Background and notations
Underļ¬tting / Overļ¬tting
Consistency
Avoiding overļ¬tting
2 (non deep) Neural networks
Presentation of multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
3 Deep neural networks
Overview
CNN
Nathalie Vialaneix & Hyphen | Introduction to neural networks 33/41
98. What is ādeep learningā?
INPUTS
= (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP
Learning with neural networks except that...
the number of neurons is huge!!!
Image taken from https://towardsdatascience.com/
Nathalie Vialaneix & Hyphen | Introduction to neural networks 34/41
99. Main types of deep NN
deep MLP
Nathalie Vialaneix & Hyphen | Introduction to neural networks 35/41
100. Main types of deep NN
deep MLP
CNN (Convolutional Neural Network): used in image processing
Images taken from [Krizhevsky et al., 2012, Zeiler and Fergus, 2014]
Nathalie Vialaneix & Hyphen | Introduction to neural networks 35/41
101. Main types of deep NN
deep MLP
CNN (Convolutional Neural Network): used in image processing
RNN (Recurrent Neural Network): used for data with a temporal
sequence (in natural language processing for instance)
Nathalie Vialaneix & Hyphen | Introduction to neural networks 35/41
103. Improvements in learning parameters
optimization algorithm: gradient descent is used but choice of the
descent step (Āµ(t)) has been improved [Ruder, 2016] and mini-batches
are also often used to speed the learning (ex: ADAM,
[Kingma and Ba, 2017])
Nathalie Vialaneix & Hyphen | Introduction to neural networks 37/41
104. Improvements in learning parameters
optimization algorithm: gradient descent is used but choice of the
descent step (Āµ(t)) has been improved [Ruder, 2016] and mini-batches
are also often used to speed the learning (ex: ADAM,
[Kingma and Ba, 2017])
avoid overļ¬tting with dropouts: randomly removing a fraction p of the
neurons (and their connexions) during each step of the training step
(larger number of iterations to converge but faster update at each
iteration) [Srivastava et al., 2014]
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
Nathalie Vialaneix & Hyphen | Introduction to neural networks 37/41
105. Improvements in learning parameters
optimization algorithm: gradient descent is used but choice of the
descent step (Āµ(t)) has been improved [Ruder, 2016] and mini-batches
are also often used to speed the learning (ex: ADAM,
[Kingma and Ba, 2017])
avoid overļ¬tting with dropouts: randomly removing a fraction p of the
neurons (and their connexions) during each step of the training step
(larger number of iterations to converge but faster update at each
iteration) [Srivastava et al., 2014]
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
Nathalie Vialaneix & Hyphen | Introduction to neural networks 37/41
106. Main available implementations
Theano
Tensor-
Flow
Caffe
PyTorch
(and
autograd)
Keras
Date 2008-2017 2015-... 2013-... 2016-... 2017-...
Main dev
Montreal
University
(Y. Bengio)
Google
Brain
Berkeley
University
community
driven
(Facebook,
Google, ...)
FranƧois
Chollet /
RStudio
Lan-
guage(s)
Python
C++/
Python
C++/
Python/
Matlab
Python
(with
strong
GPU
accelera-
tions)
high level
API on top
of Tensor-
Flow (and
CNTK,
Theano...):
Python, R
Nathalie Vialaneix & Hyphen | Introduction to neural networks 38/41
107. General architecture of CNN
Image taken from the article https://doi.org/10.3389/fpls.2017.02235
Nathalie Vialaneix & Hyphen | Introduction to neural networks 39/41
108. Convolutional layer
Images taken from https://towardsdatascience.com
Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
109. Convolutional layer
Images taken from https://towardsdatascience.com
Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
110. Convolutional layer
Images taken from https://towardsdatascience.com
Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
111. Convolutional layer
Images taken from https://towardsdatascience.com
Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
112. Convolutional layer
Images taken from https://towardsdatascience.com
Nathalie Vialaneix & Hyphen | Introduction to neural networks 40/41
113. Pooling layer
Basic principle: select a subsample of the previous layer values
Generally used: max pooling
Image taken from Wikimedia Commons, by Aphex34
Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41
114. References
Barron, A. (1994).
Approximation and estimation bounds for artiļ¬cial neural networks.
Machine Learning, 14:115ā133.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, New York, USA.
Devroye, L., Gyƶrļ¬, L., and Lugosi, G. (1996).
A Probabilistic Theory for Pattern Recognition.
Springer-Verlag, New York, NY, USA.
Farago, A. and Lugosi, G. (1993).
Strong universal consistency of neural network classiļ¬ers.
IEEE Transactions on Information Theory, 39(4):1146ā1151.
Gyƶrļ¬, L., Kohler, M., KrzyĖzak, A., and Walk, H. (2002).
A Distribution-Free Theory of Nonparametric Regression.
Springer-Verlag, New York, NY, USA.
Hornik, K. (1991).
Approximation capabilities of multilayer feedfoward networks.
Neural Networks, 4(2):251ā257.
Hornik, K. (1993).
Some new results on neural network approximation.
Neural Networks, 6(8):1069ā1072.
Kingma, D. P. and Ba, J. (2017).
Adam: a method for stochastic optimization.
arXiv preprint arXiv: 1412.6980.
Krizhevsky, A., Sutskever, I., and Hinton, Geoffrey, E. (2012).
Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41
115. ImageNet classiļ¬cation with deep convolutional neural networks.
In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Proceedings of Neural Information Processing Systems (NIPS
2012), volume 25.
Mc Culloch, W. and Pitts, W. (1943).
A logical calculus of ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 5(4):115ā133.
McCaffrey, D. and Gallant, A. (1994).
Convergence rates for single hidden layer feedforward networks.
Neural Networks, 7(1):115ā133.
Pinkus, A. (1999).
Approximation theory of the MLP model in neural networks.
Acta Numerica, 8:143ā195.
Ripley, B. (1996).
Pattern Recognition and Neural Networks.
Cambridge University Press.
Rosenblatt, F. (1958).
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological Review, 65:386ā408.
Rosenblatt, F. (1962).
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.
Spartan Books, Washington, DC, USA.
Ruder, S. (2016).
An overview of gradient descent optimization algorithms.
arXiv preprint arXiv: 1609.04747.
Rumelhart, D. and Mc Clelland, J. (1986).
Parallel Distributed Processing: Exploration in the MicroStructure of Cognition.
Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41
116. MIT Press, Cambridge, MA, USA.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overļ¬tting.
Journal of Machine Learning Research, 15:1929ā1958.
Stinchcombe, M. (1999).
Neural network approximation of continuous functionals and continuous functions on compactiļ¬cations.
Neural Network, 12(3):467ā477.
White, H. (1990).
Connectionist nonparametric regression: multilayer feedforward networks can learn arbitraty mappings.
Neural Networks, 3:535ā549.
White, H. (1991).
Nonparametric estimation of conditional quantiles using neural networks.
In Proceedings of the 23rd Symposium of the Interface: Computing Science and Statistics, pages 190ā199, Alexandria, VA,
USA. American Statistical Association.
Zeiler, M. D. and Fergus, R. (2014).
Visualizing and understanding convolutional networks.
In Proceedings of European Conference on Computer Vision (ECCV 2014).
Nathalie Vialaneix & Hyphen | Introduction to neural networks 41/41