An introduction to neural network

An introduction to neural networks
Nathalie Vialaneix & Hyphen
nathalie.vialaneix@inra.fr
http://www.nathalievialaneix.eu &
https://www.hyphen-stat.com/en/
DATE ET LIEU
Nathalie Vialaneix & Hyphen | Introduction to neural networks 1/41

Outline
1 Statistical / machine / automatic learning
Background and notations
Underfitting / Overfitting
Consistency
Avoiding overfitting
2 (non deep) Neural networks
Presentation of multi-layer perceptrons
Theoretical properties of perceptrons
Learning perceptrons
Learning in practice
3 Deep neural networks
Overview
CNN

Outline
Consistency
Overview
CNN

Background
Purpose: predict Y from X;

Background
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);

Background
What we have: n observations of (X, Y), (x1, y1), . . . , (xn, yn);
What we want: estimate unknown Y from new X: xn+1, . . . , xm.

Basics
From (xi, yi)i, deﬁnition of a machine, Φn
s.t.:
ˆynew = Φn
(xnew).

Basics
s.t.:
ˆynew = Φn
(xnew).
if Y is numeric, Φn
is called a regression function fonction de
régression;
if Y is a factor, Φn
is called a classiﬁer classiﬁeur;

Basics
s.t.:
ˆynew = Φn
(xnew).
régression;
Φn
is said to be trained or learned from the observations (xi, yi)i.

Basics
s.t.:
ˆynew = Φn
(xnew).
régression;
Φn
Desirable properties
accuracy to the observations: predictions made on known data are
close to observed values;

Basics
s.t.:
ˆynew = Φn
(xnew).
régression;
Φn
generalization ability: predictions made on new data are also
accurate.

Basics
s.t.:
ˆynew = Φn
(xnew).
régression;
Φn
generalization ability: predictions made on new data are also
accurate.
Conﬂicting objectives!!

Underﬁtting/Overﬁtting sous/sur - apprentissage
Function x → y to be estimated

Observations we might have

Observations we do have

First estimation from the observations: underﬁtting

Second estimation from the observations: accurate estimation

Third estimation from the observations: overﬁtting

Summary

Errors
training error (measures the accuracy to the observations)

Errors
if y is a factor: misclassiﬁcation rate
{ˆyi yi, i = 1, . . . , n}
n

Errors
{ˆyi yi, i = 1, . . . , n}
n
if y is numeric: mean square error (MSE)
1
n
n
i=1
(ˆyi − yi)2

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
or root mean square error (RMSE) or pseudo-R2
: 1−MSE/Var((yi)i)

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
test error: a way to prevent overﬁtting (estimates the generalization
error) is the simple validation

Errors
{ˆyi yi, i = 1, . . . , n}
n
1
n
n
i=1
(ˆyi − yi)2
test error: a way to prevent overﬁtting (estimates the generalization
error) is the simple validation
1 split the data into training/test sets (usually 80%/20%)
2 train Φn
from the training dataset
3 compute the test error from the remaining data

Example
Observations

Example
Training/Test datasets

Example
Training/Test errors

Example
Summary

Practical consequences
A machine or an algorithm is a method that:
start from a given set of prediction functions C
use observations (xi, yi)i to pick the “best” prediction function
according to what has already been observed.
This is often done by the estimation of some parameters (e.g.,
coefﬁcients in linear models, weights in neural networks, ...)

C can depend on user choices (e.g., achitecture, number of neurons in
neural networks) ⇒ hyper-parameters
parameters (learned from data) hyper-parameters (can not be learned from data)

C can depend on user choices (e.g., achitecture, number of neurons in
neural networks) ⇒ hyper-parameters
parameters (learned from data) hyper-parameters (can not be learned from data)
hyper-parameters often control the richness of C: must be carefully tuned
to ensure a good accuracy and avoid overﬁtting

Model selection / Fair error evaluation
Several strategies can be used to estimate the error of a set C and to be
able to select the best family of models C (tuning of hyper-parameters):
split the data into a training/test set (∼ 67/33%): the model is
estimated with the training dataset and a test error is computed with
the remaining observations;

split the data into a training/test set (∼ 67/33%)
cross validation: split the data into L folds. For each fold, train a
model without the data included in the current fold and compute the
error with the data included in the fold: the averaging of these L errors
is the cross validation error;
fold 1 fold 2 fold 3 . . . fold K
train without fold 2: Φ−2
error fold 2: err(φ−2
, fold 2)
CV error: 1
K l err(Φ−l
, fold l)

split the data into a training/test set (∼ 67/33%)
cross validation: split the data into L folds. For each fold, train a
model without the data included in the current fold and compute the
error with the data included in the fold: the averaging of these L errors
is the cross validation error;
out-of-bag error: create B bootstrap samples (size n taken with
replacement in the original sample) and compute a prediction based
on Out-Of-Bag samples (see next slide).

Out-of-bag observations, prediction and error
OOB (Out-Of Bags) error: error based on the observations not included in
the “bag”:
(xi, yi)
i ∈ T 1
i ∈ T b i T B
Φ1
Φb
ΦB
OOB prediction: ΦB
(xi)
used to train the model
prediction algorithm

Other model selection methods and strategies to avoid
overﬁtting
Main idea: Penalize the training error with something that represents the
complexity of the model
model selection with explicit penalization strategy: minimize (∼) error
+ complexity
Example: BIC/AIC criterion in linear models
−2L + log n × p
avoiding overﬁtting with implicit penalization strategy: constrain the
parameters w to take their values within a reduced subset
min error(w) + λ w
with w the parameter of the model (learned) and . typically the 1 or
2 norm

Outline
Consistency
Overview
CNN

What are (artiﬁcial) neural networks?
Common properties
(artiﬁcial) “Neural networks”: general name for supervised and
unsupervised methods developed in (vague) analogy to the brain;

Common properties
combination (network) of simple elements (neurons).

Common properties
combination (network) of simple elements (neurons).
Example of graphical representation:
INPUTS
OUTPUTS

Different types of neural networks
A neural network is deﬁned by:
1 the network structure;
2 the neuron type.

2 the neuron type.
Standard examples
Multilayer perceptrons (MLP) Perceptron multi-couches: dedicated to
supervised problems (classiﬁcation and regression);

2 the neuron type.
Standard examples
Radial basis function networks (RBF): same purpose but based on
local smoothing;

2 the neuron type.
Standard examples
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .

2 the neuron type.
Standard examples
local smoothing;
Self-organizing maps (SOM also sometimes called Kohonen’s maps)
or Topographic maps: dedicated to unsupervised problems
(clustering), self-organized;
. . .
In this talk, focus on MLP.

MLP: Advantages/Drawbacks
Advantages
classiﬁcation OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: ﬂexible;
good theoretical properties.

MLP: Advantages/Drawbacks
Advantages
classification OR regression (i.e., Y can be a numeric variable or a
factor);
non parametric method: flexible;
good theoretical properties.
Drawbacks
hard to train (high computational cost, especially when d is large);
overfit easily;
“black box” models (hard to interpret).

References
Advised references:
[Bishop, 1995, Ripley, 1996] overview of the topic from a learning (more
than statistical) perspective
[Devroye et al., 1996, Györﬁ et al., 2002] in dedicated chapters present
statistical properties of perceptrons

Analogy to the brain
1 a neuron collects signals
from neighboring
neurons through its
dendrites
connexions which frequently lead to activating a neuron are enforced (tend
to have an increasing impact on the destination neuron)

from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated

from neighboring
neurons through its
dendrites
2 when total signal is
above a given threshold,
the neuron is activated
3 ... and a signal is sent to
other neurons through
the axon

First model of artiﬁcial neuron
[Mc Culloch and Pitts, 1943, Rosenblatt, 1958, Rosenblatt, 1962]
x(1)
x(2)
x(p)
f(x)Σ+
w1
w2
wp
w0
f : x ∈ Rp
→ 1 p
j=1
wjx(j)+w0 ≥ 0

(artiﬁcial) Perceptron
Layers
MLP have one input layer (x ∈ Rp
), one output layer (y ∈ R or
∈ {1, . . . , K − 1} values) and several hidden layers;
no connections within a layer;
connections between two consecutive layers (feedforward).
Example (regression, y ∈ R):
INPUTS
x = (x(1)
, . . . , x(p)
)

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1
weights w
(1)
jk

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2

Layers
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3
+
w0 (Bias Biais)

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions fonctions de transfert / d’activation
Biologically inspired: Heaviside function
h(z) =
0 if z < 0
1 otherwise.

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Standard activation functions
Main issue with the Heaviside function: not continuous!
Identity
h(z) = z

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
But identity activation function gives linear model if used with one hidden
layer: not ﬂexible enough
Logistic function
h(z) = 1
1+exp(−z)

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
Another popular activation function (useful to model positive real numbers)
Rectiﬁed linear (ReLU)
h(z) = max(0, z)

A neuron in MLP
v1
v2
v3
×w1
×w2
×w3 ×w2
×w1
+
w0
General sigmoid
sigmoid: nondecreasing function h : R → R such that
lim
z→+∞
h(z) = 1 lim
z→−∞
h(z) = 0

Focus on one-hidden-layer perceptrons (R package nnet)
Regression case
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
Φ(x)
Φ(x) =
Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0
, with hk a (logistic) sigmoid

Binary classiﬁcation case (y ∈ {0, 1})
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
φ(x)
φ(x) = h0


Q
k=1
w
(2)
k
hk x w
(1)
k
+ w
(0)
k
+ w
(2)
0


with h0 logistic sigmoid or identity.

Binary classiﬁcation case (y ∈ {0, 1})
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
φ(x)
decision with:
Φ(x) =
0 if φ(x) < 1/2
1 otherwise

Extension to any classiﬁcation problem in {1, . . . , K − 1}
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
φ(x)
Straightforward extension to multiple classes with a multiple output
perceptron (number of output units equal to K) and a maximum probability
rule for the decision.

This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?

This section answers two questions:
1 can we approximate any function g : [0, 1]p
→ R arbitrary well with a
perceptron?
2 when a perceptron is trained with i.i.d. observations from an arbitrary
random variable pair (X, Y), is it consistent? (i.e., does it reach the
minimum possible error asymptotically when the number of
observations grows to inﬁnity?)

Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1

Illustration of the universal approximation property
Simple examples
a function to approximate: g : [0, 1] → sin 1
x+0.1
trying to approximate (how this is performed is explained later in this talk) this
function with MLP having different numbers of neurons on their
hidden layer

Sketch of main properties of 1-hidden layer MLP
1 universal approximation [Pinkus, 1999, Devroye et al., 1996,
Hornik, 1991, Hornik, 1993, Stinchcombe, 1999]: 1-hidden-layer MLP
can approximate arbitrary well any (regular enough) function (but, in
theory, the number of neurons can grow arbitrarily for that)
2 consistency
[Farago and Lugosi, 1993, Devroye et al., 1996, White, 1990,
White, 1991, Barron, 1994, McCaffrey and Gallant, 1994]:
1-hidden-layer MLP are universally consistent with a number of
neurons that grows slower than O n
log n to +∞
⇒ the number of neurons controls the richness of the MLP and has to
increase with n

Empirical error minimization
Given i.i.d. observations of (X, Y), (xi, yi)i, how to choose the weights w?
x(1)
x(2)
x(p) w(1)
w(2)+
w
(0)
1
+
w
(0)
Q
Φw(x)

Standard approach: minimize the empirical L2 risk:
Ln(w) =
n
i=1
[fw(xi) − yi]2
with
yi ∈ R for the regression case
yi ∈ {0, 1} for the classiﬁcation case, with the associated decision rule
x → 1{Φw (x)≤1/2}.

Standard approach: minimize the empirical L2 risk:
Ln(w) =
n
i=1
[fw(xi) − yi]2
with
yi ∈ R for the regression case
yi ∈ {0, 1} for the classiﬁcation case, with the associated decision rule
x → 1{Φw (x)≤1/2}.
But: Ln(w) is not convex in w ⇒ general optimization problem

Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) wLn(w)(w(t));

Optimization with gradient descent
Method: initialize (randomly or with some prior knowledge) the weights
w(0) ∈ RQp+2Q+1
Batch approach: for t = 1, . . . , T
w(t + 1) = w(t) − µ(t) wLn(w)(w(t));
online (or stochastic) approach: write
Ln(w)(w) =
n
i=1
[fw(xi) − yi]2
=Ei
and for t = 1, . . . , T, randomly pick i ∈ {1, . . . , n} and update:
w(t + 1) = w(t) − µ(t) wEi(w(t)).

Discussion about practical choices for this approach
batch version converges (from an optimization point of view) to a local
minimum of the error for a good choice of µ(t) but convergence can
be slow
stochastic version is usually inefﬁcient but is useful for large datasets
(n large)
more efﬁcient algorithms exist to solve the optimization task. The one
implemented in the R package nnet uses higher order derivatives
(BFGS algorithm) but this is a very active area of research in which
many improvements have been made recently
in all cases, solutions returned are, at best, local minima which
strongly depends on the initialization: using more than one
initialization state is advised, as well as checking the variability among
a few runs

Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily computes gradients in perceptrons (or in other types of
neural network):

Gradient backpropagation method
[Rumelhart and Mc Clelland, 1986]
The gradient backpropagation rétropropagation du gradient principle is
used to easily computes gradients in perceptrons (or in other types of
neural network):
This way, stochastic gradient descent alternates:
a forward step which aims at calculating outputs from all observations
xi given a value of the weights w
a backward step in which the gradient backpropagation principle is
used to obtain the gradient for the current weights w

Backpropagation in practice
x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
Φw(x)
initialize weights

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
Φw(x)
Forward step: for all k, compute a
(1)
k
= xi
w
(1)
k
+ w
(0)
k

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
Φw(x)
Forward step: for all k, compute z
(1)
k
= hk (a
(1)
k
)

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
Φw(x) = a(2)
Forward step: compute a(2) = Q
k=1
w
(2)
k
z
(1)
k
+ w
(2)
0

x(1)
x(2)
x(p)
+
+
w(1)
w
(0)
1
w
(0)
Q
w(2)
Φw(x) = a(2)
Backward step: compute ∂Ei
∂w
(2)
k
= δ(2) × zk with
δ(2)
= ∂Ei
∂a(2) =
[h0(a(2))−yi ]2
∂a(2) = 2h0(a(2)
) × [h0(a(2)
) − yi]

x(1)
x(2)
x(p)
+
+
w(2)
w
(0)
1
w
(0)
Q
w(1)
Φw(x)
Backward step: ∂Ei
∂w
(1)
kj
= δ
(1)
k
× x
(j)
i
with
δ
(1)
k
= ∂Ei
∂a
(1)
k
= ∂Ei
∂a(2) × ∂a(2)
∂a
(1)
k
= δ(2)
× w
(2)
k
hk (a
(1)
k
)

x(1)
x(2)
x(p)
+
+
w(1)
w(2)
w
(0)
1
w
(0)
Q
Φw(x)
Backward step: ∂Ei
∂w
(0)
k
= δ
(1)
k

Initialization and stopping of the training algorithm
1 How to initialize weights? Standard choices w
(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)

(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆLn(w)
target value of the evolution ˆLn(w(t)) − ˆLn(w(t + 1))

(1)
jk
∼ N(0, 1/
√
p) and
w
(2)
k
∼ N(0, 1/
√
Q)
In the R package nnet, weights are sampled uniformly between
[−0.5, 0.5] or between − 1
maxi X
(j)
i
, 1
maxi X
(j)
i
if X(j) is large.
2 When to stop the algorithm? (gradient descent or alike) Standard
choices:
bounded T
target value of the error ˆLn(w)
target value of the evolution ˆLn(w(t)) − ˆLn(w(t + 1))
In the R package nnet, a combination of the three criteria is used and
tunable.

Strategies to avoid overﬁtting
Properly tune Q with a CV or a bootstrap estimation of the
generalization ability of the method

Early stopping: for Q large enough, use a part of the data as a
validation set and stops the training (gradient descent) when the
empirical error computed on this dataset starts to increase

Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆLn(w) + λw w

Weight decay: for Q large enough, penalize the empirical risk with a
function of the weights, e.g.,
ˆLn(w) + λw w
Noise injection: modify the input data with a random noise during the
training

Outline
Consistency
Overview
CNN

What is “deep learning”?
INPUTS
= (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS
2 hidden layer MLP
Learning with neural networks except that...
the number of neurons is huge!!!
Image taken from https://towardsdatascience.com/

Main types of deep NN
deep MLP

deep MLP
CNN (Convolutional Neural Network): used in image processing
Images taken from [Krizhevsky et al., 2012, Zeiler and Fergus, 2014]

deep MLP
CNN (Convolutional Neural Network): used in image processing
RNN (Recurrent Neural Network): used for data with a temporal
sequence (in natural language processing for instance)

Why such a big buzz around deep learning?...
ImageNet results http://www.image-net.org/challenges/LSVRC/
Image by courtesy of Stéphane Canu
See also: http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/

Improvements in learning parameters
optimization algorithm: gradient descent is used but choice of the
descent step (µ(t)) has been improved [Ruder, 2016] and mini-batches
are also often used to speed the learning (ex: ADAM,
[Kingma and Ba, 2017])

Improvements in learning parameters
optimization algorithm: gradient descent is used but choice of the
descent step (µ(t)) has been improved [Ruder, 2016] and mini-batches
are also often used to speed the learning (ex: ADAM,
[Kingma and Ba, 2017])
avoid overﬁtting with dropouts: randomly removing a fraction p of the
neurons (and their connexions) during each step of the training step
(larger number of iterations to converge but faster update at each
iteration) [Srivastava et al., 2014]
INPUTS
x = (x(1)
, . . . , x(p)
) Layer 1 Layer 2 y
OUTPUTS

Main available implementations
Theano
Tensor-
Flow
Caffe
PyTorch
(and
autograd)
Keras
Date 2008-2017 2015-... 2013-... 2016-... 2017-...
Main dev
Montreal
University
(Y. Bengio)
Google
Brain
Berkeley
University
community
driven
(Facebook,
Google, ...)
François
Chollet /
RStudio
Lan-
guage(s)
Python
C++/
Python
C++/
Python/
Matlab
Python
(with
strong
GPU
accelera-
tions)
high level
API on top
of Tensor-
Flow (and
CNTK,
Theano...):
Python, R

General architecture of CNN
Image taken from the article https://doi.org/10.3389/fpls.2017.02235

Convolutional layer
Images taken from https://towardsdatascience.com

Pooling layer
Basic principle: select a subsample of the previous layer values
Generally used: max pooling
Image taken from Wikimedia Commons, by Aphex34

References
Barron, A. (1994).
Approximation and estimation bounds for artificial neural networks.
Machine Learning, 14:115–133.
Bishop, C. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, New York, USA.
Devroye, L., Györfi, L., and Lugosi, G. (1996).
A Probabilistic Theory for Pattern Recognition.
Springer-Verlag, New York, NY, USA.
Farago, A. and Lugosi, G. (1993).
Strong universal consistency of neural network classifiers.
IEEE Transactions on Information Theory, 39(4):1146–1151.
Györfi, L., Kohler, M., Krzy˙zak, A., and Walk, H. (2002).
A Distribution-Free Theory of Nonparametric Regression.
Springer-Verlag, New York, NY, USA.
Hornik, K. (1991).
Approximation capabilities of multilayer feedfoward networks.
Neural Networks, 4(2):251–257.
Hornik, K. (1993).
Some new results on neural network approximation.
Kingma, D. P. and Ba, J. (2017).
Adam: a method for stochastic optimization.
arXiv preprint arXiv: 1412.6980.
Krizhevsky, A., Sutskever, I., and Hinton, Geoffrey, E. (2012).

ImageNet classiﬁcation with deep convolutional neural networks.
In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Proceedings of Neural Information Processing Systems (NIPS
2012), volume 25.
Mc Culloch, W. and Pitts, W. (1943).
A logical calculus of ideas immanent in nervous activity.
Bulletin of Mathematical Biophysics, 5(4):115–133.
McCaffrey, D. and Gallant, A. (1994).
Convergence rates for single hidden layer feedforward networks.
Pinkus, A. (1999).
Approximation theory of the MLP model in neural networks.
Acta Numerica, 8:143–195.
Ripley, B. (1996).
Pattern Recognition and Neural Networks.
Cambridge University Press.
Rosenblatt, F. (1958).
The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological Review, 65:386–408.
Rosenblatt, F. (1962).
Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.
Spartan Books, Washington, DC, USA.
Ruder, S. (2016).
An overview of gradient descent optimization algorithms.
arXiv preprint arXiv: 1609.04747.
Rumelhart, D. and Mc Clelland, J. (1986).
Parallel Distributed Processing: Exploration in the MicroStructure of Cognition.

MIT Press, Cambridge, MA, USA.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overﬁtting.
Journal of Machine Learning Research, 15:1929–1958.
Stinchcombe, M. (1999).
Neural network approximation of continuous functionals and continuous functions on compactiﬁcations.
Neural Network, 12(3):467–477.
White, H. (1990).
Connectionist nonparametric regression: multilayer feedforward networks can learn arbitraty mappings.
Neural Networks, 3:535–549.
White, H. (1991).
Nonparametric estimation of conditional quantiles using neural networks.
In Proceedings of the 23rd Symposium of the Interface: Computing Science and Statistics, pages 190–199, Alexandria, VA,
USA. American Statistical Association.
Zeiler, M. D. and Fergus, R. (2014).
Visualizing and understanding convolutional networks.
In Proceedings of European Conference on Computer Vision (ECCV 2014).

An introduction to neural network

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to An introduction to neural network

Similar to An introduction to neural network (20)

More from tuxette

More from tuxette (20)

Recently uploaded

Recently uploaded (20)

An introduction to neural network