SlideShare a Scribd company logo
1 of 32
Download to read offline
123456768297AB123456768297AB
C96DAA8ED9FDB1CC96DAA8ED9FDB1C

A5BBDA79FDBA5BBDA79FDB
D5 7ABD6!2 B32#DABC$BD5 7ABD6!2 B32#DABC$B%DBDA67B%DBDA67B
'5ADB79#B6DB(5A68A7)D B*D FD46 29'5ADB79#B6DB(5A68A7)D B*D FD46 29
123456789A8BCD38E
1
8F567CD89748378743C8247
1
878D9C83986964567CD8974
1
8DD9D8A2635968A9D8672DC486739D
1
887634787D5C359689A8 C!D9!CC3596
1
8C!D9!CC3596#8967D7678C68796
9D7D8739
1
8D589A83783DC7
1
8!!45C359687$C!47
Artificial neural networks (I)
The Delta rule
The simplest choice is a linear combination of the inputs:
y(x) =
d
i=1
wixi + w0 What kind of network gives rise to this function?
Which can be extended to multiple outputs
yk(x) =
d
i=1
wkixi + wk0, k = 1, . . . , m And now?
Finally, let us add a non-linearity to the output:
yk(x) = g
d
i=1
wkixi + wk0 , k = 1, . . . , m And now?
Artificial neural networks (I)
The Delta rule
For convenience, define x = (1, x1, . . . , xd)T , wk = (wk0, wk1, . . . , wkd)T .
We have yk(x) = g
d
j=0
wkjxj , 1 ≤ k ≤ m.
Define now the weight matrix W(d+1)×m by gathering all the weight vec-
tors by columns and introduce the notation g[·] to mean that g is applied
component-wise. The network then computes y(x) = g[WT x].
The activation g is often a sigmoidal: differentiable, non-negative (or
non-positive) bell-shaped first derivative; horizontal asymptotes in ±∞):
(logistic)
1
1 + e−βz
∈ (0, 1) (tanh)
eβz − e−βz
eβz + e−βz
∈ (−1, 1), β  0
Artificial neural networks (I)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-4 -2 0 2 4
a=6
a=4
a=2
a=1
a=1/2
a=1/4
a=1/6
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-4 -2 0 2 4
a=3
a=2
a=1
a=1/2
a=1/4
a=1/8
a=1/12
Left: logistic Right: tanh
(note: a is β)
Artificial neural networks (I)
The Delta rule
We wish to fit this function to a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm.
Define the (empirical) mean-square error of the network as:
Eemp(W) =
1
2
N
n=1
m
k=1
(tnk − yk(xn))2
Artificial neural networks (I)
The Delta rule
Let f : Rr → R differentiable; we wish to minimize it by making changes
in its variables. Then the increment in each variable is proportional to the
corresponding derivative: xi(t + 1) = xi(t) + ∆xi(t), with
∆xi(t) = −α
∂f
∂xi x=x(t)
, α  0, i = 1, . . . , r
Illustration with r = 1. Let f(x) = 3x2 + x − 4 and take α = 0,05. We
have f′(x) = 6x + 1. Then x(0) = 1, x(1) = x(0) − αf′(1) = 1 − 0,05 · 7 =
0,65, x(2) = 0,65 − 0,05 · 4,9 = 0,405, . . .. We find l´ımi→∞ x(i) = −1
6.
Artificial neural networks (I)
The Delta rule
In our case, the function to be minimized is the empirical error and the
variables are the weights W of the network:
∆wij(t) = −α
∂Eemp(W)
∂wij W=W(t)
, α  0, i = 1, . . . , m, j = 0, . . . , N
We have
∂Eemp(W)
∂wij
= −
N
n=1
(tni − yi(xn))g′(ˆyni)xnj
where δni ≡ tni − yi(xn) is called the delta and ˆyni ≡ d
j=0 wijxnj (someti-
mes called the net input); in other words, yi(xn) = g(ˆyni)
Artificial neural networks (I)
The Delta rule
Therefore
∆wij(t) = α
N
n=1
(tni − yi(xn))g′(ˆyni)xnj
evaluated at W(t) is the Delta rule (aka α-LMS Learning Rule)
The network represents a linear regressor where the regression coeffi-
cients are estimated iteratively. This rule represents the most analyzed
 applied simple learning rule
This is a form of learning (because of the adaptation to the example
data) but as yet it is not incremental: we need all the examples from
the beginning (this is sometimes referred to as a “batch” rule)
Artificial neural networks (I)
The Delta rule
In the “on-line” version of the rule, we begin with W arbitrary and apply:
∆wij(t) = αt(tni − yi(xn))g′(ˆyni)xnj
At each learning step t, the input vector xn is drawn at random
If t≥0 αt = ∞ and t≥0 α2
t  ∞, then W(t) converges to the global
minimum W∗ asymptotically, in the mean square sense:
l´ım
t→∞
W(t) − W∗ 2 = 0
One such procedure is αt = α
t+1, with α  0
Artificial neural networks (I)
The Delta rule
The fit can be under tighter control using regularization:
Eλ
emp(W) = Eemp(W) + λ W 2, λ  0, W 2 =
ij
w2
ij
In this context, this technique is known as weight decay, because it leads
to the new updating receipt:
∆wij(t) = −α


∂Eemp(W)
∂wij W=W(t)
+ λwij(t)

 , i = 1, ..., m, j = 0, ..., N
Artificial neural networks (I)
The Delta rule
The value of λ is quite often chosen by resampling techniques.
How do we stop the process in practice?
1. When the number of iterations reaches a predetermined maximum
2. When the relative error reaches a predetermined tolerance
Artificial neural networks (II)
How could we obtain a model that is non-linear in the parameters (a
non-linear model)? We depart again from:
yk(x) = g


d
i=1
wkixi + wk0

 , k = 1, . . . , m
where g is a sigmoidal function. This is a linear model.
The solution is to apply non-linear functions to the input data:
yk(x) = g


h
i=0
wkiφi(x)

 , k = 1, . . . , m
We recover the previous “linear” situation making h = d and φi(x) = xi,
with φ0(x) = 1.
Artificial neural networks (II)
Approach 1. Make Φ = (φ0, . . . , φh) a set of predefined functions. This
is perfectly illustrated in the case n = 1 and polynomial fitting. Consider
the problem of fitting the function:
p(x) = w0 + w1x + . . . + whxh =
h
i=0
wixi to x1, . . . , xp
This can be seen as a special case of linear regression, where the set of
regressors is 1, x, x2, . . . , xh. Therefore φi(x) = xi
The weights w0, w1, . . . , wh can be estimated by standard techniques (or-
dinary least squares) or by the Delta rule.
Artificial neural networks (II)
What if we have a multivariate input x = (x1, . . . , xd)T ? The corresponding
h-degree polynomial is:
p(x) = w0 +
d
i1=1
wi1
xi1
+
d
i1=1
d
i2=i1+1
wi1i2
xi1
xi2
+
d
i1=1
d
i2=i1+1
d
i3=i2+1
wi1i2i3
xi1
xi2
xi3
. . .
The number of possible regressors grows as d+h
h
!
So many regressors (while holding N fixed) causes unsurmountable trouble for estimating
their parameters:
It is quite convenient (and sometimes mandatory!) to have more observations than
regressors
Statistical significance decreases with the number of regressors and increases with
the number of observations
Artificial neural networks (II)
Approach 2. Why not trying to engineer adaptive regressors? By adapting
the regressors to the problem, it is reasonable to expect that we shall need
a much smaller number of them for a correct fit.
The basic neural network idea is to duplicate the model:
yk(x) = g


h
i=0
wkiφi(x)

 , k = 1, . . . , m
where φi(x) = g


d
j=0
vijxj

 , with φ0(x) = 1, x0 = 1
Artificial neural networks (II)
We have now a new set of regressors Φ(x) = (φ0(x), . . . , φh(x))t.
These regressors are adaptive via the vi parameters (called the non-
linear parameters). Once the regressors are fully specified, the remai-
ning task is again a linear fit (via the wk parameters).
What kind of neural network gives rise to this function? The Multilayer
Perceptron or MLP
Under other choices for the regressors, other networks are obtained:
φi(x) = exp −
x − µi
2
2σ2
i
is the Gaussian RBF network.
Artificial neural networks (II)
Error functions
We have a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm (assume S is i.i.d.).
Ideally, we would like to know the statistical properties, namely p(t|x)
An error function can be derived by maximizing the likelihood of S:
L =
N
n=1
p(tn|xn)
The different outputs are assumed to be independent, so we can write:
p(t|x) =
m
k=1
p(tk|x)
When convenient, we can also maximize a strictly monotonic function of L, namely
the log-likelihood l = ln L
Artificial neural networks (II)
Error functions for regression
We model functional ignorance as stochatic variability, putting a continuous pdf p(t|X =
x) around each point x. The optimal solution (known as the regression function) is:
y∗
k(x) = E[tk|x] =
R
tk p(tk|x) dtk
We take tk to be a deterministic function, distorted by gaussian “noise”:
tk = hk(x) + ǫ, with ǫ ∼ N(0, σ2). Note ǫ does not depend on k or x (homoscedasticity).
Artificial neural networks (II)
Error functions for regression
Therefore we have:
p(ǫ) =
1
√
2πσ
exp −
ǫ2
2σ2
Also note that in this case the optimal function would be:
y∗
k(x) = E[tk|x] = E[hk(x) + ǫ] = E[hk(x)] + E[ǫ] = hk(x)
If we rewrite tk ∼ N(hk(x), σ2), we obtain:
p(tk|x) =
1
√
2πσ
exp −
(tk − yk(x))2
2σ2
Artificial neural networks (II)
Error functions for regression
Let us try to define and minimize the negative log-likelihood as the error:
−l = − ln L = − ln
N
n=1
p(tn|xn) = − ln
N
n=1
m
k=1
p(tnk|x) = −
N
n=1
m
k=1
ln p(tnk|x)
= −
N
n=1
m
k=1
ln 1√
2πσ
− (tnk−yk(xn))2
2σ2 =
N
n=1
m
k=1
1
2
ln(2πσ2) + (tnk−yk(xn))2
2σ2
= mp
2
(ln(2π) + 2 ln σ) + 1
2σ2
N
n=1
m
k=1
(tnk − yk(xn))2
The first term is out of our control –it does not depend on the model yk– so we should
minimize:
E ≡
1
2
N
n=1
m
k=1
(tnk − yk(xn))2
=
1
2
N
n=1
tn − y(xn) 2
Artificial neural networks (II)
Error functions for classification
The goal in classification is to model the posterior probabilities for every class P(ωk|x).
In two-class problems, we model by creating an ANN with one output neuron (m = 1)
to represent y(x) = P(ω1|x); therefore 1 − y(x) = P(ω2|x).
Suppose we have a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ {0, 1} (assume S is i.i.d.).
We take the convention that tn = 1 means xn ∈ ω1 and tn = 0 means xn ∈ ω2, to model:
P(t|x) =
y(x) if xn ∈ ω1
1 − y(x) if xn ∈ ω2
which can be more conveniently expressed as P(t|x) = y(x)t(1 − y(x))1−t, t = 0, 1.
Artificial neural networks (II)
Error functions for classification
This is a Bernoulli distribution. The log-likelihood is:
L =
N
n=1
y(xn)tn(1 − y(xn))1−tn
So which error should we use? Let us define and minimize again the
negative log-likelihood as the error:
E ≡ − ln L = −
N
n=1
{tn ln y(xn) + (1 − tn) ln(1 − y(xn))}
known as the cross-entropy; it can be generalized to more than two
classes.
Artificial neural networks (III)
A gentle derivation of backpropagation
A MLP of c hidden layers is a function F : Rn → Rm made up of pieces
F1, . . . , Fm of the form:
Fk(x) = g


hc
j=0
w
(c+1)
kj φ
(c)
j (x)

 , k = 1, . . . , m
where, for every l = 1, . . . , c, W(l) = (w
(l)
ji ) is the matrix of weights con-
necting layers l − 1 and l, hl is the size of hidden layer l and
φ
(l)
j (x) = g



hl−1
i=0
w
(l)
ji φ
(l−1)
i (x)


 , for l = 1, . . . , c
with φ
(0)
i (x) = xi, φ
(l)
0 (x) = 1 (in particular, x0 = 1) and h0 = d.
Artificial neural networks (III)
A gentle derivation of backpropagation
The goal in regression is to minimize the empirical error of the network on
the training data sample S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm:
Eemp(W) =
1
2
N
n=1
m
k=1
(tnk − Fk(xn))2
where W = {W(1), . . . , W(c+1)} is the set of all network weights.
Artificial neural networks (III)
A gentle derivation of backpropagation
Note that, if g admits a derivative everywhere, Eemp(W) is a differen-
tiable function of every weight w
(l)
ji
If we want to apply gradient descent, we need to compute the partial
derivative of the error w.r.t. every weight, the gradient vector:
∇Eemp(W) =



∂Eemp(W)
∂w
(l)
ji



l,j,i
There exists a reasonably efficient algorithm for computing this gra-
dient vector: the backpropagation algorithm
Artificial neural networks (III)
A gentle derivation of backpropagation
Consider a MLP where, for notational simplicity, we define:
z
(l)
j ≡ g(a
(l)
j ) ≡ g


i
w
(l)
ji z
(l−1)
i

 , z
(0)
j = xj
Note that Eemp is the sum of the (independent) errors for every in-
put/output example (xn, tn):
Eemp(W) =
N
n=1
1
2
m
k=1
(tnk − Fk(xn))2 ≡
N
n=1
E
(n)
emp(W)
Artificial neural networks (III)
A gentle derivation of backpropagation
Suppose we present xn to the network and compute all the neuron’s
outputs z
(l)
j (this is known as the forward propagation). Now,
∂E
(n)
emp(W)
∂w
(l)
ji
=
∂E
(n)
emp(W)
∂a
(l)
j
·
∂a
(l)
j
∂w
(l)
ji
= δ
(l)
j · z
(l−1)
i
where we have defined δ
(l)
j ≡
∂E
(n)
emp(W)
∂a
(l)
j
.
Artificial neural networks (III)
A gentle derivation of backpropagation
What have we done? We have found that, in order to compute the desired
derivative
∂E
(n)
emp(W)
∂w
(l)
ji
, we only need to find the δ
(l)
j :
Let us concentrate on an arbitrary neuron k. Suppose first that k is an
output neuron, then
δ
(c+1)
k =
∂E
(n)
emp(W)
∂a
(c+1)
k
= −g′(a
(c+1)
k )·(tnk−Fk(xn)) = g′(a
(c+1)
k )·(z
(c+1)
k −tnk)
where we have made use of the identity Fk(xn) = g(a
(c+1)
k ) = z
(c+1)
k .
Artificial neural networks (III)
A gentle derivation of backpropagation
When g is the logistic function lβ(z) = 1
1+e−βz ∈ (0, 1), we obtain:
g′(a
(c+1)
k ) = βg(a
(c+1)
k )[1 − g(a
(c+1)
k )] = βz
(c+1)
k (1 − z
(c+1)
k )
Therefore
∂E
(n)
emp(W)
∂w
(l)
ji
= βz
(c+1)
j (1 − z
(c+1)
j )(z
(c+1)
j − tnj)z
(c)
i
Artificial neural networks (III)
A gentle derivation of backpropagation
Suppose now that k is a hidden neuron, located in a layer l ∈ {1, . . . , c}:
δ
(l)
k =
∂E
(n)
emp(W)
∂a
(l)
k
=
q
∂E
(n)
emp(W)
∂a
(l+1)
q
·
∂a
(l+1)
q
∂a
(l)
k
=
q
δ
(l+1)
q ·
∂a
(l+1)
q
∂a
(l)
k
=
q
δ
(l+1)
q ·
∂a
(l+1)
q
∂z
(l)
k
·
∂z
(l)
k
∂a
(l)
k
=
q
δ
(l+1)
q w
(l+1)
qk g′(a
(l)
k ) = g′(a
(l)
k )
q
δ
(l+1)
q w
(l+1)
qk
Again, when g is the logistic, g′(a
(l)
k ) = βg(a
(l)
k )[1−g(a
(l)
k )] = βz
(l)
k (1−z
(l)
k )
Artificial neural networks (III)
A gentle derivation of backpropagation
Therefore
∂Eemp(W)
∂w
(l)
ji
=
N
n=1
∂E
(n)
emp(W)
∂w
(l)
ji
and we have ∇Eemp(W) =



∂Eemp(W)
∂w
(l)
ji



l,j,i
The updating formula for the weights is:
w
(l)
ji (t + 1) ← w
(l)
ji (t) − α
∂Eemp(W)
∂w
(l)
ji W=W(t)

More Related Content

What's hot

Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkIldar Nurgaliev
 
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...BRNSS Publication Hub
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical MethodsChristian Robert
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical MethodsChristian Robert
 
Newton's forward difference
Newton's forward differenceNewton's forward difference
Newton's forward differenceRaj Parekh
 
Erin catto numericalmethods
Erin catto numericalmethodsErin catto numericalmethods
Erin catto numericalmethodsoscarbg
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsChristian Robert
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel TrickEdgar Marca
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmmnozomuhamada
 
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...hirokazutanaka
 
Newton's Forward/Backward Difference Interpolation
Newton's Forward/Backward  Difference InterpolationNewton's Forward/Backward  Difference Interpolation
Newton's Forward/Backward Difference InterpolationVARUN KUMAR
 
Unit v rpq1
Unit v rpq1Unit v rpq1
Unit v rpq1Babu Rao
 
Molecular Solutions For The Set-Partition Problem On Dna-Based Computing
Molecular Solutions For The Set-Partition Problem On Dna-Based ComputingMolecular Solutions For The Set-Partition Problem On Dna-Based Computing
Molecular Solutions For The Set-Partition Problem On Dna-Based Computingijcsit
 
Newton's forward & backward interpolation
Newton's forward & backward interpolationNewton's forward & backward interpolation
Newton's forward & backward interpolationHarshad Koshti
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsMercier Jean-Marc
 

What's hot (20)

Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Newton's forward difference
Newton's forward differenceNewton's forward difference
Newton's forward difference
 
Erin catto numericalmethods
Erin catto numericalmethodsErin catto numericalmethods
Erin catto numericalmethods
 
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithmsRao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
Rao-Blackwellisation schemes for accelerating Metropolis-Hastings algorithms
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel Trick
 
2012 mdsp pr06  hmm
2012 mdsp pr06  hmm2012 mdsp pr06  hmm
2012 mdsp pr06  hmm
 
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
Computational Motor Control: Optimal Control for Deterministic Systems (JAIST...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Newton's Forward/Backward Difference Interpolation
Newton's Forward/Backward  Difference InterpolationNewton's Forward/Backward  Difference Interpolation
Newton's Forward/Backward Difference Interpolation
 
Unit v rpq1
Unit v rpq1Unit v rpq1
Unit v rpq1
 
Molecular Solutions For The Set-Partition Problem On Dna-Based Computing
Molecular Solutions For The Set-Partition Problem On Dna-Based ComputingMolecular Solutions For The Set-Partition Problem On Dna-Based Computing
Molecular Solutions For The Set-Partition Problem On Dna-Based Computing
 
Newton's forward & backward interpolation
Newton's forward & backward interpolationNewton's forward & backward interpolation
Newton's forward & backward interpolation
 
Shape1 d
Shape1 dShape1 d
Shape1 d
 
Integration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methodsIntegration with kernel methods, Transported meshfree methods
Integration with kernel methods, Transported meshfree methods
 
Main
MainMain
Main
 
Composed short m sequences
Composed short m sequencesComposed short m sequences
Composed short m sequences
 

Viewers also liked

22.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №10
22.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №1022.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №10
22.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №10Олег Кот
 
Chaarg promotional plan
Chaarg promotional planChaarg promotional plan
Chaarg promotional planKate Lewis
 
Blood collection, precautions to prevent hemolysis and
Blood collection, precautions to prevent hemolysis andBlood collection, precautions to prevent hemolysis and
Blood collection, precautions to prevent hemolysis andSowmya Srinivas
 
9 Viscoelasticity and biological tissues
9 Viscoelasticity and biological tissues9 Viscoelasticity and biological tissues
9 Viscoelasticity and biological tissuesLisa Benson
 
Trends in instructional technology
Trends in instructional technologyTrends in instructional technology
Trends in instructional technologyokindar
 

Viewers also liked (8)

22.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №10
22.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №1022.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №10
22.06. День скорботи і пам`яті жерт війни. Пришкільний табір Орлятко. ЗОШ №10
 
Chaarg promotional plan
Chaarg promotional planChaarg promotional plan
Chaarg promotional plan
 
The Family communication process
The Family communication processThe Family communication process
The Family communication process
 
Beyond Learning: Pushing the Limit
Beyond Learning:  Pushing the LimitBeyond Learning:  Pushing the Limit
Beyond Learning: Pushing the Limit
 
Plasmapheresis
PlasmapheresisPlasmapheresis
Plasmapheresis
 
Blood collection, precautions to prevent hemolysis and
Blood collection, precautions to prevent hemolysis andBlood collection, precautions to prevent hemolysis and
Blood collection, precautions to prevent hemolysis and
 
9 Viscoelasticity and biological tissues
9 Viscoelasticity and biological tissues9 Viscoelasticity and biological tissues
9 Viscoelasticity and biological tissues
 
Trends in instructional technology
Trends in instructional technologyTrends in instructional technology
Trends in instructional technology
 

Similar to 5.n nmodels i

Introduction to Diffusion Monte Carlo
Introduction to Diffusion Monte CarloIntroduction to Diffusion Monte Carlo
Introduction to Diffusion Monte CarloClaudio Attaccalite
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsElvis DOHMATOB
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt msFaeco Bot
 
LINEAR SYSTEMS
LINEAR SYSTEMSLINEAR SYSTEMS
LINEAR SYSTEMSJazzieJao1
 
線形回帰モデル
線形回帰モデル線形回帰モデル
線形回帰モデル貴之 八木
 
adv-2015-16-solution-09
adv-2015-16-solution-09adv-2015-16-solution-09
adv-2015-16-solution-09志远 姚
 
Ph 101-9 QUANTUM MACHANICS
Ph 101-9 QUANTUM MACHANICSPh 101-9 QUANTUM MACHANICS
Ph 101-9 QUANTUM MACHANICSChandan Singh
 
Seismic data processing lecture 3
Seismic data processing lecture 3Seismic data processing lecture 3
Seismic data processing lecture 3Amin khalil
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AIMarc Lelarge
 
5_2019_02_01!09_42_56_PM.pptx
5_2019_02_01!09_42_56_PM.pptx5_2019_02_01!09_42_56_PM.pptx
5_2019_02_01!09_42_56_PM.pptxShalabhMishra10
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
 
Finite elements : basis functions
Finite elements : basis functionsFinite elements : basis functions
Finite elements : basis functionsTarun Gehlot
 

Similar to 5.n nmodels i (20)

Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
Statistical Physics Assignment Help
Statistical Physics Assignment HelpStatistical Physics Assignment Help
Statistical Physics Assignment Help
 
Introduction to Diffusion Monte Carlo
Introduction to Diffusion Monte CarloIntroduction to Diffusion Monte Carlo
Introduction to Diffusion Monte Carlo
 
residue
residueresidue
residue
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Multivriada ppt ms
Multivriada   ppt msMultivriada   ppt ms
Multivriada ppt ms
 
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systemsAdaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
 
Metodo gauss_newton.pdf
Metodo gauss_newton.pdfMetodo gauss_newton.pdf
Metodo gauss_newton.pdf
 
LINEAR SYSTEMS
LINEAR SYSTEMSLINEAR SYSTEMS
LINEAR SYSTEMS
 
線形回帰モデル
線形回帰モデル線形回帰モデル
線形回帰モデル
 
adv-2015-16-solution-09
adv-2015-16-solution-09adv-2015-16-solution-09
adv-2015-16-solution-09
 
Ph 101-9 QUANTUM MACHANICS
Ph 101-9 QUANTUM MACHANICSPh 101-9 QUANTUM MACHANICS
Ph 101-9 QUANTUM MACHANICS
 
Seismic data processing lecture 3
Seismic data processing lecture 3Seismic data processing lecture 3
Seismic data processing lecture 3
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AI
 
5_2019_02_01!09_42_56_PM.pptx
5_2019_02_01!09_42_56_PM.pptx5_2019_02_01!09_42_56_PM.pptx
5_2019_02_01!09_42_56_PM.pptx
 
ML unit2.pptx
ML unit2.pptxML unit2.pptx
ML unit2.pptx
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Finite elements : basis functions
Finite elements : basis functionsFinite elements : basis functions
Finite elements : basis functions
 

Recently uploaded

Call Girls Satellite 7397865700 Ridhima Hire Me Full Night
Call Girls Satellite 7397865700 Ridhima Hire Me Full NightCall Girls Satellite 7397865700 Ridhima Hire Me Full Night
Call Girls Satellite 7397865700 Ridhima Hire Me Full Nightssuser7cb4ff
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证nhjeo1gg
 
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一Fi L
 
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
How to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our SiteHow to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our Sitegalleryaagency
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Rndexperts
 
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricksabhishekparmar618
 
group_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdfgroup_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdfneelspinoy
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in designnooreen17
 
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一Fi sss
 
办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一
办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一
办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一diploma 1
 
毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree 毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree ttt fff
 
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10uasjlagroup
 
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
PORTAFOLIO   2024_  ANASTASIYA  KUDINOVAPORTAFOLIO   2024_  ANASTASIYA  KUDINOVA
PORTAFOLIO 2024_ ANASTASIYA KUDINOVAAnastasiya Kudinova
 
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...katerynaivanenko1
 
Call Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full NightCall Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full Nightssuser7cb4ff
 

Recently uploaded (20)

Call Girls Satellite 7397865700 Ridhima Hire Me Full Night
Call Girls Satellite 7397865700 Ridhima Hire Me Full NightCall Girls Satellite 7397865700 Ridhima Hire Me Full Night
Call Girls Satellite 7397865700 Ridhima Hire Me Full Night
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
 
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
昆士兰大学毕业证(UQ毕业证)#文凭成绩单#真实留信学历认证永久存档
 
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
 
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
 
How to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our SiteHow to Be Famous in your Field just visit our Site
How to Be Famous in your Field just visit our Site
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025
 
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricks
 
Call Girls in Pratap Nagar, 9953056974 Escort Service
Call Girls in Pratap Nagar,  9953056974 Escort ServiceCall Girls in Pratap Nagar,  9953056974 Escort Service
Call Girls in Pratap Nagar, 9953056974 Escort Service
 
group_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdfgroup_15_empirya_p1projectIndustrial.pdf
group_15_empirya_p1projectIndustrial.pdf
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in design
 
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
(办理学位证)埃迪斯科文大学毕业证成绩单原版一比一
 
办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一
办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一
办理(USYD毕业证书)澳洲悉尼大学毕业证成绩单原版一比一
 
毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree 毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲弗林德斯大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
CREATING A POSITIVE SCHOOL CULTURE CHAPTER 10
 
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
 
PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
PORTAFOLIO   2024_  ANASTASIYA  KUDINOVAPORTAFOLIO   2024_  ANASTASIYA  KUDINOVA
PORTAFOLIO 2024_ ANASTASIYA KUDINOVA
 
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
 
Call Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full NightCall Girls Aslali 7397865700 Ridhima Hire Me Full Night
Call Girls Aslali 7397865700 Ridhima Hire Me Full Night
 

5.n nmodels i

  • 1. 123456768297AB123456768297AB C96DAA8ED9FDB1CC96DAA8ED9FDB1C A5BBDA79FDBA5BBDA79FDB D5 7ABD6!2 B32#DABC$BD5 7ABD6!2 B32#DABC$B%DBDA67B%DBDA67B '5ADB79#B6DB(5A68A7)D B*D FD46 29'5ADB79#B6DB(5A68A7)D B*D FD46 29
  • 3. Artificial neural networks (I) The Delta rule The simplest choice is a linear combination of the inputs: y(x) = d i=1 wixi + w0 What kind of network gives rise to this function? Which can be extended to multiple outputs yk(x) = d i=1 wkixi + wk0, k = 1, . . . , m And now? Finally, let us add a non-linearity to the output: yk(x) = g d i=1 wkixi + wk0 , k = 1, . . . , m And now?
  • 4. Artificial neural networks (I) The Delta rule For convenience, define x = (1, x1, . . . , xd)T , wk = (wk0, wk1, . . . , wkd)T . We have yk(x) = g d j=0 wkjxj , 1 ≤ k ≤ m. Define now the weight matrix W(d+1)×m by gathering all the weight vec- tors by columns and introduce the notation g[·] to mean that g is applied component-wise. The network then computes y(x) = g[WT x]. The activation g is often a sigmoidal: differentiable, non-negative (or non-positive) bell-shaped first derivative; horizontal asymptotes in ±∞): (logistic) 1 1 + e−βz ∈ (0, 1) (tanh) eβz − e−βz eβz + e−βz ∈ (−1, 1), β 0
  • 5. Artificial neural networks (I) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -4 -2 0 2 4 a=6 a=4 a=2 a=1 a=1/2 a=1/4 a=1/6 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -4 -2 0 2 4 a=3 a=2 a=1 a=1/2 a=1/4 a=1/8 a=1/12 Left: logistic Right: tanh (note: a is β)
  • 6. Artificial neural networks (I) The Delta rule We wish to fit this function to a set of learning examples S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm. Define the (empirical) mean-square error of the network as: Eemp(W) = 1 2 N n=1 m k=1 (tnk − yk(xn))2
  • 7. Artificial neural networks (I) The Delta rule Let f : Rr → R differentiable; we wish to minimize it by making changes in its variables. Then the increment in each variable is proportional to the corresponding derivative: xi(t + 1) = xi(t) + ∆xi(t), with ∆xi(t) = −α ∂f ∂xi x=x(t) , α 0, i = 1, . . . , r Illustration with r = 1. Let f(x) = 3x2 + x − 4 and take α = 0,05. We have f′(x) = 6x + 1. Then x(0) = 1, x(1) = x(0) − αf′(1) = 1 − 0,05 · 7 = 0,65, x(2) = 0,65 − 0,05 · 4,9 = 0,405, . . .. We find l´ımi→∞ x(i) = −1 6.
  • 8. Artificial neural networks (I) The Delta rule In our case, the function to be minimized is the empirical error and the variables are the weights W of the network: ∆wij(t) = −α ∂Eemp(W) ∂wij W=W(t) , α 0, i = 1, . . . , m, j = 0, . . . , N We have ∂Eemp(W) ∂wij = − N n=1 (tni − yi(xn))g′(ˆyni)xnj where δni ≡ tni − yi(xn) is called the delta and ˆyni ≡ d j=0 wijxnj (someti- mes called the net input); in other words, yi(xn) = g(ˆyni)
  • 9. Artificial neural networks (I) The Delta rule Therefore ∆wij(t) = α N n=1 (tni − yi(xn))g′(ˆyni)xnj evaluated at W(t) is the Delta rule (aka α-LMS Learning Rule) The network represents a linear regressor where the regression coeffi- cients are estimated iteratively. This rule represents the most analyzed applied simple learning rule This is a form of learning (because of the adaptation to the example data) but as yet it is not incremental: we need all the examples from the beginning (this is sometimes referred to as a “batch” rule)
  • 10. Artificial neural networks (I) The Delta rule In the “on-line” version of the rule, we begin with W arbitrary and apply: ∆wij(t) = αt(tni − yi(xn))g′(ˆyni)xnj At each learning step t, the input vector xn is drawn at random If t≥0 αt = ∞ and t≥0 α2 t ∞, then W(t) converges to the global minimum W∗ asymptotically, in the mean square sense: l´ım t→∞ W(t) − W∗ 2 = 0 One such procedure is αt = α t+1, with α 0
  • 11. Artificial neural networks (I) The Delta rule The fit can be under tighter control using regularization: Eλ emp(W) = Eemp(W) + λ W 2, λ 0, W 2 = ij w2 ij In this context, this technique is known as weight decay, because it leads to the new updating receipt: ∆wij(t) = −α   ∂Eemp(W) ∂wij W=W(t) + λwij(t)   , i = 1, ..., m, j = 0, ..., N
  • 12. Artificial neural networks (I) The Delta rule The value of λ is quite often chosen by resampling techniques. How do we stop the process in practice? 1. When the number of iterations reaches a predetermined maximum 2. When the relative error reaches a predetermined tolerance
  • 13. Artificial neural networks (II) How could we obtain a model that is non-linear in the parameters (a non-linear model)? We depart again from: yk(x) = g   d i=1 wkixi + wk0   , k = 1, . . . , m where g is a sigmoidal function. This is a linear model. The solution is to apply non-linear functions to the input data: yk(x) = g   h i=0 wkiφi(x)   , k = 1, . . . , m We recover the previous “linear” situation making h = d and φi(x) = xi, with φ0(x) = 1.
  • 14. Artificial neural networks (II) Approach 1. Make Φ = (φ0, . . . , φh) a set of predefined functions. This is perfectly illustrated in the case n = 1 and polynomial fitting. Consider the problem of fitting the function: p(x) = w0 + w1x + . . . + whxh = h i=0 wixi to x1, . . . , xp This can be seen as a special case of linear regression, where the set of regressors is 1, x, x2, . . . , xh. Therefore φi(x) = xi The weights w0, w1, . . . , wh can be estimated by standard techniques (or- dinary least squares) or by the Delta rule.
  • 15. Artificial neural networks (II) What if we have a multivariate input x = (x1, . . . , xd)T ? The corresponding h-degree polynomial is: p(x) = w0 + d i1=1 wi1 xi1 + d i1=1 d i2=i1+1 wi1i2 xi1 xi2 + d i1=1 d i2=i1+1 d i3=i2+1 wi1i2i3 xi1 xi2 xi3 . . . The number of possible regressors grows as d+h h ! So many regressors (while holding N fixed) causes unsurmountable trouble for estimating their parameters: It is quite convenient (and sometimes mandatory!) to have more observations than regressors Statistical significance decreases with the number of regressors and increases with the number of observations
  • 16. Artificial neural networks (II) Approach 2. Why not trying to engineer adaptive regressors? By adapting the regressors to the problem, it is reasonable to expect that we shall need a much smaller number of them for a correct fit. The basic neural network idea is to duplicate the model: yk(x) = g   h i=0 wkiφi(x)   , k = 1, . . . , m where φi(x) = g   d j=0 vijxj   , with φ0(x) = 1, x0 = 1
  • 17. Artificial neural networks (II) We have now a new set of regressors Φ(x) = (φ0(x), . . . , φh(x))t. These regressors are adaptive via the vi parameters (called the non- linear parameters). Once the regressors are fully specified, the remai- ning task is again a linear fit (via the wk parameters). What kind of neural network gives rise to this function? The Multilayer Perceptron or MLP Under other choices for the regressors, other networks are obtained: φi(x) = exp − x − µi 2 2σ2 i is the Gaussian RBF network.
  • 18. Artificial neural networks (II) Error functions We have a set of learning examples S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm (assume S is i.i.d.). Ideally, we would like to know the statistical properties, namely p(t|x) An error function can be derived by maximizing the likelihood of S: L = N n=1 p(tn|xn) The different outputs are assumed to be independent, so we can write: p(t|x) = m k=1 p(tk|x) When convenient, we can also maximize a strictly monotonic function of L, namely the log-likelihood l = ln L
  • 19. Artificial neural networks (II) Error functions for regression We model functional ignorance as stochatic variability, putting a continuous pdf p(t|X = x) around each point x. The optimal solution (known as the regression function) is: y∗ k(x) = E[tk|x] = R tk p(tk|x) dtk We take tk to be a deterministic function, distorted by gaussian “noise”: tk = hk(x) + ǫ, with ǫ ∼ N(0, σ2). Note ǫ does not depend on k or x (homoscedasticity).
  • 20. Artificial neural networks (II) Error functions for regression Therefore we have: p(ǫ) = 1 √ 2πσ exp − ǫ2 2σ2 Also note that in this case the optimal function would be: y∗ k(x) = E[tk|x] = E[hk(x) + ǫ] = E[hk(x)] + E[ǫ] = hk(x) If we rewrite tk ∼ N(hk(x), σ2), we obtain: p(tk|x) = 1 √ 2πσ exp − (tk − yk(x))2 2σ2
  • 21. Artificial neural networks (II) Error functions for regression Let us try to define and minimize the negative log-likelihood as the error: −l = − ln L = − ln N n=1 p(tn|xn) = − ln N n=1 m k=1 p(tnk|x) = − N n=1 m k=1 ln p(tnk|x) = − N n=1 m k=1 ln 1√ 2πσ − (tnk−yk(xn))2 2σ2 = N n=1 m k=1 1 2 ln(2πσ2) + (tnk−yk(xn))2 2σ2 = mp 2 (ln(2π) + 2 ln σ) + 1 2σ2 N n=1 m k=1 (tnk − yk(xn))2 The first term is out of our control –it does not depend on the model yk– so we should minimize: E ≡ 1 2 N n=1 m k=1 (tnk − yk(xn))2 = 1 2 N n=1 tn − y(xn) 2
  • 22. Artificial neural networks (II) Error functions for classification The goal in classification is to model the posterior probabilities for every class P(ωk|x). In two-class problems, we model by creating an ANN with one output neuron (m = 1) to represent y(x) = P(ω1|x); therefore 1 − y(x) = P(ω2|x). Suppose we have a set of learning examples S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ {0, 1} (assume S is i.i.d.). We take the convention that tn = 1 means xn ∈ ω1 and tn = 0 means xn ∈ ω2, to model: P(t|x) = y(x) if xn ∈ ω1 1 − y(x) if xn ∈ ω2 which can be more conveniently expressed as P(t|x) = y(x)t(1 − y(x))1−t, t = 0, 1.
  • 23. Artificial neural networks (II) Error functions for classification This is a Bernoulli distribution. The log-likelihood is: L = N n=1 y(xn)tn(1 − y(xn))1−tn So which error should we use? Let us define and minimize again the negative log-likelihood as the error: E ≡ − ln L = − N n=1 {tn ln y(xn) + (1 − tn) ln(1 − y(xn))} known as the cross-entropy; it can be generalized to more than two classes.
  • 24. Artificial neural networks (III) A gentle derivation of backpropagation A MLP of c hidden layers is a function F : Rn → Rm made up of pieces F1, . . . , Fm of the form: Fk(x) = g   hc j=0 w (c+1) kj φ (c) j (x)   , k = 1, . . . , m where, for every l = 1, . . . , c, W(l) = (w (l) ji ) is the matrix of weights con- necting layers l − 1 and l, hl is the size of hidden layer l and φ (l) j (x) = g    hl−1 i=0 w (l) ji φ (l−1) i (x)    , for l = 1, . . . , c with φ (0) i (x) = xi, φ (l) 0 (x) = 1 (in particular, x0 = 1) and h0 = d.
  • 25. Artificial neural networks (III) A gentle derivation of backpropagation The goal in regression is to minimize the empirical error of the network on the training data sample S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm: Eemp(W) = 1 2 N n=1 m k=1 (tnk − Fk(xn))2 where W = {W(1), . . . , W(c+1)} is the set of all network weights.
  • 26. Artificial neural networks (III) A gentle derivation of backpropagation Note that, if g admits a derivative everywhere, Eemp(W) is a differen- tiable function of every weight w (l) ji If we want to apply gradient descent, we need to compute the partial derivative of the error w.r.t. every weight, the gradient vector: ∇Eemp(W) =    ∂Eemp(W) ∂w (l) ji    l,j,i There exists a reasonably efficient algorithm for computing this gra- dient vector: the backpropagation algorithm
  • 27. Artificial neural networks (III) A gentle derivation of backpropagation Consider a MLP where, for notational simplicity, we define: z (l) j ≡ g(a (l) j ) ≡ g   i w (l) ji z (l−1) i   , z (0) j = xj Note that Eemp is the sum of the (independent) errors for every in- put/output example (xn, tn): Eemp(W) = N n=1 1 2 m k=1 (tnk − Fk(xn))2 ≡ N n=1 E (n) emp(W)
  • 28. Artificial neural networks (III) A gentle derivation of backpropagation Suppose we present xn to the network and compute all the neuron’s outputs z (l) j (this is known as the forward propagation). Now, ∂E (n) emp(W) ∂w (l) ji = ∂E (n) emp(W) ∂a (l) j · ∂a (l) j ∂w (l) ji = δ (l) j · z (l−1) i where we have defined δ (l) j ≡ ∂E (n) emp(W) ∂a (l) j .
  • 29. Artificial neural networks (III) A gentle derivation of backpropagation What have we done? We have found that, in order to compute the desired derivative ∂E (n) emp(W) ∂w (l) ji , we only need to find the δ (l) j : Let us concentrate on an arbitrary neuron k. Suppose first that k is an output neuron, then δ (c+1) k = ∂E (n) emp(W) ∂a (c+1) k = −g′(a (c+1) k )·(tnk−Fk(xn)) = g′(a (c+1) k )·(z (c+1) k −tnk) where we have made use of the identity Fk(xn) = g(a (c+1) k ) = z (c+1) k .
  • 30. Artificial neural networks (III) A gentle derivation of backpropagation When g is the logistic function lβ(z) = 1 1+e−βz ∈ (0, 1), we obtain: g′(a (c+1) k ) = βg(a (c+1) k )[1 − g(a (c+1) k )] = βz (c+1) k (1 − z (c+1) k ) Therefore ∂E (n) emp(W) ∂w (l) ji = βz (c+1) j (1 − z (c+1) j )(z (c+1) j − tnj)z (c) i
  • 31. Artificial neural networks (III) A gentle derivation of backpropagation Suppose now that k is a hidden neuron, located in a layer l ∈ {1, . . . , c}: δ (l) k = ∂E (n) emp(W) ∂a (l) k = q ∂E (n) emp(W) ∂a (l+1) q · ∂a (l+1) q ∂a (l) k = q δ (l+1) q · ∂a (l+1) q ∂a (l) k = q δ (l+1) q · ∂a (l+1) q ∂z (l) k · ∂z (l) k ∂a (l) k = q δ (l+1) q w (l+1) qk g′(a (l) k ) = g′(a (l) k ) q δ (l+1) q w (l+1) qk Again, when g is the logistic, g′(a (l) k ) = βg(a (l) k )[1−g(a (l) k )] = βz (l) k (1−z (l) k )
  • 32. Artificial neural networks (III) A gentle derivation of backpropagation Therefore ∂Eemp(W) ∂w (l) ji = N n=1 ∂E (n) emp(W) ∂w (l) ji and we have ∇Eemp(W) =    ∂Eemp(W) ∂w (l) ji    l,j,i The updating formula for the weights is: w (l) ji (t + 1) ← w (l) ji (t) − α ∂Eemp(W) ∂w (l) ji W=W(t)