5.n nmodels i

123456768297AB123456768297AB
C96DAA8ED9FDB1CC96DAA8ED9FDB1C

A5BBDA79FDBA5BBDA79FDB
D5 7ABD6!2 B32#DABC$BD5 7ABD6!2 B32#DABC$B%DBDA67B%DBDA67B
'5ADB79#B6DB(5A68A7)D B*D FD46 29'5ADB79#B6DB(5A68A7)D B*D FD46 29

123456789A8BCD38E
1
8F567CD89748378743C8247
1
878D9C83986964567CD8974
1
8DD9D8A2635968A9D8672DC486739D
1
887634787D5C359689A8 C!D9!CC3596
1
8C!D9!CC3596#8967D7678C68796
9D7D8739
1
8D589A83783DC7
1
8!!45C359687$C!47

Artiﬁcial neural networks (I)
The Delta rule
The simplest choice is a linear combination of the inputs:
y(x) =
d
i=1
wixi + w0 What kind of network gives rise to this function?
Which can be extended to multiple outputs
yk(x) =
d
i=1
wkixi + wk0, k = 1, . . . , m And now?
Finally, let us add a non-linearity to the output:
yk(x) = g
d
i=1
wkixi + wk0 , k = 1, . . . , m And now?

The Delta rule
For convenience, define x = (1, x1, . . . , xd)T , wk = (wk0, wk1, . . . , wkd)T .
We have yk(x) = g
d
j=0
wkjxj , 1 ≤ k ≤ m.
Define now the weight matrix W(d+1)×m by gathering all the weight vec-
tors by columns and introduce the notation g[·] to mean that g is applied
component-wise. The network then computes y(x) = g[WT x].
The activation g is often a sigmoidal: differentiable, non-negative (or
non-positive) bell-shaped first derivative; horizontal asymptotes in ±∞):
(logistic)
1
1 + e−βz
∈ (0, 1) (tanh)
eβz − e−βz
eβz + e−βz
∈ (−1, 1), β 0

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-4 -2 0 2 4
a=6
a=4
a=2
a=1
a=1/2
a=1/4
a=1/6
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-4 -2 0 2 4
a=3
a=2
a=1
a=1/2
a=1/4
a=1/8
a=1/12
Left: logistic Right: tanh
(note: a is β)

The Delta rule
We wish to ﬁt this function to a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm.
Deﬁne the (empirical) mean-square error of the network as:
Eemp(W) =
1
2
N
n=1
m
k=1
(tnk − yk(xn))2

The Delta rule
Let f : Rr → R diﬀerentiable; we wish to minimize it by making changes
in its variables. Then the increment in each variable is proportional to the
corresponding derivative: xi(t + 1) = xi(t) + ∆xi(t), with
∆xi(t) = −α
∂f
∂xi x=x(t)
, α 0, i = 1, . . . , r
Illustration with r = 1. Let f(x) = 3x2 + x − 4 and take α = 0,05. We
have f′(x) = 6x + 1. Then x(0) = 1, x(1) = x(0) − αf′(1) = 1 − 0,05 · 7 =
0,65, x(2) = 0,65 − 0,05 · 4,9 = 0,405, . . .. We ﬁnd l´ımi→∞ x(i) = −1
6.

The Delta rule
In our case, the function to be minimized is the empirical error and the
variables are the weights W of the network:
∆wij(t) = −α
∂Eemp(W)
∂wij W=W(t)
, α 0, i = 1, . . . , m, j = 0, . . . , N
We have
∂Eemp(W)
∂wij
= −
N
n=1
(tni − yi(xn))g′(ˆyni)xnj
where δni ≡ tni − yi(xn) is called the delta and ˆyni ≡ d
j=0 wijxnj (someti-
mes called the net input); in other words, yi(xn) = g(ˆyni)

The Delta rule
Therefore
∆wij(t) = α
N
n=1
(tni − yi(xn))g′(ˆyni)xnj
evaluated at W(t) is the Delta rule (aka α-LMS Learning Rule)
The network represents a linear regressor where the regression coeﬃ-
cients are estimated iteratively. This rule represents the most analyzed
applied simple learning rule
This is a form of learning (because of the adaptation to the example
data) but as yet it is not incremental: we need all the examples from
the beginning (this is sometimes referred to as a “batch” rule)

The Delta rule
In the “on-line” version of the rule, we begin with W arbitrary and apply:
∆wij(t) = αt(tni − yi(xn))g′(ˆyni)xnj
At each learning step t, the input vector xn is drawn at random
If t≥0 αt = ∞ and t≥0 α2
t ∞, then W(t) converges to the global
minimum W∗ asymptotically, in the mean square sense:
l´ım
t→∞
W(t) − W∗ 2 = 0
One such procedure is αt = α
t+1, with α 0

The Delta rule
The ﬁt can be under tighter control using regularization:
Eλ
emp(W) = Eemp(W) + λ W 2, λ 0, W 2 =
ij
w2
ij
In this context, this technique is known as weight decay, because it leads
to the new updating receipt:
∆wij(t) = −α


∂Eemp(W)
∂wij W=W(t)
+ λwij(t)

 , i = 1, ..., m, j = 0, ..., N

The Delta rule
The value of λ is quite often chosen by resampling techniques.
How do we stop the process in practice?
1. When the number of iterations reaches a predetermined maximum
2. When the relative error reaches a predetermined tolerance

Artiﬁcial neural networks (II)
How could we obtain a model that is non-linear in the parameters (a
non-linear model)? We depart again from:
yk(x) = g


d
i=1
wkixi + wk0

 , k = 1, . . . , m
where g is a sigmoidal function. This is a linear model.
The solution is to apply non-linear functions to the input data:
yk(x) = g


h
i=0
wkiφi(x)

 , k = 1, . . . , m
We recover the previous “linear” situation making h = d and φi(x) = xi,
with φ0(x) = 1.

Approach 1. Make Φ = (φ0, . . . , φh) a set of predefined functions. This
is perfectly illustrated in the case n = 1 and polynomial fitting. Consider
the problem of fitting the function:
p(x) = w0 + w1x + . . . + whxh =
h
i=0
wixi to x1, . . . , xp
This can be seen as a special case of linear regression, where the set of
regressors is 1, x, x2, . . . , xh. Therefore φi(x) = xi
The weights w0, w1, . . . , wh can be estimated by standard techniques (or-
dinary least squares) or by the Delta rule.

What if we have a multivariate input x = (x1, . . . , xd)T ? The corresponding
h-degree polynomial is:
p(x) = w0 +
d
i1=1
wi1
xi1
+
d
i1=1
d
i2=i1+1
wi1i2
xi1
xi2
+
d
i1=1
d
i2=i1+1
d
i3=i2+1
wi1i2i3
xi1
xi2
xi3
. . .
The number of possible regressors grows as d+h
h
!
So many regressors (while holding N ﬁxed) causes unsurmountable trouble for estimating
their parameters:
It is quite convenient (and sometimes mandatory!) to have more observations than
regressors
Statistical signiﬁcance decreases with the number of regressors and increases with
the number of observations

Approach 2. Why not trying to engineer adaptive regressors? By adapting
the regressors to the problem, it is reasonable to expect that we shall need
a much smaller number of them for a correct ﬁt.
The basic neural network idea is to duplicate the model:
yk(x) = g


h
i=0
wkiφi(x)

 , k = 1, . . . , m
where φi(x) = g


d
j=0
vijxj

 , with φ0(x) = 1, x0 = 1

We have now a new set of regressors Φ(x) = (φ0(x), . . . , φh(x))t.
These regressors are adaptive via the vi parameters (called the non-
linear parameters). Once the regressors are fully speciﬁed, the remai-
ning task is again a linear ﬁt (via the wk parameters).
What kind of neural network gives rise to this function? The Multilayer
Perceptron or MLP
Under other choices for the regressors, other networks are obtained:
φi(x) = exp −
x − µi
2
2σ2
i
is the Gaussian RBF network.

Error functions
We have a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm (assume S is i.i.d.).
Ideally, we would like to know the statistical properties, namely p(t|x)
An error function can be derived by maximizing the likelihood of S:
L =
N
n=1
p(tn|xn)
The diﬀerent outputs are assumed to be independent, so we can write:
p(t|x) =
m
k=1
p(tk|x)
When convenient, we can also maximize a strictly monotonic function of L, namely
the log-likelihood l = ln L

Error functions for regression
We model functional ignorance as stochatic variability, putting a continuous pdf p(t|X =
x) around each point x. The optimal solution (known as the regression function) is:
y∗
k(x) = E[tk|x] =
R
tk p(tk|x) dtk
We take tk to be a deterministic function, distorted by gaussian “noise”:
tk = hk(x) + ǫ, with ǫ ∼ N(0, σ2). Note ǫ does not depend on k or x (homoscedasticity).

Therefore we have:
p(ǫ) =
1
√
2πσ
exp −
ǫ2
2σ2
Also note that in this case the optimal function would be:
y∗
k(x) = E[tk|x] = E[hk(x) + ǫ] = E[hk(x)] + E[ǫ] = hk(x)
If we rewrite tk ∼ N(hk(x), σ2), we obtain:
p(tk|x) =
1
√
2πσ
exp −
(tk − yk(x))2
2σ2

Let us try to deﬁne and minimize the negative log-likelihood as the error:
−l = − ln L = − ln
N
n=1
p(tn|xn) = − ln
N
n=1
m
k=1
p(tnk|x) = −
N
n=1
m
k=1
ln p(tnk|x)
= −
N
n=1
m
k=1
ln 1√
2πσ
− (tnk−yk(xn))2
2σ2 =
N
n=1
m
k=1
1
2
ln(2πσ2) + (tnk−yk(xn))2
2σ2
= mp
2
(ln(2π) + 2 ln σ) + 1
2σ2
N
n=1
m
k=1
(tnk − yk(xn))2
The ﬁrst term is out of our control –it does not depend on the model yk– so we should
minimize:
E ≡
1
2
N
n=1
m
k=1
(tnk − yk(xn))2
=
1
2
N
n=1
tn − y(xn) 2

Error functions for classiﬁcation
The goal in classiﬁcation is to model the posterior probabilities for every class P(ωk|x).
In two-class problems, we model by creating an ANN with one output neuron (m = 1)
to represent y(x) = P(ω1|x); therefore 1 − y(x) = P(ω2|x).
Suppose we have a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ {0, 1} (assume S is i.i.d.).
We take the convention that tn = 1 means xn ∈ ω1 and tn = 0 means xn ∈ ω2, to model:
P(t|x) =
y(x) if xn ∈ ω1
1 − y(x) if xn ∈ ω2
which can be more conveniently expressed as P(t|x) = y(x)t(1 − y(x))1−t, t = 0, 1.

Error functions for classiﬁcation
This is a Bernoulli distribution. The log-likelihood is:
L =
N
n=1
y(xn)tn(1 − y(xn))1−tn
So which error should we use? Let us deﬁne and minimize again the
negative log-likelihood as the error:
E ≡ − ln L = −
N
n=1
{tn ln y(xn) + (1 − tn) ln(1 − y(xn))}
known as the cross-entropy; it can be generalized to more than two
classes.

Artiﬁcial neural networks (III)
A gentle derivation of backpropagation
A MLP of c hidden layers is a function F : Rn → Rm made up of pieces
F1, . . . , Fm of the form:
Fk(x) = g


hc
j=0
w
(c+1)
kj φ
(c)
j (x)

 , k = 1, . . . , m
where, for every l = 1, . . . , c, W(l) = (w
(l)
ji ) is the matrix of weights con-
necting layers l − 1 and l, hl is the size of hidden layer l and
φ
(l)
j (x) = g



hl−1
i=0
w
(l)
ji φ
(l−1)
i (x)


 , for l = 1, . . . , c
with φ
(0)
i (x) = xi, φ
(l)
0 (x) = 1 (in particular, x0 = 1) and h0 = d.

The goal in regression is to minimize the empirical error of the network on
the training data sample S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm:
Eemp(W) =
1
2
N
n=1
m
k=1
(tnk − Fk(xn))2
where W = {W(1), . . . , W(c+1)} is the set of all network weights.

Note that, if g admits a derivative everywhere, Eemp(W) is a diﬀeren-
tiable function of every weight w
(l)
ji
If we want to apply gradient descent, we need to compute the partial
derivative of the error w.r.t. every weight, the gradient vector:
∇Eemp(W) =



∂Eemp(W)
∂w
(l)
ji



l,j,i
There exists a reasonably eﬃcient algorithm for computing this gra-
dient vector: the backpropagation algorithm

Consider a MLP where, for notational simplicity, we deﬁne:
z
(l)
j ≡ g(a
(l)
j ) ≡ g


i
w
(l)
ji z
(l−1)
i

 , z
(0)
j = xj
Note that Eemp is the sum of the (independent) errors for every in-
put/output example (xn, tn):
Eemp(W) =
N
n=1
1
2
m
k=1
(tnk − Fk(xn))2 ≡
N
n=1
E
(n)
emp(W)

Suppose we present xn to the network and compute all the neuron’s
outputs z
(l)
j (this is known as the forward propagation). Now,
∂E
(n)
emp(W)
∂w
(l)
ji
=
∂E
(n)
emp(W)
∂a
(l)
j
·
∂a
(l)
j
∂w
(l)
ji
= δ
(l)
j · z
(l−1)
i
where we have deﬁned δ
(l)
j ≡
∂E
(n)
emp(W)
∂a
(l)
j
.

What have we done? We have found that, in order to compute the desired
derivative
∂E
(n)
emp(W)
∂w
(l)
ji
, we only need to ﬁnd the δ
(l)
j :
Let us concentrate on an arbitrary neuron k. Suppose ﬁrst that k is an
output neuron, then
δ
(c+1)
k =
∂E
(n)
emp(W)
∂a
(c+1)
k
= −g′(a
(c+1)
k )·(tnk−Fk(xn)) = g′(a
(c+1)
k )·(z
(c+1)
k −tnk)
where we have made use of the identity Fk(xn) = g(a
(c+1)
k ) = z
(c+1)
k .

When g is the logistic function lβ(z) = 1
1+e−βz ∈ (0, 1), we obtain:
g′(a
(c+1)
k ) = βg(a
(c+1)
k )[1 − g(a
(c+1)
k )] = βz
(c+1)
k (1 − z
(c+1)
k )
Therefore
∂E
(n)
emp(W)
∂w
(l)
ji
= βz
(c+1)
j (1 − z
(c+1)
j )(z
(c+1)
j − tnj)z
(c)
i

Suppose now that k is a hidden neuron, located in a layer l ∈ {1, . . . , c}:
δ
(l)
k =
∂E
(n)
emp(W)
∂a
(l)
k
=
q
∂E
(n)
emp(W)
∂a
(l+1)
q
·
∂a
(l+1)
q
∂a
(l)
k
=
q
δ
(l+1)
q ·
∂a
(l+1)
q
∂a
(l)
k
=
q
δ
(l+1)
q ·
∂a
(l+1)
q
∂z
(l)
k
·
∂z
(l)
k
∂a
(l)
k
=
q
δ
(l+1)
q w
(l+1)
qk g′(a
(l)
k ) = g′(a
(l)
k )
q
δ
(l+1)
q w
(l+1)
qk
Again, when g is the logistic, g′(a
(l)
k ) = βg(a
(l)
k )[1−g(a
(l)
k )] = βz
(l)
k (1−z
(l)
k )

Therefore
∂Eemp(W)
∂w
(l)
ji
=
N
n=1
∂E
(n)
emp(W)
∂w
(l)
ji
and we have ∇Eemp(W) =



∂Eemp(W)
∂w
(l)
ji



l,j,i
The updating formula for the weights is:
w
(l)
ji (t + 1) ← w
(l)
ji (t) − α
∂Eemp(W)
∂w
(l)
ji W=W(t)

5.n nmodels i

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to 5.n nmodels i

Similar to 5.n nmodels i (20)

Recently uploaded

Recently uploaded (20)

5.n nmodels i