3. Artificial neural networks (I)
The Delta rule
The simplest choice is a linear combination of the inputs:
y(x) =
d
i=1
wixi + w0 What kind of network gives rise to this function?
Which can be extended to multiple outputs
yk(x) =
d
i=1
wkixi + wk0, k = 1, . . . , m And now?
Finally, let us add a non-linearity to the output:
yk(x) = g
d
i=1
wkixi + wk0 , k = 1, . . . , m And now?
4. Artificial neural networks (I)
The Delta rule
For convenience, define x = (1, x1, . . . , xd)T , wk = (wk0, wk1, . . . , wkd)T .
We have yk(x) = g
d
j=0
wkjxj , 1 ≤ k ≤ m.
Define now the weight matrix W(d+1)×m by gathering all the weight vec-
tors by columns and introduce the notation g[·] to mean that g is applied
component-wise. The network then computes y(x) = g[WT x].
The activation g is often a sigmoidal: differentiable, non-negative (or
non-positive) bell-shaped first derivative; horizontal asymptotes in ±∞):
(logistic)
1
1 + e−βz
∈ (0, 1) (tanh)
eβz − e−βz
eβz + e−βz
∈ (−1, 1), β 0
6. Artificial neural networks (I)
The Delta rule
We wish to fit this function to a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm.
Define the (empirical) mean-square error of the network as:
Eemp(W) =
1
2
N
n=1
m
k=1
(tnk − yk(xn))2
7. Artificial neural networks (I)
The Delta rule
Let f : Rr → R differentiable; we wish to minimize it by making changes
in its variables. Then the increment in each variable is proportional to the
corresponding derivative: xi(t + 1) = xi(t) + ∆xi(t), with
∆xi(t) = −α
∂f
∂xi x=x(t)
, α 0, i = 1, . . . , r
Illustration with r = 1. Let f(x) = 3x2 + x − 4 and take α = 0,05. We
have f′(x) = 6x + 1. Then x(0) = 1, x(1) = x(0) − αf′(1) = 1 − 0,05 · 7 =
0,65, x(2) = 0,65 − 0,05 · 4,9 = 0,405, . . .. We find l´ımi→∞ x(i) = −1
6.
8. Artificial neural networks (I)
The Delta rule
In our case, the function to be minimized is the empirical error and the
variables are the weights W of the network:
∆wij(t) = −α
∂Eemp(W)
∂wij W=W(t)
, α 0, i = 1, . . . , m, j = 0, . . . , N
We have
∂Eemp(W)
∂wij
= −
N
n=1
(tni − yi(xn))g′(ˆyni)xnj
where δni ≡ tni − yi(xn) is called the delta and ˆyni ≡ d
j=0 wijxnj (someti-
mes called the net input); in other words, yi(xn) = g(ˆyni)
9. Artificial neural networks (I)
The Delta rule
Therefore
∆wij(t) = α
N
n=1
(tni − yi(xn))g′(ˆyni)xnj
evaluated at W(t) is the Delta rule (aka α-LMS Learning Rule)
The network represents a linear regressor where the regression coeffi-
cients are estimated iteratively. This rule represents the most analyzed
applied simple learning rule
This is a form of learning (because of the adaptation to the example
data) but as yet it is not incremental: we need all the examples from
the beginning (this is sometimes referred to as a “batch” rule)
10. Artificial neural networks (I)
The Delta rule
In the “on-line” version of the rule, we begin with W arbitrary and apply:
∆wij(t) = αt(tni − yi(xn))g′(ˆyni)xnj
At each learning step t, the input vector xn is drawn at random
If t≥0 αt = ∞ and t≥0 α2
t ∞, then W(t) converges to the global
minimum W∗ asymptotically, in the mean square sense:
l´ım
t→∞
W(t) − W∗ 2 = 0
One such procedure is αt = α
t+1, with α 0
11. Artificial neural networks (I)
The Delta rule
The fit can be under tighter control using regularization:
Eλ
emp(W) = Eemp(W) + λ W 2, λ 0, W 2 =
ij
w2
ij
In this context, this technique is known as weight decay, because it leads
to the new updating receipt:
∆wij(t) = −α
∂Eemp(W)
∂wij W=W(t)
+ λwij(t)
, i = 1, ..., m, j = 0, ..., N
12. Artificial neural networks (I)
The Delta rule
The value of λ is quite often chosen by resampling techniques.
How do we stop the process in practice?
1. When the number of iterations reaches a predetermined maximum
2. When the relative error reaches a predetermined tolerance
13. Artificial neural networks (II)
How could we obtain a model that is non-linear in the parameters (a
non-linear model)? We depart again from:
yk(x) = g
d
i=1
wkixi + wk0
, k = 1, . . . , m
where g is a sigmoidal function. This is a linear model.
The solution is to apply non-linear functions to the input data:
yk(x) = g
h
i=0
wkiφi(x)
, k = 1, . . . , m
We recover the previous “linear” situation making h = d and φi(x) = xi,
with φ0(x) = 1.
14. Artificial neural networks (II)
Approach 1. Make Φ = (φ0, . . . , φh) a set of predefined functions. This
is perfectly illustrated in the case n = 1 and polynomial fitting. Consider
the problem of fitting the function:
p(x) = w0 + w1x + . . . + whxh =
h
i=0
wixi to x1, . . . , xp
This can be seen as a special case of linear regression, where the set of
regressors is 1, x, x2, . . . , xh. Therefore φi(x) = xi
The weights w0, w1, . . . , wh can be estimated by standard techniques (or-
dinary least squares) or by the Delta rule.
15. Artificial neural networks (II)
What if we have a multivariate input x = (x1, . . . , xd)T ? The corresponding
h-degree polynomial is:
p(x) = w0 +
d
i1=1
wi1
xi1
+
d
i1=1
d
i2=i1+1
wi1i2
xi1
xi2
+
d
i1=1
d
i2=i1+1
d
i3=i2+1
wi1i2i3
xi1
xi2
xi3
. . .
The number of possible regressors grows as d+h
h
!
So many regressors (while holding N fixed) causes unsurmountable trouble for estimating
their parameters:
It is quite convenient (and sometimes mandatory!) to have more observations than
regressors
Statistical significance decreases with the number of regressors and increases with
the number of observations
16. Artificial neural networks (II)
Approach 2. Why not trying to engineer adaptive regressors? By adapting
the regressors to the problem, it is reasonable to expect that we shall need
a much smaller number of them for a correct fit.
The basic neural network idea is to duplicate the model:
yk(x) = g
h
i=0
wkiφi(x)
, k = 1, . . . , m
where φi(x) = g
d
j=0
vijxj
, with φ0(x) = 1, x0 = 1
17. Artificial neural networks (II)
We have now a new set of regressors Φ(x) = (φ0(x), . . . , φh(x))t.
These regressors are adaptive via the vi parameters (called the non-
linear parameters). Once the regressors are fully specified, the remai-
ning task is again a linear fit (via the wk parameters).
What kind of neural network gives rise to this function? The Multilayer
Perceptron or MLP
Under other choices for the regressors, other networks are obtained:
φi(x) = exp −
x − µi
2
2σ2
i
is the Gaussian RBF network.
18. Artificial neural networks (II)
Error functions
We have a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm (assume S is i.i.d.).
Ideally, we would like to know the statistical properties, namely p(t|x)
An error function can be derived by maximizing the likelihood of S:
L =
N
n=1
p(tn|xn)
The different outputs are assumed to be independent, so we can write:
p(t|x) =
m
k=1
p(tk|x)
When convenient, we can also maximize a strictly monotonic function of L, namely
the log-likelihood l = ln L
19. Artificial neural networks (II)
Error functions for regression
We model functional ignorance as stochatic variability, putting a continuous pdf p(t|X =
x) around each point x. The optimal solution (known as the regression function) is:
y∗
k(x) = E[tk|x] =
R
tk p(tk|x) dtk
We take tk to be a deterministic function, distorted by gaussian “noise”:
tk = hk(x) + ǫ, with ǫ ∼ N(0, σ2). Note ǫ does not depend on k or x (homoscedasticity).
20. Artificial neural networks (II)
Error functions for regression
Therefore we have:
p(ǫ) =
1
√
2πσ
exp −
ǫ2
2σ2
Also note that in this case the optimal function would be:
y∗
k(x) = E[tk|x] = E[hk(x) + ǫ] = E[hk(x)] + E[ǫ] = hk(x)
If we rewrite tk ∼ N(hk(x), σ2), we obtain:
p(tk|x) =
1
√
2πσ
exp −
(tk − yk(x))2
2σ2
21. Artificial neural networks (II)
Error functions for regression
Let us try to define and minimize the negative log-likelihood as the error:
−l = − ln L = − ln
N
n=1
p(tn|xn) = − ln
N
n=1
m
k=1
p(tnk|x) = −
N
n=1
m
k=1
ln p(tnk|x)
= −
N
n=1
m
k=1
ln 1√
2πσ
− (tnk−yk(xn))2
2σ2 =
N
n=1
m
k=1
1
2
ln(2πσ2) + (tnk−yk(xn))2
2σ2
= mp
2
(ln(2π) + 2 ln σ) + 1
2σ2
N
n=1
m
k=1
(tnk − yk(xn))2
The first term is out of our control –it does not depend on the model yk– so we should
minimize:
E ≡
1
2
N
n=1
m
k=1
(tnk − yk(xn))2
=
1
2
N
n=1
tn − y(xn) 2
22. Artificial neural networks (II)
Error functions for classification
The goal in classification is to model the posterior probabilities for every class P(ωk|x).
In two-class problems, we model by creating an ANN with one output neuron (m = 1)
to represent y(x) = P(ω1|x); therefore 1 − y(x) = P(ω2|x).
Suppose we have a set of learning examples
S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ {0, 1} (assume S is i.i.d.).
We take the convention that tn = 1 means xn ∈ ω1 and tn = 0 means xn ∈ ω2, to model:
P(t|x) =
y(x) if xn ∈ ω1
1 − y(x) if xn ∈ ω2
which can be more conveniently expressed as P(t|x) = y(x)t(1 − y(x))1−t, t = 0, 1.
23. Artificial neural networks (II)
Error functions for classification
This is a Bernoulli distribution. The log-likelihood is:
L =
N
n=1
y(xn)tn(1 − y(xn))1−tn
So which error should we use? Let us define and minimize again the
negative log-likelihood as the error:
E ≡ − ln L = −
N
n=1
{tn ln y(xn) + (1 − tn) ln(1 − y(xn))}
known as the cross-entropy; it can be generalized to more than two
classes.
24. Artificial neural networks (III)
A gentle derivation of backpropagation
A MLP of c hidden layers is a function F : Rn → Rm made up of pieces
F1, . . . , Fm of the form:
Fk(x) = g
hc
j=0
w
(c+1)
kj φ
(c)
j (x)
, k = 1, . . . , m
where, for every l = 1, . . . , c, W(l) = (w
(l)
ji ) is the matrix of weights con-
necting layers l − 1 and l, hl is the size of hidden layer l and
φ
(l)
j (x) = g
hl−1
i=0
w
(l)
ji φ
(l−1)
i (x)
, for l = 1, . . . , c
with φ
(0)
i (x) = xi, φ
(l)
0 (x) = 1 (in particular, x0 = 1) and h0 = d.
25. Artificial neural networks (III)
A gentle derivation of backpropagation
The goal in regression is to minimize the empirical error of the network on
the training data sample S = {(xn, tn)}n=1,...,N, where xn ∈ Rd, tn ∈ Rm:
Eemp(W) =
1
2
N
n=1
m
k=1
(tnk − Fk(xn))2
where W = {W(1), . . . , W(c+1)} is the set of all network weights.
26. Artificial neural networks (III)
A gentle derivation of backpropagation
Note that, if g admits a derivative everywhere, Eemp(W) is a differen-
tiable function of every weight w
(l)
ji
If we want to apply gradient descent, we need to compute the partial
derivative of the error w.r.t. every weight, the gradient vector:
∇Eemp(W) =
∂Eemp(W)
∂w
(l)
ji
l,j,i
There exists a reasonably efficient algorithm for computing this gra-
dient vector: the backpropagation algorithm
27. Artificial neural networks (III)
A gentle derivation of backpropagation
Consider a MLP where, for notational simplicity, we define:
z
(l)
j ≡ g(a
(l)
j ) ≡ g
i
w
(l)
ji z
(l−1)
i
, z
(0)
j = xj
Note that Eemp is the sum of the (independent) errors for every in-
put/output example (xn, tn):
Eemp(W) =
N
n=1
1
2
m
k=1
(tnk − Fk(xn))2 ≡
N
n=1
E
(n)
emp(W)
28. Artificial neural networks (III)
A gentle derivation of backpropagation
Suppose we present xn to the network and compute all the neuron’s
outputs z
(l)
j (this is known as the forward propagation). Now,
∂E
(n)
emp(W)
∂w
(l)
ji
=
∂E
(n)
emp(W)
∂a
(l)
j
·
∂a
(l)
j
∂w
(l)
ji
= δ
(l)
j · z
(l−1)
i
where we have defined δ
(l)
j ≡
∂E
(n)
emp(W)
∂a
(l)
j
.
29. Artificial neural networks (III)
A gentle derivation of backpropagation
What have we done? We have found that, in order to compute the desired
derivative
∂E
(n)
emp(W)
∂w
(l)
ji
, we only need to find the δ
(l)
j :
Let us concentrate on an arbitrary neuron k. Suppose first that k is an
output neuron, then
δ
(c+1)
k =
∂E
(n)
emp(W)
∂a
(c+1)
k
= −g′(a
(c+1)
k )·(tnk−Fk(xn)) = g′(a
(c+1)
k )·(z
(c+1)
k −tnk)
where we have made use of the identity Fk(xn) = g(a
(c+1)
k ) = z
(c+1)
k .
30. Artificial neural networks (III)
A gentle derivation of backpropagation
When g is the logistic function lβ(z) = 1
1+e−βz ∈ (0, 1), we obtain:
g′(a
(c+1)
k ) = βg(a
(c+1)
k )[1 − g(a
(c+1)
k )] = βz
(c+1)
k (1 − z
(c+1)
k )
Therefore
∂E
(n)
emp(W)
∂w
(l)
ji
= βz
(c+1)
j (1 − z
(c+1)
j )(z
(c+1)
j − tnj)z
(c)
i
31. Artificial neural networks (III)
A gentle derivation of backpropagation
Suppose now that k is a hidden neuron, located in a layer l ∈ {1, . . . , c}:
δ
(l)
k =
∂E
(n)
emp(W)
∂a
(l)
k
=
q
∂E
(n)
emp(W)
∂a
(l+1)
q
·
∂a
(l+1)
q
∂a
(l)
k
=
q
δ
(l+1)
q ·
∂a
(l+1)
q
∂a
(l)
k
=
q
δ
(l+1)
q ·
∂a
(l+1)
q
∂z
(l)
k
·
∂z
(l)
k
∂a
(l)
k
=
q
δ
(l+1)
q w
(l+1)
qk g′(a
(l)
k ) = g′(a
(l)
k )
q
δ
(l+1)
q w
(l+1)
qk
Again, when g is the logistic, g′(a
(l)
k ) = βg(a
(l)
k )[1−g(a
(l)
k )] = βz
(l)
k (1−z
(l)
k )
32. Artificial neural networks (III)
A gentle derivation of backpropagation
Therefore
∂Eemp(W)
∂w
(l)
ji
=
N
n=1
∂E
(n)
emp(W)
∂w
(l)
ji
and we have ∇Eemp(W) =
∂Eemp(W)
∂w
(l)
ji
l,j,i
The updating formula for the weights is:
w
(l)
ji (t + 1) ← w
(l)
ji (t) − α
∂Eemp(W)
∂w
(l)
ji W=W(t)