1.
Two algorithms to accelerate training of
back-propagation neural networks
Vincent Vanhoucke
May 23, 1999
http://www.stanford.edu/ nouk/
Abstract
This project is aimed at developing techniques to initialize the weights of multi-
layer back-propagation neural networks, in order to speed up the rate of convergence.
Two algorithms are proposed for a general class of networks.
2.
1 Conventions and notations
i,1
1
w1,1
1
wN,1
1
w2,1
1
wN,N
1
xi,1
2
yi,1
x
1
X Y
xi,N
1
xi,2
1
3
xi,N
L
yi,N
xxi,N
2
i,N
wN,1
2
2,1
2
wN,N
2
w1,1
2
w
wN,1
L
2,1
L
wN,N
L
i,1
3
xi,1
L
x 1,1
L
w
w
yi,2
L
N
Figure 1: General notations: L layers neural network
In this paper we restrict ourselves to L layers, N nodes per layer back-propagation net-
works [1]. This restriction will allow us to derive the behaviour of the network from matrix
equations.1
L Number of layers
N Number of nodes per layer
T Size of the training set
η Backpropagation coeﬃcient
φ() Activation function
xk
i,j jth
input of the kth
layer for the ith
training vector
yi,j Output of the jth
node for the ith
training vector
di,j Desired output of the jth
node for the ith
training vector
wk
i,j ith
weight of the jth
node of the kth
layer
With these conventions, we can write:
Xi =
xi
1,1 . . . xi
1,N
...
...
...
xi
T,1 . . . xi
T,N
(i ∈ [1, L]) Wi =
wi
1,1 . . . wi
1,N
...
...
...
wi
N,1 . . . wi
N,N
(i ∈ [1, L]) (1)
1
Extension to a wider variety of neural nets has not been studied yet
2
3.
Y =
y1,1 . . . y1,N
...
...
...
yT,1 . . . yT,N
D =
d1,1 . . . d1,N
...
...
...
dT,1 . . . dT,N
(2)
These notations allow us to simply express the forward propagation of all the training
set:
X2 = φ (X1W1)
...
Xi+1 = φ (XiWi)
...
Y = φ (XLWL)
(3)
2 Mathematical introduction
Consider the recursion in equation 3. Successive application of the sigmo¨id function φ()
drives the input asymptotically to the ﬁxed points of φ(). This statement is rigorously true
when no weighting is done (Wi = I), but holds statistically for a wider class of matrices.
Figure 2: Two kind of sigmo¨ids: Left φ (0) > 1, Right φ (0) ≤ 1
Figure 2 shows two diﬀerent behaviors, depending on the slope of the sigmo¨id:
1 − (x > 0)
φ (0) > 1 x → 0 (x = 0)
−1 + (x < 0)
φ (0) ≤ 1 x → 0
(4)
Conclusions:
1. If we ﬁx the initial weights of the network randomly, we are statistically more likely to
get an output - before adaptation of the weights - close to these convergence points. As
3
4.
a consequence, we might be able to transform the objectives D of the neural network
to make their coeﬃcient close to these convergence points and achieve a lower initial
distortion.
2. In the case of φ (0) > 1, there are three convergence points, depending on the sign of
the input. Note the unstability of the origin: for an output to be zero, the input has
to be zero. We will focus on this constraint later.
3. In the case of φ (0) ≤ 1, the nodes makes every entry it is given tend to zero. There
is no natural clustering of the entries induced by the sigmo¨id. We won’t consider this
case in the following discussion. Section 3.3 will show that our algorithm performs
poorly in that case.
3 Algorithm I: Initialization by Forward
Propagated Orthogonalization
3.1 Case N = T
The case N = T, where the number of nodes equals the size of the training set, is not likely
to be encountered in practice. However, it allows us to derive an exact solution that holds
for that case, and extend it to a set of approximate solutions for the more general case.
3.1.1 Exact solution
If N = T, all the matrices are square, and thus we can write2
:
Case T = 1:
D = φ (X1W1) ⇒ W1 = X−1
1 φ−1
(D) (5)
Case T > 1: Let α = 1 − be the positive ﬁxed point of φ()3
.
W1 = αX−1
1 ⇒ X2 = αI
. . .
Wi = αI ⇒ Xi+1 = αI
. . .
WL = α−1
φ−1
(D) ⇒ Y = D
(6)
This set of weights ensures that we aim at the exact solution without any adaptation.
3.1.2 Low complexity asymptotically optimal solution
We might not be able or be willing to invert the matrix X1. In that case, we can derive an
approximate solution by letting Wn = α−1
φ−1
(D), which implies Xn = αI.
2
Subject to X1 invertible
3
This excludes the case φ (0) ≤ 1
4
5.
Remember that in 2 we arrived to the conclusion that we might be able to take advantage
of a change in the objective D, so that its coeﬃcient are only ﬁxed points of the sigmo¨id.
Here we achieve this by letting the new goal be Xn = αI.
Procedure:
• Step 1:
W1 → ˜y1 ˜x1
1 . . . ˜xi
1 . . . ˜xN−1
1
X1 →
x1
1
...
xi
1
...
xN
1
α 0 0 0 0 0
0 ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
. . . . . .
. . . . . .
(7)
With:
˜xi
1 given by : 1
∀i ∈ [1, N − 1] < x1
1 | ˜xi
1 >= 0
˜y1 given by : 2
< x2
1 | ˜y1 >= 0 and < x1
1 | ˜y1 >= α
(8)
• Step 2:
W2 →
1 0 . . . 0
˜x1
2 ˜y2 ˜x2
2 . . . ˜xN−1
2
X2 →
α 0
0 x1
2
∗ x2
2
∗
...
∗ xN
2
α 0 0 0 0 0
0 α 0 0 0 0
∗ 0 ∗ ∗ ∗ ∗
. . . . . .
. . . . . .
(9)
With:
˜xi
2 given by : ∀i ∈ [1, N − 1] < x1
2 | ˜xi
2 >= 0
˜y2 given by : 3
< x2
2 | ˜y2 >= 0 and < x1
2 | ˜y2 >= α
(10)
• Step 3:
1
Note that the ˜xi
1 do not need to be all distincts. rank ˜xi
1, i ∈ [1, N − 1] > 1 is a necessary condition,
as shown in step two, but there might be some more requirements to ensure that the procedure works in all
cases.
2
The second constraint can be weakened:< x1
1 | ˜y1 > > 0. The remarks of section 2 apply, ensuring that
the coeﬃcient will naturally converge to α.
3
This can never be achieved if rank ˜xi
1, i ∈ [1, N − 1] = 1.
5
6.
W3 →
1 0 . . . 0
0 1 0 . . . . . . 0
˜x1
3 ˜x2
3 ˜y3 ˜x3
3 . . . ˜xN−1
3
X3 →
α 0 0
0 α 0
β 0 x1
3
∗ ∗ x2
3
. . .
α 0 0 0 0 0
0 α 0 0 0 0
0 0 α 0 0 0
∗ ∗ 0 ∗ ∗ ∗
. . . . . .
(11)
With:
˜x1
3 given by : < [β x1
3] | [1 ˜x1
3] >= 0
˜xi
3 (i > 1) given by : ∀i ∈ [2, N − 1] < x1
3 | ˜xi
3 >= 0
˜y3 given by : < x2
3 | ˜y3 >= 0 and < x1
3 | ˜y3 >= α
(12)
• An so on...
This procedure does a step by step diagonalization of the matrix, which is fully com-
pleted in N steps. We will see later that N layers are not required for the algorithm
to perform well. Even a partial diagonalisation reduces the error by a signiﬁcative
amount.
3.2 Experiments: Case N = 4
3.2.1 Algorithm I applied to a 4 nodes, L layers network, N = T
See ﬁgure 3.2.1. The detailed experimental protocol is:
Number of layers 1 to 6
Number of nodes per layer 4
Size of training set 4
Sigmo¨id φ(x) = tanh(4.x)
Adaptation rule Back propagation
Error measure MSE averaged on all training sequences
Input vectors Random coeﬃcients between 0 and 1
Desired output Random coeﬃcients between 0 and 1
Reference Random weights between 0 and 1
Number of training iterations 200
Number of experiments averaged 100
The adaptation time is greatly reduced by the algorithm. The ﬁrst transitory phase
where the network is searching for a stable conﬁguration is removed. However, in this case,
perfect ﬁt without adaptation can be achieved using the inversion procedure described in
3.1.1.
6
7.
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
One layer
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
100
150
200
250
300
350
400
450
500
Two layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
100
200
300
400
500
600
700
Three layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
100
150
200
250
300
350
400
450
500
550
600
Four layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
600
Five layers
Iterations
MSE
Random weights
Algorithm I
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
Six layers
Iterations
MSE
Random weights
Algorithm I
Figure 3: MSE for various number of layers in the network
7
8.
3.2.2 Case N < T
When the training set is bigger than N, a solution consists of adapting the initial weights
to a subset of size N of the training set, using one of the two techniques described before.
The results shown in ﬁgure 3.2.2 demonstrate that in that case, Algorithm I is the one
that performs the best. The inversion procedure ’overﬁts’ the reduced portion of the training
set, and thus the adaptation time required to train the other elements of the training set is
much higher.
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
Iterations
MSE
Doubled training set
Random weights
Inversion procedure
Algorithm I
Figure 4: MSE for a doubled size of the training set
4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
x 10
5
Size of the training set
AveragedMSE
Random weights
Algorithm I
Figure 5: MSE averaged over 200 iterations
8
9.
3.3 Case φ (0) ≤ 1
Theoretically, the procedure described should work poorly if φ (0) ≤ 1 (See section 2). Figure
3.3 shows that it is indeed the case.
0 20 40 60 80 100 120 140 160 180 200
150
200
250
300
350
400
450
500
550
600
Use of different phi
Iterations
MSE
Random weights, phi’(0)>1
Algorithm I, phi’(0)>1
Random weights, phi’(0)<1
Algorithm I, phi’(0)<1
Figure 6: MSE with φ(x) = tanh(x
4
).
4 Algorithm II: Initialization by Joint Diagonalization
4.1 Description
In the description of the exact procedure in section 3.1.1, the key step was to get:
X2 = αI and XL = αI (13)
Now considering a larger size of the training set, we are forced to weaken these constraints
in order to be able to work with non-square matrices.
Let’s consider that the size of the training set is T = kN, a multiple of the number of
nodes in each layer. Xi and D can be decomposed in square blocks:
Xi =
X1
i
...
Xk
i
and D =
D1
...
Dk
(14)
We can achieve a fairly good separation of the elements of the training set by transforming
the constraints in 13 into:
9
10.
X2 = XL =
αI
...
αI
(15)
If we can meet these constraints, each separated bin will contain at most k elements.
Using equation 3, we can see that meeting these constraints is equivalent to be able to
invert X1
1 , . . . , Xk
1 with a common inverse W1, as well as inverting D1
, . . . , Dk
with WL. This
can not be done strictly, but by weakening one step further the equality 15, we might be
able to achieve joint diagonalization. If X2 and XL are diagonal by blocks, remarks made in
2 apply and the blocks are good approximates of the objectives αI.
The joint diagonalisation theorem states that:
Given A1, . . . , An a set of normal matrices which commute, there exist an orthog-
onal matrix P such as: P A1P, . . ., P AnP are all diagonal.
If the matrices are not normal, and/or do not commute, we can achieve a close result by
minimization of the oﬀ-diagonal terms of the set of matrices as shown in [2]:
With:
oﬀ(A)
1≤i=j≤N
|ai,j|2
(16)
The minimization criterion4
is:
P = inf
P :P P =I
k
i=1
oﬀ P AiP (17)
As in equation 3, the objective is to diagonalize:
φ−1
X2
i = X1
i W1 and XL
i = φ−1
Di
W−1
L , i ∈ [1, k] (18)
Which can be expressed formally as:
∆i
φ−1
(X2
i )
= P
W1
AiP
X1
i
and ∆i
XL
i
= P Ai
φ−1(Di)
P
W L
(19)
As a consequence, subject to be willing to change the inputs X1 and the desired outputs
D by a bijective transform: X1 → X1P, and D → φ P φ−1
(D) , then setting Wi = P
and WL = P provides a good approximation of the ideal case.
4
A Matlab function performing this operation is given at the URL [3]
10
11.
4.2 Algorithm
Xi =
X1
i
...
Xk
i
and D =
D1
...
Dk
P: best orthogonal matrix diagonalizing X1
i , . . . , Xk
i
P : best orthogonal matrix diagonalizing φ−1
(D1
) , . . . , φ−1
Dk
X1 → .P →
L layers, N nodes neural net
W1, . . . , WL
→ P . → D
Wi = P , ∀i ∈ [2, L − 1] Wi = αI, WL = P
4.3 Experiments
Experiments made using the same protocol (see ﬁgure 4.3) clearly show the gain in perfor-
mance of algorithm II. Note that even when the objectives are identical, the gain in speed
of convergence is signiﬁcant.
0 20 40 60 80 100 120 140 160 180 200
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Algorithm II
Size of the training set
AveragedMSE
Random weights
Random w. with modified objectives
Algorithm II
Figure 7: MSE for k = 5 (20 training elements)
5 Comparison between the two algorithms
Figure 5 shows the diﬀerence of performance between algorithm I and II. The averaged MSE
is not a very good quantitative indicator of the speed of convergence, but the orders of
magnitude are consistent with the data provided by the learning curves.
11
12.
4 6 8 10 12 14 16 18 20
0
0.5
1
1.5
2
2.5
x 10
5 Comparison between the algorithms
Size of the training set
AveragedMSE
Random weights
Algorithm I
Algorithm II
Figure 8: MSE averaged over 200 iterations
Algorithm II performs obviously better, but with the restriction of having to apply a
transform on the data before feeding the neural network. On the other hand, Algorithm I is
computationnally very inexpensive, but is very likely by design not to perform very well on
very large training sets.
6 Conclusion
The two algorithms proposed speed up the learning rate of multi-layer back-propagation
neural networks. Constraints on these networks include:
• A ﬁxed number of nodes per layer
• No constrained weight in the network
• An activation function with slope greater than one at the origin.
Adaptation of these techniques to diﬀerent class of networks is still to be explored.
12
13.
References
[1] Neural Networks, A Comprehensive Foundation, Second Edition - Simon Haykin, Pren-
tice Hall, 1999
[2] Jacobi angles for simultaneous diagonalization, Jean-Fran¸cois Cardoso and Antoine
Souloumiac - SIAM J. Mat. Anal. Appl. #1 vol. 17, p. 161-164, jan. 1996
ftp://sig.enst.fr/pub/jfc/Papers/siam_note.ps.gz
[3] Description of the joint diagonalization algoritm and Matlab code:
http://www-sig.enst.fr/~cardoso/jointdiag.html
Be the first to comment