Your SlideShare is downloading. ×
Two algorithms to accelerate training of back-propagation neural networks
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Two algorithms to accelerate training of back-propagation neural networks

1,231
views

Published on

Two algorithms to accelerate training of …

Two algorithms to accelerate training of
back-propagation neural networks


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,231
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
46
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Two algorithms to accelerate training of back-propagation neural networks Vincent Vanhoucke May 23, 1999 http://www.stanford.edu/ nouk/ Abstract This project is aimed at developing techniques to initialize the weights of multi- layer back-propagation neural networks, in order to speed up the rate of convergence. Two algorithms are proposed for a general class of networks.
  • 2. 1 Conventions and notations i,1 1 w1,1 1 wN,1 1 w2,1 1 wN,N 1 xi,1 2 yi,1 x 1 X Y xi,N 1 xi,2 1 3 xi,N L yi,N xxi,N 2 i,N wN,1 2 2,1 2 wN,N 2 w1,1 2 w wN,1 L 2,1 L wN,N L i,1 3 xi,1 L x 1,1 L w w yi,2 L N Figure 1: General notations: L layers neural network In this paper we restrict ourselves to L layers, N nodes per layer back-propagation net- works [1]. This restriction will allow us to derive the behaviour of the network from matrix equations.1 L Number of layers N Number of nodes per layer T Size of the training set η Backpropagation coefficient φ() Activation function xk i,j jth input of the kth layer for the ith training vector yi,j Output of the jth node for the ith training vector di,j Desired output of the jth node for the ith training vector wk i,j ith weight of the jth node of the kth layer With these conventions, we can write: Xi =    xi 1,1 . . . xi 1,N ... ... ... xi T,1 . . . xi T,N    (i ∈ [1, L]) Wi =    wi 1,1 . . . wi 1,N ... ... ... wi N,1 . . . wi N,N    (i ∈ [1, L]) (1) 1 Extension to a wider variety of neural nets has not been studied yet 2
  • 3. Y =    y1,1 . . . y1,N ... ... ... yT,1 . . . yT,N    D =    d1,1 . . . d1,N ... ... ... dT,1 . . . dT,N    (2) These notations allow us to simply express the forward propagation of all the training set:    X2 = φ (X1W1) ... Xi+1 = φ (XiWi) ... Y = φ (XLWL) (3) 2 Mathematical introduction Consider the recursion in equation 3. Successive application of the sigmo¨id function φ() drives the input asymptotically to the fixed points of φ(). This statement is rigorously true when no weighting is done (Wi = I), but holds statistically for a wider class of matrices. Figure 2: Two kind of sigmo¨ids: Left φ (0) > 1, Right φ (0) ≤ 1 Figure 2 shows two different behaviors, depending on the slope of the sigmo¨id: 1 − (x > 0) φ (0) > 1 x → 0 (x = 0) −1 + (x < 0) φ (0) ≤ 1 x → 0 (4) Conclusions: 1. If we fix the initial weights of the network randomly, we are statistically more likely to get an output - before adaptation of the weights - close to these convergence points. As 3
  • 4. a consequence, we might be able to transform the objectives D of the neural network to make their coefficient close to these convergence points and achieve a lower initial distortion. 2. In the case of φ (0) > 1, there are three convergence points, depending on the sign of the input. Note the unstability of the origin: for an output to be zero, the input has to be zero. We will focus on this constraint later. 3. In the case of φ (0) ≤ 1, the nodes makes every entry it is given tend to zero. There is no natural clustering of the entries induced by the sigmo¨id. We won’t consider this case in the following discussion. Section 3.3 will show that our algorithm performs poorly in that case. 3 Algorithm I: Initialization by Forward Propagated Orthogonalization 3.1 Case N = T The case N = T, where the number of nodes equals the size of the training set, is not likely to be encountered in practice. However, it allows us to derive an exact solution that holds for that case, and extend it to a set of approximate solutions for the more general case. 3.1.1 Exact solution If N = T, all the matrices are square, and thus we can write2 : Case T = 1: D = φ (X1W1) ⇒ W1 = X−1 1 φ−1 (D) (5) Case T > 1: Let α = 1 − be the positive fixed point of φ()3 . W1 = αX−1 1 ⇒ X2 = αI . . . Wi = αI ⇒ Xi+1 = αI . . . WL = α−1 φ−1 (D) ⇒ Y = D (6) This set of weights ensures that we aim at the exact solution without any adaptation. 3.1.2 Low complexity asymptotically optimal solution We might not be able or be willing to invert the matrix X1. In that case, we can derive an approximate solution by letting Wn = α−1 φ−1 (D), which implies Xn = αI. 2 Subject to X1 invertible 3 This excludes the case φ (0) ≤ 1 4
  • 5. Remember that in 2 we arrived to the conclusion that we might be able to take advantage of a change in the objective D, so that its coefficient are only fixed points of the sigmo¨id. Here we achieve this by letting the new goal be Xn = αI. Procedure: • Step 1: W1 → ˜y1 ˜x1 1 . . . ˜xi 1 . . . ˜xN−1 1 X1 →        x1 1 ... xi 1 ... xN 1              α 0 0 0 0 0 0 ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ . . . . . . . . . . . .       (7) With: ˜xi 1 given by : 1 ∀i ∈ [1, N − 1] < x1 1 | ˜xi 1 >= 0 ˜y1 given by : 2 < x2 1 | ˜y1 >= 0 and < x1 1 | ˜y1 >= α (8) • Step 2: W2 → 1 0 . . . 0 ˜x1 2 ˜y2 ˜x2 2 . . . ˜xN−1 2 X2 →        α 0 0 x1 2 ∗ x2 2 ∗ ... ∗ xN 2              α 0 0 0 0 0 0 α 0 0 0 0 ∗ 0 ∗ ∗ ∗ ∗ . . . . . . . . . . . .       (9) With: ˜xi 2 given by : ∀i ∈ [1, N − 1] < x1 2 | ˜xi 2 >= 0 ˜y2 given by : 3 < x2 2 | ˜y2 >= 0 and < x1 2 | ˜y2 >= α (10) • Step 3: 1 Note that the ˜xi 1 do not need to be all distincts. rank ˜xi 1, i ∈ [1, N − 1] > 1 is a necessary condition, as shown in step two, but there might be some more requirements to ensure that the procedure works in all cases. 2 The second constraint can be weakened:< x1 1 | ˜y1 > > 0. The remarks of section 2 apply, ensuring that the coefficient will naturally converge to α. 3 This can never be achieved if rank ˜xi 1, i ∈ [1, N − 1] = 1. 5
  • 6. W3 →   1 0 . . . 0 0 1 0 . . . . . . 0 ˜x1 3 ˜x2 3 ˜y3 ˜x3 3 . . . ˜xN−1 3   X3 →       α 0 0 0 α 0 β 0 x1 3 ∗ ∗ x2 3 . . .             α 0 0 0 0 0 0 α 0 0 0 0 0 0 α 0 0 0 ∗ ∗ 0 ∗ ∗ ∗ . . . . . .       (11) With: ˜x1 3 given by : < [β x1 3] | [1 ˜x1 3] >= 0 ˜xi 3 (i > 1) given by : ∀i ∈ [2, N − 1] < x1 3 | ˜xi 3 >= 0 ˜y3 given by : < x2 3 | ˜y3 >= 0 and < x1 3 | ˜y3 >= α (12) • An so on... This procedure does a step by step diagonalization of the matrix, which is fully com- pleted in N steps. We will see later that N layers are not required for the algorithm to perform well. Even a partial diagonalisation reduces the error by a significative amount. 3.2 Experiments: Case N = 4 3.2.1 Algorithm I applied to a 4 nodes, L layers network, N = T See figure 3.2.1. The detailed experimental protocol is: Number of layers 1 to 6 Number of nodes per layer 4 Size of training set 4 Sigmo¨id φ(x) = tanh(4.x) Adaptation rule Back propagation Error measure MSE averaged on all training sequences Input vectors Random coefficients between 0 and 1 Desired output Random coefficients between 0 and 1 Reference Random weights between 0 and 1 Number of training iterations 200 Number of experiments averaged 100 The adaptation time is greatly reduced by the algorithm. The first transitory phase where the network is searching for a stable configuration is removed. However, in this case, perfect fit without adaptation can be achieved using the inversion procedure described in 3.1.1. 6
  • 7. 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 One layer Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 100 150 200 250 300 350 400 450 500 Two layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 100 200 300 400 500 600 700 Three layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 100 150 200 250 300 350 400 450 500 550 600 Four layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 600 Five layers Iterations MSE Random weights Algorithm I 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 Six layers Iterations MSE Random weights Algorithm I Figure 3: MSE for various number of layers in the network 7
  • 8. 3.2.2 Case N < T When the training set is bigger than N, a solution consists of adapting the initial weights to a subset of size N of the training set, using one of the two techniques described before. The results shown in figure 3.2.2 demonstrate that in that case, Algorithm I is the one that performs the best. The inversion procedure ’overfits’ the reduced portion of the training set, and thus the adaptation time required to train the other elements of the training set is much higher. 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 Iterations MSE Doubled training set Random weights Inversion procedure Algorithm I Figure 4: MSE for a doubled size of the training set 4 6 8 10 12 14 16 18 20 0 0.5 1 1.5 2 2.5 x 10 5 Size of the training set AveragedMSE Random weights Algorithm I Figure 5: MSE averaged over 200 iterations 8
  • 9. 3.3 Case φ (0) ≤ 1 Theoretically, the procedure described should work poorly if φ (0) ≤ 1 (See section 2). Figure 3.3 shows that it is indeed the case. 0 20 40 60 80 100 120 140 160 180 200 150 200 250 300 350 400 450 500 550 600 Use of different phi Iterations MSE Random weights, phi’(0)>1 Algorithm I, phi’(0)>1 Random weights, phi’(0)<1 Algorithm I, phi’(0)<1 Figure 6: MSE with φ(x) = tanh(x 4 ). 4 Algorithm II: Initialization by Joint Diagonalization 4.1 Description In the description of the exact procedure in section 3.1.1, the key step was to get: X2 = αI and XL = αI (13) Now considering a larger size of the training set, we are forced to weaken these constraints in order to be able to work with non-square matrices. Let’s consider that the size of the training set is T = kN, a multiple of the number of nodes in each layer. Xi and D can be decomposed in square blocks: Xi =    X1 i ... Xk i    and D =    D1 ... Dk    (14) We can achieve a fairly good separation of the elements of the training set by transforming the constraints in 13 into: 9
  • 10. X2 = XL =    αI ... αI    (15) If we can meet these constraints, each separated bin will contain at most k elements. Using equation 3, we can see that meeting these constraints is equivalent to be able to invert X1 1 , . . . , Xk 1 with a common inverse W1, as well as inverting D1 , . . . , Dk with WL. This can not be done strictly, but by weakening one step further the equality 15, we might be able to achieve joint diagonalization. If X2 and XL are diagonal by blocks, remarks made in 2 apply and the blocks are good approximates of the objectives αI. The joint diagonalisation theorem states that: Given A1, . . . , An a set of normal matrices which commute, there exist an orthog- onal matrix P such as: P A1P, . . ., P AnP are all diagonal. If the matrices are not normal, and/or do not commute, we can achieve a close result by minimization of the off-diagonal terms of the set of matrices as shown in [2]: With: off(A) 1≤i=j≤N |ai,j|2 (16) The minimization criterion4 is: P = inf P :P P =I k i=1 off P AiP (17) As in equation 3, the objective is to diagonalize: φ−1 X2 i = X1 i W1 and XL i = φ−1 Di W−1 L , i ∈ [1, k] (18) Which can be expressed formally as: ∆i φ−1 (X2 i ) = P W1 AiP X1 i and ∆i XL i = P Ai φ−1(Di) P W L (19) As a consequence, subject to be willing to change the inputs X1 and the desired outputs D by a bijective transform: X1 → X1P, and D → φ P φ−1 (D) , then setting Wi = P and WL = P provides a good approximation of the ideal case. 4 A Matlab function performing this operation is given at the URL [3] 10
  • 11. 4.2 Algorithm Xi =    X1 i ... Xk i    and D =    D1 ... Dk    P: best orthogonal matrix diagonalizing X1 i , . . . , Xk i P : best orthogonal matrix diagonalizing φ−1 (D1 ) , . . . , φ−1 Dk X1 → .P → L layers, N nodes neural net W1, . . . , WL → P . → D Wi = P , ∀i ∈ [2, L − 1] Wi = αI, WL = P 4.3 Experiments Experiments made using the same protocol (see figure 4.3) clearly show the gain in perfor- mance of algorithm II. Note that even when the objectives are identical, the gain in speed of convergence is significant. 0 20 40 60 80 100 120 140 160 180 200 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Algorithm II Size of the training set AveragedMSE Random weights Random w. with modified objectives Algorithm II Figure 7: MSE for k = 5 (20 training elements) 5 Comparison between the two algorithms Figure 5 shows the difference of performance between algorithm I and II. The averaged MSE is not a very good quantitative indicator of the speed of convergence, but the orders of magnitude are consistent with the data provided by the learning curves. 11
  • 12. 4 6 8 10 12 14 16 18 20 0 0.5 1 1.5 2 2.5 x 10 5 Comparison between the algorithms Size of the training set AveragedMSE Random weights Algorithm I Algorithm II Figure 8: MSE averaged over 200 iterations Algorithm II performs obviously better, but with the restriction of having to apply a transform on the data before feeding the neural network. On the other hand, Algorithm I is computationnally very inexpensive, but is very likely by design not to perform very well on very large training sets. 6 Conclusion The two algorithms proposed speed up the learning rate of multi-layer back-propagation neural networks. Constraints on these networks include: • A fixed number of nodes per layer • No constrained weight in the network • An activation function with slope greater than one at the origin. Adaptation of these techniques to different class of networks is still to be explored. 12
  • 13. References [1] Neural Networks, A Comprehensive Foundation, Second Edition - Simon Haykin, Pren- tice Hall, 1999 [2] Jacobi angles for simultaneous diagonalization, Jean-Fran¸cois Cardoso and Antoine Souloumiac - SIAM J. Mat. Anal. Appl. #1 vol. 17, p. 161-164, jan. 1996 ftp://sig.enst.fr/pub/jfc/Papers/siam_note.ps.gz [3] Description of the joint diagonalization algoritm and Matlab code: http://www-sig.enst.fr/~cardoso/jointdiag.html