Multilayer Neural Networks


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Multilayer Neural Networks

  1. 1. Optimising the Widths of Radial Basis Functions Mark Orr Centre for Cognitive Science, Edinburgh University 2, Buccleuch Street, Edinburgh EH8 9LW, Scotland, UK Abstract too complex for the data. The size of the penalty is controlled by , the regularisation parameter, which, In the context of regression analysis with penalised like w and r, is free to adapt to the training set. Given linear models (such as RBF networks) certain model a value for the weight vector which minimises the selection criteria can be di erentiated to yield a re- cost is estimation formula for the regularisation parameter w = A;1H>y ^ such that an initial guess can be iteratively improved where Hij = hj (xi ) is the design matrix and contains until a local minimum of the criterion is reached. In the responses of the m centres to the p inputs of the this paper we discuss some enhancements of this gen- training set, A = H>H + Im and Im is the m- eral approach including improved computational e - dimensional identity matrix. ciency, detection of the global minimum and simulta- The adjustable parameters in the model are the neous optimisation of the basis function widths. The m weights wj , the basis function width r and the bene ts of these improvements are demonstrated on a regularisation parameter . The xed parameters are practical problem. the centre positions cj and their number m. Below we will assume that the inputs of the training set are used as the xed centres, in which case m = p and 1 Introduction cj = xj , but our results apply equally to other choices of xed centres. Consider a radial basis function (RBF) network Various model selection criteria, such as gener- with centres at fcj g, weights fwj g and radial func- alised cross-validation (GCV) 2] or the marginal like- tions lihood of the data (the evidence") 3] can be di er- entiated and set equal to zero to yield a re-estimation h (x) = exp ; (x ; cj ) (x ; cj ) > j formula for the regularisation parameter. For exam- r2 ple in a previous paper 5] we derived the following formula from GCV (j = 1 : : : m) all having the same width r. The cen- = p ; w>^ ;1 w e^ e tres are xed but the weights and width are adapt- > able. The response of the network to an input x is ^ A ^ (1) m X where ^ = y ; H w, = m ; tr A;1 (the e ective e ^ ; f (x) = wj hj (x) number of parameters 4]) and = tr A;1 ; A;2 . j =1 However there are problems with simply trying to Suppose that this network is trained on a regression iterate equation (1) to convergence. Firstly, depend- data set fxi yi g (i = 1 : : : p) by minimising the a ing on the initial guess, a non-optimal local minimum penalised sum-squared-error cost function may be found and secondly, inversion of A is liable to become numerically unstable if gravitates to- C (w) = e> e + w> w wards very small values. Furthermore, the value of the width parameter r remains xed. In the next where w is the m-dimensional weight vector and e is section we describe a computationally e cient and the p-dimensional error vector, ei = yi ; f (xi ). The numerically stable method of iterating (1) which is second term penalises large weights and is designed fast enough that the global minimum and the opti- to avoid over t should the unregularised model be mal value of r can be found by explicit search.
  2. 2. 2 E cient Computation RBF network with m = 60 centres coincident with the input points and basis functions of xed width Continually recomputing the inverse of A each r = 0:2. Figure 1 shows the variation of GCV with time the value of changes requires of order m3 for one particular realisation of this problem. oating point operations per iteration and is vulner- able to numerical instability if becomes very small. A more e cient and stable method involves initially computing the eigenvalues f i g and eigenvectors fui g of HH> and fzi g, the projections of the data onto the eigenvectors (zi = y> ui ). Thereafter, the four terms appearing in (1) can be computed e ciently log(GCV) (with cost only linear in p) by p X e>e = zi 2 2 (2) i=1 ( i + )2 Xp w>A;1 w = i zi2 (3) i=1 ( i + ) 3 Xp −16 −10 −4 2 = i (4) log(λ) i=1 ( i + )2 Xp Figure 1: Local (diamonds) and global (star) p; = (5) minima of GCV. i=1 i + Note that if p > m then the last p ; m eigenvalues An array of 50 trial values of , evenly spaced be- (assuming they are ordered from largest to smallest) tween log = ;16 and log = 2, were used to nd are zero. However, as remarked earlier, if we have one rough positions for the local minima and equation (1) centre for each training set input then p = m and the was then iterated for each one found. Convergence cost of calculating the eigenvalues and eigenvectors was assumed once changes in GCV from one itera- of the p p matrix HH> is of the same order as tion to the next had dipped below a threshold of 1 inverting the m m matrix A. Therefore, unless part in a million. In the example problem three local (1) converges almost immediately, it is much more minima were detected (see gure 1) and the one with e cient to calculate the eigensystem once and then the lowest GCV corresponded to 2. Searching use (2-5) than to invert A on each iteration. for the minima and re ning the candidate solutions Once the eigensystem has been established, GCV took up only 0.6% of the total computation time, the rest was accounted for by the calculation of eigenval- GCV = (p ; ^ 2 p^ e e > ues and eigenvectors. Notice that if we had simply ) started with a single guess for and iterated equa- tion (1) to nd the solution, any initial guess below can also be cheaply calculated for any given us- about 10;4 would have led to a sub-optimal solution. ing (2,5). Thus it is feasible to evaluate GCV for a Occasionally the value of re-estimated from equa- number of trial values of searching for local min- tion (1) bounces back and forward between two val- ima, re ne those that are found by iterating (1) to ues on each side of a local minimum and then ei- convergence { using (2-5) of course { and nally se- ther takes a long time to pass through this bistable lect from the local minima the one with the smallest state before nally converging or does not converge GCV. Assuming a wide and dense enough range of at all. To solve this problem we devised the follow- trial values is employed, this procedure will nd the ing heuristic. Suppose the sequence of re-estimated global minimum. values is 1 : : : k;2 k;1 k , with k being the We now demonstrate this method on a toy prob- current value. Then if lem consisting of p = 60 samples taken from the func- j k; k;1 j > j k ; k;2 j tion y = 0:8 sin(6 x) at random points in the range 0 < x < 1 and corrupted by Gaussian noise of stan- replace k by the geometric mean of k;1 and k;2 dard deviation 0.1. The data was modelled by an before proceeding to the next iteration.
  3. 3. 3 Optimising the Width Usually this means the location of the global mini- mum also changes smoothly with r but there are par- When GCV is di erentiated with respect to the ticular values of the width where the identity of the regularisation parameter and set equal to zero the local minima with the smallest GCV switches, caus- resulting equation can be manipulated so that alone ing an abrupt change in location (but not height) of appears on the left hand side, enabling the equation to the global minimum. This explains the discontinuous be used as a re-estimation formula. Unfortunately the changes of slope in the curve of gure 2. Local min- same trick does not work with r because, after setting ima can also be created or destroyed as r changes, so the derivative of GCV with respect to r to zero, the discontinuous changes in value are also possible. terms explicitly involving r cancel so r cannot be iso- Of course, the ultimate arbiter of generalisation lated and a re-estimation formula is impossible. The performance is not the value of a model selection cri- same applies to other model selection criteria such as terion (such as GCV) on a particular realisation of maximum likelihood of the data 6]. the problem but the error of an independent test set When there is only one width parameter, as we as- averaged over multiple realisations. We perform such sume here, it is feasible to tackle the problem of choos- a test in the next section. ing an optimal value by experimenting with a number of trial values and selecting the one most favoured by the model selection criterion. The range of trial values 4 Results used will be problem speci c and could be determined For a thorough test of the method we turn to a by the likely maximum and minimum scales involved more realistic problem stemming from Friedmann's in the particular problem. The number of trial values MARS paper 1] and later used to compare RBFs between the these limits will depend on the size of the and MARS 5]. The problem involves the prediction problem (p) and the available computing resources of impedance Z and phase from the four parameters since for each trial value an eigensystem computation (resistance, frequency, inductance and capacitance) of (with cost proportional to p3 ) will be necessary. an electrical circuit. Training sets of three di erent sizes (100, 200, 400) and with a signal-to-noise ra- tio of about 3:1 were replicated 100 times each. The input components were normalised to have unit vari- ance and zero mean for each replication. The learning method, as described above, was applied using a set of 10 trial values of r between 1 and 10. Generali- sation performance was estimated by scaled sum of squared errors over two independent test sets (one GCV for Z and one for ) of size 5000 and uncorrupted by noise. This is the same experimental set up as in the previous papers 1, 5] from which further details can be obtained. Z p NEW OLD NEW OLD 0 0.2 0.4 0.6 0.8 1 100 0.34 0.45 0.27 0.26 r 200 0.19 0.26 0.18 0.20 400 0.14 0.14 0.13 0.16 Figure 2: Tracking the global minimum with respect to as r changes. Table 1: Average generalisation errors for the new method, which optimises the width r, and an older method which does not. Figure 2 illustrates using the toy problem described earlier. It shows the value of GCV at the global min- imum over for 50 trial values of r between 0.1 and Table 1 summarises the results. The left hand col- 1.0. The value of r = 0:2, which we used earlier, ap- umn gives training set size. Two sets of results, one pears to have been a little on the small side. The for Z and one for , are given. The gures quoted are optimal value is close to 0.45. the average (over 100 replications) of the scaled sum As r changes the location ( ) and height (GCV) of squared prediction errors. Apart from the method of the local minima (see gure 1) change smoothly. described above, which involves optimisation of r, the
  4. 4. average errors of an older RBF algorithm, regularised 5 Conclusions forward selection (RFS), are also quoted (taken from 5]). The main di erences to the method described We have described a new computational method here are that RFS uses a xed value of r and creates for re-estimating the regularisation parameter of an a parsimonious network. The latter has a relatively RBF network based on generalised cross-validation small a ect on generalisation performance. (GCV). It utilises an eigensystem related to the de- RFS is clearly inferior to the new method for the sign matrix of the regression problem and is more Z problem and marginally worse for . We think e cient and more stable than methods which involve the optimisation of r for each training set explains a direct matrix inverse at each iteration. We have ex- the superior performance of the new method and the tended the algorithm to optimise the basis function lack of such optimisation is a partial explanation for width simply by testing a number of trial values and the poor performance of RFS compared to MARS 5]. selecting the one associated with the smallest value The xed value of r used for RFS was 3.5 but the av- of GCV. erage optimal values determined by the new method We tested the method on a practical problem in- were 8.7 for Z and 2.8 for . Thus it looks as if the volving 4 input dimensions and a few hundred train- xed value used for RFS was an underestimate in the ing examples. Our method, which can adapt the case of Z (where the new algorithm considerably im- width of the basis functions, but not their number, proved the results) but about right for (where the was found to have better prediction performance than new method made less of an impact). a similar RBF network which can adapt the number of functions but is stuck with the same xed width. The new method, with its head-on approaches to nding the global minimum with respect to the regu- larisation parameter and to optimising the basis func- tion width, does not scale-up well for multiple regular- isation parameters or multiple widths. Additionally, there is a limit on how many training examples and basis functions can be handled due to the computa- tional cost of calculating the eigensystem. It is best Z suited to problems involving a single regularisation parameter, a single basis function width and about 1000 (or less) training set examples. References 2 2 1] J. Friedman. Multivariate adaptive regression splines 0 (with discussion). Annals of Statistics, 19:1{141, 1991. 0 2] G. Golub, M. Heath, and G. Wahba. Generalised L −2 −2 C cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215{223, 1979. Figure 3: Z as a function of L and C. 3] D. MacKay. Bayesian interpolation. Neural Compu- tation, 4(3):415{447, 1992. 4] J. Moody. The e ective number of parameters: An analysis of generalisation and regularisation in non- Note that while r = 8:7 may sound rather large, es- linear learning systems. In J. Moody, S. Hanson, and pecially in view of the normalised input components, R. Lippmann, editors, Neural Information Processing such large basis function widths do not necessarily im- Systems 4, pages 847{854. Morgan Kaufmann, San ply a lack of structure in the tted function, as might Mateo CA, 1992. 5] M. Orr. Regularisation in the selection of radial basis be assumed. Figure 3 plots Z (impedance) against C function centres. Neural Computation, 7(3):606{623, (capacitance) and L (inductance) for xed values of 1995. the other two components (resistance and frequency). 6] M. Orr. An EM algorithm for regularised radial ba- This function was tted to one of the p = 200 train- sis function networks. In International Conference on ing sets for which the algorithm had found an optimal Neural Networks and Brain, Beijing, China, October basis function width of r = 10. The function still ex- 1998. hibits considerable structure over the ranges of L and C even though they are less than half the size of r.