Avances Base Radial

508 views

Published on

Recent Advances in
Radial Basis Function
Networks

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
508
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
24
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Avances Base Radial

  1. 1. Recent Advances in Radial Basis Function Networks Mark J. L. Orr 1 Institute for Adaptive and Neural Computation Division of Informatics, Edinburgh University Edinburgh EH8 9LW, Scotland, UK June 25, 1999 Abstract In 1996 an Introduction to Radial Basis Function Networks was published on the web2 along with a package of Matlab functions3 . The emphasis was on the linear character of RBF networks and two techniques borrowed from statistics: forward selection and ridge regression. This document4 is an update on developments between 1996 and 1999 and is associated with a second version of the Matlab package5. Improvements have been made to the forward selection and ridge regression methods and a new method, which is a cross between regression trees and RBF networks, has been developed. 1 mjo@anc.ed.ac.uk 2 www.anc.ed.ac.uk/ mjo/papers/intro.ps 3 www.anc.ed.ac.uk/ mjo/software/rbf.zip 4 www.anc.ed.ac.uk/ mjo/papers/recad.ps 5 www.anc.ed.ac.uk/ mjo/software/rbf2.zip
  2. 2. 2 CONTENTS Contents 1 Introduction 3 1.1 MacKay's Hermite Polynomial . . . . . . . . . . . . . . . . . . . . . 3 1.2 Friedman's Simulated Circuit . . . . . . . . . . . . . . . . . . . . . . 4 2 Maximum Marginal Likelihood 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 The DM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Optimising the Size of RBFs 9 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 E cient Re-estimation of . . . . . . . . . . . . . . . . . . . . . . . 10 3.4 Avoiding Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 The Optimal RBF Size . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.6 Trial Values in Other Contexts . . . . . . . . . . . . . . . . . . . . . 13 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4 Regression Trees and RBF Networks 15 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.2 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 Generating the Regression Tree . . . . . . . . . . . . . . . . . . . . . 16 4.4 From Hyperrectangles to RBFs . . . . . . . . . . . . . . . . . . . . . 17 4.5 Selecting the Subset of RBFs . . . . . . . . . . . . . . . . . . . . . . 17 4.6 The Best Parameter Values . . . . . . . . . . . . . . . . . . . . . . . 19 4.7 Demonstrations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Appendix 22 A Applying the EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . 22 B The Eigensystem of H H> . . . . . . . . . . . . . . . . . . . . . . . . 23
  3. 3. Introduction 3 1 Introduction In 1996 an introduction to radial basis function (RBF) networks was published on the web 16] along with an associated Matlab software package 17]. The approach taken stressed the linear character of RBF networks, which traditionally have only a single hidden layer, and borrowed techniques from statistics, such as forward selection and ridge regression, as strategies for controlling model complexity, the main challenge facing all methods of nonparametric regression. That was three years ago. Since then, some improvements have been made, a new algorithm devised and the package of Matlab functions is now in its second version 19]. This document describes the theory of the new developments and will be of interest to practitioners using the new software package and theorists enhancing existing methods or developing new ones. Section 2 describes what happens when the expectation-maximisation algorithm is applied to RBF networks. Section 3 describes a simple procedure for optimising the RBF widths, particularly for ridge regression. Finally, section 4 describes the new algorithm which uses a regression tree to generate the centres and sizes of a set of candidate RBFs and to help select a subset of these for the network. Two simulated data sets, used for demonstration, are described below. 1.1 MacKay's Hermite Polynomial The rst data set is from 10] and is based on a one-dimensional Hermite polynomial y = 1 + (1 ; x + 2 x2 ) e;x2 : 100 input values are sampled randomly between ;4 < x < 4 and Gaussian noise of standard deviation = 0:1 is added to the outputs ( gure 1.1). 3 training set actual function 2 y 1 0 −4 −2 0 2 4 x Figure 1.1: Sample Hermite data (stars) and the actual function (curve).
  4. 4. 4 Introduction 1.2 Friedman's Simulated Circuit This second data set simulates an alternating current circuit with four parameters: resistance (R ohms), angular frequency (! radians per second), inductance (L hen- ries) and capacitance (C farads) in the ranges 0 R 100 40 ! 560 0 L 1 1 10;6 C 11 10;6 : 200 random samples of the four parameters in these ranges were used to generate corresponding values of the impedance, q Z = R2 + (! L ; 1=! C )2 to which Gaussian noise of standard deviation = 175 was added. This resulted in a training set of 200 cases with four-dimensional inputs x = R ! L C ]> and a scalar output y = Z . The problem originates from 6]. Before applying any learning algorithms to this data, the original inputs, with their very di erent dynamic ranges, are rescaled to the range ;1 1] in each component.
  5. 5. Maximum Marginal Likelihood 5 2 Maximum Marginal Likelihood 2.1 Introduction The expectation-maximisation (EM) algorithm 5, 3] performs maximum likelihood estimation for problems in which some of the variables are unobserved. Recently it has been successfully applied to density estimation 2] and probabilistic principle components 23, 22], for example. This section discusses the application of EM to RBF networks. First we review the probability model of a linear neural network and come up with an expression for the marginal likelihood of the data. It is this likelihood which we ultimately want to maximise. Then we show the results of applying the EM algorithm: a pair of re-estimation formulae for the model parameters. However, it turns out that a similar set of re-estimation formula can be derived by a simpler method and also that they converge more rapidly than the EM versions. Finally, we draw some conclusions. 2.2 Review The model estimated by a linear neural network from noisy samples f(xi yi )gp=1 i can be written X m f (x) = wj hj (x) (2.1) j =1 where the fhj gm are xed basis functions and fwj gm are unknown weights (to be j =1 j =1 estimated). The vector of residual errors between model and data is e = y ; Hw where H is the design matrix and has elements Hij = hj (xi ). In a Bayesian approach to analysing the estimation process, the a priori probability of the weights w can be modelled as a Gaussian of variance & 2 , p(w) / & ;m exp ; w & 2 : >w 2 (2.2) The conditional probability of the data y given the weights w can also be modelled as a Gaussian, with variance 2 , to account for the noise included in the outputs of the training set, fyi gp=1 , i e> p(yjw) / ;p exp ; 2 e : 2 (2.3) The joint probability of data and weights is the product of p(w) with p(yjw) and can be represented as an equivalent cost function by taking logarithms, multiplying by -2 and dropping constant terms to obtain E (y w) = p ln 2 + m ln & 2 + e>e + w>w : (2.4) 2 &2
  6. 6. 6 Maximum Marginal Likelihood The conditional probability of the weights w given the data y is found using Bayes rule, again involves the product of (2.2) with (2.3) and is another Gaussian, p(wjy) / p(yjw) p(w) / jWj;1=2 exp ; 1 (w ; w)>W;1 (w ; w) 2 ^ ^ (2.5) where w = A;1H>y ^ W = 2 A;1 A = H>H + Im 2 = &2 : (2.6) Finally, the marginal likelihood of the data is Z p(y) = p(yjw) p(w) dw ; y2 Py > / ;p jPj1=2 exp (2.7) 2 where P = Ip ; H A;1 H> : Note that there is an equivalent cost function for p(y) which is obtained by taking logarithms, multiplying by -2 and dropping the constant terms. E (y) = p ln 2 ; ln jPj + y P y : > 2 (2.8) 2.3 The EM Algorithm The EM algorithm estimates the parameters of a model iteratively, starting from some initial guess. Each iteration consists of an expectation (E) step which nds the distribution for the unobserved variables and a maximisation (M) step which re-estimates the parameters of the model to be those with the maximum likelihood for the observed and missing data combined. In the context of a linear neural network it is possible to consider the training set f(xi yi )gp=1 as the observed data, the weights fwj gm as the missing data and i j =1 the variance of the noise 2 and the a priori variance of the weights & 2 as the model parameters. In the E-step, the expectation of the conditional probability of the missing data (2.5) is taken and substituted, in the M-step, into the joint probability of the com- bined data, or its equivalent cost function (2.4), which is then optimised with respect to the model parameters 2 and & 2 . These two steps are guaranteed to increase the marginal probability of the observed data and when iterated convergence to a local maximum.
  7. 7. 2.4 The DM Algorithm 7 Detailed analysis (see appendix A) results in a pair of re-estimation formulae for the parameters 2 and & 2 : e> e 2 = ^ ^+ 2 p (2.9) ^> ^ & 2 = w w + (m ; ) & 2 m (2.10) where ^ = y ; Hw e ^ = m ; tr A;1 : Initial guesses are substituted into the right hand sides which produce new guesses. The process is repeated until a local minimum of (2.8) is reached. Note that equation (2.10) was derived in 11] by a free energy approach. It has been shown that free energy and the EM algorithm are intimately connected 13]. Figure 2.1 illustrates with the Hermite data described in section 1.1. Centres of radius r = 1 were created for each training set input. The gure plots logarithmic contours of (2.8) and the sequence of 2 and & 2 values re-estimated by (2.9, 2.10). 2 0 log(ς ) 2 −2 −4 −4 −2 0 log(σ2) Figure 2.1: Optimisation of 2 and & 2 by EM. 2.4 The DM Algorithm An alternative approach to minimising (2.7) is simply to di erentiate it and set the results to zero. This is easily done and results in the pair of re-estimation formulae 2 e>e = p^;^ (2.11) ^>^ &2 = w w : (2.12)
  8. 8. 8 Maximum Marginal Likelihood I call this method the DM algorithm" after David MacKay who rst derived these equations 10]. Its disadvantage is the absence of any guarantee that the iterations converge, unlike their EM counterparts (2.9, 2.10) which are known to increase the marginal likelihood (or leave it the same if a xed point has been reached). Any xed point of DM is also a xed point of EM, and vice versa, but if there are multiple xed points there is no guarantee that both methods will converge to the same one, even when starting from the same guess. Figure 2.2 plots the sequence of re-estimated values using (2.11, 2.12) for the same training set, RBF network and initial values of 2 and & 2 used for gure 2.1. It is apparent that convergence is faster for DM than for EM in this example, taking 6 iterations for DM compared to 28 for EM. In fact, our empirical observation is that DM always converges considerably faster than EM if they start from the same guess and converge to the same local minimum. Furthermore, DM has never failed to converge. 2 0 log(ς ) 2 −2 −4 −4 −2 0 log(σ2) Figure 2.2: Optimisation of 2 and & 2 by DM. 2.5 Conclusions We started by applying the EM algorithm to RBF networks using the weight-decay (ridge regression) style of penalised likelihood and ended with a pair of re-estimation formulae for the noise variance 2 and prior weight variance & 2 . However, these turned out to be less e cient than a similar pair of formulae which had been known in the literature for some time. The rbf rr 2 method in the Matlab software package 19] has an option to use maximum marginal likelihood (MML) as the model selection criterion (instead of GCV or BIC, for example). When this option is selected the regularisation parameter (2.6) is re-estimated using, by default, the DM equations (2.11, 2.12). Another option can be set so that the EM versions (2.9, 2.10) are used instead.
  9. 9. Optimising the Size of RBFs 9 3 Optimising the Size of RBFs 3.1 Introduction In previous work 15, 16] we concentrated on methods for optimising the regular- isation parameter, , of an RBF network. However, another key parameter is the size of the RBFs and until now no methods have been provided for its optimisation. This section describes a simple scheme to nd an overall scale size for the RBFs in a network. We rst review the basic concepts already covered elsewhere 16] and then de- scribe an improved version of the re-estimation formula for the regularisation pa- rameter which is considerably more e cient and allows multiple initial guesses for to be optimised in an e ort to avoid getting trapped in local minima (the details are given in appendix B). We then describe a method for choosing the best overall size for the RBFs from a number of trial values which is rendered tractable by the e cient optimisation of . We then make some concluding remarks. 3.2 Review In a linear model with xed basis functions fhj gm and weights fwj gm , j =1 j =1 X m f (x) = wj hj (x) (3.1) j =1 the model complexity can be controlled by the addition of a penalty term to the sum of squared errors over the training set, f(xi yi)gp=1 . When this combined error, i X p X m E = (yi ; f (xi ))2 + 2 wj i=1 j =1 is optimised, large components in the weight vector w are inhibited. This kind of penalty is known as ridge regression or weight-decay and the parameter , which controls the amount of penalty, is known as the regularisation parameter. While the nominal number of free parameters is m (the weights), the e ective number is less, due to the penalty term, and is given 12] by = m ; tr A;1 (3.2) A = H>H + Im (3.3) where H is the design matrix with elements Hij = hj (xi ). The expression for is monotonic in so model complexity can be decreased (or increased) by raising (or lowering) the value of . The parameter has a Bayesian interpretation: it's the ratio of 2 , the noise corrupting the training set outputs, to & 2 , the a priori variance of the weights (see section 2). If the value of is known then the optimal weight is w = A;1H>y ^ (3.4)
  10. 10. 10 Optimising the Size of RBFs However, neither 2 nor & 2 may be available in a practical situation so it is usually necessary to establish an e ective value for in parallel with optimising the weights. This may be done with model selection criterion such as BIC (Bayesian informa- tion criterion), GCV (generalised cross-validation) or MML (maximum marginalised likelihood see section 2) and in particular with one or more re-estimation formula. For GCV the single formula is = p ; w>^ ;1 w e^e > ^ A ^ (3.5) where ^ = y ; Hw e ^ ; = tr A;1 ; A;2 : An initial guess for is used to evaluate the right hand side of (3.5) which produces a new guess. The resulting sequence of re-estimated values converge to a local minimum of GCV. Each iteration requires the inverse of the m-by-m matrix A and therefore costs of order m3 oating point operations. 3.3 E cient Re-estimation of The optimisation of by iteration of the re-estimation formula is burdened by the necessity of having to compute an expensive matrix inverse every iteration. How- ever, by a reformulation of the individual terms of the equation using the eigen- values and eigenvectors of H H> it is possible to perform most of the work during the rst iteration and reuse the results for subsequent ones. Thus the amount of computation required to complete an optimisation which takes q steps to converge is reduced by almost a factor of 1=q. Unfortunately, the technique only works for a single global regularisation parameter 15], not for multiple parameters applying to di erent groups of weights or to individual weights 14]. Suppose the eigenvalues and eigenvectors of H H> are f i gp=1 and fui gp=1 and i i that the projections of y onto the eigenvectors are yi = y> ui . Then, as shown in ~ appendix B, the four terms involved in the re-estimation formula (3.5) are X p = (3.6) i=1 i + Xp i p; = ( i + )2 (3.7) i=1 Xp ~2 i yi ^>^ = e e ( i + )3 (3.8) i=1 Xp 2 y2 ~i w>A;1w = ^ ^ ( i + )2 : (3.9) i=1 If is re-estimated by computing (3.6{3.9), instead of explicitly calculating the inverse in (3.5), then the computational cost of each iteration is only of order p,
  11. 11. 3.4 Avoiding Local Minima 11 instead of m3 . The overhead of initially calculating the eigensystem, which is of order p3 , has to be taken into account but is only performed once. For problems in which p is not much bigger than m this represents a signi cant saving in computation time and makes it feasible to optimise multiple guesses for the initial value of to decrease the chances of getting caught in a local minimum. 3.4 Avoiding Local Minima If the initial guess for is close to a local minimum of GCV (or whatever model selection criterion is employed) then re-estimation using (3.5) is likely to get trapped. We illustrate by using Friedman's data set as described in section 1.2 with an RBF network of 200 Gaussian centres coincident with the inputs of the training set and of xed radius r = 8. The solid curve in gure 3.1 shows the variation of GCV with . The open circles show a sequence of re-estimated values with their corresponding GCV scores. The initial guess was = 10;1 and the sequence converged near 10;6 (shown by the closed circle) at a local minimum. The global minimum near 10;10 was missed. 5.3 5.2 5.1 log(GCV) 5 4.9 4.8 4.7 −12 −10 −8 −6 −4 −2 0 log(λ) Figure 3.1: The variation of GCV with for Friedman's problem and a sequence of re-estimations starting at = 10;1 . Compare gure 3.1 with gure 3.2 where the only change was to use a di erent guess, 10;7 , for the initial value of . This time the guess is su ciently close to the global minimum that the re-estimations are attracted towards it. Note that the set of eigenvalues and eigenvectors used to compute the sequences in gures 3.1 and 3.2 are identical. Since the calculation of the eigensystem dominates the other computational costs it is almost as expensive to optimise one trial value as it is to optimise several. Thus to avoid falling into a local minimum several trial values spread over a wide range can be optimised and the solution with the lowest GCV selected as the overall winner. This value can then be used to determine the weights (3.4, 3.3), and ultimately the predictions (3.1), of the network.
  12. 12. 12 Optimising the Size of RBFs 5.3 5.2 5.1 log(GCV) 5 4.9 4.8 4.7 −12 −10 −8 −6 −4 −2 0 log(λ) Figure 3.2: Same as gure 3.1 except the initial guess is = 10;7 . 3.5 The Optimal RBF Size For Gaussian radial functions of xed width r the transfer functions of the hidden units are h (x) = exp ; (x ; cj ) (x ; cj ) : > j r2 Unfortunately, there is no re-estimation formula for r, as there is for , even in this simple case where the same scale is used for each RBF and each component of the input 18]. To properly optimise the value of r would thus require the use of a nonlinear optimisation algorithm and would have to incorporate the optimisation of (since the optimal value of changes as r changes). An alternative, if rather crude approach, is to test a number of trial values for r. For each value an optimal is calculated (by using the re-estimation method above) and the model selection score noted. When all the values have been checked, the one associated with the lowest score wins. The computational cost of this procedure is dominated, once again, by the cost of computing the eigenvalues and eigenvectors of H H>, and these have to be calculated separately for each value of r. While this procedure is less computationally demanding than a full nonlinear optimisation of r and it's drawback is that it is only capable of identifying the best value for r from a nite number of alternatives. On the other hand, given that the value of is fully optimised and that the model selection criteria are heuristic (in other words, approximate) in nature, it is arguable that a more precise location for the optimal value of r is unlikely to have much practical signi cance. We illustrate the method on the Hermite data as described in section 1.1. Once again we use each training set input as an RBF centre. We tried seven di erent trial values for r: 0.4, 0.6, 0.8, 1.0, 1.2, 1.4 and 1.6. For each trial value we plot, in gure 3.3, the variation of GCV with (the curves), as well as the optimal (the closed
  13. 13. 3.6 Trial Values in Other Contexts 13 −1.5 0.4 0.6 0.8 −1.6 1.0 1.2 log(GCV) 1.4 −1.7 1.6 −1.8 −1.9 −10 −8 −6 −4 −2 0 log(λ) Figure 3.3: The Hermite data set with four sizes of RBFs. circles) found by re-estimation as described above. The radius value which led to the lowest GCV score was r = 1:0 and the corresponding optimal regularisation parameter is 0:3. Initially, as r increases from its lowest value, the GCV score at the optimum decreases. Eventually it reaches its lowest value at r = 1:0. Above that there is not much increase in optimised GCV, although the optimal decreases rapidly. 3.6 Trial Values in Other Contexts The use of trial values is limited to cases where there is a small number of parameters to optimise, such as the single parameter r. If there are several parameters with trial values then the number of di erent combinations to evaluate can easily become prohibitively large. In RBF networks where there is a separate scale parameter for each dimension, so that the transfer functions are, for example in the case of Gaussians, X (xk ; cjk )2 ! n h (x) = exp ; j rjk 2 k=1 there would be tmn combinations to check where t is the number of trial values for each rjk , m is the number of basis functions and n the number of dimensions. However, it is possible to test trial values for an overall scale size if some other mechanism can be used to generate the scales rjk . Here, the transfer functions are X (xk ; cjk )2 ! n h (x) = exp ; j 2 r2 : k=1 jk This is the approach taken for the method in section 4 where a regression tree determines the values of rjk but the overall scale size is optimised by testing trial values.
  14. 14. 14 Optimising the Size of RBFs 3.7 Conclusions We've shown how trial values for the overall size of the RBFs can be compared using a model selection criterion. In the case of ridge regression, an e cient method for optimising the regularisation parameter helps reduce the computational burden of training a separate network for each trial value. However, the same technique can also be used with other methods of complexity control, including those in which there is no regularisation. In the Matlab software package 19] each method can be con gured with a set of trial values for the overall RBF scale. The best value is chosen and used to generate the RBF network which the Matlab function returns.
  15. 15. Regression Trees and RBF Networks 15 4 Regression Trees and RBF Networks 4.1 Introduction This section is about a novel method for nonparametric regression involving a combi- nation between regression trees and RBF networks 8]. The basic idea of a regression tree is to recursively partition the input space in two and approximate the function in each half by the average output value of the samples it contains 4]. Each split is parallel to one of the axes so it can be expressed by an inequality involving one of the input components (e.g. xk > b). The input space is thus divided into hy- perrectangles organised into a binary tree where each branch is determined by the dimension (k) and boundary (b) which together minimise the residual error between model and data. A bene t of regression trees is the information provided in the split statistics about the relevance of each input variable. The components which carry the most information about the output tend to be split earliest and most often. A weakness of regression trees is the discontinuous model caused by the output value jumping across the boundary between two hyperrectangles. There is also the problem of deciding when to stop growing the tree (or equivalently, how much to prune after it has fully grown) which is the familiar bias-variance dilemma faced by all methods of nonparametric regression 7]. The use of radial basis functions in conjunction with regression trees can help to solve both these problems. Below we outline the basic method of combining RBFs and regression trees as it appeared originally and describe our version of this idea and why we think it is an improvement. Finally we show some results and summarise our conclusions. 4.2 The Basic Idea The combination of trees and RBF networks was rst suggested by 9] in the con- text of classi cation rather than regression (though the two cases are very similar). Further elaboration of the idea appeared in 8]. Essentially, each terminal node of the classi cation tree contributes one hidden unit to the RBF network, the centre and radius of which are determined by the position and size of the corresponding hyperrectangle. Thus the tree sets the number, positions and sizes of all RBFs in the network. Model complexity is controlled by two parameters: -c determines the amount of tree pruning in C4.5 20] (the software package used by 8] to generate classi cation trees) and xes the size of RBFs relative to hyperrectangles. Our major reservation about the approach taken by 8] is the treatment of model complexity. In the case of the scaling parameter ( ), the author claimed it had little e ect on prediction accuracy, but this is not in accord with our previous experience of RBF networks. As for the amount of pruning (-c), he demonstrated its e ect on prediction accuracy yet used a xed value in his benchmark tests. Moreover, there was no discussion of how to control scaling and pruning to optimise model complexity for a given data set.
  16. 16. 16 Regression Trees and RBF Networks Our method is a variation on Kubat's with the following alterations. 1. We address the model complexity issue by using the nodes of the regression tree not to x the RBF network but rather to generate a set of RBFs from which the nal network can be selected. Thus the burden of controlling model complexity shifts from tree generation to RBF selection. 2. The regression tree from which the RBFs are produced can also be used to order selections such that certain candidate RBFs are allowed to enter the model before others. We describe one way to achieve such an ordering and demonstrate that it produces more accurate models than plain forward selec- tion. 3. We show that, contrary to the conclusions of 8], the method is typically quite sensitive to the parameter and discuss its optimisation by the use of multiple trial values. 4.3 Generating the Regression Tree The rst stage of our method (and Kubat's) is to generate a regression tree. The root node of the tree is the smallest hyperrectangle which contains all the training set inputs, fxi gp=1 . Its size sk (the half-width) and centre ck in each dimension k i are 1 sk = 2 max(xik ) ; min(xik ) i2S i2S ck = 1 max(xik ) + min(xik ) : 2 i2S i2S where S = f1 2 : : : pg is the set of training set indices. A split of the root node divides the training samples into left and right subsets, SL and SR , on either side of a boundary b in one of the dimensions k such that SL = fi : xik bg SR = fi : xik > bg : The mean output value on either side of the split is X y = 1 L y pL i2SL i X yR = p1 yi R i2SR where pL and pR are the number of samples in each subset. The residual square error between model and data is then 0 1 1 X X @ (yi ; yL)2 + (yi ; yR)2 A : E (k b) = p i2SL i2SR
  17. 17. 4.4 From Hyperrectangles to RBFs 17 The split which minimises E (k b) over all possible choices of k and b is used to create the children of the root node and is easily found by discrete search over n dimensions and p cases. The children of the root node are split recursively in the same manner and the process terminates when a node cannot be split without creating a child containing less samples than a given minimum, pmin, which is a parameter of the method. Compared to their parent nodes, the child centres will be shifted and their sizes reduced in the k-th dimension. Since the size of the regression tree does not determine the model complexity, there is no need to perform the nal pruning step normally associated with recursive splitting methods 4, 20, 8]. 4.4 From Hyperrectangles to RBFs The regression tree contains a root node, some nonterminal nodes (having children) and some terminal nodes (having no children). Each node is associated with a hyperrectangle of input space having a centre c and size s as described above. The node corresponding to the largest hyperrectangle is the root node and the node sizes decrease down the tree as they are divided into smaller and smaller pieces. To translate a hyperrectangle into a Gaussian RBF we use its centre c as the RBF centre and its size s scaled by a parameter as the RBF radius, r = s. The scalar has the same value for all nodes and is another parameter of the method (in addition to pmin). Our p is not quite the same as Kubat's (they're related by an inverse and a factor of 2) but plays exactly the same role. 4.5 Selecting the Subset of RBFs After the tree nodes are translated into RBFs the next step of our method is to select a subset of them for inclusion in the model. This is in contrast to the method of 8] where all RBFs from terminal nodes were included in the model which was thus heavily dependent on the extent of tree pruning to control model complexity. Selection can be performed using either a standard method such as forward selection 15, 16] or in a novel way, by employing the tree to guide the order in which candidate RBFs are considered. In the standard methods for subset selection the RBFs generated from the regres- sion tree are treated as an unstructured collection with no distinction between RBFs corresponding to di erent nodes in the tree. However, intuition suggests that the best order to consider RBFs for inclusion in the model is large ones rst and small ones last, to synthesise coarse structure before ne details. This, in turn, suggests searching for RBF candidates by traversing the tree from the largest hyperrectangle (and RBF) at the root to the smallest hyperrectangles (and RBFs) at the terminal nodes. Thus the rst decision should be whether to include the root node in the model, the second whether to include any of the children of the root node, and so on, until the terminal nodes are reached. The scheme we eventually developed for selecting RBFs goes somewhat beyond this simple picture and was in uenced by two other considerations. The rst con- cerns a classic problem with forward selection, namely, that one regressor can block the selection of other more explanatory regressors which would have been chosen in
  18. 18. 18 Regression Trees and RBF Networks preference had they been considered rst. In our case there was a danger that a parent RBF could block its own children. To avoid this situation, when consider- ing whether to add the children of a node which had already been selected we also considered the e ect of deleting the parent. Thus our method has a measure of back- ward elimination as well as forward selection. This is reminiscent of the selection schemes developed for the MARS 6] and MAPS 1] algorithms. A second reason for departing from a simple breadth- rst search is because the size of a hyperrectangle (in terms of volume) on one level is not guaranteed to be smaller than the size of all the hyperrectangles in the level above (only its parent) so it not easy to achieve a strict largest-to-smallest ordering. In view of this, we abandoned any attempt to achieve a strict ordering and instead devised a search algorithm which dynamically adjusts the set of selectable RBFs by replacing selected RBFs with their children. The algorithm depends on the concept of an active list of nodes. At any given moment during the selection process only these nodes and their children are con- sidered for inclusion or exclusion from the model. Every time RBFs are added or subtracted from the model the active list expands by having a node replaced by its children. Eventually the active list becomes coincident with the terminal nodes and the search is terminated. In detail, the steps of the algorithm are as follows. 1. Initialise the active list with the root node and the model with the root node's RBF. 2. For all nonterminal nodes on the active list consider the e ect (on the model selection criterion) of adding both or just one of the children's RBFs (three possible modi cations to the model). If the parent's RBF is already in the model, also consider the e ect of rst removing it before adding one or both children's RBFs or of just removing it (a further four possible modi cations). 3. The total number of possible adjustments to the model is somewhere between three and seven times the number of active nonterminal nodes, depending on how many of their RBFs are already in the model. From all these possibilities choose the one which most decreases the model selection criterion. Update the current model and remove the node involved from the active list, replacing it with its children. If none of the modi cations decrease the selection criterion then chose one of the active nodes at random and replace it by its children but leave the model unaltered. 4. Return to step 2 and repeat until all the active nodes are terminal nodes. Once the selection process has terminated the network weights can be calculated in the usual way by solving the normal equation, ;1 > w = H>H H y where H is the design matrix. There is no need for a regularisation term, as ap- pears in equations (3.4, 3.3) for example, because model complexity is limited by the selection process.
  19. 19. 4.6 The Best Parameter Values 19 4.6 The Best Parameter Values Our method has three main parameters: the model selection criterion, pmin which controls the depth of the regression tree and which determines the relative size between hyperrectangles and RBFs. For the model selection criterion we found that the more conservative BIC, which tends to produce more parsimonious models, rarely performed worse than GCV and often did signi cantly better. This is in line with the experiences of other practitioners of algorithms based on subset selection such as 6], who modi ed GCV to make it more conservative, and 1], who also found BIC gave better results than GCV. For pmin and we use the simple method of comparing the model selection scores of a number of trial values, as for the RBF widths in section 3. This means growing several trees (one for each trial value of pmin) and then (for each tree) selecting models from several sets of RBFs (one for each value of ). The cost is extra computation: the more trial values there are, the longer the algorithm takes to search through them. However, the basic algorithm is not unduly expensive and if the number of trial values is kept fairly low (about 10 or less alternatives for each parameter), the computation time is acceptable. 4.7 Demonstrations Figure 4.1 shows the prediction of a pure regression tree for a sample of Hermite data (section 1.1). For clarity, the samples themselves are not shown, just the target function and the prediction. Of course, the model is discontinuous and each horizontal section corresponds to one terminal node in the tree. 3 target prediction 2 y 1 0 −4 −2 0 2 4 x Figure 4.1: A pure regression tree prediction on the Hermite data.
  20. 20. 20 Regression Trees and RBF Networks The tree which produced the prediction shown in gure 4.1 was grown until further splitting would have violated the minimum number of samples allowed per node (pmin). There was no pruning or any other sophisticated form of complexity control, so this kind of tree is not suitable for practical use as a prediction method. However, in our method the tree is only used to create RBF candidates. Model complexity is controlled by a separate process which selects a subset of RBFs for the network. Figure 4.2 shows the predictions of the combined method on the same data set used in gure 4.1 after a subset of RBFs were selected from the pool of candidates generated by the tree nodes. Now the model is continuous and its complexity is well matched to the data. 3 target prediction 2 y 1 0 −4 −2 0 2 4 x Figure 4.2: The combined method on the Hermite data. As a last demonstration we turn our attention to Friedman's data set (sec- tion 1.2). In experiments with the MARS algorithm 6], Friedman estimated the accuracy of his method by replicating data sets 100 times and computing the mean and standard deviation of the scaled sum-square-error. For this data set his best re- sults, corresponding to the most favourable values for the parameters of the MARS method, were 0:12 0:06. To compare our algorithm with MARS, and also to test the e ect of using mul- tiple trial values for our method's parameters, pmin and , we conducted a similar experiment. Before we started, we tried some di erent settings for the trial values and identi ed one which gave good results on test data. Then, for each of the 100 replications, we applied the method twice. In the rst run we used the trial values we had discovered earlier. In the second run we used only a single best" value for each parameter, the average of the trial values, forcing this value to be used for every replicated data set. The results are shown in table 1. It is apparent that the results are practically identical to MARS when the full sets of trial values are used but signi cantly inferior when only single best" values are used.
  21. 21. 4.8 Conclusions 21 pmin error 3, 4, 5 6, 7, 8, 9, 10 0:12 0:05 4 8 0:18 0:07 Table 1: Results on 100 replications of Friedman's data set. In another test using replicated data sets we compared the two alternative meth- ods of selecting the RBFs from the candidates generated by the tree: standard for- ward selection or the method described above in section 4.5 which uses the tree to guide the order in which candidates are considered. This was the only di erence between the two runs, the model parameters were the same as in the rst row of table 1. The performance of tree-guided selection was 0:12 0:05 (as in table 1), but forward selection was signi cantly worse, 0:28 0:10. 4.8 Conclusions We have described a method for nonparametric regression based on combining re- gression trees and radial basis function networks. The method is similar to 8] and has the same advantages (a continuous model and automatic relevance determina- tion) but also some signi cant improvements. The main enhancement is the addition of an automatic method for the control of model complexity through the selection of RBFs. We have also developed a novel procedure for selecting the RBFs based on the structure of the tree. We've presented evidence that the method is comparable in performance to the well known MARS algorithm and that some of its novel features (trial parameter values, tree-guided selection) are actually bene cial. More detailed evaluations with DELVE 21] data sets are in preparation and preliminary results support these con- clusions. The Matlab software package 19] has two implementations of the method. One function, rbf rt 1, uses tree-guided selection, while the other, rbf rt 2, uses for- ward selection. The operation of each function is described, with examples, in a comprehensive manual.
  22. 22. 22 Appendix 5 Appendix A Applying the EM Algorithm We want to maximise the marginal probability of the observed data (2.7) by substi- tuting expectations of the conditional probability of the unobserved data (2.5) into the cost function for the joint probability of the combined data (2.4) and minimis- ing this with respect to the parameters 2 (the noise variance) and & 2 (the a priori weight variance). From (2.5), hwi = w and h(w ; w) (w ; w)>i = W = 2 A;1 . The expectation ^ ^ ^ of w>w is then hw>wi = tr hw w>i = w>w + tr hw w> ; w w>i ^ ^ ^^ = w>w + tr h(w ; w) (w ; w)>i ^ ^ ^ ^ ^ = w>w + 2 tr A;1 ^ ^ = w>w + & 2 (m ; ) : ^ ^ (A.1) The last step follows from = m ; tr A;1 (the e ective number of parameters) and = 2 =& 2 (the regularisation parameter). Similarly, he>ei = tr he e>i = ^>^ + tr he e> ; ^ ^>i e e ee = ^>^ + tr H hw w> ; w w>i H> e e ^^ = ^>^ + 2 tr H A;1 H> e e = ^>^ + 2 e e (A.2) since e = y ; H w is linear in w and tr H A;1 H> is another expression for the e ective number of parameters . Equations (A.1, A.2) summarise the expectation of the conditional probability for w and can be substituted into the joint probability of the combined data or the equivalent cost function (2.4) so that the resulting expression can be optimised with respect to 2 and & 2 . Note that in (A.1, A.2) these parameters are held constant at their old values only the explicit occurrences of 2 and & 2 in (2.4) are varied in the optimisation. After di erentiating (2.4) with respect to 2 and & 2 , equating the results to zero and nally substituting the expectations (A.1, A.2) we get the re-estimation formulae e> e = ^ ^+ 2 2 p ^> ^ & 2 = w w + (m ; ) & : 2 m
  23. 23. B The Eigensystem of H H> 23 B The Eigensystem of H H> We want to derive expressions for each of the terms in (3.5) using the eigenvalues and eigenvectors of H H. We start with a singular value decomposition of the design matrix, H = U S V> , where U = u1 u2 : : : up] 2 Rp p and V 2 Rm m are orthogonal and S 2 Rp m , 2p 0 0 3 6 01 p 2 6 . . 0 7 7 6 .. .. 6 . . . ... 7 7 6 0 0 6 p 7 S= 6 m 7: 7 6 0 0 6 . . 0 7 7 6 .. .. 4 ... .. 7 . 5 0 0 0 contains the singular values, fp i gp=1 . Note that, due to the orthogonality of V, i H H> = U S S>U> X p = i ui u> i i=1 so the i are the eigenvalues, and ui the eigenvectors, of the matrix H H>. The eigenvalues are non-negative and, we assume, ordered from largest to smallest so that if p > m then i = 0 for i > m. The eigenvectors are orthonormal (u>ui = ii ). i 0 0 We want to derive expressions for the terms in (3.5) using just the eigenvalues and eigenvectors of H H>. As a preliminary step, we derive some more basic relations. First, the matrix inverse in each re-estimation is ;1 A;1 = H>H + Im ;1 = V > S> S V + V > V ;1 > = V S>S + Im V : (B.1) Note that the second step would have been impossible if the regularisation term, Im, had not been proportional to the identity matrix, which is where the analysis breaks down in the case of multiple regularisation parameters. Secondly, the optimal weight vector is w = A;1H>y ^ ;1 > > = V S>S + Im S U y ;1 > = V S>S + Im S y~ (B.2) where y is the projection of y onto the eigenbasis U. ~
  24. 24. 24 Appendix Thirdly, from (B.1) we can further derive = m ; tr A;1 ;1 > = m ; tr V S>S + Im V ;1 = m ; tr S>S + Im X m = m; j =1 j + X m j = j =1 j + X i p = (B.3) i=1 i + Here we have assumed p m so the last step follows (for > 0) because if p > m then the last (p ; m) eigenvalues are zero. However, the conclusion is also true if p < m since in that case the last (m ; p) singular values are annihilated in the product S>S. Fourthly, and last of the preliminary calculations, the vector of residual errors is ^ = y ; Hw e ^ = Ip ; U S (S> S + Im );1 S> U> y = U Ip ; S (S>S + Im );1 S> y : ~ (B.4) Now we are ready to tackle the terms in (3.5). From (B.3) we have X p i p; = p; i=1 i+ X p = : (B.5) i=1 i+ From (B.1) and a set of steps similar to the derivation of (B.3) it follows that = tr A;1 ; tr A;2 X 1 X m m = ; ( + )2 j =1 j + j =1 j X m j = ( j + )2 j =1 X p i = i=1 ( i + )2 : (B.6)
  25. 25. B The Eigensystem of H H> 25 The last step follows in a similar way to the last step of (B.3). Next we tackle the term w>A;1 w. From (B.1) and (B.2) we get ^ ^ ;3 > w>A;1 w = y>S S>S + Im ^ ^ S y X p ~2 i yi = i=1 ( i + )3 : (B.7) The sum of squared residual errors is, from (B.4), 2 ^>^ = y Ip ; S (S>S + Im);1 S> y e e ~ ~ X m 2 y2 ~j X p 2 = j =1 ( j + )2 + i=m+1 yi ~ X p 2 y2 ~i = i=1 ( i + )2 : (B.8) For this derivation we assumed that p m but, for reasons similar to those stated for the derivation of (B.3), the result is also true for p < m. Equations (B.5{B.8) express each of the four terms in (3.5) using the eigenvalues and eigenvectors of H H>, which was our main goal in this appendix. Other useful expressions involving the eigensystem of H H> are X p 2 ln jPj = ln & 2 + 2 i=1 i 2; X ;2 p 2 = p ln ln & i+ i=1 X p 2 y2 ~i y> P y = i=1 & 2 i+ 2 where P = Ip ; H A;1 H>, 2 is the noise variance and & 2 is the a priori variance of the weights (see section 2). For example, if these expressions are substituted in equation (2.8) for the cost function associated with the marginal likelihood of the data, the two p ln 2 terms cancel, leaving E (y) = p ln 2 ; ln jPj + y P y > 2 X ;2 p 2 X p yi2 ~ = ln & i+ + &2 i + 2: i=1 i=1
  26. 26. 26 REFERENCES References 1] A.R. Barron and X. Xiao. Discussion of Multivariate adaptive regression splines" by J.H. Friedman. Annals of Statistics, 19:67{82, 1991. 2] C. M. Bishop, M. Svensen, and K.I. Williams. Em optimization of latent- variable density models. In D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 465{471. MIT Press, Cambridge, MA, 1996. 3] C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Ox- ford, 1995. 4] L. Breiman, J. Friedman, J. Olsen, and C. Stone. Classi cation and Regression Trees. Wadsworth, Belmont, CA, 1984. 5] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelyhood from in- complete data via the EM algorithm. Journal of the Royal Statistical Society (B), 39(1):1{38, 1977. 6] J.H. Friedman. Multivariate adaptive regression splines (with discussion). An- nals of Statistics, 19:1{141, 1991. 7] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1{58, 1992. 8] M. Kubat. Decision trees can initialize radial-basis function networks. IEEE Transactions on Neural Networks, 9(5):813{821, 1998. 9] M. Kubat and I. Ivanova. Initialization of RBF networks with decision trees. In Proc. of the 5th Belgian-Dutch Conf. Machine Learning, BENELEARN'95, pages 61{70, 1995. 10] D.J.C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415{447, 1992. 11] D.J.C. MacKay. Comparison of approximate methods of handling hyperparam- eters. Accepted for publication by Neural Computation, 1999. 12] J.E. Moody. The e ective number of parameters: An analysis of generalisation and regularisation in nonlinear learning systems. In J.E. Moody, S.J. Hanson, and R.P. Lippmann, editors, Neural Information Processing Systems 4, pages 847{854. Morgan Kaufmann, San Mateo CA, 1992. 13] R.M. Neal and G.E. Hinton. A view of the EM algorithm that justi es incre- mental, sparse, and other variants. In M.I. Jordan, editor, Learning in Graphical Models. Kluwer Academic Press, 1998. 14] M.J.L. Orr. Local smoothing of radial basis function networks. In International Symposium on Arti cial Neural Networks, Hsinchu, Taiwan, 1995. 15] M.J.L. Orr. Regularisation in the selection of radial basis function centres. Neural Computation, 7(3):606{623, 1995.
  27. 27. REFERENCES 27 16] M.J.L. Orr. Introduction to radial basis function networks. Technical report, Institute for Adaptive and Neural Computation, Division of Informatics, Edin- burgh University, 1996. www.anc.ed.ac.uk/ mjo/papers/intro.ps. 17] M.J.L. Orr. Matlab routines for subset selection and ridge regression in linear neural networks. Technical report, Institute for Adaptive and Neural Computa- tion, Division of Informatics, Edinburgh University, 1996. www.anc.ed.ac.uk/ mjo/software/rbf.zip. 18] M.J.L. Orr. An EM algorithm for regularised radial basis function networks. In International Conference on Neural Networks and Brain, Beijing, China, October 1998. 19] M.J.L. Orr. Matlab functions for radial basis function networks. Techni- cal report, Institute for Adaptive and Neural Computation, Division of In- formatics, Edinburgh University, 1999. Download from www.anc.ed.ac.uk/ mjo/software/rbf2.zip. 20] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo CA, 1993. 21] C.E. Rasmussen, R.M. Neal, G.E. Hinton, D. van Camp, Z. Ghahramani, M. Revow, R. Kustra, and R. Tibshirani. The DELVE Manual, 1996. http://www.cs.utoronto.ca/ delve/. 22] M.E. Tipping and C.M. Bishop. Mixtures of principle component analysers. Technical Report NCRG/97/003, Neural Computing Research Group, Aston University, UK, 1997. 23] M.E. Tipping and C.M. Bishop. Probabilistic principal component analysis. Technical Report NCRG/97/010, Neural Computing Research Group, Aston University, UK, 1997.

×