18 Machine Learning Radial Basis Function Networks Forward Heuristics

1. Neural Networks Radial Basis Function Networks - Forward Heuristics Andres Mendez-Vazquez December 10, 2015 1 / 58

2. Outline 1 Predicting Variance of w and the output d The Variance Matrix Selecting Regularization Parameter 2 How many dimensions? How many dimensions? 3 Forward Selection Algorithms Introduction Incremental Operations Complexity Comparison Adding Basis Function Under Regularization Removing an Old Basis Function under Regularization A Possible Forward Algorithm 2 / 58

4. What is the variance of the weight vector w? The meaning If the weight have been calculated upon the basis an estimation of a stochastic variable d: What is the corresponding uncertainty in the estimation of w? Assume that the noise aﬀecting d is normal and independently, identically distributed ED d − d T d − d = σ2 I (1) Where σ is the Standard Deviation of the noise and d the mean value of d Thus, we have that the noise N d, σ2 . 4 / 58

8. Remember We are using a linear model f (xi) = d1 j=1 wjφj (xi) (2) Thus, solving the error under regularization w = ΦT Φ + Λ −1 ΦT d (3) 5 / 58

9. Remember We are using a linear model f (xi) = d1 j=1 wjφj (xi) (2) Thus, solving the error under regularization w = ΦT Φ + Λ −1 ΦT d (3) 5 / 58

10. Thus Getting the Expected Value w = ED (w) = ED ΦT Φ + Λ −1 ΦT d = ΦT Φ + Λ −1 ΦT ED [d] = ΦT Φ + Λ −1 ΦT d 6 / 58

13. Thus, we have The Variance of w is W = ED (w − w) (w − w)T = ED ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d × ... ΦT Φ + Λ −1 ΦT d − ΦT Φ + Λ −1 ΦT d T = ED ΦT Φ + Λ −1 ΦT d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT ED d − d d − d T Φ ΦT Φ + Λ −1 = ΦT Φ + Λ −1 ΦT σ2 IΦ ΦT Φ + Λ −1 = σ2 ΦT Φ + Λ −1 ΦT Φ ΦT Φ + Λ −1 7 / 58

19. The Least Squared Error Case We have Λ = 0 =⇒ W = σ2 ΦT Φ + Λ −1 (4) The following matrix is known as the variance matrix A−1 = ΦT Φ + Λ −1 (5) For the standard Ridge Regression when ΦT Φ = A − λId1 W = σ2 A−1 [A − λId1 ] A−1 = σ2 A−1 − λA−2 8 / 58

24. How to select the regularization parameter λ? We know that f = Φ ΦT Φ + λId1 −1 ΦT d (6) Thus, we have the following projection matrix assuming the diﬀerence between d and f d − f = d − Φ ΦT Φ + λId1 −1 ΦT d (7) Thus, we have tha Projection Matrix P d − f =    IN − Φ ΦT Φ + λId1 −1 ΦT P     d (8) 10 / 58

27. We can use this to rewrite the cost function Cost function C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (9) We have then C (w, λ) = Hw − d T Hw − d + wT Λw = d T Φ [A]−1 ΦT − Id1 Φ [A]−1 ΦT − Id1 d + ... dT Φ [A]−1 Λ [A]−1 ΦT d 11 / 58

30. However We have Φ [A]−1 Λ [A]−1 Φ = Φ [A]−1 A − ΦT Φ [A]−1 ΦT = Φ [A]−1 ΦT − Φ [A]−1 ΦT 2 = P − P2 Simplifying the minimum cost C (w, λ) = d T P2 d+ d T P − P2 d = d T Pd (10) 12 / 58

34. In summary, we have for the Ridge Regression Something Notable A =ΦT Φ + λId1 w =A−1 ΦT d P =IN − ΦA−1 ΦT Important Observation Some sort of model selection must be used to choose a value for the regularisation parameter . The value chosen is the one associated with the lowest prediction error. Question Which method should be used to predict the error and how is the optimal value found? 13 / 58

43. Answer Something Notable The answer to the rst question is that nobody knows for sure. There are many methods to train to obtain that value Leave-one-out cross-validation. Generalized cross-validation. Final prediction error. Bayesian information criterion. Bootstrap methods. 14 / 58

49. We will use an iterative method We have the following iterative process from Generalized Cross-Validation λ = dT P2 dtrace A−1 − λA−2 wT A−1wtrace (P) (11) To see the development, please take a look to Appendix A.10 “Introduction to Radial Basis Function Networks” by Mark J.L. Orr An iterative process started with an initial λ The value is updated until convergence. 15 / 58

53. How many dimensions for the mapping to high dimensions? We have the following for ordinary least squares - no regularization A−1 = ΦT Φ −1 (12) Now, you suppose that You are given a set of numbers {xi}N i=1 randomly drawn from a Gaussian distribution and you are asked to estimate the variance without told the mean. We can calculate the sample mean x = 1 N N i=1 xi (13) 17 / 58

56. Thus This allows to calculate the sample variance ˆσ2 = 1 N − 1 N i=1 (xi − x)2 (14) Problem, from Where the parameter N − 1 comes from? It comes from the fact that the parameter x is ﬁtting the noise. The system has N degrees of freedom Thus the underestimation of the variance is restored by reducing the remaining degrees of freedom by one. 18 / 58

59. In Supervised Learning Similarly It would be a mistake to divide the sum-squared-training-error by the number of patterns in order to estimate the noise variance since some degrees of freedom will have been used up in ﬁtting the model. In our linear model there are d1 weights and N patterns in the training set It leaves N − d1 degrees of freedom. The estimation of the variance is then ˆσ2 = ˆS N − d1 (15) Remark: ˆS is the sum-squared-error over the training set at the optimal weight vector and ˆσ2 is called the unbiased estimate of variance. 19 / 58

62. First, standard least squared error Although there is still d1 weights in the model The eﬀective number of parameters(John Moody), γ, is less than d1 and it depends on the size of the regularization parameters. We have the following (Moody and MacKay) γ = N − trace (P) (16) 20 / 58

63. First, standard least squared error Although there is still d1 weights in the model The eﬀective number of parameters(John Moody), γ, is less than d1 and it depends on the size of the regularization parameters. We have the following (Moody and MacKay) γ = N − trace (P) (16) 20 / 58

64. First, standard least squared error In the standard least squared error without regularization, A−1 = ΦT Φ −1 γ = N − trace IN − ΦA−1 ΦT = trace ΦA−1 ΦT = trace A−1 ΦT Φ = trace A−1 ΦT Φ = trace (Id1 ) = d1 21 / 58

70. Now, with the regularization term We have A−1 = ΦT Φ − λId1 −1 γ = trace A−1 ΦT Φ = trace A−1 (A − λId1 ) = trace Id1 − λA−1 = d1 − λ A−1 22 / 58

74. Now If the eigenvalues of the matrix ΦT Φ are {µj}d1 j=1 γ = d1 − λtrace A−1 = d1 − λ d1 i=1 1 λ + µj = d1 i=1 µj λ + µj 23 / 58

78. About Ridge Regression Remark Ridge regression is used as a way to balance bias and variance by varying the effective number of parameters in a linear model. An alternative strategy It is to to compare models made up of different subsets of basis functions drawn from the same fixed set of candidates. This is called Subset selection in statistics and machine learning. 25 / 58

81. Problem This is normally intractable when you have N 2N − 1 subsets to test (17) We could use diﬀerent methods 1 K-means which is explained in the book. 2 Forward Selection heuristics that we will explain here. Forward Selection It starts with an empty subset to which a basis function is added at a time: The one that reduces the sum-squared error the most. Until a chosen criterion, such that GCV stops decreasing. 26 / 58

87. Subset Selection Vs. Optimization Classic Neural Network Optimization It involves the optimization, by gradient descent, of a nonlinear sum-squared-error surface in a high-dimensional space deﬁned by the network parameters. In speciﬁc in RBF The network parameters are the centers, sizes and hidden-to-output weights. 27 / 58

88. Subset Selection Vs. Optimization Classic Neural Network Optimization It involves the optimization, by gradient descent, of a nonlinear sum-squared-error surface in a high-dimensional space deﬁned by the network parameters. In speciﬁc in RBF The network parameters are the centers, sizes and hidden-to-output weights. 27 / 58

89. Subset Selection Vs. Optimization In Subset Selection The heuristic searches in a discrete space of subsets of a set of hidden units with ﬁxed centers and sizes while ﬁnding a subset with the lowest prediction error. It uses, a minimization criteria as the variance of the GVC: ˆσ2 GCV = N ˆd T P2 ˆd trace (P)2 (18) 28 / 58

92. In addition Hidden-to-Output Weights They are not selected, they are slaved to the centers and sizes of the chosen subset. Forward selection is a non-linear type of heuristic with the following advantages There is no need to ﬁx the number of hidden units in advance. The model selection criteria are tractable. The computational requirements are relatively low. 29 / 58

96. Thus, under the classic least squared error Something Notable In forward selection each step involves growing the network by one basis function. Therefore Adding a new basis function is one of the incremental operations by using the equation Pd1+1 = Pd1 − Pd1 φjφT j Pd1 φT j Pd1 φj (19) 30 / 58

97. Thus, under the classic least squared error Something Notable In forward selection each step involves growing the network by one basis function. Therefore Adding a new basis function is one of the incremental operations by using the equation Pd1+1 = Pd1 − Pd1 φjφT j Pd1 φT j Pd1 φj (19) 30 / 58

98. Thus Where Pm+1 is the succeeding projection matrix if the J-th member of the set is added. Pm the projection matrix for the m−hidden units. The vectors φj N j=1 are the column vectors of the matrix Φ with N d1. 31 / 58

101. Thus We have that ΦN = [φ1 φ2 ... φN ] (20) If we take in account all the possible centers given by all the basis 32 / 58

103. What are we going to do? This is what we want to do 1 Adding a new basis function 2 Removing an old basis function 34 / 58

104. Given a matrix Given a squared matrix of size d1, we have the following B−1 B = Id1 BB−1 = Id1 Inverse of matrix with small-rank adjustment Suppose that an n × n matrix B1 is obtained by adding a small-rank adjustment XRY T to matrix B0, B1 = B0 + XRYT (21) Where B0 ∈ Rd1×d1 is the known inverse, X, Y ∈ Rd1×r are known with d1 > r, R ∈ Rr×r and the inverse of B1 is sought. 35 / 58

107. We can do the following We have the following formula B−1 1 = B−1 0 − B−1 0 X YT B−1 0 X + R−1 −1 YB−1 0 (22) Something Notable This is quite more eﬃcient because involves inverting a r−matrix YT B−1 0 X + R−1 . 36 / 58

108. We can do the following We have the following formula B−1 1 = B−1 0 − B−1 0 X YT B−1 0 X + R−1 −1 YB−1 0 (22) Something Notable This is quite more eﬃcient because involves inverting a r−matrix YT B−1 0 X + R−1 . 36 / 58

109. Thus, we can then partition the matrix A We have the following partition B = B11 B12 B21 B22 (23) We have that B−1 = B11 − B12B−1 22 B21 −1 B−1 11 B12 B21B−1 11 B12 − A22 −1 B21B−1 11 A12 − A22 −1 B21B−1 11 B22 − B21B−1 11 B12 −1 (24) 37 / 58

110. Thus, we can then partition the matrix A We have the following partition B = B11 B12 B21 B22 (23) We have that B−1 = B11 − B12B−1 22 B21 −1 B−1 11 B12 B21B−1 11 B12 − A22 −1 B21B−1 11 A12 − A22 −1 B21B−1 11 B22 − B21B−1 11 B12 −1 (24) 37 / 58

111. Finally, we get using ∆ = B22 − B21B−1 11 B12 We have B−1 = B−1 11 + B−1 11 B12∆−1B21B−1 11 −B−1 11 B12∆−1 −∆−1B21B−1 11 ∆−1 (25) Using this equation we obtain the following improvements Because if we retrain the network, we need to do the following: Involving constructing the new design matrix. Multiplying it with itself. Adding the regularizer (if there is one). Taking the inverse to obtain the variance matrix. Recomputing the projection matrix. 38 / 58

119. Complexity of calculation of P We have the following approximate number of multiplications Operation Completely Retrain Using Operation Add a new basis d3 1 + Nd2 1 + N2d1 N2 Remove an old basis d3 1 + Nd2 1 + N2d1 N2 Add a new pattern d3 1 + Nd2 1 + N2d1 2d2 1 + d1N + N2 Remove an old pattern d3 1 + Nd2 1 + N2d1 2d2 1 + d1N + N2 40 / 58

121. Adding a Basis Function We do the following If the J−th basis function is chosen then φj is appended to the last column of Φd1 and renamed m + 1. Thus, incrementing to the new matrix Φd1+1 = Φd1 φd1+1 (26) 42 / 58

122. Adding a Basis Function We do the following If the J−th basis function is chosen then φj is appended to the last column of Φd1 and renamed m + 1. Thus, incrementing to the new matrix Φd1+1 = Φd1 φd1+1 (26) 42 / 58

123. Where We have that φd1+1 =       φd1+1 (x1) φd1+1 (x2) ... φd1+1 (xN )       (27) 43 / 58

124. Using our variance matrix We have the following variance for the general case Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 (28) We have Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 = ΦT d1 φT d1+1 Φd1 φd1+1 + Λd1 0 0T λd1+1 44 / 58

125. Using our variance matrix We have the following variance for the general case Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 (28) We have Ad1+1 = ΦT d1+1Φd1+1 + Λd1+1 = ΦT d1 φT d1+1 Φd1 φd1+1 + Λd1 0 0T λd1+1 44 / 58

126. Thus We have that Ad1+1 = ΦT d1 Φd1 ΦT d1 φd1+1 φT d1+1Φd1 φT d1+1 φd1+1 + Λd1 0 0T λd1+1 = ΦT d1 Φd1 + Λd1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 45 / 58

127. Thus We have that Ad1+1 = ΦT d1 Φd1 ΦT d1 φd1+1 φT d1+1Φd1 φT d1+1 φd1+1 + Λd1 0 0T λd1+1 = ΦT d1 Φd1 + Λd1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 45 / 58

128. Then We have Ad1+1 = Ad1 ΦT d1 φd1+1 φT d1+1Φd1 λd1+1 + φT d1+1 φd1+1 Using our partition A−1 d1+1 = A−1 d1 0 0T 0 + ... 1 λd1+1 + φT d1+1Pd1 φd1+1 A−1 d1 ΦT d1 φd1+1 −1 φT d1+1Φd1 A−1 d1 − 1 Where Pd1 = IN − Φd1 A−1 d1 ΦT d1 46 / 58

131. Finally Then A−1 d1+1 = A−1 d1 0 0T 0 + ... 1 λd1+1 + φT d1+1Pd1 φd1+1 A−1 d1 ΦT d1 φd1+1 φT d1+1Φd1 − A−1 d1 A−1 d1 ΦT d1 φd1+ − φT d1+1Φd1 A−1 d1 −1 47 / 58

132. We have then that We can use the previous result for A−1 m+1 Pd1+1 =IN − Φd1+1A−1 d1+1ΦT d1+1 =Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 48 / 58

133. We have then that We can use the previous result for A−1 m+1 Pd1+1 =IN − Φd1+1A−1 d1+1ΦT d1+1 =Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 48 / 58

134. How do we select the new basis We can use the greatest in sum-squared error diﬀerence ˆSd1 = ˆyT P2 d1 ˆy (29) In addition, we have ˆSd1+1 = ˆyT P2 d1+1ˆy (30) 49 / 58

135. How do we select the new basis We can use the greatest in sum-squared error diﬀerence ˆSd1 = ˆyT P2 d1 ˆy (29) In addition, we have ˆSd1+1 = ˆyT P2 d1+1ˆy (30) 49 / 58

136. Thus, we do the diﬀerence We want to maximize the decrease ˆSd1 − ˆSd1+1 = ˆyT P2 d1 ˆy − ˆyT P2 d1+1ˆy = ˆyT P2 d1 − P2 d1+1 ˆy = ˆyT  P2 d1 − Pd1 − Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 2   ˆy = 2ˆyT P2 d1 φd1+1ˆyT Pd1+1φj λd1+1 + φT d1+1Pd1 φd1+1 − ˆyT P2 d1+1φj 2 φT j P2 d1+1φj λd1+1 + φT d1+1Pd1 φd1+1 2 50 / 58

140. An alternative is is to seek to maximize the decrease in the cost function We have ˆCd1 − ˆCd1+1 = ˆyT Pd1 ˆy − ˆyT Pd1+1ˆy = ˆyT Pd1 φd1+1 φT d1+1Pd1 λd1+1 + φT d1+1Pd1 φd1+1 ˆy = ˆyT Pd1 φd1+1 2 λd1+1 + φT d1+1Pd1 φd1+1 51 / 58

142. Removing an Old Basis Function under Regularization Here, we can remove any column Process: 1 Move the selected j-th column at the end (Permutation). 2 Apply our well known equation with Pd1 in place of Pd1+1 and Pd1−1 in place of Pd1 . 3 In addition φj in place of φd1+1 . 4 And λj in place of λd1+1. Thus, we have Pd1 = Pd1−1 − Pd1−1 φj φT j Pd1−1 λj + φT j Pd1−1 φj (31) 53 / 58

147. Thus If λj = 0 We can ﬁrst post- and then pre-multiplying by φj to obtain expressions of Pd1−1 φj and φT j Pd1−1 φj in terms of Pd1 Thus, we have Pd1−1 = Pd1 + Pd1 φj φT j Pd1 λj − φT j Pd1 φj (32) However For small λj, the round-oﬀ error can be problematic!!! 54 / 58

151. Based in the previous ideas We are ready for a basic algorithm However, this can be improved. 56 / 58

152. We have the following pseudocode Forward-Regularization(D) 1 Select the functions d1 to used as basis based on the data D This can be done randomly or using the clustering method described in Haykin 2 Select an > 0 stopping criteria 3 ˆCd1 = ˆyT Pd1 ˆy 4 Do 5 ˆCd1 = ˆCd1+1 6 d1 = d1 + 1 7 Do 8 Select a new base element and generate φd1+1 . Several strategies exist 9 Generate A−1 d1+1 and Pd1+1 10 Calculate ˆCd1+1 11 Until ˆyT Pd1 φd1+1 2 λd1+1+ φT d1+1Pd1 φd1+1 > 0 12 Until ˆCd1 − ˆCd1+1 2 < 57 / 58

166. For more on this Please Read the following Introduction to Radial Basis Function Networks by Mark J. L. Orr And there is much more Look At the book Bootstrap Methods and their Application by A. C. Davison and D. V. Hinkley 58 / 58

167. For more on this Please Read the following Introduction to Radial Basis Function Networks by Mark J. L. Orr And there is much more Look At the book Bootstrap Methods and their Application by A. C. Davison and D. V. Hinkley 58 / 58

18 Machine Learning Radial Basis Function Networks Forward Heuristics

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (9)

Similar to 18 Machine Learning Radial Basis Function Networks Forward Heuristics

Similar to 18 Machine Learning Radial Basis Function Networks Forward Heuristics (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

18 Machine Learning Radial Basis Function Networks Forward Heuristics