17 Machine Learning Radial Basis Functions

1. Neural Networks Radial Basis Functions Networks Andres Mendez-Vazquez December 10, 2015 1 / 96

2. Outline 1 Introduction Main Idea Basic Radial-Basis Functions 2 Separability Cover’s Theorem on the separability of patterns Dichotomy φ-separable functions The Stochastic Experiment The XOR Problem Separating Capacity of a Surface 3 Interpolation Problem What is gained? Feedforward Network Learning Process Radial-Basis Functions (RBF) 4 Introduction Description of the Problem Well-posed or ill-posed The Main Problem 5 Regularization Theory Solving the issue Bias-Variance Dilemma Measuring the diﬀerence between optimal and learned The Bias-Variance How can we use this? Getting a solution We still need to talk about... 2 / 96

4. Introduction Observation The back-propagation algorithm for the design of a multilayer perceptron as described in the previous chapter may be viewed as the application of a recursive technique known in statistics as stochastic approximation. Now We take a completely different approach by viewing the design of a neural network as a curve fitting (approximation) problem in a high-dimensional space. Thus Learning is equivalent to finding a surface in a multidimensional space that provides a best fit to the training data. Under a statistical metric 4 / 96

7. Thus In the context of a neural network The hidden units provide a set of "functions" A "basis" for the input patterns when they are expanded into the hidden space. Name of these functions Radial-Basis functions. 5 / 96

8. Thus In the context of a neural network The hidden units provide a set of "functions" A "basis" for the input patterns when they are expanded into the hidden space. Name of these functions Radial-Basis functions. 5 / 96

9. History These functions were ﬁrst introduced As the solution of the real multivariate interpolation problem Right now It is now one of the main ﬁelds of research in numerical analysis. 6 / 96

10. History These functions were ﬁrst introduced As the solution of the real multivariate interpolation problem Right now It is now one of the main ﬁelds of research in numerical analysis. 6 / 96

12. A Basic Structure We have the following structure 1 Input Layer to connect with the environment. 2 Hidden Layer applying a non-linear transformation. 3 Output Layer applying a linear transformation. Example 8 / 96

15. A Basic Structure We have the following structure 1 Input Layer to connect with the environment. 2 Hidden Layer applying a non-linear transformation. 3 Output Layer applying a linear transformation. Example Input Nodes Nonlinear Nodes Linear Node 8 / 96

16. Why the non-linear transformation? The justiﬁcation In a paper by Cover (1965), a pattern-classiﬁcation problem mapped to a high dimensional space is more likely to be linearly separable than in a low-dimensional space. Thus A good reason to make the dimension in the hidden space in a Radial-Basis Function (RBF) network high 9 / 96

17. Why the non-linear transformation? The justiﬁcation In a paper by Cover (1965), a pattern-classiﬁcation problem mapped to a high dimensional space is more likely to be linearly separable than in a low-dimensional space. Thus A good reason to make the dimension in the hidden space in a Radial-Basis Function (RBF) network high 9 / 96

19. Cover’s Theorem The Resumed Statement A complex pattern-classiﬁcation problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Actually It is quite more complex... 11 / 96

20. Cover’s Theorem The Resumed Statement A complex pattern-classiﬁcation problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Actually It is quite more complex... 11 / 96

21. Some facts A fact Once we know a set of patterns are linearly separable, the problem is easy to solve. Consider A family of surfaces that separate the space in two regions. In addition We have a set of patterns H = {x1, x2, ..., xN } (1) 12 / 96

25. Dichotomy (Binary Partition) Now The pattern set is split into two classes H1 and H2. Definition A dichotomy (binary partition) of the points is said to be separable with respect to the family of surfaces if a surface exists in the family that separates the points in the class H1 from those in the class H2. Define For each pattern x ∈ H, we define a set of real valued measurement functions {φ1 (x) , φ2 (x) , ..., φd1 (x)} 14 / 96

28. Thus We deﬁne the following function (Vector of measurements) φ : H → Rd1 (2) Deﬁned as φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T (3) Now Suppose that the pattern x is a vector in an d0-dimensional input space. 15 / 96

31. Then... We have that the mapping φ (x) It maps points in d0-dimensional space into corresponding points in a new space of dimension d1. Each of this functions φi (x) It is known as a hidden function because it plays a role similar to the hidden unit in a feed-forward neural network. Thus We have that the space spanned by the set of hidden functions {φi (x)}d1 i=1 is called as the hidden space of feature space. 16 / 96

35. φ-separable functions Definition A dichotomy {H1, H2} of H is said to be φ-separable if there exists a d1-dimensional vector w such that 1 wT φ (x) > 0 if x ∈ H1. 2 wT φ (x) < 0 if x ∈ H2. Clearly the hyperplane is defined by the equation wT φ (x) = 0 (4) Now The inverse image of this hyperplane Hyp−1 = x|wT φ (x) = 0 (5) define the separating surface in the input space. 18 / 96

39. Now Taking in consideration A natural class of mappings obtained by using a linear combination of r-wise products of the pattern vector coordinates. They are called As the rth-order rational varieties. A rational variety of order r in dimensional d0 is described by 0≤i1≤i2≤...≤ir ≤d0 ai1i2...ir xi1 xi2 ...xir = 0 (6) where xi is the ith coordinate of the input vector x and x0 is set to unity in order to express the previous equation in homogeneous form. 19 / 96

43. Homogenous Functions Deﬁnition A function f (x) is said to be homogeneous of degree n if, by introducing a constant parameter λ, replacing the variable x with λx we ﬁnd: f (λx) = λn f (x) (7) 20 / 96

44. Homogeneous Equation Equation (Eq. 6) A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial Properties For an input space of dimensionality d0, there are d0 r = d0! (d0 − r)!r! (8) monomials in (Eq. 6). 21 / 96

45. Homogeneous Equation Equation (Eq. 6) A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial Properties For an input space of dimensionality d0, there are d0 r = d0! (d0 − r)!r! (8) monomials in (Eq. 6). 21 / 96

46. Example of these surfaces Hyperplanes (ﬁrst-order rational varieties) 22 / 96

47. Example of these surfaces Hyperplanes (ﬁrst-order rational varieties) 23 / 96

48. Example of these surfaces Quadrices (second-order rational varieties) 24 / 96

49. Example of these surfaces Hyperspheres (quadrics with certain linear constraints on the coeﬃcients) 25 / 96

51. The Stochastic Experiment Suppose You have the following activation patterns x1, x2, ..., xN are chosen independently. Suppose That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable. Now given P (N, d1) the probability that a particular dichotomy picked at random is φ-separable P (N, d1) = 1 2 N−1 d1−1 m=0 N − 1 m (9) 27 / 96

54. What? Basically (Eq. 9) represents The essence of Cover’s Separability Theorem. Something Notable It is a statement of the fact that the cumulative binomial distribution corresponding to the probability that N − 1 (Flips of a coin) samples will be separable in a mapping of d1 − 1 (heads) or fewer dimensions. Speciﬁcally The higher we make the hidden space in the radial basis function the closer is the probability of P (N, d1) to one. 28 / 96

57. Final ingredients if the Cover’s Theorem First Nonlinear formulation of the hidden function deﬁned by φ (x), where x is the input vector and i = 1, 2, ..., d1. Second High dimensionality of the hidden space compared to the input space. This dimensionality is determined by the value assigned to d_1 (i.e., the number of hidden units). Then In general, a complex pattern-classiﬁcation problem cast in highdimensional space nonlinearly is more likely to be linearly separable than in a lowdimensional space. 29 / 96

62. There is always an exception to every rule!!! The XOR Problem 0 1 1 Class 1 Class 2 31 / 96

63. Now We deﬁne the following radial functions φ1 (x) = exp x − t1 2 2 where t1 = (1, 1)T φ2 (x) = exp x − t2 2 2 where t2 = (1, 1)T Then If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]: Original Mapping (0, 1) → (0.3678, 0.3678) (1, 0) → (0.3678, 0.3678) (0, 0) → (0.1353, 1) (1, 1) → (1, 0.1353) 32 / 96

64. Now We deﬁne the following radial functions φ1 (x) = exp x − t1 2 2 where t1 = (1, 1)T φ2 (x) = exp x − t2 2 2 where t2 = (1, 1)T Then If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]: Original Mapping (0, 1) → (0.3678, 0.3678) (1, 0) → (0.3678, 0.3678) (0, 0) → (0.1353, 1) (1, 1) → (1, 0.1353) 32 / 96

65. New Space We have the following new φ1 − φ2 space 0 1 1 Class 1 Class 2 33 / 96

67. Separating Capacity of a Surface Something Notable (Eq. 9) has an important bearing on the expected maximum number of randomly assigned patterns that are linearly separable in a multidimensional space. Now, given our patterns {xi}N i=1 Given N be a random variable deﬁned as the largest integer such that the sequence is φ-separable. We have that Prob (N = n) = P (n, d1) − P (n + 1, d1) (10) 35 / 96

70. Separating Capacity of a Surface Then Prob (N = n) = 1 2 n n − 1 d1 − 1 , n = 0, 1, 2... (11) Remark: n d1 = n − 1 d1 − 1 + n − 1 d1 , 0 < d1 < n To interpret this Recall the negative binomial distribution. It is a repeated sequence of Bernoulli Trials With k failures preceding the rth success. 36 / 96

73. Separating Capacity of a Surface Thus, we have that Given p and q the probabilities of success and failure, respectively, with p + q = 1. Deﬁnition p (K = k|p, q) = r + k − 1 k pr qk (12) What happened with p = q = 1 2 and k + r = n Any idea? 37 / 96

76. Separating Capacity of a Surface Thus (Eq. 11) is just the negative binomial distribution shifted d1 units to the right with parameters d1 and 1 2 Finally N corresponds to thew “waiting time” for d1 th failure in a sequence of tosses of a fair coin. We have then E [N] = 2d1 Median [N] = 2d1 38 / 96

79. This allows to deﬁne the Corollary to Cover’s Theorem A celebrated asymptotic result The expected maximum number of randomly assigned patterns (vectors) that are linearly separable in a space of dimensionality d1 is equal to 2d1 . Something Notable This result suggests that 2d1 is a natural deﬁnition of the separating capacity of a family of decision surfaces having d1 degrees of freedom. 39 / 96

80. This allows to deﬁne the Corollary to Cover’s Theorem A celebrated asymptotic result The expected maximum number of randomly assigned patterns (vectors) that are linearly separable in a space of dimensionality d1 is equal to 2d1 . Something Notable This result suggests that 2d1 is a natural deﬁnition of the separating capacity of a family of decision surfaces having d1 degrees of freedom. 39 / 96

82. Given a problem of non-linearly separable patterns It is possible to see that There is a benefit to be gained by mapping the input space into a new space of high enough dimension For this, we use a non-linear map Quite similar to solve a difficult non-linear filtering problem by mapping it to high dimension, then solving it as a linear filtering problem. 41 / 96

83. Given a problem of non-linearly separable patterns It is possible to see that There is a benefit to be gained by mapping the input space into a new space of high enough dimension For this, we use a non-linear map Quite similar to solve a difficult non-linear filtering problem by mapping it to high dimension, then solving it as a linear filtering problem. 41 / 96

85. Take in consideration the following architecture Mapping from input space to hidden space, followed by a linear mapping to output space!!! Input Nodes Nonlinear Nodes Linear Node 43 / 96

86. This can be seen as We have the following map s : Rd0 → R (13) Therefore We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1 44 / 96

87. This can be seen as We have the following map s : Rd0 → R (13) Therefore We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1 44 / 96

88. Example We have that the Red planes represent the mappings and the Gray is the Linear Separator 45 / 96

90. General Idea First The training phase constitutes the optimization of a ﬁtting procedure for the surface Γ. It is based in the know data points as input-output patterns. Second The generalization phase is synonymous with interpolation between the data points. The interpolation being performed along the constrained surface generated by the ﬁtting procedure. 47 / 96

94. This leads to the theory of multi-variable interpolation Interpolation Problem Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a function F : RN → R that satisfies the interpolation condition: F (xi) = di i = 1, 2, ..., N (14) Remark For strict interpolation as specified here, the interpolating surface is constrained to pass through all the training data points. 48 / 96

95. This leads to the theory of multi-variable interpolation Interpolation Problem Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a function F : RN → R that satisfies the interpolation condition: F (xi) = di i = 1, 2, ..., N (14) Remark For strict interpolation as specified here, the interpolating surface is constrained to pass through all the training data points. 48 / 96

97. Radial-Basis Functions (RBF) The function F has the following form (Powell, 1988) F (x) = N i=1 wiφ ( x − xi ) (15) Where {φ ( x − xi ) |i = 1, ..., N} is a set of N arbitrary, generally non-linear, functions, know as RBF with · denotes a norm that is usually Euclidean. In addition The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers of the radial basis functions. 50 / 96

100. A Set of Simultaneous Linear Equations Given φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16) Using (Eq. 14) and (Eq. 15), we get       φ11 φ12 · · · φ1N φ21 φ22 · · · φ2N ... ... ... ... φN1 φN2 · · · φNN             w1 w2 ... wN       =       d1 d2 ... dN       (17) 51 / 96

101. A Set of Simultaneous Linear Equations Given φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16) Using (Eq. 14) and (Eq. 15), we get       φ11 φ12 · · · φ1N φ21 φ22 · · · φ2N ... ... ... ... φN1 φN2 · · · φNN             w1 w2 ... wN       =       d1 d2 ... dN       (17) 51 / 96

102. Now We can create the following vectors d = [d1, d2, ..., dN ]T (Response vector). w = [w1, w2, ..., wN ]T (Linear weight vector). Now, we deﬁne a N × N matrix called interpolation matrix Φ = {φji| (j, i) = 1, 2, ..., N} (18) Thus, we have Φw = x (19) 52 / 96

105. From here Assuming that Φ is a non-singular matrix w = Φ−1 x (20) Question How can we be sure that the interpolation matrix Φ is non-singular? Answer It turns out that for a large class of radial-basis functions and under certain conditions the non-singularity happens!!! 53 / 96

109. Introduction Observation The strict interpolation procedure described may not be a good strategy for the training of RBF networks for certain classes of tasks. Reason If the number of data points is much larger than the number of degrees of freedom of the underlying physical process. Thus The network may end up ﬁtting misleading variations due to idiosyncrasies or noise in the input data. 55 / 96

113. Well-posed The Problem Assume that we have a domain X and a range Y , metric spaces. They are related by a mapping f : X → Y (21) Deﬁnition The problem of reconstructing the mapping f is said to be well-posed if three conditions are satisﬁed: Existence, Uniqueness and Continuity. 57 / 96

116. Deﬁning the meaning of this Existence For every input vector x ∈ X, there does exist an output y = f (x), where y ∈ Y . Uniqueness For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if x = t. Continuity The mapping is continuous, if for any > 0 exists δ such that the condition dX (x, t) < δ implies dY (f (x) , f (t)) < . 58 / 96

119. Basically Example Mapping 59 / 96

120. Ill-Posed Therefore If any of these conditions is not satisﬁed, the problem is said to be ill-posed. Basically An ill-posed problem means that large data sets may contain a surprisingly small amount of information about the desired solution. 60 / 96

121. Ill-Posed Therefore If any of these conditions is not satisﬁed, the problem is said to be ill-posed. Basically An ill-posed problem means that large data sets may contain a surprisingly small amount of information about the desired solution. 60 / 96

123. Learning from data Rebuilding the physical phenomena using the samples Physical Phenomenon Data 62 / 96

124. We have the following Physical Phenomena Speech, pictures, radar signals, sonar signals, seismic data. It is a well-posed data But learning form such data i.e. rebuilding the hypersurface can be an ill-posed inverse problem. 63 / 96

125. We have the following Physical Phenomena Speech, pictures, radar signals, sonar signals, seismic data. It is a well-posed data But learning form such data i.e. rebuilding the hypersurface can be an ill-posed inverse problem. 63 / 96

126. Why First The existence criterion may be violated in that a distinct output may not exist for every input Second There may not be as much information in the training sample as we really need to reconstruct the input-output mapping uniquely. Third The unavoidable presence of noise or imprecision in real-life training data adds uncertainty to the reconstructed input-output mapping. 64 / 96

129. The noise problem Getting out of the range Mapping+Noise 65 / 96

130. How? This can happen when There is a lack of information!!! Lanczos, 1964 “A lack of information cannot be remedied by any mathematical trickery.” 66 / 96

131. How? This can happen when There is a lack of information!!! Lanczos, 1964 “A lack of information cannot be remedied by any mathematical trickery.” 66 / 96

133. How do we solve the problem? Something Notable In 1963, Tikhonov proposed a new method called regularization for solving ill-posed ’ Tikhonov He was a Soviet and Russian mathematician known for important contributions to topology, functional analysis, mathematical physics, and ill-posed problems. 68 / 96

134. How do we solve the problem? Something Notable In 1963, Tikhonov proposed a new method called regularization for solving ill-posed ’ Tikhonov He was a Soviet and Russian mathematician known for important contributions to topology, functional analysis, mathematical physics, and ill-posed problems. 68 / 96

135. Also Known as Ridge Regression Setup We have: Input Signal xi ∈ Rd0 N i=1 . Output Signal {di ∈ R}N i=1. In addition Note that the output is assumed to be one-dimensional. 69 / 96

136. Also Known as Ridge Regression Setup We have: Input Signal xi ∈ Rd0 N i=1 . Output Signal {di ∈ R}N i=1. In addition Note that the output is assumed to be one-dimensional. 69 / 96

137. Now, assuming that you have an approximation function y = F (x) Standard Error Term Es (F) = 1 2 N i=1 (di − yi) = 1 2 N i=1 (di − F (xi)) (22) Regularization Term Ec (F) = 1 2 DF 2 (23) Where D is a linear diﬀerential operator. 70 / 96

140. Now Ordinarily y = F (x) Normally, the function space representing the functional F is the L2 space that consist of all real-valued functions f (x) with x ∈ Rd0 The quantity to be minimized in regularization theory is E (f ) = 1 2 N i=1 (di − f (xi)) + 1 2 Df 2 (24) Where λ is a positive real number called the regularization parameter. E (f ) is called the Tikhonov functional. 71 / 96

145. Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve ﬁtting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (25) Remark: Where the xi ∼ p (x|Θ)!!! 73 / 96

152. Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a speciﬁc training data set but very bad for another. This is the reason of studying fusion of information at decision level... 74 / 96

165. We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represent the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represent the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 77 / 96

170. Using this in our favor!!! Something Notable Introducing bias is equivalent to restricting the range of functions for which a model can account. Typically this is achieved by removing degrees of freedom. Examples They would be lowering the order of a polynomial or reducing the number of weights in a neural network!!! Ridge Regression It does not explicitly remove degrees of freedom but instead reduces the eﬀective number of parameters. 79 / 96

174. Example In the case of a linear regression model C (w) = N i=1 di − wT xi 2 + λ d0 j=1 w2 j (28) Thus This is ridge regression (weight decay) and the regularization parameter λ > 0 controls the balance between fitting the data and avoiding the penalty. A small value for λ means the data can be fit tightly without causing a large penalty. A large value for λ means a tight fit has to be sacrificed if it requires large weights. 80 / 96

178. Important The Bias It favors solutions involving small weights and the eﬀect is to smooth the output function. 81 / 96

180. Now, we can carry out the optimization First, we rewrite the cost function the following way S (w) = N i=1 (di − f (xi))2 (29) And we will use a generalized version for f f (xi) = d1 j=1 wjφj (xi) (30) Where The free variables are the weights {wj}d1 j=1. 83 / 96

183. Where φj (xi) is in our case, we may have the Gaussian distribution φj (xi) = φ (xi, xj) (31) With φ (x, xj) = exp − 1 2σ2 x − xi (32) 84 / 96

184. Where φj (xi) is in our case, we may have the Gaussian distribution φj (xi) = φ (xi, xj) (31) With φ (x, xj) = exp − 1 2σ2 x − xi (32) 84 / 96

185. Thus Final cost function assuming there is a regularization term per weight C (w, λ) = N i=1 (di − f (xi))2 + d1 j=1 λjw2 j (33) What do we do? 1 Diﬀerentiate the function with respect to the free variables. 2 Equate the results with zero. 3 Solve the resulting equations. 85 / 96

189. Diﬀerentiate the function with respect to the free variables. First ∂C (w, λ) ∂wj = 2 N i=1 (di − f (xi)) ∂f (xi) ∂wj + 2λjwj (34) We get diﬀerential of ∂f (xi) ∂wj ∂f (xi) ∂wj = φj (xi) (35) 86 / 96

190. Diﬀerentiate the function with respect to the free variables. First ∂C (w, λ) ∂wj = 2 N i=1 (di − f (xi)) ∂f (xi) ∂wj + 2λjwj (34) We get diﬀerential of ∂f (xi) ∂wj ∂f (xi) ∂wj = φj (xi) (35) 86 / 96

191. Now We have then N i=1 f (xi) φj (xi) + λjwj = N i=1 diφj (xi) (36) Something Notable There are m such equations, for 1 ≤ j ≤ m, each representing one constraint on the solution. Since there are exactly as many constraints as there are unknowns equations has, except under certain pathological conditions, a unique solution. 87 / 96

194. Using Our Linear Algebra We have then φT j f + λjwj = φT j d (37) Where φj =       φj (x1) φj (x2) ... φj (xN )       , f =       f (x1) f (x2) ... f (xN )       , d =       d1 d2 ... dN       (38) 88 / 96

195. Using Our Linear Algebra We have then φT j f + λjwj = φT j d (37) Where φj =       φj (x1) φj (x2) ... φj (xN )       , f =       f (x1) f (x2) ... f (xN )       , d =       d1 d2 ... dN       (38) 88 / 96

196. Now Since there is one of these equations, each relating one scalar quantity to another, we can stack them       φT 1 f φT 2 f ... φT d1 f       +       λ1w1 λ2w2 ... λd1 wd1       =       φT 1 d φT 2 d ... φT d1 d       (39) Now, if we deﬁne Φ = φ1 φ2 . . . φd1 (40) Written in full form Φ =       φ1 (x1) φ2 (x1) · · · φd1 (x1) φ1 (x2) φ2 (x2) · · · φd1 (x2) ... ... ... ... φ1 (xN ) φ2 (xN ) · · · φd1 (xN )       (41) 89 / 96

199. We can then Deﬁne the following matrix equation ΦT f + Λw = ΦT d (42) Where Λ =       λ1 0 · · · 0 0 λ2 · · · 0 ... ... ... ... 0 0 · · · λd1       (43) 90 / 96

200. We can then Deﬁne the following matrix equation ΦT f + Λw = ΦT d (42) Where Λ =       λ1 0 · · · 0 0 λ2 · · · 0 ... ... ... ... 0 0 · · · λd1       (43) 90 / 96

201. Now, we have that The vector can be decomposed into the product of two terms Design matrix and the weight vector We have then fi = f (xi) = d1 j=1 wjhj (xi) = φ T i w (44) Where φi =       φ1 (xi) φ2 (xi) ... φd1 (xi)       (45) 91 / 96

204. Furthermore We get that f =       f1 f2 ... fN       =        φ T 1 w φ T 2 w ... φ T N w        = Φw (46) Finally, we have that ΦT d =ΦT f + Λw =ΦT Φw + Λw = ΦT Φ + Λ w 92 / 96

205. Furthermore We get that f =       f1 f2 ... fN       =        φ T 1 w φ T 2 w ... φ T N w        = Φw (46) Finally, we have that ΦT d =ΦT f + Λw =ΦT Φw + Λw = ΦT Φ + Λ w 92 / 96

206. Now... We get ﬁnally w = ΦT Φ + Λ −1 ΦT d (47) Remember This equation is the most general form of the normal equation. We have two cases In standard ridge regression λj = λ, 1 ≤ j ≤ m. Ordinary least squares where there is no weight penalty or all λj = 0, 1 ≤ j ≤ m.. 93 / 96

209. Thus, we have First Case w = ΦT Φ + λId1 −1 ΦT d (48) Second Case w = ΦT Φ −1 ΦT d (49) 94 / 96

210. Thus, we have First Case w = ΦT Φ + λId1 −1 ΦT d (48) Second Case w = ΦT Φ −1 ΦT d (49) 94 / 96

212. There are still several things that we need to look at... First What is the variance of the weight vector? The Variance Matrix. Second The prediction of the output at any of the training set inputs - The Projection Matrix Finally The incremental algorithm for the problem!!! 96 / 96

17 Machine Learning Radial Basis Functions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to 17 Machine Learning Radial Basis Functions

Similar to 17 Machine Learning Radial Basis Functions (20)

More from Andres Mendez-Vazquez

More from Andres Mendez-Vazquez (20)

Recently uploaded

Recently uploaded (20)

17 Machine Learning Radial Basis Functions