Section5 Rbf


Published on

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Good morning. First of all, I would like to thank you for attending my today’s presentation. My name is Jeen-Shing Wang and the topic for my today’s presentation is “ on the structure and learning of self-adaptive neuro-fuzzy inference systems .
  • Section5 Rbf

    1. 1. Section 5: Radial Basis Function (RBF) Networks Course: Introduction to Neural Networks Instructor: Jeen-Shing Wang Department of Electrical Engineering National Cheng Kung University Fall, 2005
    2. 2. Outline <ul><li>Origin: Cover’s theorem </li></ul><ul><li>Interpolation problem </li></ul><ul><li>Regularization theory </li></ul><ul><li>Generalized RBFN </li></ul><ul><ul><li>Universal approximation </li></ul></ul><ul><ul><li>Comparison with MLP </li></ul></ul><ul><ul><li>RBFN = Kernel regression </li></ul></ul><ul><li>Learning </li></ul><ul><ul><li>Centers, width, and weights </li></ul></ul><ul><li>Simulations </li></ul>
    3. 3. Origin: Cover’s Theorem <ul><li>A complex pattern-classification problem cast in a high dimensional space nonlinearly is more likely to be linearly separable than in a low dimensional space (Cover, 1965). </li></ul>Cover, T. M., 1965. “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition.” IEEE transactions on Electronic Computers , EC-14, 326-334
    4. 4. Cover’s Theorem <ul><li>Covers’ theorem on separability of patterns (1965) </li></ul><ul><li>x 1 , x 2 , …, x N , x i   p , are assigned to two classes C 1 and C 2 </li></ul><ul><li> -separability: </li></ul>where  ( x ) = [  1 ( x ),  2 ( x ), …,  M ( x )] T.
    5. 5. Cover’s Theorem (cont’d) <ul><li>Two basic ingredients of Covers’ theorem: </li></ul><ul><ul><li>Nonlinear functions  ( x ) </li></ul></ul><ul><ul><li>Dimensions of hidden space ( M ) > Dimensions of input space ( P ) -> probability of separability closer to 1 </li></ul></ul>(a) (b) (c) Linear Separable Spherically Separable Quadrically Separable x x x x x x x x
    6. 6. Interpolation Problem <ul><li>Given points ( x i , d i ), x i   p , d i   , </li></ul><ul><li>1 ≤ i ≤ N: </li></ul><ul><li>Find F ( x i ) = d i , 1 ≤ i ≤ N </li></ul><ul><li>Radial basis function (RBF) technique (Powell, 1988): </li></ul><ul><ul><li>are arbitrary nonlinear functions </li></ul></ul><ul><ul><li>Number of functions is the same as number of data points </li></ul></ul><ul><ul><li>Centers are fixed at known points x i . </li></ul></ul>
    7. 7. Interpolation Problem (cont’d) <ul><li>Matrix form: </li></ul><ul><li>Vital question: Is  non-singular? </li></ul>where
    8. 8. Michelli’s Theorem <ul><li>If points x i are distinct,  is non-singular (regardless of the dimension of the input space) </li></ul><ul><li>Valid for a large class of RBF functions: </li></ul>
    9. 9. Learning: Ill and Well Posed Problems <ul><li>Given a set of data points, learning is viewed as a hypersurface reconstruction or approximation problem---inverse problem </li></ul><ul><ul><li>Well posed problem </li></ul></ul><ul><ul><ul><li>A mapping from input to output exists for all input values </li></ul></ul></ul><ul><ul><ul><li>The mapping is unique </li></ul></ul></ul><ul><ul><ul><li>The mapping function is continuous </li></ul></ul></ul><ul><ul><li>Ill posed problem </li></ul></ul><ul><ul><ul><li>Noise or imprecision data adds uncertainty to reconstruct the mapping uniquely </li></ul></ul></ul><ul><ul><ul><li>Not enough training data to reconstruct the mapping uniquely. </li></ul></ul></ul><ul><ul><ul><ul><li>Degraded generalization performance </li></ul></ul></ul></ul><ul><ul><ul><li>Regularization is needed </li></ul></ul></ul>
    10. 10. Regularization Theory <ul><li>The basic idea of regularization is to stabilize the solution by means of some auxiliary functions that embeds prior information, e.g., smoothness constraints on the input-output mapping (i.e., solution to the approximation problem), And thereby make an ill-posed problem into a well-posed one. (Poggio and Girosi, 1990). </li></ul>
    11. 11. Solution to the Regularization Problem <ul><li>Minimize the cost function E ( F ) w.r.t F </li></ul>Standard error term Regularizing term where  is the regularization parameter
    12. 12. Solution to the Regularization Problem <ul><li>Poggio & Griosi (1990): </li></ul><ul><ul><li>If C ( F ) is a (problem-dependent) linear differential operator, the solution to </li></ul></ul>
    13. 13. Interpolation vs Regularization <ul><li>Interpolation </li></ul><ul><ul><li>Exact interpolator </li></ul></ul><ul><ul><li>Possible RBF </li></ul></ul><ul><li>Regularization </li></ul><ul><ul><li>Exact interpolator </li></ul></ul><ul><ul><li>Equal to the “interpolation” solution if  = 0 </li></ul></ul><ul><ul><li>Example of Green’s function </li></ul></ul>
    14. 14. Generalized RBF Network (GRBFN) <ul><li>As many radial basis functions as training patterns </li></ul><ul><ul><li>Computationally intensive </li></ul></ul><ul><ul><li>Ill-conditioned matrix </li></ul></ul><ul><ul><li>Regularization is not easy ( C ( F ) is problem-dependent) </li></ul></ul>Possible solution -> Generalized RBFN approach Adjustable parameters
    15. 15. D-Dimensional Gaussian Distribution <ul><li>d -Dimensional Gaussian Distribution </li></ul>( ind. each other) (General form)
    16. 16. D-Dimensional Gaussian Distribution 2-Dimensional Gaussian Distribution
    17. 17. Radial Basis Function Networks
    18. 18. RBFN: Universal Approximation <ul><li>Park & Sandberg (1991): </li></ul><ul><ul><li>For any continuous input-output mapping function f ( x ) </li></ul></ul><ul><ul><li>The theorem is stronger (radial symmetry is not needed) </li></ul></ul><ul><ul><li>K is not specified </li></ul></ul><ul><ul><li>Provides a theoretical basis for practical RBFN! </li></ul></ul>
    19. 19. Kernel Regression <ul><li>Consider the nonlinear regression model : </li></ul><ul><li>Recall: </li></ul><ul><li>From probability theory, </li></ul><ul><li>By using (2) in (1), </li></ul>
    20. 20. Kernel Regression <ul><li>We do not know the , which can be estimated by Parzen-Rosenblatt density estimator : </li></ul>
    21. 21. Kernel Regression <ul><li>Integration and by using , and using the symmetric property of K , we get : </li></ul>
    22. 22. Kernel Regression <ul><li>By using (4) and (5) as estimates of part of (3) : </li></ul>
    23. 23. Nadaraya-Watson Regression Estimator <ul><li>By define the normalized weighting function : </li></ul><ul><li>We can rewrite (6) as : </li></ul><ul><li>F ( x ) : a weighted average of the y -observables </li></ul>
    24. 24. Normalized RBF Network <ul><li>Assume the spherical symmetry of K (x), then : </li></ul><ul><li>Normalized radial basis function is defined: </li></ul>
    25. 25. Normalized RBF Network <ul><li>let for all i , we may rewrite (6) as : </li></ul><ul><li>may be interpreted as the probability of an event x conditional on x i </li></ul>
    26. 26. Multivariate Gaussian Distribution <ul><li>If we take the kernel function as the multivariate Gaussian Distribution : </li></ul><ul><li>Then we can write : </li></ul>
    27. 27. Multivariate Gaussian Distribution <ul><li>And the NRBF is: </li></ul><ul><li>The centers of the RBF coincide with the data points </li></ul>
    28. 28. RBFN vs MLP <ul><li>RBFN </li></ul><ul><ul><li>Single hidden layer </li></ul></ul><ul><ul><li>Nonlinear hidden layer and linear output layer </li></ul></ul><ul><ul><li>Argument of hidden units: Euclidean norm </li></ul></ul><ul><ul><li>Universal approximation property </li></ul></ul><ul><ul><li>Local approximators </li></ul></ul><ul><li>MPL </li></ul><ul><ul><li>Single or multiple hidden layers </li></ul></ul><ul><ul><li>Nonlinear hidden layer and linear or nonlinear output layer </li></ul></ul><ul><ul><li>Argument of hidden units: scalar product </li></ul></ul><ul><ul><li>Universal approximation property </li></ul></ul><ul><ul><li>Global approximators </li></ul></ul>
    29. 29. Learning Strategies <ul><li>Parameters to be determined: w i , c i , and  i </li></ul><ul><li>Traditional learning strategy: splitted computation </li></ul><ul><ul><li>Centers, c i </li></ul></ul><ul><ul><li>Widths,  i </li></ul></ul><ul><ul><li>Weights, w i </li></ul></ul>
    30. 30. Computation of Centers <ul><li>Vector quantization: Centers c i must have the density properties of training patterns x i </li></ul><ul><ul><li>Random selection from the training set </li></ul></ul><ul><ul><li>Competitive learning </li></ul></ul><ul><ul><li>Frequency-sensitive learning </li></ul></ul><ul><ul><li>Kohonen learning </li></ul></ul><ul><li>This phase only uses the input ( x i ) information, not the output ( d i ) </li></ul>
    31. 31. K -Means Clustering <ul><li>k ( x ) = index of best-matching (winning) neuron: </li></ul><ul><li>M = number of clusters </li></ul><ul><li>where t k ( n ) = location of the k th center. </li></ul>
    32. 32. Computation of Widths <ul><li>Universal approximation property: valid with identical widths </li></ul><ul><li>In practice (limited training patterns), variables widths  i </li></ul><ul><li>One approach: Use local clusters </li></ul><ul><ul><li>Select  i according to the standard deviation of clusters </li></ul></ul>
    33. 33. Computation of Widths (cont’d.) Red-dotted line: Estimated distribution Blue-solid line: Actual distribution
    34. 34. Computation of Weights (SVD) <ul><li>Problem becomes linear! </li></ul><ul><li>Solution of least square criterion </li></ul><ul><li>In practice, use SVD </li></ul>Keep Constant
    35. 35. Computation of Weights (Gradient Descent) <ul><li>Linear weights (output layer) </li></ul><ul><li>Positions of centers (hidden layer) </li></ul><ul><li>Widths of centers (hidden layer) </li></ul>
    36. 36. Summary <ul><li>Learning is finding surface in multidimensional space best fit to training data </li></ul><ul><li>Approximate function with linear combination of Radial basis functions </li></ul><ul><li>F ( x ) =  w i G (|| x - x i ||) i = 1, 2, … , N </li></ul><ul><li>G (|| x - x i ||) is called a Green function </li></ul><ul><ul><li>It can be a uniform or Gaussian function . </li></ul></ul><ul><li>When N = number of sample, we call it regularization network </li></ul><ul><li>When N < number of sample, it is a radial basis function network </li></ul>