This document discusses radial basis function networks. It begins by introducing the basic structure of RBF networks, which typically involve an input layer, a hidden layer that applies a nonlinear transformation using radial basis functions, and an output layer with a linear transformation. The document then discusses Cover's theorem, which states that pattern classification problems are more likely to be linearly separable when mapped to a higher-dimensional space through a nonlinear transformation. Several key concepts are introduced, including dichotomies, phi-separable functions, and using hidden functions to map patterns to a hidden feature space.
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
Machine learning in science and industry — day 4arogozhnikov
- tabular data approach to machine learning and when it didn't work
- convolutional neural networks and their application
- deep learning: history and today
- generative adversarial networks
- finding optimal hyperparameters
- joint embeddings
Combining Models - In this slides, we look at a way to combine the answers form various weak classifiers to build a robust classifier. At the slides we look at the following subjects:
1.- Model Combination Vs Bayesian Model
2.- Bootstrap Data Sets
And the cherry on the top the AdaBoost
Machine learning in science and industry — day 4arogozhnikov
- tabular data approach to machine learning and when it didn't work
- convolutional neural networks and their application
- deep learning: history and today
- generative adversarial networks
- finding optimal hyperparameters
- joint embeddings
Machine learning in science and industry — day 1arogozhnikov
A course of machine learning in science and industry.
- notions and applications
- nearest neighbours: search and machine learning algorithms
- roc curve
- optimal classification and regression
- density estimation
- Gaussian mixtures and EM algorithm
- clustering, an example of clustering in the opera
Machine learning in science and industry — day 2arogozhnikov
- decision trees
- random forest
- Boosting: adaboost
- reweighting with boosting
- gradient boosting
- learning to rank with gradient boosting
- multiclass classification
- trigger in LHCb
- boosting to uniformity and flatness loss
- particle identification
This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
Polynomial matrices can help to elegantly formulate many broadband multi-sensor / multi-channel processing problems, and represent a direct extension of well-established narrowband techniques which typically involve eigen- (EVD) and singular value decompositions (SVD) for optimisation. Polynomial matrix decompositions extend the utility of the EVD to polynomial parahermitian matrices, and this talk presents a brief overview of such polynomial matrices, characteristics of the polynomial EVD (PEVD) and iterative algorithms for its solution. The presentation concludes with some surprising results when applying the PEVD to subband coding and broadband beamforming.
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
Analysis of data is an important task in data managements systems. Many mathematical tools are used in data analysis. A new division of data management has appeared in machine learning, linear algebra, an optimal tool to analyse and manipulate the data. Data science is a multi-disciplinary subject that uses scientific methods to process the structured and unstructured data to extract the knowledge by applying suitable algorithms and systems. The strength of linear algebra is ignored by the researchers due to the poor understanding. It powers major areas of Data Science including the hot fields of Natural Language Processing and Computer Vision. The data science enthusiasts finding the programming languages for data science are easy to analyze the big data rather than using mathematical tools like linear algebra. Linear algebra is a must-know subject in data science. It will open up possibilities of working and manipulating data. In this paper, some applications of Linear Algebra in Data Science are explained.
Implementation of Back-Propagation Neural Network using Scilab and its Conver...IJEEE
Artificial neural network has been widely used for solving non-linear complex tasks. With the development of computer technology, machine learning techniques are becoming good choice. The selection of the machine learning technique depends upon the viability for particular application. Most of the non-linear problems have been solved using back propagation based neural network. The training time of neural network is directly affected by convergence speed. Several efforts are done to improve the convergence speed of back propagation algorithm. This paper focuses on the implementation of back-propagation algorithm and an effort to improve its convergence speed. The algorithm is written in SCILAB. UCI standard data set is used for analysis purposes. Proposed modification in standard backpropagation algorithm provides substantial improvement in the convergence speed.
Presentation of my NSERC-USRA funded summer research project given at the Canadian Undergraduate Mathematics Conference (CUMC) 2014.
Please refer to the project site: http://jessebett.com/Radial-Basis-Function-USRA/
Machine learning in science and industry — day 1arogozhnikov
A course of machine learning in science and industry.
- notions and applications
- nearest neighbours: search and machine learning algorithms
- roc curve
- optimal classification and regression
- density estimation
- Gaussian mixtures and EM algorithm
- clustering, an example of clustering in the opera
Machine learning in science and industry — day 2arogozhnikov
- decision trees
- random forest
- Boosting: adaboost
- reweighting with boosting
- gradient boosting
- learning to rank with gradient boosting
- multiclass classification
- trigger in LHCb
- boosting to uniformity and flatness loss
- particle identification
This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
Polynomial matrices can help to elegantly formulate many broadband multi-sensor / multi-channel processing problems, and represent a direct extension of well-established narrowband techniques which typically involve eigen- (EVD) and singular value decompositions (SVD) for optimisation. Polynomial matrix decompositions extend the utility of the EVD to polynomial parahermitian matrices, and this talk presents a brief overview of such polynomial matrices, characteristics of the polynomial EVD (PEVD) and iterative algorithms for its solution. The presentation concludes with some surprising results when applying the PEVD to subband coding and broadband beamforming.
Covariance matrices are central to many adaptive filtering and optimisation problems. In practice, they have to be estimated from a finite number of samples; on this, I will review some known results from spectrum estimation and multiple-input multiple-output communications systems, and how properties that are assumed to be inherent in covariance and power spectral densities can easily be lost in the estimation process. I will discuss new results on space-time covariance estimation, and how the estimation from finite sample sets will impact on factorisations such as the eigenvalue decomposition, which is often key to solving the introductory optimisation problems. The purpose of the presentation is to give you some insight into estimating statistics as well as to provide a glimpse on classical signal processing challenges such as the separation of sources from a mixture of signals.
Analysis of data is an important task in data managements systems. Many mathematical tools are used in data analysis. A new division of data management has appeared in machine learning, linear algebra, an optimal tool to analyse and manipulate the data. Data science is a multi-disciplinary subject that uses scientific methods to process the structured and unstructured data to extract the knowledge by applying suitable algorithms and systems. The strength of linear algebra is ignored by the researchers due to the poor understanding. It powers major areas of Data Science including the hot fields of Natural Language Processing and Computer Vision. The data science enthusiasts finding the programming languages for data science are easy to analyze the big data rather than using mathematical tools like linear algebra. Linear algebra is a must-know subject in data science. It will open up possibilities of working and manipulating data. In this paper, some applications of Linear Algebra in Data Science are explained.
Implementation of Back-Propagation Neural Network using Scilab and its Conver...IJEEE
Artificial neural network has been widely used for solving non-linear complex tasks. With the development of computer technology, machine learning techniques are becoming good choice. The selection of the machine learning technique depends upon the viability for particular application. Most of the non-linear problems have been solved using back propagation based neural network. The training time of neural network is directly affected by convergence speed. Several efforts are done to improve the convergence speed of back propagation algorithm. This paper focuses on the implementation of back-propagation algorithm and an effort to improve its convergence speed. The algorithm is written in SCILAB. UCI standard data set is used for analysis purposes. Proposed modification in standard backpropagation algorithm provides substantial improvement in the convergence speed.
Presentation of my NSERC-USRA funded summer research project given at the Canadian Undergraduate Mathematics Conference (CUMC) 2014.
Please refer to the project site: http://jessebett.com/Radial-Basis-Function-USRA/
Radial basis function network ppt bySheetal,Samreen and Dhanashrisheetal katkar
Radial Basis Functions are nonlinear activation functions used by artificial neural networks.Explained commonly used RBFs ,cover's theorem,interpolation problem and learning strategies.
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
This a set of slides explaining the search methods by
Gradient Descent
Simulated Annealing
Hill Climbing
They are still not great, but they are good enough
Senior data scientist and founder of the company Intelligentia Data I+D SA de CV. We are offering consultancy services, development of projects and products in Machine Learning, Big Data, Data Sciences and Artificial Intelligence.
My first set of slides (The NN and DL class I am preparing for the fall)... I included the problem of Vanishing Gradient and the need to have ReLu (Mentioning btw the saturation problem inherited from Hebbian Learning)
It has been almost 62 years since the invention of the term Artificial Intelligence by Samuel and Minsky et al. at the Dartmouth workshop College in 1956 (“Dartmouth Summer Research Project on Artificial Intelligence”) where this new area of Computer Science was invented. However, the history of Artificial Intelligence goes back to previous millennia, when the Greeks in their Myths spoke about golden robots at Hephaestus, and the Galatea of Pygmalion. They were the first automatons known at the dawn of history, and although these first attempts were only myths, automatons were invented and built through multiple civilizations in history. Nevertheless, these automatons resembled in quite limited way their final objectives, representing animals and humans. In spite of that, the greatest illusion of an automaton, the Turk by Wolfgang von Kempelen, inspired many people, trough its exhibitions, as Alexander Graham Bell and Charles Babbage to develop inventions that would change forever human history. Thus, the importance of the concept “Artificial Intelligence” as a driver of our technological dreams. And although Artificial Intelligence has never been defined in a precise practical way, the amount of research and methods that have been developed to tackle some of its basics tasks have been and are quite humongous. Thus, the importance of having an introduction to the concepts of Artificial Intelligence, thus the dream can continue.
A review of one of the most popular methods of clustering, a part of what is know as unsupervised learning, K-Means. Here, we go from the basic heuristic used to solve the NP-Hard problem to an approximation algorithm K-Centers. Additionally, we look at variations coming from the Fuzzy Set ideas. In the future, we will add more about On-Line algorithms in the line of Stochastic Gradient Ideas...
Here a Review of the Combination of Machine Learning models from Bayesian Averaging, Committees to Boosting... Specifically An statistical analysis of Boosting is done
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
Forklift Classes Overview by Intella PartsIntella Parts
Discover the different forklift classes and their specific applications. Learn how to choose the right forklift for your needs to ensure safety, efficiency, and compliance in your operations.
For more technical information, visit our website https://intellaparts.com
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
2. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
2 / 96
3. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
3 / 96
4. Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
5. Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
6. Introduction
Observation
The back-propagation algorithm for the design of a multilayer perceptron
as described in the previous chapter may be viewed as the application of a
recursive technique known in statistics as stochastic approximation.
Now
We take a completely different approach by viewing the design of a neural
network as a curve fitting (approximation) problem in a high-dimensional
space.
Thus
Learning is equivalent to finding a surface in a multidimensional space that
provides a best fit to the training data.
Under a statistical metric
4 / 96
7. Thus
In the context of a neural network
The hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into the
hidden space.
Name of these functions
Radial-Basis functions.
5 / 96
8. Thus
In the context of a neural network
The hidden units provide a set of "functions"
A "basis" for the input patterns when they are expanded into the
hidden space.
Name of these functions
Radial-Basis functions.
5 / 96
9. History
These functions were first introduced
As the solution of the real multivariate interpolation problem
Right now
It is now one of the main fields of research in numerical analysis.
6 / 96
10. History
These functions were first introduced
As the solution of the real multivariate interpolation problem
Right now
It is now one of the main fields of research in numerical analysis.
6 / 96
11. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
7 / 96
12. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
13. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
14. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
8 / 96
15. A Basic Structure
We have the following structure
1 Input Layer to connect with the environment.
2 Hidden Layer applying a non-linear transformation.
3 Output Layer applying a linear transformation.
Example
Input Nodes
Nonlinear Nodes
Linear Node
8 / 96
16. Why the non-linear transformation?
The justification
In a paper by Cover (1965), a pattern-classification problem mapped to a
high dimensional space is more likely to be linearly separable than in a
low-dimensional space.
Thus
A good reason to make the dimension in the hidden space in a
Radial-Basis Function (RBF) network high
9 / 96
17. Why the non-linear transformation?
The justification
In a paper by Cover (1965), a pattern-classification problem mapped to a
high dimensional space is more likely to be linearly separable than in a
low-dimensional space.
Thus
A good reason to make the dimension in the hidden space in a
Radial-Basis Function (RBF) network high
9 / 96
18. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
10 / 96
19. Cover’s Theorem
The Resumed Statement
A complex pattern-classification problem cast in a high-dimensional space
nonlinearly is more likely to be linearly separable than in a low-dimensional
space.
Actually
It is quite more complex...
11 / 96
20. Cover’s Theorem
The Resumed Statement
A complex pattern-classification problem cast in a high-dimensional space
nonlinearly is more likely to be linearly separable than in a low-dimensional
space.
Actually
It is quite more complex...
11 / 96
21. Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
22. Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
23. Some facts
A fact
Once we know a set of patterns are linearly separable, the problem is easy
to solve.
Consider
A family of surfaces that separate the space in two regions.
In addition
We have a set of patterns
H = {x1, x2, ..., xN } (1)
12 / 96
24. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
13 / 96
25. Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
26. Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
27. Dichotomy (Binary Partition)
Now
The pattern set is split into two classes H1 and H2.
Definition
A dichotomy (binary partition) of the points is said to be separable with
respect to the family of surfaces if a surface exists in the family that
separates the points in the class H1 from those in the class H2.
Define
For each pattern x ∈ H, we define a set of real valued measurement
functions {φ1 (x) , φ2 (x) , ..., φd1 (x)}
14 / 96
28. Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
29. Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
30. Thus
We define the following function (Vector of measurements)
φ : H → Rd1
(2)
Defined as
φ (x) = (φ1 (x) , φ2 (x) , ..., φd1 (x))T
(3)
Now
Suppose that the pattern x is a vector in an d0-dimensional input space.
15 / 96
31. Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
32. Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
33. Then...
We have that the mapping φ (x)
It maps points in d0-dimensional space into corresponding points in a new
space of dimension d1.
Each of this functions φi (x)
It is known as a hidden function because it plays a role similar to the
hidden unit in a feed-forward neural network.
Thus
We have that the space spanned by the set of hidden functions
{φi (x)}d1
i=1 is called as the hidden space of feature space.
16 / 96
34. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
17 / 96
35. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
36. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
37. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
38. φ-separable functions
Definition
A dichotomy {H1, H2} of H is said to be φ-separable if there exists a
d1-dimensional vector w such that
1 wT φ (x) > 0 if x ∈ H1.
2 wT φ (x) < 0 if x ∈ H2.
Clearly the hyperplane is defined by the equation
wT
φ (x) = 0 (4)
Now
The inverse image of this hyperplane
Hyp−1
= x|wT
φ (x) = 0 (5)
define the separating surface in the input space.
18 / 96
39. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
40. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
41. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
42. Now
Taking in consideration
A natural class of mappings obtained by using a linear combination of
r-wise products of the pattern vector coordinates.
They are called
As the rth-order rational varieties.
A rational variety of order r in dimensional d0 is described by
0≤i1≤i2≤...≤ir ≤d0
ai1i2...ir xi1 xi2 ...xir = 0 (6)
where xi is the ith coordinate of the input vector x and x0 is set to unity
in order to express the previous equation in homogeneous form.
19 / 96
43. Homogenous Functions
Definition
A function f (x) is said to be homogeneous of degree n if, by introducing a
constant parameter λ, replacing the variable x with λx we find:
f (λx) = λn
f (x) (7)
20 / 96
44. Homogeneous Equation
Equation (Eq. 6)
A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial
Properties
For an input space of dimensionality d0, there are
d0
r
=
d0!
(d0 − r)!r!
(8)
monomials in (Eq. 6).
21 / 96
45. Homogeneous Equation
Equation (Eq. 6)
A rth order product of entries xi of x, xi1 xi2 ...xir , is called a monomial
Properties
For an input space of dimensionality d0, there are
d0
r
=
d0!
(d0 − r)!r!
(8)
monomials in (Eq. 6).
21 / 96
46. Example of these surfaces
Hyperplanes (first-order rational varieties)
22 / 96
47. Example of these surfaces
Hyperplanes (first-order rational varieties)
23 / 96
48. Example of these surfaces
Quadrices (second-order rational varieties)
24 / 96
49. Example of these surfaces
Hyperspheres (quadrics with certain linear constraints on the
coefficients)
25 / 96
50. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
26 / 96
51. The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
52. The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
53. The Stochastic Experiment
Suppose
You have the following activation patterns x1, x2, ..., xN are chosen
independently.
Suppose
That all possible dichotomies of H = {x1, x2, ..., xN } are equiprobable.
Now given P (N, d1) the probability that a particular dichotomy
picked at random is φ-separable
P (N, d1) =
1
2
N−1 d1−1
m=0
N − 1
m
(9)
27 / 96
54. What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
55. What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
56. What?
Basically (Eq. 9) represents
The essence of Cover’s Separability Theorem.
Something Notable
It is a statement of the fact that the cumulative binomial distribution
corresponding to the probability that N − 1 (Flips of a coin) samples will
be separable in a mapping of d1 − 1 (heads) or fewer dimensions.
Specifically
The higher we make the hidden space in the radial basis function the
closer is the probability of P (N, d1) to one.
28 / 96
57. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
58. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
59. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
60. Final ingredients if the Cover’s Theorem
First
Nonlinear formulation of the hidden function defined by φ (x), where x is
the input vector and i = 1, 2, ..., d1.
Second
High dimensionality of the hidden space compared to the input space.
This dimensionality is determined by the value assigned to d_1 (i.e.,
the number of hidden units).
Then
In general, a complex pattern-classification problem cast in
highdimensional space nonlinearly is more likely to be linearly separable
than in a lowdimensional space.
29 / 96
61. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
30 / 96
62. There is always an exception to every rule!!!
The XOR Problem
0
1
1
Class 1
Class 2
31 / 96
63. Now
We define the following radial functions
φ1 (x) = exp x − t1
2
2 where t1 = (1, 1)T
φ2 (x) = exp x − t2
2
2 where t2 = (1, 1)T
Then
If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping
(0, 1) → (0.3678, 0.3678)
(1, 0) → (0.3678, 0.3678)
(0, 0) → (0.1353, 1)
(1, 1) → (1, 0.1353)
32 / 96
64. Now
We define the following radial functions
φ1 (x) = exp x − t1
2
2 where t1 = (1, 1)T
φ2 (x) = exp x − t2
2
2 where t2 = (1, 1)T
Then
If we apply our classic mapping φ (x) = [φ1 (x) , φ2 (x)]:
Original Mapping
(0, 1) → (0.3678, 0.3678)
(1, 0) → (0.3678, 0.3678)
(0, 0) → (0.1353, 1)
(1, 1) → (1, 0.1353)
32 / 96
65. New Space
We have the following new φ1 − φ2 space
0
1
1
Class 1
Class 2
33 / 96
66. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
34 / 96
67. Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
68. Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
69. Separating Capacity of a Surface
Something Notable
(Eq. 9) has an important bearing on the expected maximum number of
randomly assigned patterns that are linearly separable in a
multidimensional space.
Now, given our patterns {xi}N
i=1
Given N be a random variable defined as the largest integer such that the
sequence is φ-separable.
We have that
Prob (N = n) = P (n, d1) − P (n + 1, d1) (10)
35 / 96
70. Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
71. Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
72. Separating Capacity of a Surface
Then
Prob (N = n) =
1
2
n
n − 1
d1 − 1
, n = 0, 1, 2... (11)
Remark:
n
d1
=
n − 1
d1 − 1
+
n − 1
d1
, 0 < d1 < n
To interpret this
Recall the negative binomial distribution.
It is a repeated sequence of Bernoulli Trials
With k failures preceding the rth success.
36 / 96
73. Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
74. Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
75. Separating Capacity of a Surface
Thus, we have that
Given p and q the probabilities of success and failure, respectively, with
p + q = 1.
Definition
p (K = k|p, q) =
r + k − 1
k
pr
qk
(12)
What happened with p = q = 1
2
and k + r = n
Any idea?
37 / 96
76. Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
77. Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
78. Separating Capacity of a Surface
Thus
(Eq. 11) is just the negative binomial distribution shifted d1 units to the
right with parameters d1 and 1
2
Finally
N corresponds to thew “waiting time” for d1 th failure in a sequence of
tosses of a fair coin.
We have then
E [N] = 2d1
Median [N] = 2d1
38 / 96
79. This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic result
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separating
capacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
80. This allows to define the Corollary to Cover’s Theorem
A celebrated asymptotic result
The expected maximum number of randomly assigned patterns (vectors)
that are linearly separable in a space of dimensionality d1 is equal to 2d1 .
Something Notable
This result suggests that 2d1 is a natural definition of the separating
capacity of a family of decision surfaces having d1 degrees of freedom.
39 / 96
81. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
40 / 96
82. Given a problem of non-linearly separable patterns
It is possible to see that
There is a benefit to be gained by mapping the input space into a new
space of high enough dimension
For this, we use a non-linear map
Quite similar to solve a difficult non-linear filtering problem by mapping it
to high dimension, then solving it as a linear filtering problem.
41 / 96
83. Given a problem of non-linearly separable patterns
It is possible to see that
There is a benefit to be gained by mapping the input space into a new
space of high enough dimension
For this, we use a non-linear map
Quite similar to solve a difficult non-linear filtering problem by mapping it
to high dimension, then solving it as a linear filtering problem.
41 / 96
84. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
42 / 96
85. Take in consideration the following architecture
Mapping from input space to hidden space, followed by a linear
mapping to output space!!!
Input Nodes
Nonlinear Nodes
Linear Node
43 / 96
86. This can be seen as
We have the following map
s : Rd0
→ R (13)
Therefore
We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
87. This can be seen as
We have the following map
s : Rd0
→ R (13)
Therefore
We may think of s as a hypersurface (graph) Γ ⊂ Rd0+1
44 / 96
88. Example
We have that the Red planes represent the mappings and the Gray is
the Linear Separator
45 / 96
89. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
46 / 96
90. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
91. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
92. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
93. General Idea
First
The training phase constitutes the optimization of a fitting procedure
for the surface Γ.
It is based in the know data points as input-output patterns.
Second
The generalization phase is synonymous with interpolation between
the data points.
The interpolation being performed along the constrained surface
generated by the fitting procedure.
47 / 96
94. This leads to the theory of multi-variable interpolation
Interpolation Problem
Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a
corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ..., N (14)
Remark
For strict interpolation as specified here, the interpolating surface is
constrained to pass through all the training data points.
48 / 96
95. This leads to the theory of multi-variable interpolation
Interpolation Problem
Given a set of N different points xi ∈ Rd0 |i = 1, 2, ..., N and a
corresponding set of N real numbers di ∈ R1|i = 1, 2, ..., N , find a
function F : RN → R that satisfies the interpolation condition:
F (xi) = di i = 1, 2, ..., N (14)
Remark
For strict interpolation as specified here, the interpolating surface is
constrained to pass through all the training data points.
48 / 96
96. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
49 / 96
97. Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
98. Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
99. Radial-Basis Functions (RBF)
The function F has the following form (Powell, 1988)
F (x) =
N
i=1
wiφ ( x − xi ) (15)
Where
{φ ( x − xi ) |i = 1, ..., N}
is a set of N arbitrary, generally non-linear, functions, know as RBF with
· denotes a norm that is usually Euclidean.
In addition
The know data points xi ∈ Rd0 i = 1, 2, ..., N are taken to be the centers
of the radial basis functions.
50 / 96
100. A Set of Simultaneous Linear Equations
Given
φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16)
Using (Eq. 14) and (Eq. 15), we get
φ11 φ12 · · · φ1N
φ21 φ22 · · · φ2N
...
...
...
...
φN1 φN2 · · · φNN
w1
w2
...
wN
=
d1
d2
...
dN
(17)
51 / 96
101. A Set of Simultaneous Linear Equations
Given
φji = φ ( xj − xi ) , (j, i) = 1, 2, ..., N (16)
Using (Eq. 14) and (Eq. 15), we get
φ11 φ12 · · · φ1N
φ21 φ22 · · · φ2N
...
...
...
...
φN1 φN2 · · · φNN
w1
w2
...
wN
=
d1
d2
...
dN
(17)
51 / 96
102. Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
103. Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
104. Now
We can create the following vectors
d = [d1, d2, ..., dN ]T
(Response vector).
w = [w1, w2, ..., wN ]T
(Linear weight vector).
Now, we define a N × N matrix called interpolation matrix
Φ = {φji| (j, i) = 1, 2, ..., N} (18)
Thus, we have
Φw = x (19)
52 / 96
105. From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
106. From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
107. From here
Assuming that Φ is a non-singular matrix
w = Φ−1
x (20)
Question
How can we be sure that the interpolation matrix Φ is non-singular?
Answer
It turns out that for a large class of radial-basis functions and under
certain conditions the non-singularity happens!!!
53 / 96
108. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
54 / 96
109. Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
110. Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
111. Introduction
Observation
The strict interpolation procedure described may not be a good strategy
for the training of RBF networks for certain classes of tasks.
Reason
If the number of data points is much larger than the number of degrees of
freedom of the underlying physical process.
Thus
The network may end up fitting misleading variations due to idiosyncrasies
or noise in the input data.
55 / 96
112. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
56 / 96
113. Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
114. Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
115. Well-posed
The Problem
Assume that we have a domain X and a range Y , metric spaces.
They are related by a mapping
f : X → Y (21)
Definition
The problem of reconstructing the mapping f is said to be well-posed if
three conditions are satisfied: Existence, Uniqueness and Continuity.
57 / 96
116. Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
117. Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
118. Defining the meaning of this
Existence
For every input vector x ∈ X, there does exist an output y = f (x), where
y ∈ Y .
Uniqueness
For any pair of input vectors x, t ∈ X, we have f (x) = f (t) if and only if
x = t.
Continuity
The mapping is continuous, if for any > 0 exists δ such that the
condition dX (x, t) < δ implies dY (f (x) , f (t)) < .
58 / 96
120. Ill-Posed
Therefore
If any of these conditions is not satisfied, the problem is said to be
ill-posed.
Basically
An ill-posed problem means that large data sets may contain a
surprisingly small amount of information about the desired solution.
60 / 96
121. Ill-Posed
Therefore
If any of these conditions is not satisfied, the problem is said to be
ill-posed.
Basically
An ill-posed problem means that large data sets may contain a
surprisingly small amount of information about the desired solution.
60 / 96
122. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
61 / 96
124. We have the following
Physical Phenomena
Speech, pictures, radar signals, sonar signals, seismic data.
It is a well-posed data
But learning form such data i.e. rebuilding the hypersurface can be an
ill-posed inverse problem.
63 / 96
125. We have the following
Physical Phenomena
Speech, pictures, radar signals, sonar signals, seismic data.
It is a well-posed data
But learning form such data i.e. rebuilding the hypersurface can be an
ill-posed inverse problem.
63 / 96
126. Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
127. Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
128. Why
First
The existence criterion may be violated in that a distinct output may not
exist for every input
Second
There may not be as much information in the training sample as we really
need to reconstruct the input-output mapping uniquely.
Third
The unavoidable presence of noise or imprecision in real-life training data
adds uncertainty to the reconstructed input-output mapping.
64 / 96
130. How?
This can happen when
There is a lack of information!!!
Lanczos, 1964
“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
131. How?
This can happen when
There is a lack of information!!!
Lanczos, 1964
“A lack of information cannot be remedied by any mathematical trickery.”
66 / 96
132. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
67 / 96
133. How do we solve the problem?
Something Notable
In 1963, Tikhonov proposed a new method called regularization for solving
ill-posed ’
Tikhonov
He was a Soviet and Russian mathematician known for important
contributions to topology, functional analysis, mathematical physics, and
ill-posed problems.
68 / 96
134. How do we solve the problem?
Something Notable
In 1963, Tikhonov proposed a new method called regularization for solving
ill-posed ’
Tikhonov
He was a Soviet and Russian mathematician known for important
contributions to topology, functional analysis, mathematical physics, and
ill-posed problems.
68 / 96
135. Also Known as Ridge Regression
Setup
We have:
Input Signal xi ∈ Rd0
N
i=1
.
Output Signal {di ∈ R}N
i=1.
In addition
Note that the output is assumed to be one-dimensional.
69 / 96
136. Also Known as Ridge Regression
Setup
We have:
Input Signal xi ∈ Rd0
N
i=1
.
Output Signal {di ∈ R}N
i=1.
In addition
Note that the output is assumed to be one-dimensional.
69 / 96
137. Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
138. Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
139. Now, assuming that you have an approximation function
y = F (x)
Standard Error Term
Es (F) =
1
2
N
i=1
(di − yi) =
1
2
N
i=1
(di − F (xi)) (22)
Regularization Term
Ec (F) =
1
2
DF 2
(23)
Where
D is a linear differential operator.
70 / 96
140. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
141. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
142. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
143. Now
Ordinarily y = F (x)
Normally, the function space representing the functional F is the L2 space
that consist of all real-valued functions f (x) with x ∈ Rd0
The quantity to be minimized in regularization theory is
E (f ) =
1
2
N
i=1
(di − f (xi)) +
1
2
Df 2
(24)
Where
λ is a positive real number called the regularization parameter.
E (f ) is called the Tikhonov functional.
71 / 96
144. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
72 / 96
145. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
146. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
147. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
148. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
149. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
150. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
151. Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (25)
Remark: Where the xi ∼ p (x|Θ)!!!
73 / 96
152. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
153. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
154. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
155. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
156. Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
74 / 96
157. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
158. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
159. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
160. How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (26)
Remark: The expected output of the machine g (x|D)
75 / 96
161. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
162. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
163. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
164. Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (27)
76 / 96
165. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
166. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
167. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
168. We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represent the measure of the error between our machine g (x|D) and the
expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represent the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
77 / 96
169. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
78 / 96
170. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
171. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
172. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
173. Using this in our favor!!!
Something Notable
Introducing bias is equivalent to restricting the range of functions for
which a model can account.
Typically this is achieved by removing degrees of freedom.
Examples
They would be lowering the order of a polynomial or reducing the number
of weights in a neural network!!!
Ridge Regression
It does not explicitly remove degrees of freedom but instead reduces the
effective number of parameters.
79 / 96
174. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
175. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
176. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
177. Example
In the case of a linear regression model
C (w) =
N
i=1
di − wT
xi
2
+ λ
d0
j=1
w2
j (28)
Thus
This is ridge regression (weight decay) and the regularization
parameter λ > 0 controls the balance between fitting the data and
avoiding the penalty.
A small value for λ means the data can be fit tightly without causing
a large penalty.
A large value for λ means a tight fit has to be sacrificed if it requires
large weights.
80 / 96
178. Important
The Bias
It favors solutions involving small weights and the effect is to smooth the
output function.
81 / 96
179. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
82 / 96
180. Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
181. Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
182. Now, we can carry out the optimization
First, we rewrite the cost function the following way
S (w) =
N
i=1
(di − f (xi))2
(29)
And we will use a generalized version for f
f (xi) =
d1
j=1
wjφj (xi) (30)
Where
The free variables are the weights {wj}d1
j=1.
83 / 96
183. Where
φj (xi) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi, xj) (31)
With
φ (x, xj) = exp −
1
2σ2
x − xi (32)
84 / 96
184. Where
φj (xi) is in our case, we may have the Gaussian distribution
φj (xi) = φ (xi, xj) (31)
With
φ (x, xj) = exp −
1
2σ2
x − xi (32)
84 / 96
185. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
186. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
187. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
188. Thus
Final cost function assuming there is a regularization term per weight
C (w, λ) =
N
i=1
(di − f (xi))2
+
d1
j=1
λjw2
j (33)
What do we do?
1 Differentiate the function with respect to the free variables.
2 Equate the results with zero.
3 Solve the resulting equations.
85 / 96
189. Differentiate the function with respect to the free variables.
First
∂C (w, λ)
∂wj
= 2
N
i=1
(di − f (xi))
∂f (xi)
∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)
∂wj
∂f (xi)
∂wj
= φj (xi) (35)
86 / 96
190. Differentiate the function with respect to the free variables.
First
∂C (w, λ)
∂wj
= 2
N
i=1
(di − f (xi))
∂f (xi)
∂wj
+ 2λjwj (34)
We get differential of ∂f (xi)
∂wj
∂f (xi)
∂wj
= φj (xi) (35)
86 / 96
191. Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
192. Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
193. Now
We have then
N
i=1
f (xi) φj (xi) + λjwj =
N
i=1
diφj (xi) (36)
Something Notable
There are m such equations, for 1 ≤ j ≤ m, each representing one
constraint on the solution.
Since there are exactly as many constraints as there are unknowns
equations has, except under certain pathological conditions, a unique
solution.
87 / 96
194. Using Our Linear Algebra
We have then
φT
j f + λjwj = φT
j d (37)
Where
φj =
φj (x1)
φj (x2)
...
φj (xN )
, f =
f (x1)
f (x2)
...
f (xN )
, d =
d1
d2
...
dN
(38)
88 / 96
195. Using Our Linear Algebra
We have then
φT
j f + λjwj = φT
j d (37)
Where
φj =
φj (x1)
φj (x2)
...
φj (xN )
, f =
f (x1)
f (x2)
...
f (xN )
, d =
d1
d2
...
dN
(38)
88 / 96
196. Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them
φT
1 f
φT
2 f
...
φT
d1
f
+
λ1w1
λ2w2
...
λd1 wd1
=
φT
1 d
φT
2 d
...
φT
d1
d
(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
197. Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them
φT
1 f
φT
2 f
...
φT
d1
f
+
λ1w1
λ2w2
...
λd1 wd1
=
φT
1 d
φT
2 d
...
φT
d1
d
(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
198. Now
Since there is one of these equations, each relating one scalar
quantity to another, we can stack them
φT
1 f
φT
2 f
...
φT
d1
f
+
λ1w1
λ2w2
...
λd1 wd1
=
φT
1 d
φT
2 d
...
φT
d1
d
(39)
Now, if we define
Φ = φ1 φ2 . . . φd1
(40)
Written in full form
Φ =
φ1 (x1) φ2 (x1) · · · φd1 (x1)
φ1 (x2) φ2 (x2) · · · φd1 (x2)
...
...
...
...
φ1 (xN ) φ2 (xN ) · · · φd1 (xN )
(41)
89 / 96
199. We can then
Define the following matrix equation
ΦT
f + Λw = ΦT
d (42)
Where
Λ =
λ1 0 · · · 0
0 λ2 · · · 0
...
...
...
...
0 0 · · · λd1
(43)
90 / 96
200. We can then
Define the following matrix equation
ΦT
f + Λw = ΦT
d (42)
Where
Λ =
λ1 0 · · · 0
0 λ2 · · · 0
...
...
...
...
0 0 · · · λd1
(43)
90 / 96
201. Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =
φ1 (xi)
φ2 (xi)
...
φd1 (xi)
(45)
91 / 96
202. Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =
φ1 (xi)
φ2 (xi)
...
φd1 (xi)
(45)
91 / 96
203. Now, we have that
The vector can be decomposed into the product of two terms
Design matrix and the weight vector
We have then
fi = f (xi) =
d1
j=1
wjhj (xi) = φ
T
i w (44)
Where
φi =
φ1 (xi)
φ2 (xi)
...
φd1 (xi)
(45)
91 / 96
204. Furthermore
We get that
f =
f1
f2
...
fN
=
φ
T
1 w
φ
T
2 w
...
φ
T
N w
= Φw (46)
Finally, we have that
ΦT
d =ΦT
f + Λw
=ΦT
Φw + Λw
= ΦT
Φ + Λ w
92 / 96
205. Furthermore
We get that
f =
f1
f2
...
fN
=
φ
T
1 w
φ
T
2 w
...
φ
T
N w
= Φw (46)
Finally, we have that
ΦT
d =ΦT
f + Λw
=ΦT
Φw + Λw
= ΦT
Φ + Λ w
92 / 96
206. Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
207. Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
208. Now...
We get finally
w = ΦT
Φ + Λ
−1
ΦT
d (47)
Remember
This equation is the most general form of the normal equation.
We have two cases
In standard ridge regression λj = λ, 1 ≤ j ≤ m.
Ordinary least squares where there is no weight penalty or all λj = 0,
1 ≤ j ≤ m..
93 / 96
209. Thus, we have
First Case
w = ΦT
Φ + λId1
−1
ΦT
d (48)
Second Case
w = ΦT
Φ
−1
ΦT
d (49)
94 / 96
210. Thus, we have
First Case
w = ΦT
Φ + λId1
−1
ΦT
d (48)
Second Case
w = ΦT
Φ
−1
ΦT
d (49)
94 / 96
211. Outline
1 Introduction
Main Idea
Basic Radial-Basis Functions
2 Separability
Cover’s Theorem on the separability of patterns
Dichotomy
φ-separable functions
The Stochastic Experiment
The XOR Problem
Separating Capacity of a Surface
3 Interpolation Problem
What is gained?
Feedforward Network
Learning Process
Radial-Basis Functions (RBF)
4 Introduction
Description of the Problem
Well-posed or ill-posed
The Main Problem
5 Regularization Theory
Solving the issue
Bias-Variance Dilemma
Measuring the difference between optimal and learned
The Bias-Variance
How can we use this?
Getting a solution
We still need to talk about...
95 / 96
212. There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96
213. There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96
214. There are still several things that we need to look at...
First
What is the variance of the weight vector? The Variance Matrix.
Second
The prediction of the output at any of the training set inputs - The
Projection Matrix
Finally
The incremental algorithm for the problem!!!
96 / 96