SlideShare a Scribd company logo
1 of 35
Statistical Machine Learning-
The Basic Approach and
Current Research Challenges
Shai Ben-David
CS497
February, 2007
A High Level Agenda
“The purpose of science is
to find meaningful simplicity
in the midst of disorderly complexity”
Herbert Simon
Representative learning tasks
 Medical research.
 Detection of fraudulent activity
(credit card transactions, intrusion
detection, stock market manipulation)
 Analysis of genome functionality
 Email spam detection.
 Spatial prediction of landslide hazards.
Common to all such tasks
 We wish to develop algorithms that detect meaningful
regularities in large complex data sets.
 We focus on data that is too complex for humans to
figure out its meaningful regularities.
 We consider the task of finding such regularities from
random samples of the data population.
 We should derive conclusions in timely manner.
Computational efficiency is essential.
Different types of learning tasks
 Classification prediction –
we wish to classify data points into categories, and we
are given already classified samples as our training
input.
For example:
 Training a spam filter
 Medical Diagnosis (Patient info → High/Low risk).
 Stock market prediction ( Predict tomorrow’s market
trend from companies performance data)
Other Learning Tasks
 Clustering –
the grouping data into representative collections
- a fundamental tool for data analysis.
Examples :
 Clustering customers for targeted marketing.
 Clustering pixels to detect objects in images.
 Clustering web pages for content similarity.
Differences from Classical Statistics
 We are interested in hypothesis generation
rather than hypothesis testing.
 We wish to make no prior assumptions
about the structure of our data.
 We develop algorithms for automated
generation of hypotheses.
 We are concerned with computational
efficiency.
Learning Theory:
The fundamental dilemma…
X
Y
y=f(x)
Good models
should enable
Prediction
of new data…
Tradeoff between
accuracy and simplicity
A Fundamental Dilemma of Science:
Model Complexity vs Prediction Accuracy
Complexity
Accuracy
Possible
Models/representations
Limited data
Problem Outline
 We are interested in
(automated) Hypothesis Generation,
rather than traditional Hypothesis Testing
 First obstacle: The danger of overfitting.
 First solution:
Consider only a limited set of candidate hypotheses.
Empirical Risk Minimization
Paradigm
 Choose a Hypothesis Class H of subsets of X.
 For an input sample S, find some h in H that fits S
well.
 For a new point x, predict a label according to its
membership in h.
The Mathematical Justification
Assume both a training sample S and the test point
(x,l) are generated i.i.d. by the same distribution over
X x {0,1} then,
If H is not too rich ( in some formal sense) then,
for every h in H, the training error of h on the
sample S is a good estimate of its probability of
success on the new x .
In other words – there is no overfitting
Training error
Expected test error
The Mathematical Justification - Formally
|
|
)
1
ln(
)
dim(
|
|
|
}
)
(
:
y)
{(x,
|
)
)
(
(
Pr )
,
(
S
H
VC
c
S
y
x
h
S
y
x
h
D
y
x








If S is sampled i.i.d. by some probability P over X×{0,1}
then, with probability > 1-, For all h in H
Complexity Term
The Types of Errors to be
Considered
Approximation Error
Estimation Error
The Class H
Best regressor for P
Training error
minimizer
Best h (in H) for P
Total error
Expanding H
will lower the approximation error
BUT
it will increase the estimation error
(lower statistical soundness)
The Model Selection Problem
Once we have a large enough training
sample,
how much computation is required to
search for a good hypothesis?
(That is, empirically good.)
Yet another problem –
Computational Complexity
The Computational Problem
Given a class H of subsets of Rn
 Input: A finite set of {0, 1}-labeled points S in Rn
 Output: Some ‘hypothesis’ function h in H that
maximizes the number of correctly labeled points of S.
For each of the following classes, approximating
the
best agreement rate for h in H (on a given input
sample S ) up to some constant ratio, is NP-hard
:
Monomials Constant width
Monotone Monomials
Half-spaces
Balls
Axis aligned Rectangles
Threshold NN’s
BD-Eiron-Long
Bartlett- BD
Hardness-of-Approximation Results
The Types of Errors to be
Considered
Output of the the
learning Algorithm
Best regressor for D
Approximation Error
Estimation Error
Computational Error
}
H
h
:
)
h
(
Er
min{
Arg 
}
H
h
:
)
h
(
s
r
Ê
min{
Arg 
The Class H
Total Error
Our hypotheses set should balance
several requirements:
Expressiveness – being able to capture the
structure of our learning task.
Statistical ‘compactness’- having low
combinatorial complexity.
Computational manageability – existence of
efficient ERM algorithms.
(where w is the weight vector of the hyperplane h,
and x=(x1, …xi,…xn) is the example to classify)
Sign ( wi xi+b)
The predictor h:
Concrete learning paradigm- linear separators
h
Potential problem –
data may not be linearly separable
The SVM Paradigm
 Choose an Embedding of the domain X into
some high dimensional Euclidean space,
so that the data sample becomes (almost)
linearly separable.
 Find a large-margin data-separating hyperplane
in this image space, and use it for prediction.
Important gain: When the data is separable,
finding such a hyperplane is computationally feasible.
The SVM Idea: an Example
The SVM Idea: an Example
x ↦ (x, x2)
The SVM Idea: an Example
Potentially the embeddings may require
very high Euclidean dimension.
How can we search for hyperplanes
efficiently?
The Kernel Trick: Use algorithms that
depend only on the inner product of
sample points.
Controlling Computational Complexity
Rather than define the embedding explicitly, define
just the matrix of the inner products in the range
space.
Kernel-Based Algorithms
Mercer Theorem: If the matrix is symmetric and positive
semi-definite, then it is the inner product matrix with
respect to some embedding
K(x1x1) K(x1x2) K(x1xm)
K(xmxm)
K(xmx1)
........
.......
............
.......
K(xixj)
On input: Sample (x1 y1) ... (xmym) and a
kernel matrix K
Output: A “good” separating
hyperplane
Support Vector Machines (SVMs)
A Potential Problem: Generalization
 VC-dimension bounds: The VC-dimension of
the class of half-spaces in Rn is n+1.
Can we guarantee low dimension of the embeddings
range?
 Margin bounds: Regardless of the Euclidean
dimension, generalization can bounded as a function of
the margins of the hypothesis hyperplane.
Can one guarantee the existence of a large-margin
separation?
(where wn is the weight vector of the hyperplane h)
max min wn  xi
separating h xi
The Margins of a Sample
h
Summary of SVM learning
1. The user chooses a “Kernel Matrix”
- a measure of similarity between input
points.
2. Upon viewing the training data, the
algorithm finds a linear separator the
maximizes the margins (in the high
dimensional “Feature Space”).
How are the basic requirements met?
Expressiveness – by allowing all types of kernels
there is (potentially) high expressive power.
Statistical ‘compactness’- only if we are lucky,
and the algorithm found a large margin good
separator.
Computational manageability – it turns out the
search for a large margin classifier can be done in
time polynomial in the input size.

More Related Content

Similar to Statistical Machine________ Learning.ppt

original
originaloriginal
originalbutest
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzerbutest
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorizationmidi
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum EntropyJiawang Liu
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3butest
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Zihui Li
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.pptbutest
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesXavier Rafael Palou
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavAgile Testing Alliance
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysisbutest
 

Similar to Statistical Machine________ Learning.ppt (20)

original
originaloriginal
original
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
 
Principle of Maximum Entropy
Principle of Maximum EntropyPrinciple of Maximum Entropy
Principle of Maximum Entropy
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)Machine Learning Algorithms Review(Part 2)
Machine Learning Algorithms Review(Part 2)
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Lect4
Lect4Lect4
Lect4
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
 
Introduction to conventional machine learning techniques
Introduction to conventional machine learning techniquesIntroduction to conventional machine learning techniques
Introduction to conventional machine learning techniques
 
Optimization
OptimizationOptimization
Optimization
 
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep YadavMachine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
Machine learning by Dr. Vivek Vijay and Dr. Sandeep Yadav
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 

Recently uploaded

power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and usesDevarapalliHaritha
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineeringmalavadedarshan25
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxPoojaBan
 

Recently uploaded (20)

power system scada applications and uses
power system scada applications and usespower system scada applications and uses
power system scada applications and uses
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Internship report on mechanical engineering
Internship report on mechanical engineeringInternship report on mechanical engineering
Internship report on mechanical engineering
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
Heart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptxHeart Disease Prediction using machine learning.pptx
Heart Disease Prediction using machine learning.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 

Statistical Machine________ Learning.ppt

  • 1. Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007
  • 2. A High Level Agenda “The purpose of science is to find meaningful simplicity in the midst of disorderly complexity” Herbert Simon
  • 3. Representative learning tasks  Medical research.  Detection of fraudulent activity (credit card transactions, intrusion detection, stock market manipulation)  Analysis of genome functionality  Email spam detection.  Spatial prediction of landslide hazards.
  • 4. Common to all such tasks  We wish to develop algorithms that detect meaningful regularities in large complex data sets.  We focus on data that is too complex for humans to figure out its meaningful regularities.  We consider the task of finding such regularities from random samples of the data population.  We should derive conclusions in timely manner. Computational efficiency is essential.
  • 5. Different types of learning tasks  Classification prediction – we wish to classify data points into categories, and we are given already classified samples as our training input. For example:  Training a spam filter  Medical Diagnosis (Patient info → High/Low risk).  Stock market prediction ( Predict tomorrow’s market trend from companies performance data)
  • 6. Other Learning Tasks  Clustering – the grouping data into representative collections - a fundamental tool for data analysis. Examples :  Clustering customers for targeted marketing.  Clustering pixels to detect objects in images.  Clustering web pages for content similarity.
  • 7. Differences from Classical Statistics  We are interested in hypothesis generation rather than hypothesis testing.  We wish to make no prior assumptions about the structure of our data.  We develop algorithms for automated generation of hypotheses.  We are concerned with computational efficiency.
  • 8. Learning Theory: The fundamental dilemma… X Y y=f(x) Good models should enable Prediction of new data… Tradeoff between accuracy and simplicity
  • 9. A Fundamental Dilemma of Science: Model Complexity vs Prediction Accuracy Complexity Accuracy Possible Models/representations Limited data
  • 10.
  • 11.
  • 12. Problem Outline  We are interested in (automated) Hypothesis Generation, rather than traditional Hypothesis Testing  First obstacle: The danger of overfitting.  First solution: Consider only a limited set of candidate hypotheses.
  • 13. Empirical Risk Minimization Paradigm  Choose a Hypothesis Class H of subsets of X.  For an input sample S, find some h in H that fits S well.  For a new point x, predict a label according to its membership in h.
  • 14. The Mathematical Justification Assume both a training sample S and the test point (x,l) are generated i.i.d. by the same distribution over X x {0,1} then, If H is not too rich ( in some formal sense) then, for every h in H, the training error of h on the sample S is a good estimate of its probability of success on the new x . In other words – there is no overfitting
  • 15. Training error Expected test error The Mathematical Justification - Formally | | ) 1 ln( ) dim( | | | } ) ( : y) {(x, | ) ) ( ( Pr ) , ( S H VC c S y x h S y x h D y x         If S is sampled i.i.d. by some probability P over X×{0,1} then, with probability > 1-, For all h in H Complexity Term
  • 16. The Types of Errors to be Considered Approximation Error Estimation Error The Class H Best regressor for P Training error minimizer Best h (in H) for P Total error
  • 17. Expanding H will lower the approximation error BUT it will increase the estimation error (lower statistical soundness) The Model Selection Problem
  • 18. Once we have a large enough training sample, how much computation is required to search for a good hypothesis? (That is, empirically good.) Yet another problem – Computational Complexity
  • 19. The Computational Problem Given a class H of subsets of Rn  Input: A finite set of {0, 1}-labeled points S in Rn  Output: Some ‘hypothesis’ function h in H that maximizes the number of correctly labeled points of S.
  • 20. For each of the following classes, approximating the best agreement rate for h in H (on a given input sample S ) up to some constant ratio, is NP-hard : Monomials Constant width Monotone Monomials Half-spaces Balls Axis aligned Rectangles Threshold NN’s BD-Eiron-Long Bartlett- BD Hardness-of-Approximation Results
  • 21. The Types of Errors to be Considered Output of the the learning Algorithm Best regressor for D Approximation Error Estimation Error Computational Error } H h : ) h ( Er min{ Arg  } H h : ) h ( s r Ê min{ Arg  The Class H Total Error
  • 22. Our hypotheses set should balance several requirements: Expressiveness – being able to capture the structure of our learning task. Statistical ‘compactness’- having low combinatorial complexity. Computational manageability – existence of efficient ERM algorithms.
  • 23. (where w is the weight vector of the hyperplane h, and x=(x1, …xi,…xn) is the example to classify) Sign ( wi xi+b) The predictor h: Concrete learning paradigm- linear separators h
  • 24. Potential problem – data may not be linearly separable
  • 25. The SVM Paradigm  Choose an Embedding of the domain X into some high dimensional Euclidean space, so that the data sample becomes (almost) linearly separable.  Find a large-margin data-separating hyperplane in this image space, and use it for prediction. Important gain: When the data is separable, finding such a hyperplane is computationally feasible.
  • 26. The SVM Idea: an Example
  • 27. The SVM Idea: an Example x ↦ (x, x2)
  • 28. The SVM Idea: an Example
  • 29. Potentially the embeddings may require very high Euclidean dimension. How can we search for hyperplanes efficiently? The Kernel Trick: Use algorithms that depend only on the inner product of sample points. Controlling Computational Complexity
  • 30. Rather than define the embedding explicitly, define just the matrix of the inner products in the range space. Kernel-Based Algorithms Mercer Theorem: If the matrix is symmetric and positive semi-definite, then it is the inner product matrix with respect to some embedding K(x1x1) K(x1x2) K(x1xm) K(xmxm) K(xmx1) ........ ....... ............ ....... K(xixj)
  • 31. On input: Sample (x1 y1) ... (xmym) and a kernel matrix K Output: A “good” separating hyperplane Support Vector Machines (SVMs)
  • 32. A Potential Problem: Generalization  VC-dimension bounds: The VC-dimension of the class of half-spaces in Rn is n+1. Can we guarantee low dimension of the embeddings range?  Margin bounds: Regardless of the Euclidean dimension, generalization can bounded as a function of the margins of the hypothesis hyperplane. Can one guarantee the existence of a large-margin separation?
  • 33. (where wn is the weight vector of the hyperplane h) max min wn  xi separating h xi The Margins of a Sample h
  • 34. Summary of SVM learning 1. The user chooses a “Kernel Matrix” - a measure of similarity between input points. 2. Upon viewing the training data, the algorithm finds a linear separator the maximizes the margins (in the high dimensional “Feature Space”).
  • 35. How are the basic requirements met? Expressiveness – by allowing all types of kernels there is (potentially) high expressive power. Statistical ‘compactness’- only if we are lucky, and the algorithm found a large margin good separator. Computational manageability – it turns out the search for a large margin classifier can be done in time polynomial in the input size.