4. Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problems
after observing example pairs (X , Y ), build a function g : X → Y
that predicts well: g (X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the data
generation mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learning
Practical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
5. Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problems
after observing example pairs (X , Y ), build a function g : X → Y
that predicts well: g (X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the data
generation mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learning
Practical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
6. Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problems
after observing example pairs (X , Y ), build a function g : X → Y
that predicts well: g (X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the data
generation mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learning
Practical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
7. Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problems
after observing example pairs (X , Y ), build a function g : X → Y
that predicts well: g (X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the data
generation mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learning
Practical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
8. Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problems
after observing example pairs (X , Y ), build a function g : X → Y
that predicts well: g (X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the data
generation mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learning
Practical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
9. Outline Framework Theoretical Results Consequences Practical Implications
The Setting
Prediction problems
after observing example pairs (X , Y ), build a function g : X → Y
that predicts well: g (X ) ≈ Y
Typical setting is statistical (data assumed to be sampled i.i.d.)
Other setting: on-line adversarial (no assumption on the data
generation mechanism)
Goal: find the best algorithm
Theoretical answer: fundamental limits of learning
Practical answer: guidelines for algorithm design
Olivier Bousquet Machine Learning: Some theoretical and practical problems
10. Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X × Y)n × X → Y.
Sample: Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
Misclassification error: L(g ) = P (g (X ) = Y ) (conditional on the
sample)
Bayes error: best possible error L∗ = inf g L(g ) over all measurable
functions
Sequence of classification rules {gn }: defined for any sample size
(algorithms are usually defined in this way, possibly with a sample
size-dependent parameter)
Consistency: limn→∞ EL(gn ) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
11. Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X × Y)n × X → Y.
Sample: Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
Misclassification error: L(g ) = P (g (X ) = Y ) (conditional on the
sample)
Bayes error: best possible error L∗ = inf g L(g ) over all measurable
functions
Sequence of classification rules {gn }: defined for any sample size
(algorithms are usually defined in this way, possibly with a sample
size-dependent parameter)
Consistency: limn→∞ EL(gn ) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
12. Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X × Y)n × X → Y.
Sample: Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
Misclassification error: L(g ) = P (g (X ) = Y ) (conditional on the
sample)
Bayes error: best possible error L∗ = inf g L(g ) over all measurable
functions
Sequence of classification rules {gn }: defined for any sample size
(algorithms are usually defined in this way, possibly with a sample
size-dependent parameter)
Consistency: limn→∞ EL(gn ) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
13. Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X × Y)n × X → Y.
Sample: Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
Misclassification error: L(g ) = P (g (X ) = Y ) (conditional on the
sample)
Bayes error: best possible error L∗ = inf g L(g ) over all measurable
functions
Sequence of classification rules {gn }: defined for any sample size
(algorithms are usually defined in this way, possibly with a sample
size-dependent parameter)
Consistency: limn→∞ EL(gn ) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
14. Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X × Y)n × X → Y.
Sample: Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
Misclassification error: L(g ) = P (g (X ) = Y ) (conditional on the
sample)
Bayes error: best possible error L∗ = inf g L(g ) over all measurable
functions
Sequence of classification rules {gn }: defined for any sample size
(algorithms are usually defined in this way, possibly with a sample
size-dependent parameter)
Consistency: limn→∞ EL(gn ) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
15. Outline Framework Theoretical Results Consequences Practical Implications
Definitions
We consider the classification setting: Y = {0, 1} with data sampled i.i.d.
A rule (or learning algorithm) is a mapping gn : (X × Y)n × X → Y.
Sample: Sn = {(X1 , Y1 ), . . . , (Xn , Yn )}
Misclassification error: L(g ) = P (g (X ) = Y ) (conditional on the
sample)
Bayes error: best possible error L∗ = inf g L(g ) over all measurable
functions
Sequence of classification rules {gn }: defined for any sample size
(algorithms are usually defined in this way, possibly with a sample
size-dependent parameter)
Consistency: limn→∞ EL(gn ) = L∗
Olivier Bousquet Machine Learning: Some theoretical and practical problems
17. Outline Framework Theoretical Results Consequences Practical Implications
Consistency
How to build a consistent sequence of rules?
Countable X
very easy, just wait! eventually every point with non-zero probability
is observed an unbounded number of times (i.e. take majority vote
over observed x, and random prediction on unobserved ones)
Uncountable X
observed sample has measure zero (for non-atomic measures) so
this trick does not work
Instead, take local majority with two conditions: more and more
local, but also more and more points averaged
Olivier Bousquet Machine Learning: Some theoretical and practical problems
18. Outline Framework Theoretical Results Consequences Practical Implications
Consistency
How to build a consistent sequence of rules?
Countable X
very easy, just wait! eventually every point with non-zero probability
is observed an unbounded number of times (i.e. take majority vote
over observed x, and random prediction on unobserved ones)
Uncountable X
observed sample has measure zero (for non-atomic measures) so
this trick does not work
Instead, take local majority with two conditions: more and more
local, but also more and more points averaged
Olivier Bousquet Machine Learning: Some theoretical and practical problems
19. Outline Framework Theoretical Results Consequences Practical Implications
Consistency
How to build a consistent sequence of rules?
Countable X
very easy, just wait! eventually every point with non-zero probability
is observed an unbounded number of times (i.e. take majority vote
over observed x, and random prediction on unobserved ones)
Uncountable X
observed sample has measure zero (for non-atomic measures) so
this trick does not work
Instead, take local majority with two conditions: more and more
local, but also more and more points averaged
Olivier Bousquet Machine Learning: Some theoretical and practical problems
20. Outline Framework Theoretical Results Consequences Practical Implications
Consistency of Histograms
Histogram in Rd : cubic cells of size hn , prediction is constant over each
cell (majority vote)
d
hn → 0, nhn → ∞ enough for universal consistency
Idea of the proof
Continuous functions with bounded support are dense in Lp (ν)
Such functions are uniformly continuous and can thus be
approximated by histograms (average of the function on a cell)
provided cell size goes to 0
Since cells will contain more and more points (second
condition), the cell value will eventually converge to the
average over the cell
Olivier Bousquet Machine Learning: Some theoretical and practical problems
21. Outline Framework Theoretical Results Consequences Practical Implications
Consistency of Histograms
Histogram in Rd : cubic cells of size hn , prediction is constant over each
cell (majority vote)
d
hn → 0, nhn → ∞ enough for universal consistency
Idea of the proof
Continuous functions with bounded support are dense in Lp (ν)
Such functions are uniformly continuous and can thus be
approximated by histograms (average of the function on a cell)
provided cell size goes to 0
Since cells will contain more and more points (second
condition), the cell value will eventually converge to the
average over the cell
Olivier Bousquet Machine Learning: Some theoretical and practical problems
22. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which one
is the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
23. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which one
is the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
24. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which one
is the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
25. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which one
is the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
26. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which one
is the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
27. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch
We can ”learn” anything
Is the problem solved?
The question becomes: among the consistent algorithms, which one
is the best?
We consider here the special case of classification
Similar phenomena occur for regression or density estimation
Unfortunately, there is no free lunch
Olivier Bousquet Machine Learning: Some theoretical and practical problems
28. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 1
Out-of-sample error: L (gn ) = P (gn (X ) = Y |X ∈ Sn )
/
Consider a uniform probability distribution µ over problems, i.e. for
all x, Eµ P(Y = 1|X = x) = Eµ P(Y = 0|X = x)
All classifiers have the same average error
Theorem (Wolpert96)
For any classification rule gn ,
1
Eµ EL (gn ) =
2
Olivier Bousquet Machine Learning: Some theoretical and practical problems
29. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 1
Out-of-sample error: L (gn ) = P (gn (X ) = Y |X ∈ Sn )
/
Consider a uniform probability distribution µ over problems, i.e. for
all x, Eµ P(Y = 1|X = x) = Eµ P(Y = 0|X = x)
All classifiers have the same average error
Theorem (Wolpert96)
For any classification rule gn ,
1
Eµ EL (gn ) =
2
Olivier Bousquet Machine Learning: Some theoretical and practical problems
30. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 2
A consequence of NFL1 is that there are always cases where an
algorithm can be beaten.
A stronger version of NFL1: No Super Classifier
Theorem (DGL96)
For every sequence of classification rules {gn } there is a universally
consistent sequence {gn } such that for some distribution
L(gn ) > L(gn )
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
31. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 2
A consequence of NFL1 is that there are always cases where an
algorithm can be beaten.
A stronger version of NFL1: No Super Classifier
Theorem (DGL96)
For every sequence of classification rules {gn } there is a universally
consistent sequence {gn } such that for some distribution
L(gn ) > L(gn )
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
32. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 3
A variation of NFL1
Arbitrarily bad error for fixed sample sizes
Theorem (Devroye82)
Fix an > 0. For any integer n and classification rule gn , there exists a
distribution of (X , Y ) with Bayes risk L∗ = 0 such that
EL(gn ) ≥ 1/2 −
Olivier Bousquet Machine Learning: Some theoretical and practical problems
33. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 3
A variation of NFL1
Arbitrarily bad error for fixed sample sizes
Theorem (Devroye82)
Fix an > 0. For any integer n and classification rule gn , there exists a
distribution of (X , Y ) with Bayes risk L∗ = 0 such that
EL(gn ) ≥ 1/2 −
Olivier Bousquet Machine Learning: Some theoretical and practical problems
34. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 4
NFL3 possibly considers a different distribution for each n
What happens for a fixed distribution when n increases?
Slow rate phenomenon
Theorem (Devroye82)
Let {an } be a sequence of positive numbers converging to zero with
1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, there
exists a distribution of (X , Y ) with Bayes risk L∗ = 0 such that
EL(gn ) ≥ an
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
35. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 4
NFL3 possibly considers a different distribution for each n
What happens for a fixed distribution when n increases?
Slow rate phenomenon
Theorem (Devroye82)
Let {an } be a sequence of positive numbers converging to zero with
1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, there
exists a distribution of (X , Y ) with Bayes risk L∗ = 0 such that
EL(gn ) ≥ an
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
36. Outline Framework Theoretical Results Consequences Practical Implications
No Free Lunch 4
NFL3 possibly considers a different distribution for each n
What happens for a fixed distribution when n increases?
Slow rate phenomenon
Theorem (Devroye82)
Let {an } be a sequence of positive numbers converging to zero with
1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, there
exists a distribution of (X , Y ) with Bayes risk L∗ = 0 such that
EL(gn ) ≥ an
for all n.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
37. Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problem
with no structure (prediction at x unrelated to prediction at x )
All proofs work on finite (for fixed n) or countable (for varying n)
spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have not
been observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
38. Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problem
with no structure (prediction at x unrelated to prediction at x )
All proofs work on finite (for fixed n) or countable (for varying n)
spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have not
been observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
39. Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problem
with no structure (prediction at x unrelated to prediction at x )
All proofs work on finite (for fixed n) or countable (for varying n)
spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have not
been observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
40. Outline Framework Theoretical Results Consequences Practical Implications
Proofs
The idea is to create a ”bad” distribution
It turns out that random ones are bad enough: just create a problem
with no structure (prediction at x unrelated to prediction at x )
All proofs work on finite (for fixed n) or countable (for varying n)
spaces (no need to introduce uncountable X )
The trick is to make sure that there are enough point that have not
been observed yet (on those, the error will be 1/2)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
41. Outline Framework Theoretical Results Consequences Practical Implications
A closer look at consistency
Consider the trivially consistent rule for a countable space (majority
vote)
Its error decreases with increasing sample size
∀n, EL(gn ) ≥ EL(gn+1 )
Is is true in general for universally consistent rules?
Olivier Bousquet Machine Learning: Some theoretical and practical problems
42. Outline Framework Theoretical Results Consequences Practical Implications
A closer look at consistency
Consider the trivially consistent rule for a countable space (majority
vote)
Its error decreases with increasing sample size
∀n, EL(gn ) ≥ EL(gn+1 )
Is is true in general for universally consistent rules?
Olivier Bousquet Machine Learning: Some theoretical and practical problems
43. Outline Framework Theoretical Results Consequences Practical Implications
A closer look at consistency
Consider the trivially consistent rule for a countable space (majority
vote)
Its error decreases with increasing sample size
∀n, EL(gn ) ≥ EL(gn+1 )
Is is true in general for universally consistent rules?
Olivier Bousquet Machine Learning: Some theoretical and practical problems
44. Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn } of classification rules is smart if for any distribution and
any integer n,
EL(gn ) ≥ EL(gn+1 )
For uncountable spaces, some of the known universally consistent
rules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adapt
the degree of smoothness to the sample size, this means that there
will be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
45. Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn } of classification rules is smart if for any distribution and
any integer n,
EL(gn ) ≥ EL(gn+1 )
For uncountable spaces, some of the known universally consistent
rules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adapt
the degree of smoothness to the sample size, this means that there
will be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
46. Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn } of classification rules is smart if for any distribution and
any integer n,
EL(gn ) ≥ EL(gn+1 )
For uncountable spaces, some of the known universally consistent
rules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adapt
the degree of smoothness to the sample size, this means that there
will be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
47. Outline Framework Theoretical Results Consequences Practical Implications
Smart rules
Consistency for uncountable spaces is not so trivial
Smart rules
Definition
A sequence {gn } of classification rules is smart if for any distribution and
any integer n,
EL(gn ) ≥ EL(gn+1 )
For uncountable spaces, some of the known universally consistent
rules can be shown to be non-smart
Conjecture: on Rd any smart rule is not universally consistent
Interpretation: consistency on uncountable spaces requires to adapt
the degree of smoothness to the sample size, this means that there
will be a point for which smoothness degree will be too large
Olivier Bousquet Machine Learning: Some theoretical and practical problems
48. Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is much
worse than random guessing!
One can indeed construct distributions for which some standard
algorithms have EL(gn ) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite to
the ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
49. Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is much
worse than random guessing!
One can indeed construct distributions for which some standard
algorithms have EL(gn ) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite to
the ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
50. Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is much
worse than random guessing!
One can indeed construct distributions for which some standard
algorithms have EL(gn ) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite to
the ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
51. Outline Framework Theoretical Results Consequences Practical Implications
Anti-learning
Average error is 1/2 so there are problems for which the error is much
worse than random guessing!
One can indeed construct distributions for which some standard
algorithms have EL(gn ) arbitrarily close to 1 even with L∗ = 0!
Of course this occurs for a fixed sample size
Can one always do that (for any rule) ?
The problem should have a structure, but one which is opposite to
the ones preferred by the algorithm
Olivier Bousquet Machine Learning: Some theoretical and practical problems
52. Outline Framework Theoretical Results Consequences Practical Implications
Bayes Error Estimation
Assume we just want to estimate L∗ .
Of course, we could use any universally consistent algorithm and
estimate its error. But we get slow rates!
Is there a better way?
Theorem (DGL96)
ˆ
For every n, for any estimate Ln of the Bayes error L∗ and for every
> 0, there exists a distribution (X , Y ), such that
ˆ
E |Ln − L∗ | ≥ 1/4 −
Estimating this single number does not seem easier than estimating
the whole set {x : P(Y = 1|x) > 1/2}
Olivier Bousquet Machine Learning: Some theoretical and practical problems
53. Outline Framework Theoretical Results Consequences Practical Implications
Bayes Error Estimation
Assume we just want to estimate L∗ .
Of course, we could use any universally consistent algorithm and
estimate its error. But we get slow rates!
Is there a better way?
Theorem (DGL96)
ˆ
For every n, for any estimate Ln of the Bayes error L∗ and for every
> 0, there exists a distribution (X , Y ), such that
ˆ
E |Ln − L∗ | ≥ 1/4 −
Estimating this single number does not seem easier than estimating
the whole set {x : P(Y = 1|x) > 1/2}
Olivier Bousquet Machine Learning: Some theoretical and practical problems
54. Outline Framework Theoretical Results Consequences Practical Implications
Bayes Error Estimation
Assume we just want to estimate L∗ .
Of course, we could use any universally consistent algorithm and
estimate its error. But we get slow rates!
Is there a better way?
Theorem (DGL96)
ˆ
For every n, for any estimate Ln of the Bayes error L∗ and for every
> 0, there exists a distribution (X , Y ), such that
ˆ
E |Ln − L∗ | ≥ 1/4 −
Estimating this single number does not seem easier than estimating
the whole set {x : P(Y = 1|x) > 1/2}
Olivier Bousquet Machine Learning: Some theoretical and practical problems
56. Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be said
about learning algorithms
Can we prove something interesting under slightly more restrictive
assumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4
holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we can
hope to derive appropriate algorithms (optimal on this class of
problems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
57. Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be said
about learning algorithms
Can we prove something interesting under slightly more restrictive
assumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4
holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we can
hope to derive appropriate algorithms (optimal on this class of
problems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
58. Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be said
about learning algorithms
Can we prove something interesting under slightly more restrictive
assumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4
holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we can
hope to derive appropriate algorithms (optimal on this class of
problems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
59. Outline Framework Theoretical Results Consequences Practical Implications
What can we hope to prove?
Our framework is too general! Nothing interesting can be said
about learning algorithms
Can we prove something interesting under slightly more restrictive
assumptions?
Are the distributions used to prove the NFLs pathological? (NFL 4
holds even within classes of ”reasonable” distributions!)
If we can define which problems actually occur in real life, we can
hope to derive appropriate algorithms (optimal on this class of
problems)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
60. Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf sup L(gn ) − inf L(g )
{gn } P∈P g
Seems reasonable and useful for understanding but does not provide
guarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
61. Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf sup L(gn ) − inf L(g )
{gn } P∈P g
Seems reasonable and useful for understanding but does not provide
guarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
62. Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf sup L(gn ) − inf L(g )
{gn } P∈P g
Seems reasonable and useful for understanding but does not provide
guarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
63. Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf sup L(gn ) − inf L(g )
{gn } P∈P g
Seems reasonable and useful for understanding but does not provide
guarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
64. Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf sup L(gn ) − inf L(g )
{gn } P∈P g
Seems reasonable and useful for understanding but does not provide
guarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
65. Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf sup L(gn ) − inf L(g )
{gn } P∈P g
Seems reasonable and useful for understanding but does not provide
guarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
66. Outline Framework Theoretical Results Consequences Practical Implications
The Bayesian Way
Assume something about how the data is generated
Consider an algorithm specifically tuned to this property
Prove that under this assumption the algorithm does well
Most results are going in this direction (sometimes in a subtle way)
Bayesian algorithms
Most minimax results are of this form
inf sup L(gn ) − inf L(g )
{gn } P∈P g
Seems reasonable and useful for understanding but does not provide
guarantees
Olivier Bousquet Machine Learning: Some theoretical and practical problems
67. Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how the
data is
inf sup L(gn ) − inf L(g )
{gn } P g ∈G
Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
68. Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how the
data is
inf sup L(gn ) − inf L(g )
{gn } P g ∈G
Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
69. Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how the
data is
inf sup L(gn ) − inf L(g )
{gn } P g ∈G
Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
70. Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how the
data is
inf sup L(gn ) − inf L(g )
{gn } P g ∈G
Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
71. Outline Framework Theoretical Results Consequences Practical Implications
The Worst Case Way
Assume nothing about the data (distribution-free)
Restrict your objectives
Derive an algorithm that reaches this objective no matter how the
data is
inf sup L(gn ) − inf L(g )
{gn } P g ∈G
Gives guarantees
In between: adaptation
Olivier Bousquet Machine Learning: Some theoretical and practical problems
73. Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on most
real-world problems
If we have a characterization of these problems, we can even prove
something about such algorithms
However, there is no guarantee that a new problem will satisfy this
characterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
74. Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on most
real-world problems
If we have a characterization of these problems, we can even prove
something about such algorithms
However, there is no guarantee that a new problem will satisfy this
characterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
75. Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on most
real-world problems
If we have a characterization of these problems, we can even prove
something about such algorithms
However, there is no guarantee that a new problem will satisfy this
characterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
76. Outline Framework Theoretical Results Consequences Practical Implications
Does this help practically?
We can probably come up with algorithms that work well on most
real-world problems
If we have a characterization of these problems, we can even prove
something about such algorithms
However, there is no guarantee that a new problem will satisfy this
characterization
So there cannot be a formal proof that an algorithm is good or bad
Olivier Bousquet Machine Learning: Some theoretical and practical problems
77. Outline Framework Theoretical Results Consequences Practical Implications
If theory cannot help, what can we do?
Essentially a matter of finding an algorithm that implements the
right notion of smoothness for the problem at hand
More an art than a science!
Olivier Bousquet Machine Learning: Some theoretical and practical problems
78. Outline Framework Theoretical Results Consequences Practical Implications
If theory cannot help, what can we do?
Essentially a matter of finding an algorithm that implements the
right notion of smoothness for the problem at hand
More an art than a science!
Olivier Bousquet Machine Learning: Some theoretical and practical problems
79. Outline Framework Theoretical Results Consequences Practical Implications
Priors
Algorithm design is composed of two steps
Choosing a preference
This first step is based on knowledge of the problem, this is where
guidance (but no theory) is needed.
Exploiting it for inference
The second step can possibly be formalized (optimality with respect
to assumptions). The main issue is computational cost.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
80. Outline Framework Theoretical Results Consequences Practical Implications
Priors
Algorithm design is composed of two steps
Choosing a preference
This first step is based on knowledge of the problem, this is where
guidance (but no theory) is needed.
Exploiting it for inference
The second step can possibly be formalized (optimality with respect
to assumptions). The main issue is computational cost.
Olivier Bousquet Machine Learning: Some theoretical and practical problems
81. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
82. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
83. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
84. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
85. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
86. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
87. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
88. Outline Framework Theoretical Results Consequences Practical Implications
Why can algorithms fail in practice?
1 Data representation (unappropriate features, errors, ...)
2 Data scarcity (not enough data samples)
3 Data overload (too many variables, too much noise)
4 Lack of understanding of the result (impossible validation) / lack of
validation data
Examples
Forgot to remove the output variable (or a version of it): algorithm
picks it up
An irrelevant variable happens to be discriminative (e.g. date of
sample collection)
Error in a measurement (misalignment in the database)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
89. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
90. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
91. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
92. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
93. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
94. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
95. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
96. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
97. Outline Framework Theoretical Results Consequences Practical Implications
So, what would be helpful?
Flexible ways to incorporate knowledge/expertise
Provide tools that allow to formulate prior knowledge in a
natural way
Look for other types of prior assumptions that occur in various
problems (e.g. manifold structure, clusteredness, analogy...)
Ability to understand what is found by the algorithm (need a
language to interact with experts)
Investigate how to improve understandability (simpler models,
separate models and language for interaction...)
Improve interaction (understand user’s intent)
Computationally efficient algorithms
Scalability, anytime
Incorporate time complexity in the theoretical analysis (trade
complexity for accuracy)
Olivier Bousquet Machine Learning: Some theoretical and practical problems
98. Outline Framework Theoretical Results Consequences Practical Implications
References
L. Devroye: Necessary and Sufficient Conditions for the Almost
Everywhere Convergence of Nearest Neighbors Regression Function
Estimates. Zeitschrift f¨r Wahrscheinlichkeitstheorie und verwandte
u
Gebiete, 61: 467-481 (1982)
D. Wolpert: The lack of a prior distinctions between learning
algorithms, Neural Computation 8 (1996)
L. Devroye, L. Gy¨rfi and G. Lugosi: A Probabilistic Theory of
o
Pattern Recognition, Springer (1996)
Olivier Bousquet Machine Learning: Some theoretical and practical problems