Machine Learning: Some theoretical and practical problems

1. Outline Framework Theoretical Results Consequences Practical Implications Machine Learning: Some theoretical and practical problems Olivier Bousquet Journ´es MAS, Lille, 2006 e Olivier Bousquet Machine Learning: Some theoretical and practical problems

2. Outline Framework Theoretical Results Consequences Practical Implications 1 Framework 2 Theoretical Results 3 Consequences 4 Practical Implications Olivier Bousquet Machine Learning: Some theoretical and practical problems

3. Outline Framework Theoretical Results Consequences Practical Implications Outline 1 Framework 2 Theoretical Results 3 Consequences 4 Practical Implications Olivier Bousquet Machine Learning: Some theoretical and practical problems

4. Outline Framework Theoretical Results Consequences Practical Implications The Setting Prediction problems after observing example pairs (X , Y ), build a function g : X → Y that predicts well: g (X ) ≈ Y Typical setting is statistical (data assumed to be sampled i.i.d.) Other setting: on-line adversarial (no assumption on the data generation mechanism) Goal: ﬁnd the best algorithm Theoretical answer: fundamental limits of learning Practical answer: guidelines for algorithm design Olivier Bousquet Machine Learning: Some theoretical and practical problems

10. Outline Framework Theoretical Results Consequences Practical Implications Definitions We consider the classification setting: Y = {0, 1} with data sampled i.i.d. A rule (or learning algorithm) is a mapping gn : (X × Y)n × X → Y. Sample: Sn = {(X1 , Y1 ), . . . , (Xn , Yn )} Misclassification error: L(g ) = P (g (X ) = Y ) (conditional on the sample) Bayes error: best possible error L∗ = inf g L(g ) over all measurable functions Sequence of classification rules {gn }: defined for any sample size (algorithms are usually defined in this way, possibly with a sample size-dependent parameter) Consistency: limn→∞ EL(gn ) = L∗ Olivier Bousquet Machine Learning: Some theoretical and practical problems

17. Outline Framework Theoretical Results Consequences Practical Implications Consistency How to build a consistent sequence of rules? Countable X very easy, just wait! eventually every point with non-zero probability is observed an unbounded number of times (i.e. take majority vote over observed x, and random prediction on unobserved ones) Uncountable X observed sample has measure zero (for non-atomic measures) so this trick does not work Instead, take local majority with two conditions: more and more local, but also more and more points averaged Olivier Bousquet Machine Learning: Some theoretical and practical problems

20. Outline Framework Theoretical Results Consequences Practical Implications Consistency of Histograms Histogram in Rd : cubic cells of size hn , prediction is constant over each cell (majority vote) d hn → 0, nhn → ∞ enough for universal consistency Idea of the proof Continuous functions with bounded support are dense in Lp (ν) Such functions are uniformly continuous and can thus be approximated by histograms (average of the function on a cell) provided cell size goes to 0 Since cells will contain more and more points (second condition), the cell value will eventually converge to the average over the cell Olivier Bousquet Machine Learning: Some theoretical and practical problems

21. Outline Framework Theoretical Results Consequences Practical Implications Consistency of Histograms Histogram in Rd : cubic cells of size hn , prediction is constant over each cell (majority vote) d hn → 0, nhn → ∞ enough for universal consistency Idea of the proof Continuous functions with bounded support are dense in Lp (ν) Such functions are uniformly continuous and can thus be approximated by histograms (average of the function on a cell) provided cell size goes to 0 Since cells will contain more and more points (second condition), the cell value will eventually converge to the average over the cell Olivier Bousquet Machine Learning: Some theoretical and practical problems

22. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch We can ”learn” anything Is the problem solved? The question becomes: among the consistent algorithms, which one is the best? We consider here the special case of classiﬁcation Similar phenomena occur for regression or density estimation Unfortunately, there is no free lunch Olivier Bousquet Machine Learning: Some theoretical and practical problems

28. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch 1 Out-of-sample error: L (gn ) = P (gn (X ) = Y |X ∈ Sn ) / Consider a uniform probability distribution µ over problems, i.e. for all x, Eµ P(Y = 1|X = x) = Eµ P(Y = 0|X = x) All classiﬁers have the same average error Theorem (Wolpert96) For any classiﬁcation rule gn , 1 Eµ EL (gn ) = 2 Olivier Bousquet Machine Learning: Some theoretical and practical problems

29. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch 1 Out-of-sample error: L (gn ) = P (gn (X ) = Y |X ∈ Sn ) / Consider a uniform probability distribution µ over problems, i.e. for all x, Eµ P(Y = 1|X = x) = Eµ P(Y = 0|X = x) All classiﬁers have the same average error Theorem (Wolpert96) For any classiﬁcation rule gn , 1 Eµ EL (gn ) = 2 Olivier Bousquet Machine Learning: Some theoretical and practical problems

30. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch 2 A consequence of NFL1 is that there are always cases where an algorithm can be beaten. A stronger version of NFL1: No Super Classiﬁer Theorem (DGL96) For every sequence of classiﬁcation rules {gn } there is a universally consistent sequence {gn } such that for some distribution L(gn ) > L(gn ) for all n. Olivier Bousquet Machine Learning: Some theoretical and practical problems

31. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch 2 A consequence of NFL1 is that there are always cases where an algorithm can be beaten. A stronger version of NFL1: No Super Classiﬁer Theorem (DGL96) For every sequence of classiﬁcation rules {gn } there is a universally consistent sequence {gn } such that for some distribution L(gn ) > L(gn ) for all n. Olivier Bousquet Machine Learning: Some theoretical and practical problems

32. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch 3 A variation of NFL1 Arbitrarily bad error for ﬁxed sample sizes Theorem (Devroye82) Fix an > 0. For any integer n and classiﬁcation rule gn , there exists a distribution of (X , Y ) with Bayes risk L∗ = 0 such that EL(gn ) ≥ 1/2 − Olivier Bousquet Machine Learning: Some theoretical and practical problems

33. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch 3 A variation of NFL1 Arbitrarily bad error for ﬁxed sample sizes Theorem (Devroye82) Fix an > 0. For any integer n and classiﬁcation rule gn , there exists a distribution of (X , Y ) with Bayes risk L∗ = 0 such that EL(gn ) ≥ 1/2 − Olivier Bousquet Machine Learning: Some theoretical and practical problems

34. Outline Framework Theoretical Results Consequences Practical Implications No Free Lunch 4 NFL3 possibly considers a different distribution for each n What happens for a fixed distribution when n increases? Slow rate phenomenon Theorem (Devroye82) Let {an } be a sequence of positive numbers converging to zero with 1/16 ≥ a1 ≥ a2 ≥ . . .. For every sequence of classification rules, there exists a distribution of (X , Y ) with Bayes risk L∗ = 0 such that EL(gn ) ≥ an for all n. Olivier Bousquet Machine Learning: Some theoretical and practical problems

37. Outline Framework Theoretical Results Consequences Practical Implications Proofs The idea is to create a ”bad” distribution It turns out that random ones are bad enough: just create a problem with no structure (prediction at x unrelated to prediction at x ) All proofs work on ﬁnite (for ﬁxed n) or countable (for varying n) spaces (no need to introduce uncountable X ) The trick is to make sure that there are enough point that have not been observed yet (on those, the error will be 1/2) Olivier Bousquet Machine Learning: Some theoretical and practical problems

41. Outline Framework Theoretical Results Consequences Practical Implications A closer look at consistency Consider the trivially consistent rule for a countable space (majority vote) Its error decreases with increasing sample size ∀n, EL(gn ) ≥ EL(gn+1 ) Is is true in general for universally consistent rules? Olivier Bousquet Machine Learning: Some theoretical and practical problems

44. Outline Framework Theoretical Results Consequences Practical Implications Smart rules Consistency for uncountable spaces is not so trivial Smart rules Deﬁnition A sequence {gn } of classiﬁcation rules is smart if for any distribution and any integer n, EL(gn ) ≥ EL(gn+1 ) For uncountable spaces, some of the known universally consistent rules can be shown to be non-smart Conjecture: on Rd any smart rule is not universally consistent Interpretation: consistency on uncountable spaces requires to adapt the degree of smoothness to the sample size, this means that there will be a point for which smoothness degree will be too large Olivier Bousquet Machine Learning: Some theoretical and practical problems

48. Outline Framework Theoretical Results Consequences Practical Implications Anti-learning Average error is 1/2 so there are problems for which the error is much worse than random guessing! One can indeed construct distributions for which some standard algorithms have EL(gn ) arbitrarily close to 1 even with L∗ = 0! Of course this occurs for a ﬁxed sample size Can one always do that (for any rule) ? The problem should have a structure, but one which is opposite to the ones preferred by the algorithm Olivier Bousquet Machine Learning: Some theoretical and practical problems

52. Outline Framework Theoretical Results Consequences Practical Implications Bayes Error Estimation Assume we just want to estimate L∗ . Of course, we could use any universally consistent algorithm and estimate its error. But we get slow rates! Is there a better way? Theorem (DGL96) ˆ For every n, for any estimate Ln of the Bayes error L∗ and for every > 0, there exists a distribution (X , Y ), such that ˆ E |Ln − L∗ | ≥ 1/4 − Estimating this single number does not seem easier than estimating the whole set {x : P(Y = 1|x) > 1/2} Olivier Bousquet Machine Learning: Some theoretical and practical problems

56. Outline Framework Theoretical Results Consequences Practical Implications What can we hope to prove? Our framework is too general! Nothing interesting can be said about learning algorithms Can we prove something interesting under slightly more restrictive assumptions? Are the distributions used to prove the NFLs pathological? (NFL 4 holds even within classes of ”reasonable” distributions!) If we can deﬁne which problems actually occur in real life, we can hope to derive appropriate algorithms (optimal on this class of problems) Olivier Bousquet Machine Learning: Some theoretical and practical problems

60. Outline Framework Theoretical Results Consequences Practical Implications The Bayesian Way Assume something about how the data is generated Consider an algorithm speciﬁcally tuned to this property Prove that under this assumption the algorithm does well Most results are going in this direction (sometimes in a subtle way) Bayesian algorithms Most minimax results are of this form inf sup L(gn ) − inf L(g ) {gn } P∈P g Seems reasonable and useful for understanding but does not provide guarantees Olivier Bousquet Machine Learning: Some theoretical and practical problems

67. Outline Framework Theoretical Results Consequences Practical Implications The Worst Case Way Assume nothing about the data (distribution-free) Restrict your objectives Derive an algorithm that reaches this objective no matter how the data is inf sup L(gn ) − inf L(g ) {gn } P g ∈G Gives guarantees In between: adaptation Olivier Bousquet Machine Learning: Some theoretical and practical problems

73. Outline Framework Theoretical Results Consequences Practical Implications Does this help practically? We can probably come up with algorithms that work well on most real-world problems If we have a characterization of these problems, we can even prove something about such algorithms However, there is no guarantee that a new problem will satisfy this characterization So there cannot be a formal proof that an algorithm is good or bad Olivier Bousquet Machine Learning: Some theoretical and practical problems

77. Outline Framework Theoretical Results Consequences Practical Implications If theory cannot help, what can we do? Essentially a matter of ﬁnding an algorithm that implements the right notion of smoothness for the problem at hand More an art than a science! Olivier Bousquet Machine Learning: Some theoretical and practical problems

78. Outline Framework Theoretical Results Consequences Practical Implications If theory cannot help, what can we do? Essentially a matter of ﬁnding an algorithm that implements the right notion of smoothness for the problem at hand More an art than a science! Olivier Bousquet Machine Learning: Some theoretical and practical problems

79. Outline Framework Theoretical Results Consequences Practical Implications Priors Algorithm design is composed of two steps Choosing a preference This ﬁrst step is based on knowledge of the problem, this is where guidance (but no theory) is needed. Exploiting it for inference The second step can possibly be formalized (optimality with respect to assumptions). The main issue is computational cost. Olivier Bousquet Machine Learning: Some theoretical and practical problems

80. Outline Framework Theoretical Results Consequences Practical Implications Priors Algorithm design is composed of two steps Choosing a preference This ﬁrst step is based on knowledge of the problem, this is where guidance (but no theory) is needed. Exploiting it for inference The second step can possibly be formalized (optimality with respect to assumptions). The main issue is computational cost. Olivier Bousquet Machine Learning: Some theoretical and practical problems

81. Outline Framework Theoretical Results Consequences Practical Implications Why can algorithms fail in practice? 1 Data representation (unappropriate features, errors, ...) 2 Data scarcity (not enough data samples) 3 Data overload (too many variables, too much noise) 4 Lack of understanding of the result (impossible validation) / lack of validation data Examples Forgot to remove the output variable (or a version of it): algorithm picks it up An irrelevant variable happens to be discriminative (e.g. date of sample collection) Error in a measurement (misalignment in the database) Olivier Bousquet Machine Learning: Some theoretical and practical problems

89. Outline Framework Theoretical Results Consequences Practical Implications So, what would be helpful? Flexible ways to incorporate knowledge/expertise Provide tools that allow to formulate prior knowledge in a natural way Look for other types of prior assumptions that occur in various problems (e.g. manifold structure, clusteredness, analogy...) Ability to understand what is found by the algorithm (need a language to interact with experts) Investigate how to improve understandability (simpler models, separate models and language for interaction...) Improve interaction (understand user’s intent) Computationally eﬃcient algorithms Scalability, anytime Incorporate time complexity in the theoretical analysis (trade complexity for accuracy) Olivier Bousquet Machine Learning: Some theoretical and practical problems

98. Outline Framework Theoretical Results Consequences Practical Implications References L. Devroye: Necessary and Suﬃcient Conditions for the Almost Everywhere Convergence of Nearest Neighbors Regression Function Estimates. Zeitschrift f¨r Wahrscheinlichkeitstheorie und verwandte u Gebiete, 61: 467-481 (1982) D. Wolpert: The lack of a prior distinctions between learning algorithms, Neural Computation 8 (1996) L. Devroye, L. Gy¨rﬁ and G. Lugosi: A Probabilistic Theory of o Pattern Recognition, Springer (1996) Olivier Bousquet Machine Learning: Some theoretical and practical problems

Machine Learning: Some theoretical and practical problems

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Machine Learning: Some theoretical and practical problems

Similar to Machine Learning: Some theoretical and practical problems (20)

More from butest

More from butest (20)

Machine Learning: Some theoretical and practical problems