Introduction to Machine Learning           Bernhard Schölkopf     Empirical Inference Department Max Planck Institute for ...
Empirical Inference• Drawing conclusions from empirical data (observations, measurements)• Example 1: scientific inference...
Empirical Inference• Drawing conclusions from empirical data (observations, measurements)• Example 1: scientific inference...
Empirical Inference, II• Example 2: perception   “The brain is nothing but a statistical decision organ”                  ...
Hard Inference Problems                                                                  Sonnenburg, Rätsch, Schäfer,     ...
Hard Inference Problems, II  • We can solve scientific inference problems that humans can’t solve  • Even if it’s just bec...
Generalization                              (thanks to O. Bousquet)• observe             1,    2, 4,    7,..• What’s next?...
Generalization, II• Question: which continuation is correct (“generalizes”)?• Answer: there’s no way to tell (“induction p...
2-class classification Learn                             based on m observations                                   generat...
The law of large numbers  For all                and  Does this imply “consistency” of empirical risk minimization  (optim...
Consistency and uniform convergence
-> LaTeX                                                                               12  Bernhard Schölkopf   Empirical ...
Support Vector Machines                         class 2         class 1                                    F              ...
Support Vector Machines                         class 2         class 1                                    F              ...
Applications in Computational Geometry / GraphicsSteinke, Walder, Blanz et al.,Eurographics’05, ’06, ‘08, ICML ’05,‘08,NIP...
Max-Planck-Institut fürbiologische KybernetikBernhard Schölkopf, Tübingen,  FIFA World Cup – Germany vs. England, June 27,...
Kernel Quiz                                                                               17  Bernhard Schölkopf   Empiric...
Kernel Methods            Bernhard Sch¨lkopf                        oMax Planck Institute for Intelligent Systems         ...
Statistical Learning Theory1. started by Vapnik and Chervonenkis in the Sixties2. model: we observe data generated by an u...
Example: Regression Estimation                   y                                        x• Data: input-output pairs (xi,...
Pattern RecognitionLearn f : X → {±1} from examples(x1, y1), . . . , (xm, ym) ∈ X×{±1},   generated i.i.d. from P(x, y),su...
Convergence of Means to ExpectationsLaw of large numbers:                        Remp[f ] → R[f ]as m → ∞.Does this imply ...
Consistency and Uniform Convergence                                      RRisk                                      Remp  ...
The Importance of the Set of FunctionsWhat about allowing all functions from X to {±1}?Training set (x1, y1), . . . , (xm,...
Restricting the Class of FunctionsTwo views:1. Statistical Learning (VC) Theory: take into account the ca-pacity of the cl...
Detailed Analysis• loss ξi := 1 |f (xi) − yi| in {0, 1}             2• the ξi are independent Bernoulli trials            ...
Chernoff ’s Bound                                          1 m                        P       ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2...
Chernoff ’s Bound, IITranslate this back into machine learning terminology: the prob-ability of obtaining an m-sample where...
Uniform Convergence (Vapnik & Chervonenkis)Necessary and sufficient conditions for nontrivial consistency ofempirical risk m...
How to Prove a VC BoundTake a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}.Plan:• if the function class F contains onl...
The Case of Two FunctionsSuppose F = {f1, f2}. Rewrite                                              1    2          P{ sup...
The Union BoundSimilarly, if F = {f1, . . . , fn}, we have                                           1            n       ...
Infinite Function Classes• Note: empirical risk only refers to m points. On these points,  the functions of F can take at m...
SymmetrizationLemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))For mǫ2 ≥ 2 we have                                         ...
Shattering Coefficient• Hence, we only need to consider the maximum size of F on 2m  points. Call it N(F, 2m).• N(F, 2m) = m...
Putting Everything TogetherWe now use (1) symmetrization, (2) the shattering coefficient, and(3) the union bound, to get    ...
ctd.Use Chernoff’s bound for each term:∗              1 m         1                        2m                   mǫ2    ...
Confidence IntervalsRewrite the bound: specify the probability with which we want Rto be close to Remp, and solve for ǫ:Wit...
Discussion• tighter bounds are available (better constants etc.)• cannot minimize the bound over f• other capacity concept...
VC EntropyOn an example (x, y), f causes a loss                                 1               ξ(x, y, f (x)) = |f (x) − ...
Further PR Capacity Concepts• exchange ’E’ and ’ln’: annealed entropy.  ann HF (m)/m → 0 ⇐⇒ exponentially fast uniform con...
Structure of the Growth FunctionEither GF (m) = m · ln(2) for all m ∈ NOr there exists some maximal m for which the above ...
VC-Dimension: ExampleHalf-spaces in R2:   f (x, y) = sgn(a + bx + cy),                                                    ...
A Typical Bound for Pattern RecognitionFor any f ∈ F and m > h, with a probability of at least 1 − δ,                     ...
SRM      error              R(f* )                                  bound on test error                                  c...
Finding a Good Function Class• recall: separating hyperplanes in R2 have a VC dimension of 3.• more generally: separating ...
Kernels and Feature SpacesPreprocess the data with                        Φ:X → H                          x → Φ(x),where ...
Example: All Degree 2 Monomials                       Φ : R2 → R3                    √                                    ...
General Product Feature SpaceHow about patterns x ∈ RN and product features of order d?Here, dim(H) grows like N d.E.g. N ...
The Kernel Trick, N = d = 2                            √                 √                                       2)(x′2, 2...
The Kernel Trick, IIMore generally: x, x′ ∈ RN , d ∈ N:                    d           N x, x ′ d =          xj · x′  ...
Mercer’s TheoremIf k is a continuous kernel of a positive definite integral oper-ator on L2(X) (where X is some compact spa...
The Mercer Feature MapIn that case                √                                       √λ1ψ1(x)                    Φ(...
Positive Definite KernelsIt can be shown that the admissible class of kernels coincides withthe one of positive definite (pd...
The Kernel Trick — Summary• any algorithm that only depends on dot products can benefit  from the kernel trick• this way, w...
Properties of PD Kernels, 1Assumption: Φ maps X into a dot product space H; x, x′ ∈ XKernels from Feature Maps.k(x, x′) :=...
Properties of PD Kernels, 2 [36, 39]Assumption: k, k1, k2, . . . are pd; x, x′ ∈ Xk(x, x) ≥ 0 for all x (Positivity on the...
The Feature Space for PD Kernels                                    [4, 1, 35]• define a feature map                       ...
Turn it Into a Linear SpaceForm linear combinations                               m                     f (.) =         αi...
Endow it With a Dot Product                                   m m′                    f, g :=                  αiβj k(xi, ...
The Reproducing Kernel PropertyTwo special cases: • Assume                                      f (.) = k(., x).   In this...
Endow it With a Dot Product, II• It can be shown that ., . is a p.d. kernel on the set of functions  {f (.) = m αik(., xi)...
The Empirical Kernel MapRecall the feature map                         Φ : X → RX                             x → k(., x)....
ctd.• Φm(x1), . . . , Φm(xm) contain all necessary information about  Φ(x1), . . . , Φ(xm)• the Gram matrix Gij := Φm(xi),...
An Example of a Kernel AlgorithmIdea: classify points x := Φ(x) in feature space according to whichof the two class means ...
An Example of a Kernel Algorithm, ctd. [36]                                                                              ...
An Example of a Kernel Algorithm, ctd.• Demo• Exercise: derive the Parzen windows classifier by computing the  distance cri...
An example of a kernel algorithm, revisited                                              o                          +     ...
When do the means coincide?k(x, x′) = x, x′ :        the means coincidek(x, x′) = ( x, x′ + 1)d: all empirical moments up ...
Proposition 2 Assume X, Y are defined as above, k isstrictly pd, and for all i, j, xi = xj , and yi = yj .If for some αi, β...
Proof (by contradiction)W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1),                                  ...
The mean map                                                m                                         1              µ : X...
Witness function     µ(X)−µ(Y )f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ):                                     ...
The mean map for measuresp, q Borel probability measures,Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)]  ∞ ( k(x, .) ≤ M  ∞ is suffic...
Theorem 3 [15, 13]     p = q ⇐⇒       sup    Ex∼p(f (x)) − Ex∼q (f (x)) = 0,                f ∈C(X)where C(X) is the space...
• µ is invertible on its image  M = {µ(p) | p is a probability distribution}  (the “marginal polytope”, [53])• generalizat...
Fourier CriterionAssume we have densities, the kernel is shift invariant (k(x, y) =k(x − y)), and all Fourier transforms b...
Fourier OpticsApplication: p source of incoherent light, I indicator of a finite                                           ...
Application 1: Two-sample problem [19]X, Y i.i.d. m-samples from p, q, respectively.                 2   µ(p) − µ(q)      ...
Theorem 5 Assume k is bounded.                                                              1ˆD(X, Y )2 converges to D(p, ...
Application 2: Dependence MeasuresAssume that (x, y) are drawn from pxy , with marginals px, py .Want to know whether pxy ...
k kernel on X × Y.                µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)]            µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)]...
Witness function of the equivalent optimisation problem:                                 Dependence witness and sample    ...
Application 3: Covariate Shift Correction and LocalLearningtraining set X = {(x1, y1), . . . , (xm, ym)} drawn from p,test...
Minimize                          2  m       βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0,                                ...
The Representer TheoremTheorem 7 Given: a p.d. kernel k on X × X, a training set(x1, y1), . . . , (xm, ym) ∈ X × R, a stri...
Remarks• significance: many learning algorithms have solutions that can  be expressed as expansions in terms of the trainin...
ProofDecompose f ∈ H into a part in the span of the k(xi, .) and anorthogonal one:                    f=       αik(xi, .) ...
Proof: second part of (3)Since f⊥ is orthogonal to   i αik(xi, .), and Ω is strictly mono-tonic, we get        Ω( f ) = Ω ...
Application: Support Vector ClassificationHere, yi ∈ {±1}. Use                               1       c ((xi, yi, f (xi))i) ...
Further ApplicationsBayesian MAP Estimates. Identify (3) with the negative logposterior (cf. Kimeldorf  Wahba, 1970, Poggi...
Conclusion• the kernel corresponds to  – a similarity measure for the data, or  – a (linear) representation of the data, o...
Kernel PCA                                                                                                               [...
Kernel PCA, II                                                m                                           1  x1, . . . , x...
Kernel PCA in Dual VariablesIn term of the m × m Gram matrix                Kij := Φ(xi), Φ(xj ) = k(xi, xj ),this leads t...
Feature extractionCompute projections on the Eigenvectors                              m                      Vn =        ...
The Kernel PCA MapRecall             Φw : X → Rm              m                         1                       − 2 (k(x ,...
Toy Example with Gaussian Kernelk(x, x′) = exp − x − x′ 2                                   B. Sch¨lkopf, MLSS France 2011...
Super-Resolution                                                                              (Kim, Franz,  Sch¨lkopf, 200...
Support Vector Classifiers       input space               feature space           G                     N                 ...
Separating Hyperplane                                  w, x + b  0                                           N            ...
Optimal Separating Hyperplane                                             [50]                                       N    ...
Eliminating the Scaling Freedom                                         [47]Note: if c = 0, then         {x| w, x + b = 0}...
Canonical Optimal Hyperplane                            {x | w, x + b = +1}{x | w, x + b = −1}                            ...
Canonical Hyperplanes                                                   [47]Note: if c = 0, then         {x| w, x + b = 0}...
Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0where w is normalized such that they are in canonical formw.r.t. a se...
xx            x        R   x    x        γ1                 γ2                              B. Sch¨lkopf, MLSS France 2011...
Proof Strategy (Gurvits, 1997)Assume that x1, . . . , xr are shattered by canonical hyperplaneswith w ≤ Λ, i.e., for all y...
Part ISumming (5) over i = 1, . . . , r yields                                                                          ...
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Introduction to Machine Learning
Upcoming SlideShare
Loading in …5
×

Introduction to Machine Learning

1,852 views

Published on

Published in: Technology, Education
  • Be the first to comment

Introduction to Machine Learning

  1. 1. Introduction to Machine Learning Bernhard Schölkopf Empirical Inference Department Max Planck Institute for Intelligent Systems Tübingen, Germany http://www.tuebingen.mpg.de/bs 1
  2. 2. Empirical Inference• Drawing conclusions from empirical data (observations, measurements)• Example 1: scientific inference y = Σi ai k(x,xi) + b x y y=a*x x x x x x x x x x Leibniz, Weyl, Chaitin 2
  3. 3. Empirical Inference• Drawing conclusions from empirical data (observations, measurements)• Example 1: scientific inference “If your experiment needs statistics [inference], you ought to have done a better experiment.” (Rutherford) 3
  4. 4. Empirical Inference, II• Example 2: perception “The brain is nothing but a statistical decision organ” (H. Barlow) 4
  5. 5. Hard Inference Problems Sonnenburg, Rätsch, Schäfer, Schölkopf, 2006, Journal of Machine Learning Research Task: classify human DNA sequence locations into {acceptor splice site, decoy} using 15 Million sequences of length 141, and a Multiple-Kernel Support Vector Machines. PRC = Precision-Recall-Curve, fraction of correct positive predictions among all positively predicted cases • High dimensionality – consider many factors simultaneously to find the regularity • Complex regularities – nonlinear, nonstationary, etc. • Little prior knowledge – e.g., no mechanistic models for the data • Need large data sets – processing requires computers and automatic inference methods 5
  6. 6. Hard Inference Problems, II • We can solve scientific inference problems that humans can’t solve • Even if it’s just because of data set size / dimensionality, this is a quantum leap 6
  7. 7. Generalization (thanks to O. Bousquet)• observe 1, 2, 4, 7,..• What’s next? +1 +2 +3• 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)• 1,2,4,7,12,20,…: an+2=an+1+an+1• 1,2,4,7,13,24,…: “Tribonacci”-sequence• 1,2,4,7,14,28: set of divisors of 28• 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159… and e=2,718… interleaved• The On-Line Encyclopedia of Integer Sequences: >600 hits… 7
  8. 8. Generalization, II• Question: which continuation is correct (“generalizes”)?• Answer: there’s no way to tell (“induction problem”)• Question of statistical learning theory: how to come up with a law that is (probably) correct (“demarcation problem”) (more accurately: a law that is probably as correct on the test data as it is on the training data) 8
  9. 9. 2-class classification Learn based on m observations generated from some Goal: minimize expected error (“risk”) V. Vapnik Problem: P is unknown. Induction principle: minimize training error (“empirical risk”) over some class of functions. Q: is this “consistent”? 9
  10. 10. The law of large numbers For all and Does this imply “consistency” of empirical risk minimization (optimality in the limit)? No – need a uniform law of large numbers: For all 10
  11. 11. Consistency and uniform convergence
  12. 12. -> LaTeX 12 Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
  13. 13. Support Vector Machines class 2 class 1 F +- - + + - k(x,x’) + - = + <F(x),F(x’)> +• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992): - - representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)• unique solution found by convex QPBernhard Schölkopf, 03October 2011 13
  14. 14. Support Vector Machines class 2 class 1 F +- - + + - k(x,x’) + - = + <F(x),F(x’)> +• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992): - - representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)• unique solution found by convex QPBernhard Schölkopf, 03October 2011 14
  15. 15. Applications in Computational Geometry / GraphicsSteinke, Walder, Blanz et al.,Eurographics’05, ’06, ‘08, ICML ’05,‘08,NIPS ’07 15
  16. 16. Max-Planck-Institut fürbiologische KybernetikBernhard Schölkopf, Tübingen, FIFA World Cup – Germany vs. England, June 27, 20103. Oktober 2011 16
  17. 17. Kernel Quiz 17 Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
  18. 18. Kernel Methods Bernhard Sch¨lkopf oMax Planck Institute for Intelligent Systems B. Sch¨lkopf, MLSS France 2011 o
  19. 19. Statistical Learning Theory1. started by Vapnik and Chervonenkis in the Sixties2. model: we observe data generated by an unknown stochastic regularity3. learning = extraction of the regularity from the data4. the analysis of the learning problem leads to notions of capacity of the function classes that a learning machine can implement.5. support vector machines use a particular type of function class: classifiers with large “margins” in a feature space induced by a kernel. [47, 48] B. Sch¨lkopf, MLSS France 2011 o
  20. 20. Example: Regression Estimation y x• Data: input-output pairs (xi, yi) ∈ R × R• Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)• Learning: choose a function f : R → R such that the error, averaged over P, is minimized.• Problem: P is unknown, so the average cannot be computed — need an “induction principle”
  21. 21. Pattern RecognitionLearn f : X → {±1} from examples(x1, y1), . . . , (xm, ym) ∈ X×{±1}, generated i.i.d. from P(x, y),such that the expected misclassification error on a test set, alsodrawn from P(x, y), 1 R[f ] = |f (x) − y)| dP(x, y), 2is minimal (Risk Minimization (RM)).Problem: P is unknown. −→ need an induction principle.Empirical risk minimization (ERM): replace the average overP(x, y) by an average over the training sample, i.e. minimize thetraining error 1 m 1 Remp[f ] = |f (xi) − yi| m i=1 2 B. Sch¨lkopf, MLSS France 2011 o
  22. 22. Convergence of Means to ExpectationsLaw of large numbers: Remp[f ] → R[f ]as m → ∞.Does this imply that empirical risk minimization will give us theoptimal result in the limit of infinite sample size (“consistency”of empirical risk minimization)?No.Need a uniform version of the law of large numbers. Uniform overall functions that the learning machine can implement. B. Sch¨lkopf, MLSS France 2011 o
  23. 23. Consistency and Uniform Convergence RRisk Remp Remp [f] R[f] f f opt fm Function class B. Sch¨lkopf, MLSS France 2011 o
  24. 24. The Importance of the Set of FunctionsWhat about allowing all functions from X to {±1}?Training set (x1, y1), . . . , (xm, ym) ∈ X × {±1} ¯ ¯¯Test patterns x1, . . . , xm ∈ X, ¯¯such that {¯ 1, . . . , xm} ∩ {x1, . . . , xm} = {}. x 1. f ∗(xi) = f (xi) for all iFor any f there exists f ∗ s.t.: 2. f ∗(¯ j ) = f (¯ j ) for all j. x xBased on the training set alone, there is no means of choosingwhich one is better. On the test set, however, they give oppositeresults. There is ’no free lunch’ [24, 56].−→ a restriction must be placed on the functions that we allow B. Sch¨lkopf, MLSS France 2011 o
  25. 25. Restricting the Class of FunctionsTwo views:1. Statistical Learning (VC) Theory: take into account the ca-pacity of the class of functions that the learning machine canimplement2. The Bayesian Way: place Prior distributions P(f ) over theclass of functions B. Sch¨lkopf, MLSS France 2011 o
  26. 26. Detailed Analysis• loss ξi := 1 |f (xi) − yi| in {0, 1} 2• the ξi are independent Bernoulli trials 1 m• empirical mean m i=1 ξi (by def: equals Remp[f ])• expected value E [ξ] (equals R[f ]) B. Sch¨lkopf, MLSS France 2011 o
  27. 27. Chernoff ’s Bound    1 m  P ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2mǫ2) m  i=1• here, P refers to the probability of getting a sample ξ1, . . . , ξm with the property m m ξi − E [ξ] ≥ ǫ (is a product mea- 1 i=1 sure)Useful corollary: Given a 2m-sample of Bernoulli trials, we have    1 m 1 2m  mǫ2 P ξi − ξi ≥ ǫ ≤ 4 exp − . m m  2 i=1 i=m+1 B. Sch¨lkopf, MLSS France 2011 o
  28. 28. Chernoff ’s Bound, IITranslate this back into machine learning terminology: the prob-ability of obtaining an m-sample where the training error and testerror differ by more than ǫ > 0 is bounded by P Remp[f ] − R[f ] ≥ ǫ ≤ 2 exp(−2mǫ2).• refers to one fixed f• not allowed to look at the data before choosing f , hence not suitable as a bound on the test error of a learning algorithm using empirical risk minimization B. Sch¨lkopf, MLSS France 2011 o
  29. 29. Uniform Convergence (Vapnik & Chervonenkis)Necessary and sufficient conditions for nontrivial consistency ofempirical risk minimization (ERM):One-sided convergence, uniformly over all functions that can beimplemented by the learning machine. lim P { sup (R[f ] − Remp[f ]) > ǫ} = 0 m→∞ f ∈Ffor all ǫ > 0.• note that this takes into account the whole set of functions that can be implemented by the learning machine• this is hard to check for a learning machineAre there properties of learning machines (≡ sets of functions)which ensure uniform convergence of risk? B. Sch¨lkopf, MLSS France 2011 o
  30. 30. How to Prove a VC BoundTake a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}.Plan:• if the function class F contains only one function, then Cher- noff’s bound suffices: P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 2 exp(−2mǫ2). f ∈F• if there are finitely many functions, we use the ’union bound’• even if there are infinitely many, then on any finite sample there are effectively only finitely many (use symmetrization and capacity concepts) B. Sch¨lkopf, MLSS France 2011 o
  31. 31. The Case of Two FunctionsSuppose F = {f1, f2}. Rewrite 1 2 P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ Cǫ ), f ∈Fwhere i Cǫ := {(x1, y1), . . . , (xm, ym) | (R[fi] − Remp[fi]) > ǫ}denotes the event that the risks of fi differ by more than ǫ.The RHS equals 1 2 1 2 1 2 P(Cǫ ∪ Cǫ ) = P(Cǫ ) + P(Cǫ ) − P(Cǫ ∩ Cǫ ) 1 2 ≤ P(Cǫ ) + P(Cǫ ).Hence by Chernoff’s bound 1 2 P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ P(Cǫ ) + P(Cǫ ) f ∈F ≤ 2 · 2 exp(−2mǫ2).
  32. 32. The Union BoundSimilarly, if F = {f1, . . . , fn}, we have 1 n P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ · · · ∪ Cǫ ), f ∈Fand n 1 n P(Cǫ ∪ · · · ∪ Cǫ ) ≤ i P(Cǫ ). i=1Use Chernoff for each summand, to get an extra factor n in thebound. iNote: this becomes an equality if and only if all the events Cǫinvolved are disjoint. B. Sch¨lkopf, MLSS France 2011 o
  33. 33. Infinite Function Classes• Note: empirical risk only refers to m points. On these points, the functions of F can take at most 2m values• for Remp, the function class thus “looks” finite• how about R?• need to use a trick B. Sch¨lkopf, MLSS France 2011 o
  34. 34. SymmetrizationLemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))For mǫ2 ≥ 2 we have ′P{ sup (R[f ]−Remp[f ]) > ǫ} ≤ 2P{ sup (Remp[f ]−Remp[f ]) > ǫ/2} f ∈F f ∈FHere, the first P refers to the distribution of iid samples ofsize m, while the second one refers to iid samples of size 2m.In the latter case, Remp measures the loss on the first half of ′the sample, and Remp on the second half. B. Sch¨lkopf, MLSS France 2011 o
  35. 35. Shattering Coefficient• Hence, we only need to consider the maximum size of F on 2m points. Call it N(F, 2m).• N(F, 2m) = max. number of different outputs (y1, . . . , y2m) that the function class can generate on 2m points — in other words, the max. number of different ways the function class can separate 2m points into two classes.• N(F, 2m) ≤ 22m• if N(F, 2m) = 22m, then the function class is said to shatter 2m points. B. Sch¨lkopf, MLSS France 2011 o
  36. 36. Putting Everything TogetherWe now use (1) symmetrization, (2) the shattering coefficient, and(3) the union bound, to get P{sup(R[f ] − Remp[f ]) > ǫ} f ∈F ′≤ 2P{sup(Remp[f ] − Remp[f ]) > ǫ/2} f ∈F ′ ′ = 2P{(Remp[f1] − Remp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)] − Remp[fN(F,2m)]) > ǫ/2} N(F,2m) ′≤ 2P{(Remp[fn ] − Remp[fn]) > ǫ/2}. n=1 B. Sch¨lkopf, MLSS France 2011 o
  37. 37. ctd.Use Chernoff’s bound for each term:∗  1 m 1 2m  mǫ2 P ξi − ξi ≥ ǫ ≤ 2 exp − . m m  2 i=1 i=m+1This yields mǫ2 P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 4N(F, 2m) exp − . f ∈F 8 • provided that N(F, 2m) does not grow exponentially in m, this is nontrivial • such bounds are called VC type inequalities • two types of randomness: (1) the P refers to the drawing of the training examples, and (2) R[f ] is an expectation over the drawing of test examples.∗ Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random-ization over permutations of the 2m-sample, see [36].
  38. 38. Confidence IntervalsRewrite the bound: specify the probability with which we want Rto be close to Remp, and solve for ǫ:With a probability of at least 1 − δ, 8 4 R[f ] ≤ Remp[f ] + ln(N(F, 2m)) + ln . m δThis bound holds independent of f ; in particular, it holds for thefunction f m minimizing the empirical risk. B. Sch¨lkopf, MLSS France 2011 o
  39. 39. Discussion• tighter bounds are available (better constants etc.)• cannot minimize the bound over f• other capacity concepts can be used B. Sch¨lkopf, MLSS France 2011 o
  40. 40. VC EntropyOn an example (x, y), f causes a loss 1 ξ(x, y, f (x)) = |f (x) − y| ∈ {0, 1}.For a larger sample (x , y ), . .2. , (x , y ), the different functions 1 1 m mf ∈ F lead to a set of loss vectors ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),whose cardinality we denote by N (F, (x1, y1) . . . , (xm, ym)) .The VC entropy is defined as HF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] ,where the expectation is taken over the random generation of them-sample (x1, y1) . . . , (xm, ym) from P.HF (m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis-tency)
  41. 41. Further PR Capacity Concepts• exchange ’E’ and ’ln’: annealed entropy. ann HF (m)/m → 0 ⇐⇒ exponentially fast uniform convergence• take ’max’ instead of ’E’: growth function. Note that GF (m) = ln N(F, m). GF (m)/m → 0 ⇐⇒ exponential convergence for all underlying distributions P. GF (m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectors can be generated, i.e., the m points can be chosen such that by using functions of the learning machine, they can be separated in all 2m possible ways (shattered ). B. Sch¨lkopf, MLSS France 2011 o
  42. 42. Structure of the Growth FunctionEither GF (m) = m · ln(2) for all m ∈ NOr there exists some maximal m for which the above is possible.Call this number the VC-dimension, and denote it by h. Form > h, m GF (m) ≤ h ln + 1 . hNothing “in between” linear growth and logarithmic growth ispossible. B. Sch¨lkopf, MLSS France 2011 o
  43. 43. VC-Dimension: ExampleHalf-spaces in R2: f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R• Clearly, we can shatter three non-collinear points.• But we can never shatter four points.• Hence the VC dimension is h = 3 (in this case, equal to the number of parameters) xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx B. Sch¨lkopf, MLSS France 2011 o
  44. 44. A Typical Bound for Pattern RecognitionFor any f ∈ F and m > h, with a probability of at least 1 − δ, h log 2m + 1 − log(δ/4) h R[f ] ≤ Remp[f ] + mholds.• does this mean, that we can learn anything?• The study of the consistency of ERM has thus led to concepts and results which lets us formulate another induction principle (structural risk minimization) B. Sch¨lkopf, MLSS France 2011 o
  45. 45. SRM error R(f* ) bound on test error capacity term training error h structure Sn−1 Sn Sn+1 B. Sch¨lkopf, MLSS France 2011 o
  46. 46. Finding a Good Function Class• recall: separating hyperplanes in R2 have a VC dimension of 3.• more generally: separating hyperplanes in RN have a VC di- mension of N + 1.• hence: separating hyperplanes in high-dimensional feature spaces have extremely large VC dimension, and may not gener- alize well• however, margin hyperplanes can still have a small VC dimen- sion B. Sch¨lkopf, MLSS France 2011 o
  47. 47. Kernels and Feature SpacesPreprocess the data with Φ:X → H x → Φ(x),where H is a dot product space, and learn the mapping from Φ(x)to y [6].• usually, dim(X) ≪ dim(H)• “Curse of Dimensionality”?• crucial issue: capacity, not dimensionality B. Sch¨lkopf, MLSS France 2011 o
  48. 48. Example: All Degree 2 Monomials Φ : R2 → R3 √ 2 , 2 x x , x2 ) (x1, x2) → (z1, z2, z3) := (x1 1 2 2 x z3 2 H H H H x1 H H H H H H HH z1 H H H H z2 B. Sch¨lkopf, MLSS France 2011 o
  49. 49. General Product Feature SpaceHow about patterns x ∈ RN and product features of order d?Here, dim(H) grows like N d.E.g. N = 16 × 16, and d = 5 −→ dimension 1010 B. Sch¨lkopf, MLSS France 2011 o
  50. 50. The Kernel Trick, N = d = 2 √ √ 2)(x′2, 2 x′ x′ , x′2)⊤ Φ(x), Φ(x′) = (x2, 1 2 x1 x2 , x2 1 1 2 2 2 = x, x′ = : k(x, x′)−→ the dot product in H can be computed in R2 B. Sch¨lkopf, MLSS France 2011 o
  51. 51. The Kernel Trick, IIMore generally: x, x′ ∈ RN , d ∈ N:  d N x, x ′ d =  xj · x′  j j=1 N = xj1 · · · · · xjd · x′ 1 · · · · · x′ d = Φ(x), Φ(x′) , j j j1,...,jd=1where Φ maps into the space spanned by all ordered products ofd input directions B. Sch¨lkopf, MLSS France 2011 o
  52. 52. Mercer’s TheoremIf k is a continuous kernel of a positive definite integral oper-ator on L2(X) (where X is some compact space), k(x, x′)f (x)f (x′) dx dx′ ≥ 0, Xit can be expanded as ∞ k(x, x′) = λiψi(x)ψi(x′) i=1using eigenfunctions ψi and eigenvalues λi ≥ 0 [30]. B. Sch¨lkopf, MLSS France 2011 o
  53. 53. The Mercer Feature MapIn that case √  √λ1ψ1(x) Φ(x) :=  λ2ψ2(x)  . .satisfies Φ(x), Φ(x′) = k(x, x′).Proof: √  √ ′)  √λ1ψ1(x) √λ1ψ1(x′ Φ(x), Φ(x′) =  λ2ψ2(x)  ,  λ2ψ2(x )  . . . . ∞ = λiψi(x)ψi(x′) = k(x, x′) i=1 B. Sch¨lkopf, MLSS France 2011 o
  54. 54. Positive Definite KernelsIt can be shown that the admissible class of kernels coincides withthe one of positive definite (pd) kernels: kernels which are sym-metric (i.e., k(x, x′) = k(x′, x)), and for• any set of training points x1, . . . , xm ∈ X and• any a1, . . . , am ∈ Rsatisfy aiaj Kij ≥ 0, where Kij := k(xi, xj ). i,jK is called the Gram matrix or kernel matrix.If for pairwise distinct points, i,j aiaj Kij = 0 =⇒ a = 0, callit strictly positive definite. B. Sch¨lkopf, MLSS France 2011 o
  55. 55. The Kernel Trick — Summary• any algorithm that only depends on dot products can benefit from the kernel trick• this way, we can apply linear methods to vectorial as well as non-vectorial data• think of the kernel as a nonlinear similarity measure• examples of common kernels: Polynomial k(x, x′) = ( x, x′ + c)d Gaussian k(x, x′) = exp(− x − x′ 2/(2 σ 2))• Kernels are also known as covariance functions [54, 52, 55, 29] B. Sch¨lkopf, MLSS France 2011 o
  56. 56. Properties of PD Kernels, 1Assumption: Φ maps X into a dot product space H; x, x′ ∈ XKernels from Feature Maps.k(x, x′) := Φ(x), Φ(x′) is a pd kernel on X × X.Kernels from Feature Maps, IIK(A, B) := x∈A,x′∈B k(x, x′),where A, B are finite subsets of X, is also a pd kernel ˜(Hint: use the feature map Φ(A) := x∈A Φ(x)) B. Sch¨lkopf, MLSS France 2011 o
  57. 57. Properties of PD Kernels, 2 [36, 39]Assumption: k, k1, k2, . . . are pd; x, x′ ∈ Xk(x, x) ≥ 0 for all x (Positivity on the Diagonal)k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality)(Hint: compute the determinant of the Gram matrix)k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals)The following kernels are pd: • αk, provided α ≥ 0 • k1 + k2 • k(x, x′) := limn→∞ kn(x, x′), provided it exists • k1 · k2 • tensor products, direct sums, convolutions [22] B. Sch¨lkopf, MLSS France 2011 o
  58. 58. The Feature Space for PD Kernels [4, 1, 35]• define a feature map Φ : X → RX x → k(., x).E.g., for the Gaussian kernel: Φ . . x x Φ(x) Φ(x)Next steps:• turn Φ(X) into a linear space• endow it with a dot product satisfying Φ(x), Φ(x′) = k(x, x′), i.e., k(., x), k(., x′ ) = k(x, x′)• complete the space to get a reproducing kernel Hilbert space B. Sch¨lkopf, MLSS France 2011 o
  59. 59. Turn it Into a Linear SpaceForm linear combinations m f (.) = αik(., xi), i=1 m′ g(.) = βj k(., x′ ) j j=1(m, m′ ∈ N, αi, βj ∈ R, xi, x′ ∈ X). j B. Sch¨lkopf, MLSS France 2011 o
  60. 60. Endow it With a Dot Product m m′ f, g := αiβj k(xi, x′ ) j i=1 j=1 m m′ = αig(xi) = βj f (x′ ) j i=1 j=1• This is well-defined, symmetric, and bilinear (more later).• So far, it also works for non-pd kernels B. Sch¨lkopf, MLSS France 2011 o
  61. 61. The Reproducing Kernel PropertyTwo special cases: • Assume f (.) = k(., x). In this case, we have k(., x), g = g(x). • If moreover g(.) = k(., x′), we have k(., x), k(., x′ ) = k(x, x′).k is called a reproducing kernel(up to here, have not used positive definiteness) B. Sch¨lkopf, MLSS France 2011 o
  62. 62. Endow it With a Dot Product, II• It can be shown that ., . is a p.d. kernel on the set of functions {f (.) = m αik(., xi)|αi ∈ R, xi ∈ X} : i=1 γi γj f i , f j = γi f i , γj f j =: f, f ij i j = αik(., xi), αik(., xi) = αiαj k(xi, xj ) ≥ 0 i i ij• furthermore, it is strictly positive definite: f (x)2 = f, k(., x) 2 ≤ f, f k(., x), k(., x) hence f, f = 0 implies f = 0.• Complete the space in the corresponding norm to get a Hilbert space Hk .
  63. 63. The Empirical Kernel MapRecall the feature map Φ : X → RX x → k(., x).• each point is represented by its similarity to all other points• how about representing it by its similarity to a sample of points?Consider Φm : X → Rm x → k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))⊤ B. Sch¨lkopf, MLSS France 2011 o
  64. 64. ctd.• Φm(x1), . . . , Φm(xm) contain all necessary information about Φ(x1), . . . , Φ(xm)• the Gram matrix Gij := Φm(xi), Φm(xj ) satisfies G = K 2 where Kij = k(xi, xj )• modify Φm to Φw : X → Rm m − 1 (k(x , x), . . . , k(x , x))⊤ x → K 2 1 m• this “whitened” map (“kernel PCA map”) satifies Φw (xi), Φw (xj ) = k(xi, xj ) m m for all i, j = 1, . . . , m. B. Sch¨lkopf, MLSS France 2011 o
  65. 65. An Example of a Kernel AlgorithmIdea: classify points x := Φ(x) in feature space according to whichof the two class means is closer. 1 1 c+ := Φ(xi), c− := Φ(xi) m+ m− yi=1 yi=−1 + o o . + w + c2 o c c1 x-c o xCompute the sign of the dot product between w := c+ − c− andx − c. B. Sch¨lkopf, MLSS France 2011 o
  66. 66. An Example of a Kernel Algorithm, ctd. [36]  f (x) = sgn 1 Φ(x), Φ(xi) − 1 Φ(x), Φ(xi) +b m+ m− {i:yi=+1} {i:yi=−1}   = sgn  1 k(x, xi) − 1 k(x, xi) + b m+ m− {i:yi=+1} {i:yi=−1}where   1 1 1 b= k(xi, xj ) − k(xi, xj ) . 2 m2− m2+ {(i,j):yi=yj =−1} {(i,j):yi=yj =+1}• provides a geometric interpretation of Parzen windows B. Sch¨lkopf, MLSS France 2011 o
  67. 67. An Example of a Kernel Algorithm, ctd.• Demo• Exercise: derive the Parzen windows classifier by computing the distance criterion directly• SVMs (ppt) B. Sch¨lkopf, MLSS France 2011 o
  68. 68. An example of a kernel algorithm, revisited o + µ(Y ) . + w o µ(X ) + o +X compact subset of a separable metric space, m, n ∈ N.Positive class X := {x1, . . . , xm} ⊂ XNegative class Y := {y1, . . . , yn} ⊂ X 1 m 1 nRKHS means µ(X) = m i=1 k(xi, ·), µ(Y ) = n i=1 k(yi, ·).Get a problem if µ(X) = µ(Y )! B. Sch¨lkopf, MLSS France 2011 o
  69. 69. When do the means coincide?k(x, x′) = x, x′ : the means coincidek(x, x′) = ( x, x′ + 1)d: all empirical moments up to order d coincidek strictly pd: X =Y.The mean “remembers” each point that contributed to it. B. Sch¨lkopf, MLSS France 2011 o
  70. 70. Proposition 2 Assume X, Y are defined as above, k isstrictly pd, and for all i, j, xi = xj , and yi = yj .If for some αi, βj ∈ R − {0}, we have m n αik(xi, .) = βj k(yj , .), (1) i=1 j=1then X = Y . B. Sch¨lkopf, MLSS France 2011 o
  71. 71. Proof (by contradiction)W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1), j=1and make it a sum over pairwise distinct points, to get 0= γik(zi, .), iwhere z1 = x1, γ1 = α1 = 0, andz2, · · · ∈ X ∪ Y − {x1}, γ2, · · · ∈ R.Take the RKHS dot product with j γj k(zj , .) to get 0= γiγj k(zi, zj ), ijwith γ = 0, hence k cannot be strictly pd. B. Sch¨lkopf, MLSS France 2011 o
  72. 72. The mean map m 1 µ : X = (x1, . . . , xm) → k(xi, ·) m i=1satisfies m m 1 1 µ(X), f = k(xi, ·), f = f (xi) m m i=1 i=1and m n 1 1 µ(X)−µ(Y ) = sup | µ(X) − µ(Y ), f | = sup f (xi) − f (yi) . f ≤1 f ≤1 m i=1 n i=1Note: Large distance = can find a function distinguishing thesamples B. Sch¨lkopf, MLSS France 2011 o
  73. 73. Witness function µ(X)−µ(Y )f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ): Witness f for Gauss and Laplace data 1 f 0.8 Gauss Laplace 0.6 Prob. density and f 0.4 0.2 0 −0.2 −0.4 −6 −4 −2 0 2 4 6 XThis function is in the RKHS of a Gaussian kernel, but not in theRKHS of the linear kernel. B. Sch¨lkopf, MLSS France 2011 o
  74. 74. The mean map for measuresp, q Borel probability measures,Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)] ∞ ( k(x, .) ≤ M ∞ is sufficient)Define µ : p → Ex∼p[k(x, ·)].Note µ(p), f = Ex∼p[f (x)]and µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] . f ≤1Recall that in the finite sample case, for strictly p.d. kernels, µwas injective — how about now?[43, 17] B. Sch¨lkopf, MLSS France 2011 o
  75. 75. Theorem 3 [15, 13] p = q ⇐⇒ sup Ex∼p(f (x)) − Ex∼q (f (x)) = 0, f ∈C(X)where C(X) is the space of continuous bounded functions onX.Combine this with µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] . f ≤1Replace C(X) by the unit ball in an RKHS that is dense in C(X)— universal kernel [45], e.g., Gaussian.Theorem 4 [19] If k is universal, then p = q ⇐⇒ µ(p) − µ(q) = 0. B. Sch¨lkopf, MLSS France 2011 o
  76. 76. • µ is invertible on its image M = {µ(p) | p is a probability distribution} (the “marginal polytope”, [53])• generalization of the moment generating function of a RV x with distribution p: Mp(.) = Ex∼p e x, · .This provides us with a convenient metric on probability distribu-tions, which can be used to check whether two distributions aredifferent — provided that µ is invertible. B. Sch¨lkopf, MLSS France 2011 o
  77. 77. Fourier CriterionAssume we have densities, the kernel is shift invariant (k(x, y) =k(x − y)), and all Fourier transforms below exist.Note that µ is invertible iff k(x − y)p(y) dy = k(x − y)q(y) dy =⇒ p = q,i.e., ˆp ˆ k(ˆ − q ) = 0 =⇒ p = q(Sriperumbudur et al., 2008) ˆE.g., µ is invertible if k has full support. Restricting the class of ˆdistributions, weaker conditions suffice (e.g., if k has non-empty in-terior, µ is invertible for all distributions with compact support). B. Sch¨lkopf, MLSS France 2011 o
  78. 78. Fourier OpticsApplication: p source of incoherent light, I indicator of a finite ˆaperture. In Fraunhofer diffraction, the intensity image is ∝ p∗ I 2. ˆSet k = I 2, then this equals µ(p). ˆThis k does not have full support, thus the imaging process is notinvertible for the class of all light sources (Abbe), but it is if werestrict the class (e.g., to compact support). B. Sch¨lkopf, MLSS France 2011 o
  79. 79. Application 1: Two-sample problem [19]X, Y i.i.d. m-samples from p, q, respectively. 2 µ(p) − µ(q) =Ex,x′∼p [k(x, x′)] − 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y ′)] =Ex,x′∼p,y,y′∼q [h((x, y), (x′, y ′))]with h((x, y), (x′, y ′)) := k(x, x′) − k(x, y ′) − k(y, x′) + k(y, y ′).Define D(p, q)2 := Ex,x′∼p,y,y′∼q h((x, y), (x′, y ′)) ˆ D(X, Y )2 := 1 h((xi, yi), (xj , yj )). m(m−1) i=j ˆD(X, Y )2 is an unbiased estimator of D(p, q)2.It’s easy to compute, and works on structured data. B. Sch¨lkopf, MLSS France 2011 o
  80. 80. Theorem 5 Assume k is bounded. 1ˆD(X, Y )2 converges to D(p, q)2 in probability with rate O(m− 2 ).This could be used as a basis for a test, but uniform convergence bounds are often loose.. √ ˆTheorem 6 We assume E h2 ∞. When p = q, then m(D(X, Y )2 − D(p, q)2)converges in distribution to a zero mean Gaussian with variance 2 σu = 4 Ez (Ez′ h(z, z ′ ))2 − Ez,z′ (h(z, z ′ )) 2 . ˆ ˆWhen p = q, then m(D(X, Y )2 − D(p, q)2) = mD(X, Y )2 converges in distribution to ∞ λl ql2 − 2 , (2) l=1where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation ˜ k(x, x′)ψi (x)dp(x) = λi ψi(x′ ), X ˜and k(xi, xj ) := k(xi, xj ) − Exk(xi, x) − Exk(x, xj ) + Ex,x′ k(x, x′) is the centred RKHSkernel. B. Sch¨lkopf, MLSS France 2011 o
  81. 81. Application 2: Dependence MeasuresAssume that (x, y) are drawn from pxy , with marginals px, py .Want to know whether pxy factorizes.[2, 16]: kernel generalized variance[20, 21]: kernel constrained covariance, HSICMain idea [25, 34]:x and y independent ⇐⇒ ∀ bounded continuous functions f, g,we have Cov(f (x), g(y)) = 0. B. Sch¨lkopf, MLSS France 2011 o
  82. 82. k kernel on X × Y. µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)] µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)] .Use ∆ := µ(pxy ) − µ(px × py ) as a measure of dependence.For k((x, y), (x′ , y ′)) = kx(x, x′)ky (y, y ′):∆2 equals the Hilbert-Schmidt norm of the covariance opera-tor between the two RKHSs (HSIC), with empirical estimatem−2 tr HKxHKy , where H = I − 1/m [20, 44]. B. Sch¨lkopf, MLSS France 2011 o
  83. 83. Witness function of the equivalent optimisation problem: Dependence witness and sample 1.5 0.05 1 0.04 0.03 0.5 0.02 0.01 Y 0 0 −0.5 −0.01 −0.02 −1 −0.03 −0.04 −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 XApplication: learning causal structures (Sun et al., ICML 2007; Fuku-mizu et al., NIPS 2007)) B. Sch¨lkopf, MLSS France 2011 o
  84. 84. Application 3: Covariate Shift Correction and LocalLearningtraining set X = {(x1, y1), . . . , (xm, ym)} drawn from p,test set X ′ = (x′ , y1), . . . , (x′ , yn) from p′ = p. 1 ′ n ′Assume py|x = p′ . y|x[40]: reweight training set B. Sch¨lkopf, MLSS France 2011 o
  85. 85. Minimize 2 m βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0, 2 βi = 1. i=1 iEquivalent QP: 1 ⊤ minimize β (K + λ1) β − β ⊤l β 2 subject to βi ≥ 0 and βi = 1, iwhere Kij := k(xi, xj ), li = k(xi, ·), µ(X ′) .Experiments show that in underspecified situations (e.g., large ker-nel widths), this helps [23].X ′ = x′ leads to a local sample weighting scheme. B. Sch¨lkopf, MLSS France 2011 o
  86. 86. The Representer TheoremTheorem 7 Given: a p.d. kernel k on X × X, a training set(x1, y1), . . . , (xm, ym) ∈ X × R, a strictly monotonic increasingreal-valued function Ω on [0, ∞[, and an arbitrary cost functionc : (X × R2)m → R ∪ {∞}Any f ∈ Hk minimizing the regularized risk functional c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω ( f ) (3)admits a representation of the form m f (.) = αik(xi, .). i=1 B. Sch¨lkopf, MLSS France 2011 o
  87. 87. Remarks• significance: many learning algorithms have solutions that can be expressed as expansions in terms of the training examples• original form, with mean squared loss m 1 c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) = (yi − f (xi))2, m i=1 and Ω( f ) = λ f 2 (λ 0): [27]• generalization to non-quadratic cost functions: [10]• present form: [36] B. Sch¨lkopf, MLSS France 2011 o
  88. 88. ProofDecompose f ∈ H into a part in the span of the k(xi, .) and anorthogonal one: f= αik(xi, .) + f⊥,where for all j i f⊥, k(xj , .) = 0.Application of f to an arbitrary training point xj yields f (xj ) = f, k(xj , .) = αik(xi, .) + f⊥, k(xj , .) i = αi k(xi, .), k(xj , .) , iindependent of f⊥. B. Sch¨lkopf, MLSS France 2011 o
  89. 89. Proof: second part of (3)Since f⊥ is orthogonal to i αik(xi, .), and Ω is strictly mono-tonic, we get Ω( f ) = Ω αik(xi, .) + f⊥ i = Ω αik(xi, .) 2 + f⊥ 2 i ≥ Ω αik(xi, .) , (4) iwith equality occuring if and only if f⊥ = 0.Hence, any minimizer must have f⊥ = 0. Consequently, anysolution takes the form f= αik(xi, .). i B. Sch¨lkopf, MLSS France 2011 o
  90. 90. Application: Support Vector ClassificationHere, yi ∈ {±1}. Use 1 c ((xi, yi, f (xi))i) = max (0, 1 − yif (xi)) , λ iand the regularizer Ω ( f ) = f 2.λ → 0 leads to the hard margin SVM B. Sch¨lkopf, MLSS France 2011 o
  91. 91. Further ApplicationsBayesian MAP Estimates. Identify (3) with the negative logposterior (cf. Kimeldorf Wahba, 1970, Poggio Girosi, 1990),i.e.• exp(−c((xi, yi, f (xi))i)) — likelihood of the data• exp(−Ω( f )) — prior over the set of functions; e.g., Ω( f ) = λ f 2 — Gaussian process prior [55] with covariance function k• minimizer of (3) = MAP estimateKernel PCA (see below) can be shown to correspond to the caseof  2 1 1c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) − m j f (xj ) = 1   ∞ otherwisewith g an arbitrary strictly monotonically increasing function.
  92. 92. Conclusion• the kernel corresponds to – a similarity measure for the data, or – a (linear) representation of the data, or – a hypothesis space for learning,• kernels allow the formulation of a multitude of geometrical algo- rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...) B. Sch¨lkopf, MLSS France 2011 o
  93. 93. Kernel PCA [37] linear PCA k(x,y) = (x.y) R2 x x x xx x x x x x x x kernel PCA k(x,y) = (x.y)d R2 x x x x x x x x x xx x x x x x x x x x x x x x k H Φ B. Sch¨lkopf, MLSS France 2011 o
  94. 94. Kernel PCA, II m 1 x1, . . . , xm ∈ X, Φ : X → H, C= Φ(xj )Φ(xj )⊤ m j=1Eigenvalue problem m 1 λV = CV = Φ(xj ), V Φ(xj ). m j=1For λ = 0, V ∈ span{Φ(x1), . . . , Φ(xm)}, thus m V= αiΦ(xi), i=1and the eigenvalue problem can be written as λ Φ(xn), V = Φ(xn), CV for all n = 1, . . . , m B. Sch¨lkopf, MLSS France 2011 o
  95. 95. Kernel PCA in Dual VariablesIn term of the m × m Gram matrix Kij := Φ(xi), Φ(xj ) = k(xi, xj ),this leads to mλKα = K 2αwhere α = (α1, . . . , αm)⊤.Solve mλα = Kα−→ (λn, αn) Vn, Vn = 1 ⇐⇒ λn αn, αn = 1thus divide α n by √λ n B. Sch¨lkopf, MLSS France 2011 o
  96. 96. Feature extractionCompute projections on the Eigenvectors m Vn = n αi Φ(xi) i=1in H:for a test point x with image Φ(x) in H we get the features m Vn, Φ(x) = n αi Φ(xi), Φ(x) i=1 m = n αi k(xi, x) i=1 B. Sch¨lkopf, MLSS France 2011 o
  97. 97. The Kernel PCA MapRecall Φw : X → Rm m 1 − 2 (k(x , x), . . . , k(x , x))⊤ x → K 1 mIf K = U DU ⊤ is K’s diagonalization, then K −1/2 =U D−1/2U ⊤. Thus we have Φw (x) = U D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤. mWe can drop the leading U (since it leaves the dot product invari-ant) to get a map Φw CA(x) = D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤. KPThe rows of U ⊤ are the eigenvectors αn of K, and the entries of −1/2the diagonal matrix D−1/2 equal λi . B. Sch¨lkopf, MLSS France 2011 o
  98. 98. Toy Example with Gaussian Kernelk(x, x′) = exp − x − x′ 2 B. Sch¨lkopf, MLSS France 2011 o
  99. 99. Super-Resolution (Kim, Franz, Sch¨lkopf, 2004) oa. original image of resolution b. low resolution image (264 × c. bicubic interpolation d. supervised example-based f. unsupervised KPCA recon-528 × 396 198) stretched to the original learning based on nearest neigh- struction scale bor classifier g. enlarged portions of a-d, and f (from left to right)Comparison between different super-resolution methods. B. Sch¨lkopf, MLSS France 2011 o
  100. 100. Support Vector Classifiers input space feature space G N N N N Φ G G G G G [6] B. Sch¨lkopf, MLSS France 2011 o
  101. 101. Separating Hyperplane w, x + b 0 N G N G N w, x + b 0 w N G G G {x | w, x + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  102. 102. Optimal Separating Hyperplane [50] N G N G N . w N G G G {x | w, x + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  103. 103. Eliminating the Scaling Freedom [47]Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}.Hence (cw, cb) describes the same hyperplane as (w, b).Definition: The hyperplane is in canonical form w.r.t. X ∗ ={x1, . . . , xr } if minxi∈X | w, xi + b| = 1. B. Sch¨lkopf, MLSS France 2011 o
  104. 104. Canonical Optimal Hyperplane {x | w, x + b = +1}{x | w, x + b = −1} Note: N w, x1 + b = +1 H N x1 yi = +1 w, x2 + b = −1 x2H = w , (x1−x2) = 2 N , w yi = −1 w N = , (x1−x2) = 2 ||w|| ||w|| H H H {x | w, x + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  105. 105. Canonical Hyperplanes [47]Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}.Hence (cw, cb) describes the same hyperplane as (w, b).Definition: The hyperplane is in canonical form w.r.t. X ∗ ={x1, . . . , xr } if minxi∈X | w, xi + b| = 1.Note that for canonical hyperplanes, the distance of the closestpoint to the hyperplane (“margin”) is 1/ w :minxi∈X w ,x + b = 1 . w i w w B. Sch¨lkopf, MLSS France 2011 o
  106. 106. Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0where w is normalized such that they are in canonical formw.r.t. a set of points X ∗ = {x1, . . . , xr }, i.e., min | w, xi | = 1. i=1,...,rThe set of decision functions fw(x) = sgn x, w defined onX ∗ and satisfying the constraint w ≤ Λ has a VC dimensionsatisfying h ≤ R2Λ2.Here, R is the radius of the smallest sphere around the origincontaining X ∗. B. Sch¨lkopf, MLSS France 2011 o
  107. 107. xx x R x x γ1 γ2 B. Sch¨lkopf, MLSS France 2011 o
  108. 108. Proof Strategy (Gurvits, 1997)Assume that x1, . . . , xr are shattered by canonical hyperplaneswith w ≤ Λ, i.e., for all y1, . . . , yr ∈ {±1}, yi w, xi ≥ 1 for all i = 1, . . . , r. (5)Two steps:• prove that the more points we want to shatter (5), the larger r i=1 yixi must be r• upper bound the size of i=1 yixi in terms of RCombining the two tells us how many points we can at most shat-ter. B. Sch¨lkopf, MLSS France 2011 o
  109. 109. Part ISumming (5) over i = 1, . . . , r yields   r w,  yi xi  ≥ r. i=1By the Cauchy-Schwarz inequality, on the other hand, we have   r r r w,  yi xi  ≤ w yixi ≤ Λ yixi . i=1 i=1 i=1Combine both: r r ≤ yi xi . (6) Λ i=1 B. Sch¨lkopf, MLSS France 2011 o

×