Introduction to Machine Learning
Upcoming SlideShare
Loading in...5
×
 

Introduction to Machine Learning

on

  • 921 views

 

Statistics

Views

Total Views
921
Views on SlideShare
921
Embed Views
0

Actions

Likes
0
Downloads
46
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to Machine Learning Introduction to Machine Learning Presentation Transcript

  • Introduction to Machine Learning Bernhard Schölkopf Empirical Inference Department Max Planck Institute for Intelligent Systems Tübingen, Germany http://www.tuebingen.mpg.de/bs 1
  • Empirical Inference• Drawing conclusions from empirical data (observations, measurements)• Example 1: scientific inference y = Σi ai k(x,xi) + b x y y=a*x x x x x x x x x x Leibniz, Weyl, Chaitin 2
  • Empirical Inference• Drawing conclusions from empirical data (observations, measurements)• Example 1: scientific inference “If your experiment needs statistics [inference], you ought to have done a better experiment.” (Rutherford) 3
  • Empirical Inference, II• Example 2: perception “The brain is nothing but a statistical decision organ” (H. Barlow) 4
  • Hard Inference Problems Sonnenburg, Rätsch, Schäfer, Schölkopf, 2006, Journal of Machine Learning Research Task: classify human DNA sequence locations into {acceptor splice site, decoy} using 15 Million sequences of length 141, and a Multiple-Kernel Support Vector Machines. PRC = Precision-Recall-Curve, fraction of correct positive predictions among all positively predicted cases • High dimensionality – consider many factors simultaneously to find the regularity • Complex regularities – nonlinear, nonstationary, etc. • Little prior knowledge – e.g., no mechanistic models for the data • Need large data sets – processing requires computers and automatic inference methods 5
  • Hard Inference Problems, II • We can solve scientific inference problems that humans can’t solve • Even if it’s just because of data set size / dimensionality, this is a quantum leap 6
  • Generalization (thanks to O. Bousquet)• observe 1, 2, 4, 7,..• What’s next? +1 +2 +3• 1,2,4,7,11,16,…: an+1=an+n (“lazy caterer’s sequence”)• 1,2,4,7,12,20,…: an+2=an+1+an+1• 1,2,4,7,13,24,…: “Tribonacci”-sequence• 1,2,4,7,14,28: set of divisors of 28• 1,2,4,7,1,1,5,…: decimal expansions of p=3,14159… and e=2,718… interleaved• The On-Line Encyclopedia of Integer Sequences: >600 hits… 7
  • Generalization, II• Question: which continuation is correct (“generalizes”)?• Answer: there’s no way to tell (“induction problem”)• Question of statistical learning theory: how to come up with a law that is (probably) correct (“demarcation problem”) (more accurately: a law that is probably as correct on the test data as it is on the training data) 8
  • 2-class classification Learn based on m observations generated from some Goal: minimize expected error (“risk”) V. Vapnik Problem: P is unknown. Induction principle: minimize training error (“empirical risk”) over some class of functions. Q: is this “consistent”? 9
  • The law of large numbers For all and Does this imply “consistency” of empirical risk minimization (optimality in the limit)? No – need a uniform law of large numbers: For all 10
  • Consistency and uniform convergence
  • -> LaTeX 12 Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
  • Support Vector Machines class 2 class 1 F +- - + + - k(x,x’) + - = + <F(x),F(x’)> +• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992): - - representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)• unique solution found by convex QPBernhard Schölkopf, 03October 2011 13
  • Support Vector Machines class 2 class 1 F +- - + + - k(x,x’) + - = + <F(x),F(x’)> +• sparse expansion of solution in terms of SVs (Boser, Guyon, Vapnik 1992): - - representer theorem (Kimeldorf & Wahba 1971, Schölkopf et al. 2000)• unique solution found by convex QPBernhard Schölkopf, 03October 2011 14
  • Applications in Computational Geometry / GraphicsSteinke, Walder, Blanz et al.,Eurographics’05, ’06, ‘08, ICML ’05,‘08,NIPS ’07 15
  • Max-Planck-Institut fürbiologische KybernetikBernhard Schölkopf, Tübingen, FIFA World Cup – Germany vs. England, June 27, 20103. Oktober 2011 16
  • Kernel Quiz 17 Bernhard Schölkopf Empirical Inference Department Tübingen , 03 October 2011
  • Kernel Methods Bernhard Sch¨lkopf oMax Planck Institute for Intelligent Systems B. Sch¨lkopf, MLSS France 2011 o
  • Statistical Learning Theory1. started by Vapnik and Chervonenkis in the Sixties2. model: we observe data generated by an unknown stochastic regularity3. learning = extraction of the regularity from the data4. the analysis of the learning problem leads to notions of capacity of the function classes that a learning machine can implement.5. support vector machines use a particular type of function class: classifiers with large “margins” in a feature space induced by a kernel. [47, 48] B. Sch¨lkopf, MLSS France 2011 o
  • Example: Regression Estimation y x• Data: input-output pairs (xi, yi) ∈ R × R• Regularity: (x1, y1), . . . (xm, ym) drawn from P(x, y)• Learning: choose a function f : R → R such that the error, averaged over P, is minimized.• Problem: P is unknown, so the average cannot be computed — need an “induction principle”
  • Pattern RecognitionLearn f : X → {±1} from examples(x1, y1), . . . , (xm, ym) ∈ X×{±1}, generated i.i.d. from P(x, y),such that the expected misclassification error on a test set, alsodrawn from P(x, y), 1 R[f ] = |f (x) − y)| dP(x, y), 2is minimal (Risk Minimization (RM)).Problem: P is unknown. −→ need an induction principle.Empirical risk minimization (ERM): replace the average overP(x, y) by an average over the training sample, i.e. minimize thetraining error 1 m 1 Remp[f ] = |f (xi) − yi| m i=1 2 B. Sch¨lkopf, MLSS France 2011 o
  • Convergence of Means to ExpectationsLaw of large numbers: Remp[f ] → R[f ]as m → ∞.Does this imply that empirical risk minimization will give us theoptimal result in the limit of infinite sample size (“consistency”of empirical risk minimization)?No.Need a uniform version of the law of large numbers. Uniform overall functions that the learning machine can implement. B. Sch¨lkopf, MLSS France 2011 o
  • Consistency and Uniform Convergence RRisk Remp Remp [f] R[f] f f opt fm Function class B. Sch¨lkopf, MLSS France 2011 o
  • The Importance of the Set of FunctionsWhat about allowing all functions from X to {±1}?Training set (x1, y1), . . . , (xm, ym) ∈ X × {±1} ¯ ¯¯Test patterns x1, . . . , xm ∈ X, ¯¯such that {¯ 1, . . . , xm} ∩ {x1, . . . , xm} = {}. x 1. f ∗(xi) = f (xi) for all iFor any f there exists f ∗ s.t.: 2. f ∗(¯ j ) = f (¯ j ) for all j. x xBased on the training set alone, there is no means of choosingwhich one is better. On the test set, however, they give oppositeresults. There is ’no free lunch’ [24, 56].−→ a restriction must be placed on the functions that we allow B. Sch¨lkopf, MLSS France 2011 o
  • Restricting the Class of FunctionsTwo views:1. Statistical Learning (VC) Theory: take into account the ca-pacity of the class of functions that the learning machine canimplement2. The Bayesian Way: place Prior distributions P(f ) over theclass of functions B. Sch¨lkopf, MLSS France 2011 o
  • Detailed Analysis• loss ξi := 1 |f (xi) − yi| in {0, 1} 2• the ξi are independent Bernoulli trials 1 m• empirical mean m i=1 ξi (by def: equals Remp[f ])• expected value E [ξ] (equals R[f ]) B. Sch¨lkopf, MLSS France 2011 o
  • Chernoff ’s Bound    1 m  P ξi − E [ξ] ≥ ǫ ≤ 2 exp(−2mǫ2) m  i=1• here, P refers to the probability of getting a sample ξ1, . . . , ξm with the property m m ξi − E [ξ] ≥ ǫ (is a product mea- 1 i=1 sure)Useful corollary: Given a 2m-sample of Bernoulli trials, we have    1 m 1 2m  mǫ2 P ξi − ξi ≥ ǫ ≤ 4 exp − . m m  2 i=1 i=m+1 B. Sch¨lkopf, MLSS France 2011 o
  • Chernoff ’s Bound, IITranslate this back into machine learning terminology: the prob-ability of obtaining an m-sample where the training error and testerror differ by more than ǫ > 0 is bounded by P Remp[f ] − R[f ] ≥ ǫ ≤ 2 exp(−2mǫ2).• refers to one fixed f• not allowed to look at the data before choosing f , hence not suitable as a bound on the test error of a learning algorithm using empirical risk minimization B. Sch¨lkopf, MLSS France 2011 o
  • Uniform Convergence (Vapnik & Chervonenkis)Necessary and sufficient conditions for nontrivial consistency ofempirical risk minimization (ERM):One-sided convergence, uniformly over all functions that can beimplemented by the learning machine. lim P { sup (R[f ] − Remp[f ]) > ǫ} = 0 m→∞ f ∈Ffor all ǫ > 0.• note that this takes into account the whole set of functions that can be implemented by the learning machine• this is hard to check for a learning machineAre there properties of learning machines (≡ sets of functions)which ensure uniform convergence of risk? B. Sch¨lkopf, MLSS France 2011 o
  • How to Prove a VC BoundTake a closer look at P{supf ∈F (R[f ] − Remp[f ]) > ǫ}.Plan:• if the function class F contains only one function, then Cher- noff’s bound suffices: P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 2 exp(−2mǫ2). f ∈F• if there are finitely many functions, we use the ’union bound’• even if there are infinitely many, then on any finite sample there are effectively only finitely many (use symmetrization and capacity concepts) B. Sch¨lkopf, MLSS France 2011 o
  • The Case of Two FunctionsSuppose F = {f1, f2}. Rewrite 1 2 P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ Cǫ ), f ∈Fwhere i Cǫ := {(x1, y1), . . . , (xm, ym) | (R[fi] − Remp[fi]) > ǫ}denotes the event that the risks of fi differ by more than ǫ.The RHS equals 1 2 1 2 1 2 P(Cǫ ∪ Cǫ ) = P(Cǫ ) + P(Cǫ ) − P(Cǫ ∩ Cǫ ) 1 2 ≤ P(Cǫ ) + P(Cǫ ).Hence by Chernoff’s bound 1 2 P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ P(Cǫ ) + P(Cǫ ) f ∈F ≤ 2 · 2 exp(−2mǫ2).
  • The Union BoundSimilarly, if F = {f1, . . . , fn}, we have 1 n P{ sup (R[f ] − Remp[f ]) > ǫ} = P(Cǫ ∪ · · · ∪ Cǫ ), f ∈Fand n 1 n P(Cǫ ∪ · · · ∪ Cǫ ) ≤ i P(Cǫ ). i=1Use Chernoff for each summand, to get an extra factor n in thebound. iNote: this becomes an equality if and only if all the events Cǫinvolved are disjoint. B. Sch¨lkopf, MLSS France 2011 o
  • Infinite Function Classes• Note: empirical risk only refers to m points. On these points, the functions of F can take at most 2m values• for Remp, the function class thus “looks” finite• how about R?• need to use a trick B. Sch¨lkopf, MLSS France 2011 o
  • SymmetrizationLemma 1 (Vapnik & Chervonenkis (e.g., [46, 12]))For mǫ2 ≥ 2 we have ′P{ sup (R[f ]−Remp[f ]) > ǫ} ≤ 2P{ sup (Remp[f ]−Remp[f ]) > ǫ/2} f ∈F f ∈FHere, the first P refers to the distribution of iid samples ofsize m, while the second one refers to iid samples of size 2m.In the latter case, Remp measures the loss on the first half of ′the sample, and Remp on the second half. B. Sch¨lkopf, MLSS France 2011 o
  • Shattering Coefficient• Hence, we only need to consider the maximum size of F on 2m points. Call it N(F, 2m).• N(F, 2m) = max. number of different outputs (y1, . . . , y2m) that the function class can generate on 2m points — in other words, the max. number of different ways the function class can separate 2m points into two classes.• N(F, 2m) ≤ 22m• if N(F, 2m) = 22m, then the function class is said to shatter 2m points. B. Sch¨lkopf, MLSS France 2011 o
  • Putting Everything TogetherWe now use (1) symmetrization, (2) the shattering coefficient, and(3) the union bound, to get P{sup(R[f ] − Remp[f ]) > ǫ} f ∈F ′≤ 2P{sup(Remp[f ] − Remp[f ]) > ǫ/2} f ∈F ′ ′ = 2P{(Remp[f1] − Remp[f1]) > ǫ/2 ∨. . .∨ (Remp[fN(F,2m)] − Remp[fN(F,2m)]) > ǫ/2} N(F,2m) ′≤ 2P{(Remp[fn ] − Remp[fn]) > ǫ/2}. n=1 B. Sch¨lkopf, MLSS France 2011 o
  • ctd.Use Chernoff’s bound for each term:∗  1 m 1 2m  mǫ2 P ξi − ξi ≥ ǫ ≤ 2 exp − . m m  2 i=1 i=m+1This yields mǫ2 P{ sup (R[f ] − Remp[f ]) > ǫ} ≤ 4N(F, 2m) exp − . f ∈F 8 • provided that N(F, 2m) does not grow exponentially in m, this is nontrivial • such bounds are called VC type inequalities • two types of randomness: (1) the P refers to the drawing of the training examples, and (2) R[f ] is an expectation over the drawing of test examples.∗ Note that the fi depend on the 2m−sample. A rigorous treatment would need to use a second random-ization over permutations of the 2m-sample, see [36].
  • Confidence IntervalsRewrite the bound: specify the probability with which we want Rto be close to Remp, and solve for ǫ:With a probability of at least 1 − δ, 8 4 R[f ] ≤ Remp[f ] + ln(N(F, 2m)) + ln . m δThis bound holds independent of f ; in particular, it holds for thefunction f m minimizing the empirical risk. B. Sch¨lkopf, MLSS France 2011 o
  • Discussion• tighter bounds are available (better constants etc.)• cannot minimize the bound over f• other capacity concepts can be used B. Sch¨lkopf, MLSS France 2011 o
  • VC EntropyOn an example (x, y), f causes a loss 1 ξ(x, y, f (x)) = |f (x) − y| ∈ {0, 1}.For a larger sample (x , y ), . .2. , (x , y ), the different functions 1 1 m mf ∈ F lead to a set of loss vectors ξf = (ξ(x1, y1, f (x1)), . . . , ξ(xm, ym, f (xm))),whose cardinality we denote by N (F, (x1, y1) . . . , (xm, ym)) .The VC entropy is defined as HF (m) = E [ln N (F, (x1, y1) . . . , (xm, ym))] ,where the expectation is taken over the random generation of them-sample (x1, y1) . . . , (xm, ym) from P.HF (m)/m → 0 ⇐⇒ uniform convergence of risks (hence consis-tency)
  • Further PR Capacity Concepts• exchange ’E’ and ’ln’: annealed entropy. ann HF (m)/m → 0 ⇐⇒ exponentially fast uniform convergence• take ’max’ instead of ’E’: growth function. Note that GF (m) = ln N(F, m). GF (m)/m → 0 ⇐⇒ exponential convergence for all underlying distributions P. GF (m) = m · ln(2) for all m ⇐⇒ for any m, all loss vectors can be generated, i.e., the m points can be chosen such that by using functions of the learning machine, they can be separated in all 2m possible ways (shattered ). B. Sch¨lkopf, MLSS France 2011 o
  • Structure of the Growth FunctionEither GF (m) = m · ln(2) for all m ∈ NOr there exists some maximal m for which the above is possible.Call this number the VC-dimension, and denote it by h. Form > h, m GF (m) ≤ h ln + 1 . hNothing “in between” linear growth and logarithmic growth ispossible. B. Sch¨lkopf, MLSS France 2011 o
  • VC-Dimension: ExampleHalf-spaces in R2: f (x, y) = sgn(a + bx + cy), with parameters a, b, c ∈ R• Clearly, we can shatter three non-collinear points.• But we can never shatter four points.• Hence the VC dimension is h = 3 (in this case, equal to the number of parameters) xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x x xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx B. Sch¨lkopf, MLSS France 2011 o
  • A Typical Bound for Pattern RecognitionFor any f ∈ F and m > h, with a probability of at least 1 − δ, h log 2m + 1 − log(δ/4) h R[f ] ≤ Remp[f ] + mholds.• does this mean, that we can learn anything?• The study of the consistency of ERM has thus led to concepts and results which lets us formulate another induction principle (structural risk minimization) B. Sch¨lkopf, MLSS France 2011 o
  • SRM error R(f* ) bound on test error capacity term training error h structure Sn−1 Sn Sn+1 B. Sch¨lkopf, MLSS France 2011 o
  • Finding a Good Function Class• recall: separating hyperplanes in R2 have a VC dimension of 3.• more generally: separating hyperplanes in RN have a VC di- mension of N + 1.• hence: separating hyperplanes in high-dimensional feature spaces have extremely large VC dimension, and may not gener- alize well• however, margin hyperplanes can still have a small VC dimen- sion B. Sch¨lkopf, MLSS France 2011 o
  • Kernels and Feature SpacesPreprocess the data with Φ:X → H x → Φ(x),where H is a dot product space, and learn the mapping from Φ(x)to y [6].• usually, dim(X) ≪ dim(H)• “Curse of Dimensionality”?• crucial issue: capacity, not dimensionality B. Sch¨lkopf, MLSS France 2011 o
  • Example: All Degree 2 Monomials Φ : R2 → R3 √ 2 , 2 x x , x2 ) (x1, x2) → (z1, z2, z3) := (x1 1 2 2 x z3  2                H    H H H x1 H H  H H   H H HH   z1  H  H H H         z2 B. Sch¨lkopf, MLSS France 2011 o
  • General Product Feature SpaceHow about patterns x ∈ RN and product features of order d?Here, dim(H) grows like N d.E.g. N = 16 × 16, and d = 5 −→ dimension 1010 B. Sch¨lkopf, MLSS France 2011 o
  • The Kernel Trick, N = d = 2 √ √ 2)(x′2, 2 x′ x′ , x′2)⊤ Φ(x), Φ(x′) = (x2, 1 2 x1 x2 , x2 1 1 2 2 2 = x, x′ = : k(x, x′)−→ the dot product in H can be computed in R2 B. Sch¨lkopf, MLSS France 2011 o
  • The Kernel Trick, IIMore generally: x, x′ ∈ RN , d ∈ N:  d N x, x ′ d =  xj · x′  j j=1 N = xj1 · · · · · xjd · x′ 1 · · · · · x′ d = Φ(x), Φ(x′) , j j j1,...,jd=1where Φ maps into the space spanned by all ordered products ofd input directions B. Sch¨lkopf, MLSS France 2011 o
  • Mercer’s TheoremIf k is a continuous kernel of a positive definite integral oper-ator on L2(X) (where X is some compact space), k(x, x′)f (x)f (x′) dx dx′ ≥ 0, Xit can be expanded as ∞ k(x, x′) = λiψi(x)ψi(x′) i=1using eigenfunctions ψi and eigenvalues λi ≥ 0 [30]. B. Sch¨lkopf, MLSS France 2011 o
  • The Mercer Feature MapIn that case √  √λ1ψ1(x) Φ(x) :=  λ2ψ2(x)  . .satisfies Φ(x), Φ(x′) = k(x, x′).Proof: √  √ ′)  √λ1ψ1(x) √λ1ψ1(x′ Φ(x), Φ(x′) =  λ2ψ2(x)  ,  λ2ψ2(x )  . . . . ∞ = λiψi(x)ψi(x′) = k(x, x′) i=1 B. Sch¨lkopf, MLSS France 2011 o
  • Positive Definite KernelsIt can be shown that the admissible class of kernels coincides withthe one of positive definite (pd) kernels: kernels which are sym-metric (i.e., k(x, x′) = k(x′, x)), and for• any set of training points x1, . . . , xm ∈ X and• any a1, . . . , am ∈ Rsatisfy aiaj Kij ≥ 0, where Kij := k(xi, xj ). i,jK is called the Gram matrix or kernel matrix.If for pairwise distinct points, i,j aiaj Kij = 0 =⇒ a = 0, callit strictly positive definite. B. Sch¨lkopf, MLSS France 2011 o
  • The Kernel Trick — Summary• any algorithm that only depends on dot products can benefit from the kernel trick• this way, we can apply linear methods to vectorial as well as non-vectorial data• think of the kernel as a nonlinear similarity measure• examples of common kernels: Polynomial k(x, x′) = ( x, x′ + c)d Gaussian k(x, x′) = exp(− x − x′ 2/(2 σ 2))• Kernels are also known as covariance functions [54, 52, 55, 29] B. Sch¨lkopf, MLSS France 2011 o
  • Properties of PD Kernels, 1Assumption: Φ maps X into a dot product space H; x, x′ ∈ XKernels from Feature Maps.k(x, x′) := Φ(x), Φ(x′) is a pd kernel on X × X.Kernels from Feature Maps, IIK(A, B) := x∈A,x′∈B k(x, x′),where A, B are finite subsets of X, is also a pd kernel ˜(Hint: use the feature map Φ(A) := x∈A Φ(x)) B. Sch¨lkopf, MLSS France 2011 o
  • Properties of PD Kernels, 2 [36, 39]Assumption: k, k1, k2, . . . are pd; x, x′ ∈ Xk(x, x) ≥ 0 for all x (Positivity on the Diagonal)k(x, x′)2 ≤ k(x, x)k(x′, x′) (Cauchy-Schwarz Inequality)(Hint: compute the determinant of the Gram matrix)k(x, x) = 0 for all x =⇒ k(x, x′) = 0 for all x, x′ (Vanishing Diagonals)The following kernels are pd: • αk, provided α ≥ 0 • k1 + k2 • k(x, x′) := limn→∞ kn(x, x′), provided it exists • k1 · k2 • tensor products, direct sums, convolutions [22] B. Sch¨lkopf, MLSS France 2011 o
  • The Feature Space for PD Kernels [4, 1, 35]• define a feature map Φ : X → RX x → k(., x).E.g., for the Gaussian kernel: Φ . . x x Φ(x) Φ(x)Next steps:• turn Φ(X) into a linear space• endow it with a dot product satisfying Φ(x), Φ(x′) = k(x, x′), i.e., k(., x), k(., x′ ) = k(x, x′)• complete the space to get a reproducing kernel Hilbert space B. Sch¨lkopf, MLSS France 2011 o
  • Turn it Into a Linear SpaceForm linear combinations m f (.) = αik(., xi), i=1 m′ g(.) = βj k(., x′ ) j j=1(m, m′ ∈ N, αi, βj ∈ R, xi, x′ ∈ X). j B. Sch¨lkopf, MLSS France 2011 o
  • Endow it With a Dot Product m m′ f, g := αiβj k(xi, x′ ) j i=1 j=1 m m′ = αig(xi) = βj f (x′ ) j i=1 j=1• This is well-defined, symmetric, and bilinear (more later).• So far, it also works for non-pd kernels B. Sch¨lkopf, MLSS France 2011 o
  • The Reproducing Kernel PropertyTwo special cases: • Assume f (.) = k(., x). In this case, we have k(., x), g = g(x). • If moreover g(.) = k(., x′), we have k(., x), k(., x′ ) = k(x, x′).k is called a reproducing kernel(up to here, have not used positive definiteness) B. Sch¨lkopf, MLSS France 2011 o
  • Endow it With a Dot Product, II• It can be shown that ., . is a p.d. kernel on the set of functions {f (.) = m αik(., xi)|αi ∈ R, xi ∈ X} : i=1 γi γj f i , f j = γi f i , γj f j =: f, f ij i j = αik(., xi), αik(., xi) = αiαj k(xi, xj ) ≥ 0 i i ij• furthermore, it is strictly positive definite: f (x)2 = f, k(., x) 2 ≤ f, f k(., x), k(., x) hence f, f = 0 implies f = 0.• Complete the space in the corresponding norm to get a Hilbert space Hk .
  • The Empirical Kernel MapRecall the feature map Φ : X → RX x → k(., x).• each point is represented by its similarity to all other points• how about representing it by its similarity to a sample of points?Consider Φm : X → Rm x → k(., x)|(x1 ,...,xm) = (k(x1, x), . . . , k(xm, x))⊤ B. Sch¨lkopf, MLSS France 2011 o
  • ctd.• Φm(x1), . . . , Φm(xm) contain all necessary information about Φ(x1), . . . , Φ(xm)• the Gram matrix Gij := Φm(xi), Φm(xj ) satisfies G = K 2 where Kij = k(xi, xj )• modify Φm to Φw : X → Rm m − 1 (k(x , x), . . . , k(x , x))⊤ x → K 2 1 m• this “whitened” map (“kernel PCA map”) satifies Φw (xi), Φw (xj ) = k(xi, xj ) m m for all i, j = 1, . . . , m. B. Sch¨lkopf, MLSS France 2011 o
  • An Example of a Kernel AlgorithmIdea: classify points x := Φ(x) in feature space according to whichof the two class means is closer. 1 1 c+ := Φ(xi), c− := Φ(xi) m+ m− yi=1 yi=−1 + o o . + w + c2 o c c1 x-c o xCompute the sign of the dot product between w := c+ − c− andx − c. B. Sch¨lkopf, MLSS France 2011 o
  • An Example of a Kernel Algorithm, ctd. [36]  f (x) = sgn 1 Φ(x), Φ(xi) − 1 Φ(x), Φ(xi) +b m+ m− {i:yi=+1} {i:yi=−1}   = sgn  1 k(x, xi) − 1 k(x, xi) + b m+ m− {i:yi=+1} {i:yi=−1}where   1 1 1 b= k(xi, xj ) − k(xi, xj ) . 2 m2− m2+ {(i,j):yi=yj =−1} {(i,j):yi=yj =+1}• provides a geometric interpretation of Parzen windows B. Sch¨lkopf, MLSS France 2011 o
  • An Example of a Kernel Algorithm, ctd.• Demo• Exercise: derive the Parzen windows classifier by computing the distance criterion directly• SVMs (ppt) B. Sch¨lkopf, MLSS France 2011 o
  • An example of a kernel algorithm, revisited o + µ(Y ) . + w o µ(X ) + o +X compact subset of a separable metric space, m, n ∈ N.Positive class X := {x1, . . . , xm} ⊂ XNegative class Y := {y1, . . . , yn} ⊂ X 1 m 1 nRKHS means µ(X) = m i=1 k(xi, ·), µ(Y ) = n i=1 k(yi, ·).Get a problem if µ(X) = µ(Y )! B. Sch¨lkopf, MLSS France 2011 o
  • When do the means coincide?k(x, x′) = x, x′ : the means coincidek(x, x′) = ( x, x′ + 1)d: all empirical moments up to order d coincidek strictly pd: X =Y.The mean “remembers” each point that contributed to it. B. Sch¨lkopf, MLSS France 2011 o
  • Proposition 2 Assume X, Y are defined as above, k isstrictly pd, and for all i, j, xi = xj , and yi = yj .If for some αi, βj ∈ R − {0}, we have m n αik(xi, .) = βj k(yj , .), (1) i=1 j=1then X = Y . B. Sch¨lkopf, MLSS France 2011 o
  • Proof (by contradiction)W.l.o.g., assume that x1 ∈ Y . Subtract n βj k(yj , .) from (1), j=1and make it a sum over pairwise distinct points, to get 0= γik(zi, .), iwhere z1 = x1, γ1 = α1 = 0, andz2, · · · ∈ X ∪ Y − {x1}, γ2, · · · ∈ R.Take the RKHS dot product with j γj k(zj , .) to get 0= γiγj k(zi, zj ), ijwith γ = 0, hence k cannot be strictly pd. B. Sch¨lkopf, MLSS France 2011 o
  • The mean map m 1 µ : X = (x1, . . . , xm) → k(xi, ·) m i=1satisfies m m 1 1 µ(X), f = k(xi, ·), f = f (xi) m m i=1 i=1and m n 1 1 µ(X)−µ(Y ) = sup | µ(X) − µ(Y ), f | = sup f (xi) − f (yi) . f ≤1 f ≤1 m i=1 n i=1Note: Large distance = can find a function distinguishing thesamples B. Sch¨lkopf, MLSS France 2011 o
  • Witness function µ(X)−µ(Y )f = µ(X)−µ(Y ) , thus f (x) ∝ µ(X) − µ(Y ), k(x, .) ): Witness f for Gauss and Laplace data 1 f 0.8 Gauss Laplace 0.6 Prob. density and f 0.4 0.2 0 −0.2 −0.4 −6 −4 −2 0 2 4 6 XThis function is in the RKHS of a Gaussian kernel, but not in theRKHS of the linear kernel. B. Sch¨lkopf, MLSS France 2011 o
  • The mean map for measuresp, q Borel probability measures,Ex,x′∼p[k(x, x′)], Ex,x′∼q [k(x, x′)] < ∞ ( k(x, .) ≤ M < ∞ is sufficient)Define µ : p → Ex∼p[k(x, ·)].Note µ(p), f = Ex∼p[f (x)]and µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] . f ≤1Recall that in the finite sample case, for strictly p.d. kernels, µwas injective — how about now?[43, 17] B. Sch¨lkopf, MLSS France 2011 o
  • Theorem 3 [15, 13] p = q ⇐⇒ sup Ex∼p(f (x)) − Ex∼q (f (x)) = 0, f ∈C(X)where C(X) is the space of continuous bounded functions onX.Combine this with µ(p) − µ(q) = sup Ex∼p[f (x)] − Ex∼q [f (x)] . f ≤1Replace C(X) by the unit ball in an RKHS that is dense in C(X)— universal kernel [45], e.g., Gaussian.Theorem 4 [19] If k is universal, then p = q ⇐⇒ µ(p) − µ(q) = 0. B. Sch¨lkopf, MLSS France 2011 o
  • • µ is invertible on its image M = {µ(p) | p is a probability distribution} (the “marginal polytope”, [53])• generalization of the moment generating function of a RV x with distribution p: Mp(.) = Ex∼p e x, · .This provides us with a convenient metric on probability distribu-tions, which can be used to check whether two distributions aredifferent — provided that µ is invertible. B. Sch¨lkopf, MLSS France 2011 o
  • Fourier CriterionAssume we have densities, the kernel is shift invariant (k(x, y) =k(x − y)), and all Fourier transforms below exist.Note that µ is invertible iff k(x − y)p(y) dy = k(x − y)q(y) dy =⇒ p = q,i.e., ˆp ˆ k(ˆ − q ) = 0 =⇒ p = q(Sriperumbudur et al., 2008) ˆE.g., µ is invertible if k has full support. Restricting the class of ˆdistributions, weaker conditions suffice (e.g., if k has non-empty in-terior, µ is invertible for all distributions with compact support). B. Sch¨lkopf, MLSS France 2011 o
  • Fourier OpticsApplication: p source of incoherent light, I indicator of a finite ˆaperture. In Fraunhofer diffraction, the intensity image is ∝ p∗ I 2. ˆSet k = I 2, then this equals µ(p). ˆThis k does not have full support, thus the imaging process is notinvertible for the class of all light sources (Abbe), but it is if werestrict the class (e.g., to compact support). B. Sch¨lkopf, MLSS France 2011 o
  • Application 1: Two-sample problem [19]X, Y i.i.d. m-samples from p, q, respectively. 2 µ(p) − µ(q) =Ex,x′∼p [k(x, x′)] − 2Ex∼p,y∼q [k(x, y)] + Ey,y′∼q [k(y, y ′)] =Ex,x′∼p,y,y′∼q [h((x, y), (x′, y ′))]with h((x, y), (x′, y ′)) := k(x, x′) − k(x, y ′) − k(y, x′) + k(y, y ′).Define D(p, q)2 := Ex,x′∼p,y,y′∼q h((x, y), (x′, y ′)) ˆ D(X, Y )2 := 1 h((xi, yi), (xj , yj )). m(m−1) i=j ˆD(X, Y )2 is an unbiased estimator of D(p, q)2.It’s easy to compute, and works on structured data. B. Sch¨lkopf, MLSS France 2011 o
  • Theorem 5 Assume k is bounded. 1ˆD(X, Y )2 converges to D(p, q)2 in probability with rate O(m− 2 ).This could be used as a basis for a test, but uniform convergence bounds are often loose.. √ ˆTheorem 6 We assume E h2 < ∞. When p = q, then m(D(X, Y )2 − D(p, q)2)converges in distribution to a zero mean Gaussian with variance 2 σu = 4 Ez (Ez′ h(z, z ′ ))2 − Ez,z′ (h(z, z ′ )) 2 . ˆ ˆWhen p = q, then m(D(X, Y )2 − D(p, q)2) = mD(X, Y )2 converges in distribution to ∞ λl ql2 − 2 , (2) l=1where ql ∼ N(0, 2) i.i.d., λi are the solutions to the eigenvalue equation ˜ k(x, x′)ψi (x)dp(x) = λi ψi(x′ ), X ˜and k(xi, xj ) := k(xi, xj ) − Exk(xi, x) − Exk(x, xj ) + Ex,x′ k(x, x′) is the centred RKHSkernel. B. Sch¨lkopf, MLSS France 2011 o
  • Application 2: Dependence MeasuresAssume that (x, y) are drawn from pxy , with marginals px, py .Want to know whether pxy factorizes.[2, 16]: kernel generalized variance[20, 21]: kernel constrained covariance, HSICMain idea [25, 34]:x and y independent ⇐⇒ ∀ bounded continuous functions f, g,we have Cov(f (x), g(y)) = 0. B. Sch¨lkopf, MLSS France 2011 o
  • k kernel on X × Y. µ(pxy ) := E(x,y)∼pxy [k((x, y), ·)] µ(px × py ) := Ex∼px,y∼py [k((x, y), ·)] .Use ∆ := µ(pxy ) − µ(px × py ) as a measure of dependence.For k((x, y), (x′ , y ′)) = kx(x, x′)ky (y, y ′):∆2 equals the Hilbert-Schmidt norm of the covariance opera-tor between the two RKHSs (HSIC), with empirical estimatem−2 tr HKxHKy , where H = I − 1/m [20, 44]. B. Sch¨lkopf, MLSS France 2011 o
  • Witness function of the equivalent optimisation problem: Dependence witness and sample 1.5 0.05 1 0.04 0.03 0.5 0.02 0.01 Y 0 0 −0.5 −0.01 −0.02 −1 −0.03 −0.04 −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 XApplication: learning causal structures (Sun et al., ICML 2007; Fuku-mizu et al., NIPS 2007)) B. Sch¨lkopf, MLSS France 2011 o
  • Application 3: Covariate Shift Correction and LocalLearningtraining set X = {(x1, y1), . . . , (xm, ym)} drawn from p,test set X ′ = (x′ , y1), . . . , (x′ , yn) from p′ = p. 1 ′ n ′Assume py|x = p′ . y|x[40]: reweight training set B. Sch¨lkopf, MLSS France 2011 o
  • Minimize 2 m βik(xi, ·) − µ(X ′) +λ β 2 subject to βi ≥ 0, 2 βi = 1. i=1 iEquivalent QP: 1 ⊤ minimize β (K + λ1) β − β ⊤l β 2 subject to βi ≥ 0 and βi = 1, iwhere Kij := k(xi, xj ), li = k(xi, ·), µ(X ′) .Experiments show that in underspecified situations (e.g., large ker-nel widths), this helps [23].X ′ = x′ leads to a local sample weighting scheme. B. Sch¨lkopf, MLSS France 2011 o
  • The Representer TheoremTheorem 7 Given: a p.d. kernel k on X × X, a training set(x1, y1), . . . , (xm, ym) ∈ X × R, a strictly monotonic increasingreal-valued function Ω on [0, ∞[, and an arbitrary cost functionc : (X × R2)m → R ∪ {∞}Any f ∈ Hk minimizing the regularized risk functional c ((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) + Ω ( f ) (3)admits a representation of the form m f (.) = αik(xi, .). i=1 B. Sch¨lkopf, MLSS France 2011 o
  • Remarks• significance: many learning algorithms have solutions that can be expressed as expansions in terms of the training examples• original form, with mean squared loss m 1 c((x1, y1, f (x1)), . . . , (xm, ym, f (xm))) = (yi − f (xi))2, m i=1 and Ω( f ) = λ f 2 (λ > 0): [27]• generalization to non-quadratic cost functions: [10]• present form: [36] B. Sch¨lkopf, MLSS France 2011 o
  • ProofDecompose f ∈ H into a part in the span of the k(xi, .) and anorthogonal one: f= αik(xi, .) + f⊥,where for all j i f⊥, k(xj , .) = 0.Application of f to an arbitrary training point xj yields f (xj ) = f, k(xj , .) = αik(xi, .) + f⊥, k(xj , .) i = αi k(xi, .), k(xj , .) , iindependent of f⊥. B. Sch¨lkopf, MLSS France 2011 o
  • Proof: second part of (3)Since f⊥ is orthogonal to i αik(xi, .), and Ω is strictly mono-tonic, we get Ω( f ) = Ω αik(xi, .) + f⊥ i = Ω αik(xi, .) 2 + f⊥ 2 i ≥ Ω αik(xi, .) , (4) iwith equality occuring if and only if f⊥ = 0.Hence, any minimizer must have f⊥ = 0. Consequently, anysolution takes the form f= αik(xi, .). i B. Sch¨lkopf, MLSS France 2011 o
  • Application: Support Vector ClassificationHere, yi ∈ {±1}. Use 1 c ((xi, yi, f (xi))i) = max (0, 1 − yif (xi)) , λ iand the regularizer Ω ( f ) = f 2.λ → 0 leads to the hard margin SVM B. Sch¨lkopf, MLSS France 2011 o
  • Further ApplicationsBayesian MAP Estimates. Identify (3) with the negative logposterior (cf. Kimeldorf & Wahba, 1970, Poggio & Girosi, 1990),i.e.• exp(−c((xi, yi, f (xi))i)) — likelihood of the data• exp(−Ω( f )) — prior over the set of functions; e.g., Ω( f ) = λ f 2 — Gaussian process prior [55] with covariance function k• minimizer of (3) = MAP estimateKernel PCA (see below) can be shown to correspond to the caseof  2 1 1c((xi, yi, f (xi))i=1,...,m) = 0 if m i f (xi) − m j f (xj ) = 1   ∞ otherwisewith g an arbitrary strictly monotonically increasing function.
  • Conclusion• the kernel corresponds to – a similarity measure for the data, or – a (linear) representation of the data, or – a hypothesis space for learning,• kernels allow the formulation of a multitude of geometrical algo- rithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...) B. Sch¨lkopf, MLSS France 2011 o
  • Kernel PCA [37] linear PCA k(x,y) = (x.y) R2 x x x xx x x x x x x x kernel PCA k(x,y) = (x.y)d R2 x x x x x x x x x xx x x x x x x x x x x x x x k H Φ B. Sch¨lkopf, MLSS France 2011 o
  • Kernel PCA, II m 1 x1, . . . , xm ∈ X, Φ : X → H, C= Φ(xj )Φ(xj )⊤ m j=1Eigenvalue problem m 1 λV = CV = Φ(xj ), V Φ(xj ). m j=1For λ = 0, V ∈ span{Φ(x1), . . . , Φ(xm)}, thus m V= αiΦ(xi), i=1and the eigenvalue problem can be written as λ Φ(xn), V = Φ(xn), CV for all n = 1, . . . , m B. Sch¨lkopf, MLSS France 2011 o
  • Kernel PCA in Dual VariablesIn term of the m × m Gram matrix Kij := Φ(xi), Φ(xj ) = k(xi, xj ),this leads to mλKα = K 2αwhere α = (α1, . . . , αm)⊤.Solve mλα = Kα−→ (λn, αn) Vn, Vn = 1 ⇐⇒ λn αn, αn = 1thus divide α n by √λ n B. Sch¨lkopf, MLSS France 2011 o
  • Feature extractionCompute projections on the Eigenvectors m Vn = n αi Φ(xi) i=1in H:for a test point x with image Φ(x) in H we get the features m Vn, Φ(x) = n αi Φ(xi), Φ(x) i=1 m = n αi k(xi, x) i=1 B. Sch¨lkopf, MLSS France 2011 o
  • The Kernel PCA MapRecall Φw : X → Rm m 1 − 2 (k(x , x), . . . , k(x , x))⊤ x → K 1 mIf K = U DU ⊤ is K’s diagonalization, then K −1/2 =U D−1/2U ⊤. Thus we have Φw (x) = U D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤. mWe can drop the leading U (since it leaves the dot product invari-ant) to get a map Φw CA(x) = D−1/2U ⊤(k(x1, x), . . . , k(xm, x))⊤. KPThe rows of U ⊤ are the eigenvectors αn of K, and the entries of −1/2the diagonal matrix D−1/2 equal λi . B. Sch¨lkopf, MLSS France 2011 o
  • Toy Example with Gaussian Kernelk(x, x′) = exp − x − x′ 2 B. Sch¨lkopf, MLSS France 2011 o
  • Super-Resolution (Kim, Franz, & Sch¨lkopf, 2004) oa. original image of resolution b. low resolution image (264 × c. bicubic interpolation d. supervised example-based f. unsupervised KPCA recon-528 × 396 198) stretched to the original learning based on nearest neigh- struction scale bor classifier g. enlarged portions of a-d, and f (from left to right)Comparison between different super-resolution methods. B. Sch¨lkopf, MLSS France 2011 o
  • Support Vector Classifiers input space feature space G N N N N Φ G G G G G [6] B. Sch¨lkopf, MLSS France 2011 o
  • Separating Hyperplane <w, x> + b > 0 N G N G N <w, x> + b < 0 w N G G G {x | <w, x> + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  • Optimal Separating Hyperplane [50] N G N G N . w N G G G {x | <w, x> + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  • Eliminating the Scaling Freedom [47]Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}.Hence (cw, cb) describes the same hyperplane as (w, b).Definition: The hyperplane is in canonical form w.r.t. X ∗ ={x1, . . . , xr } if minxi∈X | w, xi + b| = 1. B. Sch¨lkopf, MLSS France 2011 o
  • Canonical Optimal Hyperplane {x | <w, x> + b = +1}{x | <w, x> + b = −1} Note: N <w, x1> + b = +1 H N x1 yi = +1 <w, x2> + b = −1 x2H => <w , (x1−x2)> = 2 N , w yi = −1 w N => < , (x1−x2) = 2 > ||w|| ||w|| H H H {x | <w, x> + b = 0} B. Sch¨lkopf, MLSS France 2011 o
  • Canonical Hyperplanes [47]Note: if c = 0, then {x| w, x + b = 0} = {x| cw, x + cb = 0}.Hence (cw, cb) describes the same hyperplane as (w, b).Definition: The hyperplane is in canonical form w.r.t. X ∗ ={x1, . . . , xr } if minxi∈X | w, xi + b| = 1.Note that for canonical hyperplanes, the distance of the closestpoint to the hyperplane (“margin”) is 1/ w :minxi∈X w ,x + b = 1 . w i w w B. Sch¨lkopf, MLSS France 2011 o
  • Theorem 8 (Vapnik [46]) Consider hyperplanes w, x = 0where w is normalized such that they are in canonical formw.r.t. a set of points X ∗ = {x1, . . . , xr }, i.e., min | w, xi | = 1. i=1,...,rThe set of decision functions fw(x) = sgn x, w defined onX ∗ and satisfying the constraint w ≤ Λ has a VC dimensionsatisfying h ≤ R2Λ2.Here, R is the radius of the smallest sphere around the origincontaining X ∗. B. Sch¨lkopf, MLSS France 2011 o
  • xx x R x x γ1 γ2 B. Sch¨lkopf, MLSS France 2011 o
  • Proof Strategy (Gurvits, 1997)Assume that x1, . . . , xr are shattered by canonical hyperplaneswith w ≤ Λ, i.e., for all y1, . . . , yr ∈ {±1}, yi w, xi ≥ 1 for all i = 1, . . . , r. (5)Two steps:• prove that the more points we want to shatter (5), the larger r i=1 yixi must be r• upper bound the size of i=1 yixi in terms of RCombining the two tells us how many points we can at most shat-ter. B. Sch¨lkopf, MLSS France 2011 o
  • Part ISumming (5) over i = 1, . . . , r yields   r w,  yi xi  ≥ r. i=1By the Cauchy-Schwarz inequality, on the other hand, we have   r r r w,  yi xi  ≤ w yixi ≤ Λ yixi . i=1 i=1 i=1Combine both: r r ≤ yi xi . (6) Λ i=1 B. Sch¨lkopf, MLSS France 2011 o
  • Part IIConsider independent random labels yi ∈ {±1}, uniformly dis-tributed (Rademacher variables).   2   r r rE yi xi  = E  yi xi , yj xj    i=1 i=1 j=1      r = E  y i x i ,  yj xj  + yixi  i=1 j=i    r =  E yixi, yj xj  + E [ yixi, yixi ] i=1 j=i r r = E yi xi 2 = xi 2 i=1 i=1 B. Sch¨lkopf, MLSS France 2011 o
  • Part II, ctd.Since xi ≤ R, we get   2 r E yixi  ≤ rR2.   i=1• This holds for the expectation over the random choices of the labels, hence there must be at least one set of labels for which it also holds true. Use this set.Hence 2 r yi xi ≤ rR2. i=1 B. Sch¨lkopf, MLSS France 2011 o
  • Part I and II Combined r 2Part I: Λ ≤ r yi xi 2 i=1Part II: r i=1 yixi 2 ≤ rR2Hence r2 ≤ rR2, Λ2i.e., r ≤ R2Λ2,completing the proof. B. Sch¨lkopf, MLSS France 2011 o
  • Pattern Noise as Maximum Margin Regularization o r o o + o + ρ + + B. Sch¨lkopf, MLSS France 2011 o
  • Maximum Margin vs. MDL — 2D Case o o o γ + γ+∆γ o + ρ R + + ρCan perturb γ by ∆γ with |∆γ| < arcsin R and still correctlyseparate the data.Hence only need to store γ with accuracy ∆γ [36, 51]. B. Sch¨lkopf, MLSS France 2011 o
  • Formulation as an Optimization ProblemHyperplane with maximum margin: minimize w 2(recall: margin ∼ 1/ w ) subject to yi · [ w, xi + b] ≥ 1 for i = 1 . . . m(i.e. the training data are separated correctly). B. Sch¨lkopf, MLSS France 2011 o
  • Lagrange Function (e.g., [5])Introduce Lagrange multipliers αi ≥ 0 and a Lagrangian m 1 L(w, b, α) = w 2 − αi (yi · [ w, xi + b] − 1) . 2 i=1 L has to minimized w.r.t. the primal variables w and b andmaximized with respect to the dual variables αi• if a constraint is violated, then yi · ( w, xi + b) − 1 < 0 −→ · αi will grow to increase L — how far? · w, b want to decrease L; i.e. they have to change such that the constraint is satisfied. If the problem is separable, this ensures that αi < ∞.• similarly: if yi · ( w, xi + b) − 1 > 0, then αi = 0: otherwise, L could be increased by decreasing αi (KKT conditions) B. Sch¨lkopf, MLSS France 2011 o
  • Derivation of the Dual ProblemAt the extremum, we have ∂ ∂ L(w, b, α) = 0, L(w, b, α) = 0, ∂b ∂wi.e. m αi yi = 0 i=1and m w= αi yi xi . i=1Substitute both into L to get the dual problem B. Sch¨lkopf, MLSS France 2011 o
  • The Support Vector Expansion m w= αi yi xi i=1where for all i = 1, . . . , m eitheryi · [ w, xi + b] > 1 =⇒ αi = 0 −→ xi irrelevantoryi · [ w, xi + b] = 1 (on the margin) −→ xi “Support Vector”The solution is determined by the examples on the margin.Thus f (x) = sgn ( x, w + b) m = sgn α y x, xi + b . i=1 i i B. Sch¨lkopf, MLSS France 2011 o
  • Why it is Good to Have Few SVsLeave out an example that does not become SV −→ same solution.Theorem [49]: Denote #SV(m) the number of SVs obtainedby training on m examples randomly drawn from P(x, y), and Ethe expectation. Then E [#SV(m)] E [Prob(test error)] ≤ mHere, Prob(test error) refers to the expected value of the risk,where the expectation is taken over training the SVM on samplesof size m − 1. B. Sch¨lkopf, MLSS France 2011 o
  • A Mechanical Interpretation [8]Assume that each SV xi exerts a perpendicular force of size αiand sign yi on a solid plane sheet lying along the hyperplane.Then the solution is mechanically stable: m αiyi = 0 implies that the forces sum to zero i=1 m w= αiyixi implies that the torques sum to zero, i=1via xi × yiαi · w/ w = w × w/ w = 0. i B. Sch¨lkopf, MLSS France 2011 o
  • Dual ProblemDual: maximize m m 1 W (α) = αi − αi αj yi yj xi , xj 2 i=1 i,j=1subject to m αi ≥ 0, i = 1, . . . , m, and αiyi = 0. i=1Both the final decision function and the function to be maximizedare expressed in dot products −→ can use a kernel to compute xi, xj = Φ(xi), Φ(xj ) = k(xi, xj ). B. Sch¨lkopf, MLSS France 2011 o
  • The SVM Architecture f(x)= sgn ( Σ + b) classification f(x)= sgn ( Σ λi.k(x,x i) + b) λ1 λ2 λ3 λ4 weightsk k k k comparison: k(x,x i), e.g. k(x,x i)=(x.x i)d k(x,x i)=exp(−||x−x i||2 / c) support vectors x 1 ... x 4 k(x,x i)=tanh(κ(x.x i)+θ) input vector x B. Sch¨lkopf, MLSS France 2011 o
  • Toy Example with Gaussian Kernel k(x, x′) = exp − x − x′ 2 B. Sch¨lkopf, MLSS France 2011 o
  • Nonseparable Problems [3, 9]If yi · ( w, xi + b) ≥ 1 cannot be satisfied, then αi → ∞.Modify the constraint to yi · ( w, xi + b) ≥ 1 − ξiwith ξi ≥ 0(“soft margin”) and add m C· ξi i=1in the objective function. B. Sch¨lkopf, MLSS France 2011 o
  • Soft Margin SVMsC-SVM [9]: for C > 0, minimize m 1 τ (w, ξ) = w 2 + C ξi 2 i=1subject to yi · ( w, xi + b) ≥ 1 − ξi, ξi ≥ 0 (margin 1/ w )ν-SVM [38]: for 0 ≤ ν < 1, minimize 1 2 − νρ + 1 τ (w, ξ, ρ) = w ξi 2 m isubject to yi · ( w, xi + b) ≥ ρ − ξi, ξi ≥ 0 (margin ρ/ w ) B. Sch¨lkopf, MLSS France 2011 o
  • The ν-PropertySVs: αi > 0“margin errors:” ξi > 0KKT-Conditions =⇒• All margin errors are SVs.• Not all SVs need to be margin errors. Those which are not lie exactly on the edge of the margin.Proposition:1. fraction of Margin Errors ≤ ν ≤ fraction of SVs.2. asymptotically: ... = ν = ... B. Sch¨lkopf, MLSS France 2011 o
  • Duals, Using KernelsC-SVM dual: maximize 1 W (α) = α − α α y y k(xi, xj ) i i 2 i,j i j i jsubject to 0 ≤ αi ≤ C, i αiyi = 0.ν-SVM dual: maximize 1 W (α) = − αiαj yiyj k(xi, xj ) 2 ij 1subject to 0 ≤ αi ≤ m , i αiyi = 0, i αi ≥ νIn both cases: decision function: m f (x) = sgn αiyik(x, xi) + b i=1 B. Sch¨lkopf, MLSS France 2011 o
  • Connection between ν-SVC and C-SVCProposition. If ν-SV classification leads to ρ > 0, then C-SVclassification, with C set a priori to 1/ρ, leads to the same decisionfunction.Proof. Minimize the primal target, then fix ρ, and minimize only over theremaining variables: nothing will change. Hence the obtained solution w0, b0, ξ0minimizes the primal problem of C-SVC, for C = 1, subject to yi · ( xi, w + b) ≥ ρ − ξi.To recover the constraint yi · ( xi, w + b) ≥ 1 − ξi,rescale to the set of variables w′ = w/ρ, b′ = b/ρ, ξ′ = ξ/ρ. This leaves us, upto a constant scaling factor ρ2, with the C-SV target with C = 1/ρ. B. Sch¨lkopf, MLSS France 2011 o
  • SVM Training• naive approach: the complexity of maximizing m 1 m W (α) = αi − αiαj yiyj k(xi, xj ) i=1 2 i,j=1 scales with the third power of the training set size m• only SVs are relevant −→ only compute (k(xi, xj ))ij for SVs. Extract them iteratively by cycling through the training set in chunks [46].• in fact, one can use chunks which do not even contain all SVs [31]. Maximize over these sub-problems, using your favorite optimizer.• the extreme case: by making the sub-problems very small (just two points), one can solve them analytically [32].• http://www.kernel-machines.org/software.html
  • MNIST Benchmarkhandwritten character benchmark (60000 training & 10000 testexamples, 28 × 28) 5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6 3 0 2 1 1 7 9 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1 B. Sch¨lkopf, MLSS France 2011 o
  • MNIST Error Rates Classifier test error reference linear classifier 8.4% [7] 3-nearest-neighbour 2.4% [7] SVM 1.4% [8] Tangent distance 1.1% [41] LeNet4 1.1% [28] Boosted LeNet4 0.7% [28] Translation invariant SVM 0.56% [11]Note: the SVM used a polynomial kernel of degree 9, corresponding to a featurespace of dimension ≈ 3.2 · 1020. B. Sch¨lkopf, MLSS France 2011 o
  • References [1] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950. [2] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1–48, 2002. [3] K. P. Bennett and O. L. Mangasarian. Robust linear programming discrimination of two linearly inseparable sets. Opti- mization Methods and Software, 1:23–34, 1992. [4] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, 1984. [5] D. P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA, 1995. [6] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM Press. [7] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. M¨ ller, E. S¨ckinger, P. Simard, u a and V. Vapnik. Comparison of classifier methods: a case study in handwritten digit recognition. In Proceedings of the 12th International Conference on Pattern Recognition and Neural Networks, Jerusalem, pages 77–87. IEEE Computer Society Press, 1994. [8] C. J. C. Burges and B. Sch¨lkopf. Improving the accuracy and speed of support vector learning machines. In M. Mozer, o M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems 9, pages 375–381, Cambridge, MA, 1997. MIT Press. [9] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.[10] D. Cox and F. O’Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics, 18:1676–1695, 1990.[11] D. DeCoste and B. Sch¨lkopf. Training invariant support vector machines. Machine Learning, 2001. Accepted for publi- o cation. Also: Technical Report JPL-MLTR-00-1, Jet Propulsion Laboratory, Pasadena, CA, 2000.
  • [12] L. Devroye, L. Gy¨rfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of o mathematics. Springer, New York, 1996.[13] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, 2002.[14] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. J. Smola, P. L. Bartlett, B. Sch¨lkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 171–203, Cambridge, MA, o 2000. MIT Press. e e e ´[15] R. Fortet and E. Mourier. Convergence de la r´paration empirique vers la r´paration th´orique. Ann. Scient. Ecole Norm. Sup., 70:266–285, 1953.[16] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73–99, 2004.[17] K. Fukumizu, A. Gretton, X. Sun, and B. Sch¨lkopf. Kernel measures of conditional dependence. In J. C. Platt, D. Koller, o Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 489–496, Cam- bridge, MA, USA, 09 2008. MIT Press.[18] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6):1455– 1480, 1998.[19] A. Gretton, K. Borgwardt, M. Rasch, B. Sch¨lkopf, and A. J. Smola. A kernel method for the two-sample-problem. In o B. Sch¨lkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The o MIT Press, Cambridge, MA, 2007.[20] A. Gretton, O. Bousquet, A.J. Smola, and B. Sch¨lkopf. Measuring statistical dependence with Hilbert-Schmidt norms. o In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings Algorithmic Learning Theory, pages 63–77, Berlin, Germany, 2005. Springer-Verlag.[21] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch¨lkopf. Kernel methods for measuring independence. J. Mach. o Learn. Res., 6:2075–2129, 2005.[22] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Depart- ment, University of California at Santa Cruz, 1999.[23] J. Huang, A.J. Smola, A. Gretton, K. Borgwardt, and B. Sch¨lkopf. Correcting sample selection bias by unlabeled data. In o B. Sch¨lkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The o MIT Press, Cambridge, MA, 2007.
  • [24] I. A. Ibragimov and R. Z. Has’minskii. Statistical Estimation — Asymptotic Theory. Springer-Verlag, New York, 1981.[25] J. Jacod and P. Protter. Probability Essentials. Springer, New York, 2000.[26] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41:495–502, 1970.[27] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82–95, 1971.[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:2278–2324, 1998.[29] D. J. C. MacKay. Introduction to gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, pages 133–165. Springer-Verlag, Berlin, 1998.[30] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Phi- los. Trans. Roy. Soc. London, A 209:415–446, 1909.[31] E. Osuna, R. Freund, and F. Girosi. Support vector machines: Training and applications. Technical Report AIM-1602, MIT A.I. Lab., 1996.[32] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨lkopf, C. J. C. Burges, o and A. J. Smola, editors, Advances in Kernel Methods — Support Vector Learning, pages 185–208, Cambridge, MA, 1999. MIT Press.[33] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September 1990.[34] A. R´nyi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10:441–451, 1959. e[35] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific & Technical, Harlow, England, 1988.[36] B. Sch¨lkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002. o[37] B. Sch¨lkopf, A. Smola, and K.-R. M¨ ller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Compu- o u tation, 10:1299–1319, 1998.[38] B. Sch¨lkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, o 12:1207–1245, 2000.[39] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004.
  • [40] H. Shimodaira. Improving predictive inference under convariance shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, 2000.[41] P. Simard, Y. LeCun, and J. Denker. Efficient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5. Proceedings of the 1992 Conference, pages 50–58, San Mateo, CA, 1993. Morgan Kaufmann.[42] A. Smola, B. Sch¨lkopf, and K.-R. M¨ ller. The connection between regularization operators and support vector kernels. o u Neural Networks, 11:637–649, 1998.[43] A. J. Smola, A. Gretton, L. Song, and B. Sch¨lkopf. A Hilbert space embedding for distributions. In M. Hutter, R. A. o Servedio, and E. Takimoto, editors, Algorithmic Learning Theory: 18th International Conference, pages 13–31, Berlin, 10 2007. Springer.[44] A. J. Smola, A. Gretton, L. Song, and B. Sch¨lkopf. A Hilbert space embedding for distributions. In Proc. Intl. Conf. o Algorithmic Learning Theory, volume 4754 of LNAI. Springer, 2007.[45] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67–93, 2001.[46] V. Vapnik. Estimation of Dependences Based on Empirical Data [in Russian]. Nauka, Moscow, 1979. (English translation: Springer Verlag, New York, 1982).[47] V. Vapnik. The Nature of Statistical Learning Theory. Springer, NY, 1995.[48] V. Vapnik. Statistical Learning Theory. Wiley, NY, 1998.[49] V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka, Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie–Verlag, Berlin, 1979).[50] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait method. Automation and Remote Control, 24, 1963.[51] U. von Luxburg, O. Bousquet, and B. Sch¨lkopf. A compression approach to support vector model selection. Technical o Report 101, Max Planck Institute for Biological Cybernetics, T¨ bingen, Germany, 2002. see more detailed JMLR version. u[52] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Math- ematics. SIAM, Philadelphia, 1990.
  • [53] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, September 2003.[54] H. L. Weinert. Reproducing Kernel Hilbert Spaces. Hutchinson Ross, Stroudsburg, PA, 1982.[55] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer, 1998.[56] D. H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7):1341–1390, 1996. B. Sch¨lkopf, MLSS France 2011 o
  • Regularization Interpretation of Kernel MachinesThe norm in H can be interpreted as a regularization term (Girosi1998, Smola et al., 1998, Evgeniou et al., 2000): if P is a regular-ization operator (mapping into a dot product space D) such thatk is Green’s function of P ∗P , then w = Pf ,where m w= αiΦ(xi) i=1and f (x) = αik(xi, x). iExample: for the Gaussian kernel, P is a linear combination ofdifferential operators. B. Sch¨lkopf, MLSS France 2011 o
  • w 2 = αiαj k(xi, xj ) i,j = αiαj k(xi, .), δxj (.) i,j = αiαj k(xi, .), (P ∗P k)(xj , .) i,j = αiαj (P k)(xi, .), (P k)(xj , .) D i,j = (P αik)(xi, .), (P αj k)(xj , .) i j D = P f 2,using f (x) = i αik(xi, x). B. Sch¨lkopf, MLSS France 2011 o