Kernel Methods in Computer Vision    Christoph Lampert                Matthew Blaschko  Max Planck Institute for          ...
Overview... 14:00 – 15:00   Introduction to Kernel Classifiers 15:20 – 15:50   Selecting and Combining Kernels 15:50 – 16:2...
Introduction to Kernel Classifiers
Linear Classification Separates these two sample sets.             height                              width
Linear Classification Separates these two sample sets.             height                              width               ...
Linear Regression Find a function that interpolates data points.               1.2               1.0               0.8    ...
Linear Regression Find a function that interpolates data points.               1.2               1.0               0.8    ...
Linear Dimensionality ReductionReduce the dimensionality of a dataset while preserving its structure.               4     ...
Linear Dimensionality ReductionReduce the dimensionality of a dataset while preserving its structure.               4     ...
Linear TechniquesThree different elementary tasks:   classification,   regression,   dimensionality reduction.In each case, ...
Linear TechniquesLinear techniques...     often work well,         most natural functions are smooth,         smooth funct...
Example: Maximum Margin ClassificationNotation:    data points X = {x1 , . . . , xn }, xi ∈ Rd ,    class labels Y = {y1 , ...
Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }.            2.0            1.0  ...
Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions thedata space ...
Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w?  2.0        ...
Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w?  2.0        ...
Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that futuresamples are s...
Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that futuresamples are s...
Example: Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem:                   ma...
Example: Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem:                   ma...
Example: Maximum Margin ClassificationWe can rewrite this as a minimization problem:                                 2     ...
Example: Maximum Margin ClassificationFrom the view of optimization theory                                  2              ...
Linear Separability What is the best w for this dataset?              height                               width
Linear Separability What is the best w for this dataset?              height                               width Not this.
Linear Separability What is the best w for this dataset?              height                               width Not this.
Linear Separability What is the best w for this dataset?              height                               width Not this.
Linear Separability What is the best w for this dataset?              height                               width Not this.
Linear Separability  The problem           2  min w                                         ?  w∈Rd                       ...
Linear Separability What is the best w for this dataset?              height                               width
Linear Separability What is the best w for this dataset?                                                          n       ...
Linear Separability What is the best w for this dataset?              height                               width
Linear Separability What is the best w for this dataset?                                         small                    ...
Linear Separability What is the best w for this dataset?                                                      n           ...
Solving for Soft-Margin Solution Mathematically, we formulate the trade-off by slack-variables ξi :                        ...
Linear Separability So, what is the best soft-margin w for this dataset?                                       y          ...
Non-Linear Classification: Stacking Idea 1) Use classifier output as input to other (linear) classifiers:                fi=<...
Non-linearity: Data Preprocessing Idea 2) Preprocess the data:                                                 y  This dat...
Non-linearity: Data Preprocessing                                                     y  Non-linear separation            ...
Generalized Linear Classifier     Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }.     Given any (non-linear) feature ...
Example Feature Mappings    Polar coordinates:                                                 √                          ...
Is this enough? In this example, changing the coordinates did help. Does this trick always work?                          ...
Is this enough? In this example, changing the coordinates did help. Does this trick always work?                          ...
Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → ...
Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → ...
Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm :                                   ...
Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm :                                   ...
Kernel Trick The representer theorem allows us to rewrite the optimization:                                            n  ...
Kernel Trick We can minimize over αi instead of w:                           n                         n                  ...
Kernel Trick          2 Use w        = w, w :                   n                                  n      min             ...
Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function.                      n                              n   ...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?1) Speed:    We might find an expression for k(xi , xj ) that is faster to    calc...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?2) Flexibility:     There are kernel functions k(xi , xj ), for which we know tha...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?2) Flexibility:     There are kernel functions k(xi , xj ), for which we know tha...
Positive Definite Kernel FunctionDefinition (Positive Definite Kernel Function)Let X be a non-empty set. A function k : X × X...
Hilbert SpacesDefinition (Hilbert Space)A Hilbert Space H is a vector space H with an inner product . , . H , e.g. a mappin...
Kernels for Arbitrary Sets Theorem Let k : X × X → R be a positive definite kernel function. Then there exists a Hilbert Sp...
How to Check if a Function is a Kernel Problem:     Checking if a given k : X × X → R fulfills the conditions for a     ker...
Constructing Kernels1) We can construct kernels from scratch:    For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kerne...
Constructing Kernels1) We can construct kernels from scratch:    For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kerne...
Constructing KernelsExamples for kernels for X = Rd :    any linear combination j αj kj with αj ≥ 0,    polynomial kernels...
Constructing KernelsExamples for kernels for X = Rd :    any linear combination j αj kj with αj ≥ 0,    polynomial kernels...
Constructing KernelsExamples for kernels for X = Rd :    any linear combination j αj kj with αj ≥ 0,    polynomial kernels...
Kernels in Computer VisionX = { images }, treat feature extraction as part of kernel definition     OCR/handwriting recogni...
Kernels in Computer VisionX = { images }, treat feature extraction as part of kernel definition   object category recogniti...
SummaryLinear methods are popular and well understood    classification, regression, dimensionality reduction, ...Kernels a...
What did we not see?We have skipped the largest part of theory on kernel methods:    Optimization         Dualization     ...
Selecting and Combining Kernels
Selecting From Multiple KernelsTypically, one has many different kernels to choose from:    different functional forms      ...
Selecting From Multiple KernelsTypically, one has many different kernels to choose from:    different functional forms      ...
Selecting From Multiple KernelsTypically, one has many different kernels to choose from:    different functional forms      ...
Kernel Parameter SelectionNote: Model Selection makes a difference!    Action Classification, KTH dataset       Method      ...
Kernel Parameter SelectionRule of thumb for kernel parameters     For kernels based on the exponential function           ...
1.0                                                                               1.0                                     ...
1.0                                                                               1.0                                     ...
1.0                                                                               1.0                                     ...
Kernel Selection ↔ Kernel CombinationIs there a single best kernel at all?     Kernels are typcally designed to capture on...
Kernel Selection ↔ Kernel CombinationIs there a single best kernel at all?     Kernels are typcally designed to capture on...
Combining Two KernelsFor two kernels k1 , k2 :     product k = k1 k2 is again a kernel         Problem: very small kernel ...
Combining Many KernelsMultiple kernels: k1 ,. . . ,kK    all convex combinations are kernels:                    K        ...
Feature Space View of Kernel CombinationEach kernel kj induces    a Hilbert Space Hj and a mapping ϕj : X → Hj .          ...
Feature Space View of Kernel CombinationImplicit representation of a dataset using two kernels:     Kernel k1 , feature re...
Feature Space View of Kernel Combination                                   β = 0.00   margin = 0.0000        5            ...
Feature Space View of Kernel Combination                                   β = 1.00   margin = 0.1000        5            ...
Feature Space View of Kernel Combination                                   β = 0.99   margin = 0.2460        5            ...
Feature Space View of Kernel Combination                                   β = 0.98   margin = 0.3278        5            ...
Feature Space View of Kernel Combination                                   β = 0.97   margin = 0.3809        5            ...
Feature Space View of Kernel Combination                                   β = 0.95   margin = 0.4515        5            ...
Feature Space View of Kernel Combination                                   β = 0.90   margin = 0.5839        5            ...
Feature Space View of Kernel Combination                                   β = 0.80   margin = 0.7194        5            ...
Feature Space View of Kernel Combination                                   β = 0.70   margin = 0.7699        5            ...
Feature Space View of Kernel Combination                                   β = 0.65   margin = 0.7770        5            ...
Feature Space View of Kernel Combination                                   β = 0.60   margin = 0.7751        5            ...
Feature Space View of Kernel Combination                                   β = 0.50   margin = 0.7566        5            ...
Feature Space View of Kernel Combination                                   β = 0.40   margin = 0.7365        5            ...
Feature Space View of Kernel Combination                                   β = 0.30   margin = 0.7073        5            ...
Feature Space View of Kernel Combination                                   β = 0.20   margin = 0.6363        5            ...
Feature Space View of Kernel Combination                                   β = 0.10   margin = 0.4928        5            ...
Feature Space View of Kernel Combination                                   β = 0.03   margin = 0.2870        5            ...
Feature Space View of Kernel Combination                                   β = 0.02   margin = 0.2363        5            ...
Feature Space View of Kernel Combination                                   β = 0.01   margin = 0.1686        5            ...
Feature Space View of Kernel Combination                                   β = 0.00   margin = 0.0000        5            ...
Multiple Kernel Learning Can we calculate coefficients βj that realize the largest margin?     Analyze: how does the margin ...
Multiple Kernel Learning     Insert                                               K                            k(x, x ) = ...
Multiple Kernel Learning     For fixed βj , the largest margin hyperplane is given by                                  2   ...
Multiple Kernel Learning     Therefore, best hyperplane for variable βj is given by:                                      ...
Multiple Kernel Learning     Same for soft-margin with slack-variables:                                              1    ...
Software for Multiple Kernel Learning     Existing toolboxes allow Multiple-Kernel SVM training:         Shogun (C++ with ...
MKL Toy ExampleSupport-vector regression to learn samples of f (t) = sin(ωt)                            2                 ...
Combining Good KernelsObservation: if all kernels are reasonable, simple combinationmethods work as well as difficult ones (...
Combining Good and Bad kernelsObservation: if some kernels are helpful, but others are not, smarttechniques are better.   ...
Example: Multi-Class Object LocalizationMKL for joint prediction of different object classes.     Objects in images do not ...
Example: Multi-Class Object Localization     Predict candidate regions for all object classes.     Train a decision functi...
Example: Multi-Class Object Localization     Benchmark on PASCAL VOC 2006 and VOC 2007.     Combination improves detection...
Example: Multi-Class Object LocalizationInterpretation of Weights (VOC 2007):    Every class decision                     ...
Example: Multi-Class Object LocalizationWe can turn the non-zero weights into a dependency graph:                         ...
SummaryKernel Selection and Combination    Model selection is important to achive highest accuracy    Combining several ke...
Upcoming SlideShare
Loading in …5
×

CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to Kernel Methods, Selecting and Combining Kernels

1,443 views

Published on

Published in: Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,443
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
67
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

CVPR2009 tutorial: Kernel Methods in Computer Vision: part I: Introduction to Kernel Methods, Selecting and Combining Kernels

  1. 1. Kernel Methods in Computer Vision Christoph Lampert Matthew Blaschko Max Planck Institute for MPI Tübingen and Biological Cybernetics, Tübingen University of Oxford June 20, 2009
  2. 2. Overview... 14:00 – 15:00 Introduction to Kernel Classifiers 15:20 – 15:50 Selecting and Combining Kernels 15:50 – 16:20 Other Kernel Methods 16:40 – 17:40 Learning with Structured Outputs Slides and Additional Material (soon) http://www.christoph-lampert.de also watch out for
  3. 3. Introduction to Kernel Classifiers
  4. 4. Linear Classification Separates these two sample sets. height width
  5. 5. Linear Classification Separates these two sample sets. height width Linear Classification
  6. 6. Linear Regression Find a function that interpolates data points. 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0
  7. 7. Linear Regression Find a function that interpolates data points. 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Least Squares Regression
  8. 8. Linear Dimensionality ReductionReduce the dimensionality of a dataset while preserving its structure. 4 3 2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3
  9. 9. Linear Dimensionality ReductionReduce the dimensionality of a dataset while preserving its structure. 4 3 2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 Principal Component Analysis
  10. 10. Linear TechniquesThree different elementary tasks: classification, regression, dimensionality reduction.In each case, linear techniques are very successful.
  11. 11. Linear TechniquesLinear techniques... often work well, most natural functions are smooth, smooth function can be approximated, at least locally, by linear functions. are fast and easy to solve elementary maths, even closed form solutions typically involve only matrix operation are intuitive solution can be visualized geometrically, solution corresponds to common sense.
  12. 12. Example: Maximum Margin ClassificationNotation: data points X = {x1 , . . . , xn }, xi ∈ Rd , class labels Y = {y1 , . . . , yn }, yi ∈ {+1, −1}. linear (decision) function f : Rd → R, decide classes based on sign f : Rd → {−1, 1}. parameterize f (x) = a 1 x 1 + a 2 x 2 + . . . a n x n + a 0 ≡ w, x + b with w = (a 1 , . . . , a n ), b = a 0 . . , . is the scalar product is Rd . f is uniquely determined by w ∈ Rd and b ∈ R, but we usually ignore b and only study w b can be absorbed into w. Set w = (w, b), x = (x, 1).
  13. 13. Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
  14. 14. Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions thedata space into two half-spaces, i.e. defines a classifier. 2.0 f (x) > 0 f (x) < 0 1.0 w 0.0 −1.0 0.0 1.0 2.0 3.0 “What’s the best w?”
  15. 15. Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w? 2.0 2.0 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0Not these, since they misclassify many examples.Criterion 1: Enforce sign w, xi = yi for i = 1, . . . , n.
  16. 16. Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w? 2.0 2.0 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0Better not these, since they would be “risky” for future samples.Criterion 2: Try to ensure sign w, x = y for future (x, y) as well.
  17. 17. Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that futuresamples are similar to current ones. What’s the best w? 2.0 γ γ 1.0 0.0 −1.0 0.0 1.0 2.0 3.0Maximize “stability”: use w such that we can maximally perturb theinput samples without introducing misclassifications.
  18. 18. Example: Maximum Margin ClassificationGiven X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that futuresamples are similar to current ones. What’s the best w? 2.0 2.0 n rgi ma ion γ γ γ reg 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0Maximize “stability”: use w such that we can maximally perturb theinput samples without introducing misclassifications.Central quantity: w margin(x) = distance of x to decision hyperplane = w ,x
  19. 19. Example: Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem: max γ w∈Rd ,γ∈R+subject to sign w, xi = yi for i = 1, . . . n. w , xi ≥γ for i = 1, . . . n. wClassify new samples using f (x) = w, x .
  20. 20. Example: Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem: max γ w∈Rd , w =1 γ∈Rsubject to yi w, xi ≥ γ for i = 1, . . . n.Classify new samples using f (x) = w, x .
  21. 21. Example: Maximum Margin ClassificationWe can rewrite this as a minimization problem: 2 mind w w∈Rsubject to yi w, xi ≥ 1 for i = 1, . . . n.Classify new samples using f (x) = w, x .
  22. 22. Example: Maximum Margin ClassificationFrom the view of optimization theory 2 mind w w∈Rsubject to yi w, xi ≥ 1 for i = 1, . . . nis rather easy: The objective function is differentiable and convex. The constraints are all linear.We can find the globally optimal w in O(n 3 ) (or faster).
  23. 23. Linear Separability What is the best w for this dataset? height width
  24. 24. Linear Separability What is the best w for this dataset? height width Not this.
  25. 25. Linear Separability What is the best w for this dataset? height width Not this.
  26. 26. Linear Separability What is the best w for this dataset? height width Not this.
  27. 27. Linear Separability What is the best w for this dataset? height width Not this.
  28. 28. Linear Separability The problem 2 min w ? w∈Rd height subject to yi w, xi ≥ 1 has no solution. The constraints con- tradict each other! width We cannot find a maximum-margin hyperplane here, because there is none. To fix this, we must allow hyperplanes that make mistakes.
  29. 29. Linear Separability What is the best w for this dataset? height width
  30. 30. Linear Separability What is the best w for this dataset? n a tio ol vi in height a rg m ξ width Possibly this one, even though one sample is misclassified.
  31. 31. Linear Separability What is the best w for this dataset? height width
  32. 32. Linear Separability What is the best w for this dataset? small in marg ve r y height width Maybe not this one, even though all points are classified correctly.
  33. 33. Linear Separability What is the best w for this dataset? n tio height a ol vi in a rg m ξ width Trade-off: large margin vs. few mistakes on training set
  34. 34. Solving for Soft-Margin Solution Mathematically, we formulate the trade-off by slack-variables ξi : n 2 min w +C ξi w∈R ,ξi ∈R+ d i=1 subject to yi w, xi ≥ 1 − ξi for i = 1, . . . n. We can fulfill every constraint by choosing ξi large enough. The larger ξi , the larger the objective (that we try to minimize). C is a regularization/trade-off parameter: small C → constraints are easily ignored large C → constraints are hard to ignore C = ∞ → hard margin case → no training error Note: The problem is still convex and efficiently solvable.
  35. 35. Linear Separability So, what is the best soft-margin w for this dataset? y x None. We need something non-linear!
  36. 36. Non-Linear Classification: Stacking Idea 1) Use classifier output as input to other (linear) classifiers: fi=<wi,x> σ(f5(x)) σ(f1(x)) σ(f2(x)) σ(f3(x)) σ(f4(x)) Multilayer Perceptron (Artificial Neural Network) or Boosting ⇒ decisions depend non-linearly on x and wj .
  37. 37. Non-linearity: Data Preprocessing Idea 2) Preprocess the data: y This dataset is not (well) linearly separable: x θ This one is: r In fact, both are the same dataset! Top: Cartesian coordinates. Bottom: polar coordinates
  38. 38. Non-linearity: Data Preprocessing y Non-linear separation x θ Linear separation r Linear classifier in polar space; acts non-linearly in Cartesian space.
  39. 39. Generalized Linear Classifier Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Given any (non-linear) feature map ϕ : Rk → Rm . Solve the minimization for ϕ(x1 ), . . . , ϕ(xn ) instead of x1 , . . . , xn : n 2 min w +C ξi w∈R ,ξi ∈R+ m i=1 subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. The weight vector w now comes from the target space Rm . Distances/angles are measure by the scalar product . , . in Rm . Classifier f (x) = w, ϕ(x) is linear in w, but non-linear in x.
  40. 40. Example Feature Mappings Polar coordinates: √ x x 2 + y2 ϕ: → y ∠(x, y) d-th degree polynomials: 2 2 d d ϕ : x1 , . . . , xn → 1, x1 , . . . , xn , x1 , . . . , xn , . . . , x1 , . . . , xn Distance map: ϕ:x→ x − pi , . . . , x − pN for a set of N prototype vectors pi , i = 1, . . . , N .
  41. 41. Is this enough? In this example, changing the coordinates did help. Does this trick always work? y θ x r ↔
  42. 42. Is this enough? In this example, changing the coordinates did help. Does this trick always work? y θ x r ↔ Answer: In a way, yes! Lemma Let (xi )i=1,...,n with xi = xj for i = j. Let ϕ : Rk → Rm be a feature map. If the set ϕ(xi )i=1,...,n is linearly independent, then the points ϕ(xi )i=1,...,n are linearly separable. Lemma If we choose m > n large enough, we can always find a map ϕ.
  43. 43. Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → R20001 such that – no matter how we label them – there is always a hyperplane classifier that has zero training error.
  44. 44. Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → R20001 such that – no matter how we label them – there is always a hyperplane classifier that has 0 training error.
  45. 45. Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm : n 2 min w +C ξi (1) w∈Rm ,ξi ∈R+ i=1subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.For large m, won’t solving for w ∈ Rm become impossible?
  46. 46. Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm : n 2 min w +C ξi (1) w∈Rm ,ξi ∈R+ i=1subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.For large m, won’t solving for w ∈ Rm become impossible? No!Theorem (Representer Theorem)The minimizing solution w to problem (1) can always be written as n w= αj ϕ(xj ) for coefficients α1 , . . . , αn ∈ R. j=1
  47. 47. Kernel Trick The representer theorem allows us to rewrite the optimization: n 2 min w +C ξi w∈R ,ξi ∈R+ m i=1 subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. n Insert w = j=1 αj ϕ(xj ):
  48. 48. Kernel Trick We can minimize over αi instead of w: n n 2 min αj ϕ(xj ) +C ξi αi ∈R,ξi ∈R+ j=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1
  49. 49. Kernel Trick 2 Use w = w, w : n n min αj αk ϕ(xj ), ϕ(xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 Note: ϕ only occurs in ϕ(.), ϕ(.) pairs.
  50. 50. Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function. n n min αj αk k(xj , xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj k(xj , xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 The maximum-margin classifier in this form with a kernel function is often called Support-Vector Machine (SVM).
  51. 51. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?1) Speed: We might find an expression for k(xi , xj ) that is faster to calculate than forming ϕ(xi ) and then ϕ(xi ), ϕ(xj ) . Example: 2nd-order polynomial kernel (here for x ∈ R1 ): √ ϕ : x → (1, 2x, x 2 ) ∈ R3 √ √ ϕ(xi ), ϕ(xj ) = (1, 2xi , xi2 ), (1, 2xj , xj2 ) = 1 + 2xi xj + xi2 xj2 But equivalently (and faster) we can calculate without ϕ: k(xi , xj ) : = (1 + xi xj )2 = 1 + 2xi xj + xi2 xj2
  52. 52. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?2) Flexibility: There are kernel functions k(xi , xj ), for which we know that a feature transformation ϕ exists, but we don’t know what ϕ is.
  53. 53. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ?2) Flexibility: There are kernel functions k(xi , xj ), for which we know that a feature transformation ϕ exists, but we don’t know what ϕ is. How that???TheoremLet k : X × X → R be a positive definite kernel function. Thenthere exists a Hilbert Space H and a mapping ϕ : X → H suchthat k(x, x ) = ϕ(x), ϕ(x ) Hwhere . , . H is the inner product in H.
  54. 54. Positive Definite Kernel FunctionDefinition (Positive Definite Kernel Function)Let X be a non-empty set. A function k : X × X → R is calledpositive definite kernel function, iff k is symmetric, i.e. k(x, x ) = k(x , x) for all x, x ∈ X . For any set of points x1 , . . . , xn ∈ X , the matrix Kij = (k(xi , xj ))i,j is positive (semi-)definite, i.e. for all vectors t ∈ Rn : n ti Kij tj ≥ 0. i,j=1Note: Instead of “positive definite kernel function”, we will oftenjust say “kernel”.
  55. 55. Hilbert SpacesDefinition (Hilbert Space)A Hilbert Space H is a vector space H with an inner product . , . H , e.g. a mapping .,. H : H × H → Rwhich is symmetric: v, v H = v , v H for all v, v ∈ H , positive definite: v, v H ≥ 0 for all v ∈ H , where v, v H = 0 only for v = 0 ∈ H . bilinear: av, v H = a v, v H for v ∈ H , a ∈ R v + v , v H = v, v H + v , v HWe can treat a Hilbert space like some Rn , if we only use conceptslike vectors, angles, distances. Note: dim H = ∞ is possible!
  56. 56. Kernels for Arbitrary Sets Theorem Let k : X × X → R be a positive definite kernel function. Then there exists a Hilbert Space H and a mapping ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) H where . , . H is the inner product in H. Translation Take any set X and any function k : X × X → R. If k is a positive definite kernel, then we can use k to learn a (soft) maximum-margin classifier for the elements in X ! Note: X can be any set, e.g. X = { all images }.
  57. 57. How to Check if a Function is a Kernel Problem: Checking if a given k : X × X → R fulfills the conditions for a kernel is difficult: We need to prove or disprove n ti k(xi , xj )tj ≥ 0. i,j=1 for any set x1 , . . . , xn ∈ X and any t ∈ Rn for any n ∈ N. Workaround: It is easy to construct functions k that are positive definite kernels.
  58. 58. Constructing Kernels1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. If d : X × X → R is a distance function, i.e. • d(x, x ) ≥ 0 for all x, x ∈ X , • d(x, x ) = 0 only for x = x , • d(x, x ) = d(x , x) for all x, x ∈ X , • d(x, x ) ≤ d(x, x ) + d(x , x ) for all x, x , x ∈ X , then k(x, x ) := exp(−d(x, x )) is a kernel.
  59. 59. Constructing Kernels1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. If d : X × X → R is a distance function, i.e. • d(x, x ) ≥ 0 for all x, x ∈ X , • d(x, x ) = 0 only for x = x , • d(x, x ) = d(x , x) for all x, x ∈ X , • d(x, x ) ≤ d(x, x ) + d(x , x ) for all x, x , x ∈ X , then k(x, x ) := exp(−d(x, x )) is a kernel.2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels.
  60. 60. Constructing KernelsExamples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian or RBF k(x, x ) = exp − 2σ 2 with σ > 0,
  61. 61. Constructing KernelsExamples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian or RBF k(x, x ) = exp − 2σ 2 with σ > 0,Examples for kernels for other X : k(h, h ) = n min(hi , hi ) for n-bin histograms h, h . i=1 k(p, p ) = exp(−KL(p, p )) with KL the symmetrized KL-divergence between positive probability distributions. k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distance
  62. 62. Constructing KernelsExamples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian or RBF k(x, x ) = exp − 2σ 2 with σ > 0,Examples for kernels for other X : k(h, h ) = n min(hi , hi ) for n-bin histograms h, h . i=1 k(p, p ) = exp(−KL(p, p )) with KL the symmetrized KL-divergence between positive probability distributions. k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distanceExamples for functions X × X → R that are not kernels: tanh (κ x, x + θ) (matrix Kij can have negative eigenvalues)
  63. 63. Kernels in Computer VisionX = { images }, treat feature extraction as part of kernel definition OCR/handwriting recognition resize image, normalize brightness/contrast/rotation/skew polynomial kernel k(x, x ) = (1 + x, x )d , d > 0 [DeCoste, Schölkopf. ML2002] Pedestrian detection resize image, calculate local intensity gradient directions local thresholding + linear kernel [Dalal, Triggs. CVPR 2005] or local L1 -normalization + histogram intersection kernel [Maji, Berg, Malik. CVPR 2008]
  64. 64. Kernels in Computer VisionX = { images }, treat feature extraction as part of kernel definition object category recognition extract local image descriptors, e.g. SIFT calculate multi-level pyramid histograms h l,k (x) pyramid match kernel [Grauman, Darrell. ICCV 2005] L 2l−1 l kPMK (x, x ) = 2 min h l,k (x), h l,k (x ) l=1 k=1 scene/object category recognition extract local image descriptors, e.g. SIFT quantize descriptors into bag-of-words histograms χ2 -kernel [Puzicha, Buhmann, Rubner, Tomasi. ICCV1999] kχ2 (h, h ) = exp −γχ2 (h, h ) for γ > 0 K (hk − hk )2 χ2 (h, h ) = k=1 hk + hk
  65. 65. SummaryLinear methods are popular and well understood classification, regression, dimensionality reduction, ...Kernels are at the same time... 1) Similarity measure between (arbitrary) objects, 2) Scalar products in a (hidden) vector space.Kernelization can make linear techniques more powerful implicit preprocessing, non-linear in the original data. still linear in some feature space ⇒ still intuitive/interpretableKernels can be defined over arbitrary inputs, e.g. images unified framework for all preprocessing steps different features, normalization, etc., becomes kernel choices
  66. 66. What did we not see?We have skipped the largest part of theory on kernel methods: Optimization Dualization Algorithms to train SVMs Kernel Design Systematic methods to construct data-dependent kernels. Statistical Interpretations What do we assume about samples? What performance can we expect? Generalization Bounds The test error of a (kernelized) linear classifier can be controlled using its modelling error and its training error. “Support Vectors”This and much more in standard references.
  67. 67. Selecting and Combining Kernels
  68. 68. Selecting From Multiple KernelsTypically, one has many different kernels to choose from: different functional forms linear, polynomial, RBF, . . . different parameters polynomial degree, Gaussian bandwidth, . . .
  69. 69. Selecting From Multiple KernelsTypically, one has many different kernels to choose from: different functional forms linear, polynomial, RBF, . . . different parameters polynomial degree, Gaussian bandwidth, . . .Different image features give rise to different kernels Color histograms, SIFT bag-of-words, HOG, Pyramid match, Spatial pyramids, . . .
  70. 70. Selecting From Multiple KernelsTypically, one has many different kernels to choose from: different functional forms linear, polynomial, RBF, . . . different parameters polynomial degree, Gaussian bandwidth, . . .Different image features give rise to different kernels Color histograms, SIFT bag-of-words, HOG, Pyramid match, Spatial pyramids, . . .How to choose? Ideally, based on the kernels’ performance on task at hand: estimate by cross-validation or validation set error Classically part of “Model Selection”.
  71. 71. Kernel Parameter SelectionNote: Model Selection makes a difference! Action Classification, KTH dataset Method Accuracy Dollár et al. VS-PETS 2005: ”SVM classifier“ 80.66 Nowozin et al., ICCV 2007: ”baseline RBF“ 85.19 identical features, same kernel function difference: Nowozin used cross-validation for model selection (bandwidth and C )Note: there is no overfitting involved here. Model selection is fullyautomatic and uses only training data.
  72. 72. Kernel Parameter SelectionRule of thumb for kernel parameters For kernels based on the exponential function 1 k(x, x ) = exp(− X (x, x )) γ with any X , set γ ≈ meani,j=1,...,n X (xi , xj ). Sometimes better: use only X (xi , xj ) with yi = yj . In general, if there are several classes, then the kernel matrix : Kij = k(xi , xj ) should have a block structure w.r.t. the classes.
  73. 73. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear RBF: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 0 0.9 0 0.9 0.8 0.810 10 0.7 0.720 0.6 20 0.6 0.5 0.530 0.4 30 0.4 0.3 0.340 0.2 40 0.2 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000
  74. 74. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear RBF: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0.8 0.8 0.810 10 10 0.7 0.7 0.720 0.6 20 0.6 20 0.6 0.5 0.5 0.530 0.4 30 0.4 30 0.4 0.3 0.3 0.340 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000 γ = 0.6 rule of thumb
  75. 75. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear RBF: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000 γ = 0.6 γ = 1.6 rule of thumb 5-fold CV
  76. 76. Kernel Selection ↔ Kernel CombinationIs there a single best kernel at all? Kernels are typcally designed to capture one aspect of the data texture, color, edges, . . . Choosing one kernel means to select exactly one such aspect.
  77. 77. Kernel Selection ↔ Kernel CombinationIs there a single best kernel at all? Kernels are typcally designed to capture one aspect of the data texture, color, edges, . . . Choosing one kernel means to select exactly one such aspect. Combining aspects if often better than Selecting. Method Accuracy Colour 60.9 ± 2.1 Shape 70.2 ± 1.3 Texture 63.7 ± 2.7 HOG 58.5 ± 4.5 HSV 61.3 ± 0.7 siftint 70.6 ± 1.6 siftbdy 59.4 ± 3.3 combination 85.2 ± 1.5Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
  78. 78. Combining Two KernelsFor two kernels k1 , k2 : product k = k1 k2 is again a kernel Problem: very small kernel values suppress large ones average k = 1 (k1 + k2 ) is again a kernel 2 Problem: k1 , k2 on different scales. Re-scale first? convex combination kβ = (1 − β)k1 + βk2 with β ∈ [0, 1] Model selection: cross-validate over β ∈ {0, 0.1, . . . , 1}.
  79. 79. Combining Many KernelsMultiple kernels: k1 ,. . . ,kK all convex combinations are kernels: K K k= βj kj with βj ≥ 0, β = 1. j=1 j=1 Kernels can be “deactivated” by βj = 0. Combinatorial explosion forbids cross-validation over all combinations of βjProxy: instead of CV, maximize SVM-objective. Each combined kernel induces a feature space. In which of the feature spaces can we best explain the training data, and achieve a large margin between the classes?
  80. 80. Feature Space View of Kernel CombinationEach kernel kj induces a Hilbert Space Hj and a mapping ϕj : X → Hj . βThe weighted kernel kj j := βj kj induces the same Hilbert Space Hj , but β a rescaled feature mapping ϕj j (x) := βj ϕj (x). β β k βj (x, x ) ≡ ϕj j (x), ϕj j (x ) H = βj ϕj (x), βj ϕj (x ) H = βj ϕj (x), ϕj (x ) H = βj k(x, x ). ˆThe linear combination k := K βj kj induces j=1 the product space H := ⊕K Hj , and j=1 the product mapping ϕ(x) := (ϕβ1 (x), . . . , ϕβn (x))t ˆ 1 n K K ˆ β β k(x, x ) ≡ ϕ(x), ϕ(x ) ˆ ˆ H = ϕj j (x), ϕj j (x ) H = βj k(x, x ) j=1 j=1
  81. 81. Feature Space View of Kernel CombinationImplicit representation of a dataset using two kernels: Kernel k1 , feature representation ϕ1 (x1 ), . . . , ϕ1 (xn ) ∈ H1 Kernel k2 , feature representation ϕ2 (x1 ), . . . , ϕ2 (xn ) ∈ H2Kernel Selection would most likely pick k2 .For k = (1 − β)k1 + βk2 , top is β = 0, bottom is β = 1.
  82. 82. Feature Space View of Kernel Combination β = 0.00 margin = 0.0000 5 √ 1.00φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.00φ1 (xi ) −1 −1 0 1 2 3 4 5
  83. 83. Feature Space View of Kernel Combination β = 1.00 margin = 0.1000 5 √ 0.00φ2 (xi ) H1 × H2 4 3 2 1 0 √ 1.00φ1 (xi ) −1 −1 0 1 2 3 4 5
  84. 84. Feature Space View of Kernel Combination β = 0.99 margin = 0.2460 5 √ 0.01φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.99φ1 (xi ) −1 −1 0 1 2 3 4 5
  85. 85. Feature Space View of Kernel Combination β = 0.98 margin = 0.3278 5 √ 0.02φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.98φ1 (xi ) −1 −1 0 1 2 3 4 5
  86. 86. Feature Space View of Kernel Combination β = 0.97 margin = 0.3809 5 √ 0.03φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.97φ1 (xi ) −1 −1 0 1 2 3 4 5
  87. 87. Feature Space View of Kernel Combination β = 0.95 margin = 0.4515 5 √ 0.05φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.95φ1 (xi ) −1 −1 0 1 2 3 4 5
  88. 88. Feature Space View of Kernel Combination β = 0.90 margin = 0.5839 5 √ 0.10φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.90φ1 (xi ) −1 −1 0 1 2 3 4 5
  89. 89. Feature Space View of Kernel Combination β = 0.80 margin = 0.7194 5 √ 0.20φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.80φ1 (xi ) −1 −1 0 1 2 3 4 5
  90. 90. Feature Space View of Kernel Combination β = 0.70 margin = 0.7699 5 √ 0.30φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.70φ1 (xi ) −1 −1 0 1 2 3 4 5
  91. 91. Feature Space View of Kernel Combination β = 0.65 margin = 0.7770 5 √ 0.35φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.65φ1 (xi ) −1 −1 0 1 2 3 4 5
  92. 92. Feature Space View of Kernel Combination β = 0.60 margin = 0.7751 5 √ 0.40φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.60φ1 (xi ) −1 −1 0 1 2 3 4 5
  93. 93. Feature Space View of Kernel Combination β = 0.50 margin = 0.7566 5 √ 0.50φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.50φ1 (xi ) −1 −1 0 1 2 3 4 5
  94. 94. Feature Space View of Kernel Combination β = 0.40 margin = 0.7365 5 √ 0.60φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.40φ1 (xi ) −1 −1 0 1 2 3 4 5
  95. 95. Feature Space View of Kernel Combination β = 0.30 margin = 0.7073 5 √ 0.70φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.30φ1 (xi ) −1 −1 0 1 2 3 4 5
  96. 96. Feature Space View of Kernel Combination β = 0.20 margin = 0.6363 5 √ 0.80φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.20φ1 (xi ) −1 −1 0 1 2 3 4 5
  97. 97. Feature Space View of Kernel Combination β = 0.10 margin = 0.4928 5 √ 0.90φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.10φ1 (xi ) −1 −1 0 1 2 3 4 5
  98. 98. Feature Space View of Kernel Combination β = 0.03 margin = 0.2870 5 √ 0.97φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.03φ1 (xi ) −1 −1 0 1 2 3 4 5
  99. 99. Feature Space View of Kernel Combination β = 0.02 margin = 0.2363 5 √ 0.98φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.02φ1 (xi ) −1 −1 0 1 2 3 4 5
  100. 100. Feature Space View of Kernel Combination β = 0.01 margin = 0.1686 5 √ 0.99φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.01φ1 (xi ) −1 −1 0 1 2 3 4 5
  101. 101. Feature Space View of Kernel Combination β = 0.00 margin = 0.0000 5 √ 1.00φ2 (xi ) H1 × H2 4 3 2 1 0 √ 0.00φ1 (xi ) −1 −1 0 1 2 3 4 5
  102. 102. Multiple Kernel Learning Can we calculate coefficients βj that realize the largest margin? Analyze: how does the margin depend on βj ? Remember standard SVM (here without slack variables): 2 min w H w∈H subject to yi w, xi H ≥1 for i = 1, . . . n. H and ϕ were induced by kernel k. New samples are classified by f (x) = w, x H.
  103. 103. Multiple Kernel Learning Insert K k(x, x ) = βj kj (x, x ) (2) j=1 with Hilbert space H = ⊕j√ j , H √ feature map ϕ(x) = ( β1 ϕ1 (x), . . . , βK ϕK (x))t , weight vector w = (w1 , . . . , wK )t . such that 2 2 w H = wj Hj (3) j w, ϕ(xi ) H = βj wj , ϕj (xi ) Hj (4) j
  104. 104. Multiple Kernel Learning For fixed βj , the largest margin hyperplane is given by 2 min wj Hj wj ∈Hj j subject to yi βj wj , ϕj (xi ) Hj ≥1 for i = 1, . . . n. j 0 Renaming vj = βj wj (and defining 0 = 0): 1 2 min vj Hj vj ∈Hj j βj subject to yi vj , ϕj (xi ) Hj ≥1 for i = 1, . . . n. j
  105. 105. Multiple Kernel Learning Therefore, best hyperplane for variable βj is given by: 1 2 min vj Hj (5) vj ∈Hj j βj β =1 j j βj ≥0 subject to yi vj , ϕj (xi ) Hj ≥1 for i = 1, . . . n. (6) j This optimization problem is jointly-convex in vj and βj . There is a unique global minimum, and we can find it efficiently!
  106. 106. Multiple Kernel Learning Same for soft-margin with slack-variables: 1 2 min vj Hj +C ξi (7) vj ∈Hj j βj i β =1 j j βj ≥0 ξi ∈R+ subject to yi vj , ϕj (xi ) Hj ≥ 1 − ξi for i = 1, . . . n. (8) j This optimization problem is jointly-convex in vj and βj . There is a unique global minimum, and we can find it efficiently!
  107. 107. Software for Multiple Kernel Learning Existing toolboxes allow Multiple-Kernel SVM training: Shogun (C++ with bindings to Matlab, Python etc.) http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun MPI IKL (Matlab with libSVM, CoinIPopt) http://www.kyb.mpg.de/bs/people/pgehler/ikl-webpage/index.html SimpleMKL (Matlab) http://asi.insa-rouen.fr/enseignants/˜arakotom/code/mklindex.html SKMsmo (Matlab) http://www.di.ens.fr/˜fbach/ (older and slower than the others) Typically, one only has to specify the set of candidate kernels and the regularization parameter C .
  108. 108. MKL Toy ExampleSupport-vector regression to learn samples of f (t) = sin(ωt) 2 x −x 2 kj (x, x ) = exp 2 with 2σj ∈ {0.005, 0.05, 0.5, 1, 10}. 2σj Multiple-Kernel Learning correctly identifies the right bandwidth.
  109. 109. Combining Good KernelsObservation: if all kernels are reasonable, simple combinationmethods work as well as difficult ones (and are much faster): Single features Combination methods Method Accuracy Time Method Accuracy Time Colour 60.9 ± 2.1 3 product 85.5 ± 1.2 2 Shape 70.2 ± 1.3 4 averaging 84.9 ± 1.9 10 Texture 63.7 ± 2.7 3 CG-Boost 84.8 ± 2.2 1225 HOG 58.5 ± 4.5 4 MKL (SILP) 85.2 ± 1.5 97 HSV 61.3 ± 0.7 3 MKL (Simple) 85.2 ± 1.5 152 siftint 70.6 ± 1.6 4 LP-β 85.5 ± 3.0 80 siftbdy 59.4 ± 3.3 5 LP-B 85.4 ± 2.4 98Mean accuracy and total runtime (model selection, training, testing) onOxford Flowers dataset [Gehler, Nowozin: ICCV2009]
  110. 110. Combining Good and Bad kernelsObservation: if some kernels are helpful, but others are not, smarttechniques are better. Performance with added noise features 90 85 80 75 accuracy 70 65 product 60 average CG−Boost 55 MKL (silp or simple) 50 LP−β LP−B 45 01 5 10 25 50 no. noise features added Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
  111. 111. Example: Multi-Class Object LocalizationMKL for joint prediction of different object classes. Objects in images do not occur independently of each other. Chairs and tables often occur together in indoor scenes. Busses often occur together with cars in street scenes. Chairs rarely occur together with cars.One can make use of these dependencies to improve prediction.
  112. 112. Example: Multi-Class Object Localization Predict candidate regions for all object classes. Train a decision function for each class (red), taking into account candidate regions for all classes (red and green). Decide per-class which other object categories are worth using 20 2 2 k(I , I ) = β0 kχ (h, h ) + βj kχ (hj , hj ) j=1 h: feature histogram for the full image x hj : histogram for the region predicted for object class j in x Use MKL to learn weights βj , j = 0, . . . , 20.[Lampert and Blaschko, DAGM 2008]
  113. 113. Example: Multi-Class Object Localization Benchmark on PASCAL VOC 2006 and VOC 2007. Combination improves detection accuracy (black vs. blue).
  114. 114. Example: Multi-Class Object LocalizationInterpretation of Weights (VOC 2007): Every class decision 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 depends on the full image aeroplane [1] bicycle [2] and on the object box. bird [3] boat [4] bottle [5] bus [6] High image weights: car [7] cat [8] → scene classification? chair [9] cow [10] diningtable [11] dog [12] Intuitive connections: horse [13] motorbike [14] chair → diningtable, person [15] pottedplant [16] sheep [17] person → bottle, sofa [18] train [19] person → dog. tvmonitor [20] Many classes depend on rows: class to be detected the person class. columns: class candidate boxes
  115. 115. Example: Multi-Class Object LocalizationWe can turn the non-zero weights into a dependency graph: diningtable person motorbike 0.07 0.20 0.06 0.20 0.27 0.05 0.04 0.12 chair bottle 0.07 cat 0.09 dog 0.08 0.04 0.09 0.09 0.12 0.06 0.06 car sofa 0.04 horse bird 0.09 0.05 0.04 0.05 0.05 0.05 0.05 0.05 0.08 0.05 train aeroplane 0.06 bus tvmonitor pottedplant bicycle 0.05 sheep 0.06 0.05 0.05 0.08 boat cow Threshold relative weights (without image component) at 0.04 i → j means “Class i is used to predict class j.” Interpretable clusters: vehicles, indoor, animals.
  116. 116. SummaryKernel Selection and Combination Model selection is important to achive highest accuracy Combining several kernels is often superior to selecting oneMultiple-Kernel Learning Learn weights for the “best” linear kernel combination: unified approach to feature selection/combination. visit [Gehler, Nowozin. CVPR 2009] on Wednesday afternoon Beware: MKL is no silver bullet. Other and even simpler techniques might be superior! Always compare against single best, averaging, product.Warning: Caltech101/256 Be careful when reading kernel combination results Many results reported rely on “broken” Bosch kernel matrices

×