Kernel Methods in Computer Vision               Christoph Lampert      Institute of Science and Technology Austria        ...
Kernel Methods in Computer VisionOverview...Monday:     9:00 – 10:00   Introduction to Kernel Classifiers    10:00 – 10:10 ...
Extended versions of the lectures in book form                Foundations and Trends in                Computer Graphics a...
A Machine Learning View on Multimedia Problems                                                                          ...
A Machine Learning View on Multimedia Problems                                                             • ... Class...
A Machine Learning View on Multimedia Problems                                                             • ... Class...
A Machine Learning View on Multimedia Problems                                                           • ...  Classi...
Introduction to Kernel Classifiers
Linear Classification Separate these two sample sets.              height                               width
Linear Classification Separate these two sample sets.              height                               width
Linear Classification Separate these two sample sets.
Linear Classification Separate these two sample sets.              height                               width              ...
Why linear?Linear techniques have been studied for centries. Why?    they are fast and easy to solve         elementary ma...
Why linear?Linear techniques have been studied for centries. Why?    they are fast and easy to solve         elementary ma...
Notation...     data points X = {x1 , . . . , xn }, xi ∈ Rd ,                (images)     class labels Y = {y1 , . . . , y...
Notation...     data points X = {x1 , . . . , xn }, xi ∈ Rd ,                          (images)     class labels Y = {y1 ,...
Example: Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }.                 2.0                 1....
Example: Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions the data space into t...
Example: Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions the data space into t...
Example: Ad Hoc Linear Classification Because everything is linear, we only need Linear Algebra!     X = x1 , . . . , xn ∈ ...
Does it work? w t = yX t (XX t )−1              2.0              1.0              0.0                −1.0    0.0   1.0   2...
Does it work?                                                          −1  t      t       t −1                   t   47.82...
Criteria for Linear Classification What properties should an optimal w have? Given X = {x1 , . . . , xn }, Y = {y1 , . . . ...
Criteria for Linear Classification What properties should an optimal w have? Given X = {x1 , . . . , xn }, Y = {y1 , . . . ...
Criteria for Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future samples are sim...
Criteria for Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future samples are sim...
Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem:                    max       ...
Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem:                    max    γ  ...
Maximum Margin ClassificationWe can rewrite this as a minimization problem:                                 2              ...
Maximum Margin ClassificationFrom the view of optimization theory                                   2                      ...
Linear Separability What is the best w for this dataset?
Linear Separability What is the best w for this dataset? Not this.
Linear Separability What is the best w for this dataset? Not this.
Linear Separability What is the best w for this dataset? Not this.
Linear Separability What is the best w for this dataset? Not this.
Linear Separability  The problem                2  minw∈Rd w  subject to        y1 w, x1 ≥ 1        y2 w, x2 ≥ 1        y3...
Linear Separability What is the best w for this dataset?              2.0              1.0              0.0               ...
Linear Separability What is the best w for this dataset?              2.0                                                 ...
Linear Separability What is the best w for this dataset?              2.0              1.0              0.0               ...
Linear Separability What is the best w for this dataset?               2.0               1.0               0.0            ...
Linear Separability What is the best w for this dataset?              2.0                                                r...
Soft-Margin Classification Mathematically, we formulate the trade-off by slack-variables ξi :                               ...
Solving for Soft-Margin Solution Reformulate constraints into objective functions with Hinge loss:                        ...
Linear SVMs in Practice  Linear SVMs are currently the most frequently used  classifiers in Computer Vision.         Efficien...
Linear Separability So, what is the best soft-margin w for this dataset?             1.5                                  ...
Non-Linear Classification: Stacking Idea 1) Use classifier outputs as input to other classifier:                  fi=<wi,x>  ...
Non-linearity: Data Preprocessing Idea 2) Preprocess the data:                                                  y  This da...
Non-linearity: Data Preprocessing                                                      y  Non-linear separation           ...
Generalized Linear Classifier     Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }.     Given any (non-linear) feature ...
Example Feature Mappings    Polar coordinates:                                                  √                         ...
Is this enough? In this example, changing the coordinates did help. Does this trick always work?                          ...
Is this enough? In this example, changing the coordinates did help. Does this trick always work?                          ...
Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → ...
Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → ...
Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm :                                   ...
Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm :                                   ...
Representer Theorem: ProofTheorem: w solves                                    n                         2 (∗)      min   ...
Representer Theorem: ProofTheorem: w solves                                    n                         2 (∗)      min   ...
Representer Theorem: ProofTheorem: w solves                                    n                         2 (∗)      min   ...
Representer Theorem: ProofTheorem: w solves                                    n                         2 (∗)      min   ...
Representer Theorem: ProofTheorem: w solves                                      n                           2 (∗)        ...
Representer Theorem: ProofTheorem: w solves                                      n                           2 (∗)        ...
Representer Theorem: ProofTheorem: w solves                                      n                           2 (∗)        ...
Representer Theorem: ProofTheorem: w solves                                      n                           2 (∗)        ...
Representer Theorem: ProofTheorem: w solves                                      n                           2 (∗)        ...
Representer Theorem: ProofTheorem: w solves                                      n                           2 (∗)        ...
Representer Theorem: ProofTheorem: w solves                                      n                           2 (∗)        ...
Kernel Trick The representer theorem allows us to rewrite the optimization:                                          n    ...
Kernel Trick                   n Insert w =        j=1   αj ϕ(xj ) and minimize over αi instead of w:                     ...
Kernel Trick          2 Use w        = w, w and linearity of ·, · :                    n                                  ...
Kernel Trick          2 Use w        = w, w and linearity of ·, · :                    n                                  ...
Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function.                       n                              n  ...
Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function.                       n                              n  ...
Dualization More elegant: dualization with Langrian multipliers αi                          n                             ...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage:     Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory.     S...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage:     Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory.     S...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage:     Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory.     S...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 3) Flexibility:      There are kernel functions k(xi , xj ), for which we know t...
Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 3) Flexibility:      There are kernel functions k(xi , xj ), for which we know t...
Positive Definite Kernel Function Definition (Positive Definite Kernel Function) Let X be a non-empty set. A function k : X ×...
Hilbert Spaces Definition (Hilbert Space) A Hilbert Space H is a vector space H with an inner product  . , . H , e.g. a map...
Kernels for Arbitrary Sets Theorem Let k : X × X → R be a positive definite kernel function. Then there exists a Hilbert Sp...
How to Check if a Function is a Kernel Problem:     Checking if a given k : X × X → R fulfills the conditions for a     ker...
Constructing Kernels 1) We can construct kernels from scratch:     For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a ker...
Constructing Kernels 1) We can construct kernels from scratch:     For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a ker...
Constructing Kernels 1) We can construct kernels from scratch:     For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a ker...
Constructing Kernels 2) We can construct kernels from other kernels:     if k is a kernel and α > 0, then αk and k + α are...
Constructing Kernels 2) We can construct kernels from other kernels:     if k is a kernel and α > 0, then αk and k + α are...
Constructing Kernels 2) We can construct kernels from other kernels:     if k is a kernel and α > 0, then αk and k + α are...
Constructing Kernels 2) We can construct kernels from other kernels:     if k is a kernel and α > 0, then αk and k + α are...
Summary – Classification with Kernels     (Generalized) linear classification with SVMs         conceptually simple, but pow...
Summary – Classification with Kernels     (Generalized) linear classification with SVMs         conceptually simple, but pow...
Summary – Classification with Kernels     (Generalized) linear classification with SVMs         conceptually simple, but pow...
Summary – Classification with Kernels     (Generalized) linear classification with SVMs         conceptually simple, but pow...
Summary – Classification with Kernels     (Generalized) linear classification with SVMs         conceptually simple, but pow...
What did we not see?We have skipped a large part of theory on kernel methods:    Optimization         Dualization    Numer...
Linear SVMs in Practice SVMs with non-linear kernel are commonly used for small to medium sized Computer Vision problems. ...
Observation 1: Linear SVMs are very fast in training andevaluation.Observation 2: Non-linear kernel SVMs give better resul...
Observation 1: Linear SVMs are very fast in training andevaluation.Observation 2: Non-linear kernel SVMs give better resul...
(Approximate) Explicit Feature Maps Core Facts     For every kernel k : X × X → R, there exists (implicit)     ϕ : X → H s...
(Approximate) Explicit Feature Maps Core Facts     For every kernel k : X × X → R, there exists (implicit)     ϕ : X → H s...
Example: exact explicit feature map for Hellinger kernel                  d                                               ...
Example: exact explicit feature map for Hellinger kernel                              d                                   ...
Selecting and Combining Kernels
Selecting From Multiple Kernels Typically, one has many different kernels to choose from:     different functional forms    ...
Selecting From Multiple Kernels Typically, one has many different kernels to choose from:     different functional forms    ...
Selecting From Multiple Kernels Typically, one has many different kernels to choose from:     different functional forms    ...
Kernel Parameter Selection Remark: Model Selection makes a difference!     Action Classification, KTH dataset      Method   ...
Kernel Parameter Selection Rule of thumb for kernel parameters      For generalize Gaussian kernels:                      ...
1.0                                                                               1.0                                     ...
1.0                                                                               1.0                                     ...
1.0                                                                               1.0                                     ...
Kernel Selection ↔ Kernel CombinationIs really one of the kernels best?     Kernels are typcally designed to capture one a...
Kernel Selection ↔ Kernel CombinationIs really one of the kernels best?     Kernels are typcally designed to capture one a...
Combining Two KernelsFor two kernels k1 , k2 :    product k = k1 k2 is again a kernel         Problem: very small kernel v...
Combining Many KernelsMultiple kernels: k1 ,. . . ,kK   all convex combinations are kernels:                     K        ...
Multiple Kernel Learning     Same for soft-margin with slack-variables:                                               1   ...
Combining Good KernelsObservation: if all kernels are reasonable, simple combinationmethods work as well as difficult ones (...
Combining Good and Bad kernelsObservation: if some kernels are helpful, but others are not, smarttechniques are better.   ...
SummaryKernel Selection and Combination    Model selection is important to achive highest accuracy    Combining several ke...
Other Kernel Methods
Multiclass Classification – One-versus-rest SVM Classification problems with M classes:     Training samples {x1 , . . . , x...
Multiclass Classification – One-versus-rest SVM Classification problems with M classes:     Training samples {x1 , . . . , x...
Multiclass Classification – All-versus-all SVM Classification problems with M classes:     Training samples {x1 , . . . , xn...
Multiclass Classification – All-versus-all SVM Classification problems with M classes:     Training samples {x1 , . . . , xn...
The “Kernel Trick”Many other algorithms besides SVMs can make use of kernels:            Any algorithm that uses the data ...
Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates ...
Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates ...
Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates ...
Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates ...
Nonlinear Regression Given xi ∈ X , yi ∈ R, i = 1, . . . , n, and ϕ : X → H. Find an appro- ximating function f (x) = w, ϕ...
Nonlinear Regression Given xi ∈ X , yi ∈ R, i = 1, . . . , n, and ϕ : X → H. Find an appro- ximating function f (x) = w, ϕ...
Example: Kernel Ridge RegressionWhat if H and ϕ : X → H are given implicitly by kernel function?We cannot store the closed...
Nonlinear Regression Like Least-Squared Regression, (Kernel) Ridge Regression is sensitive to outliers:      3            ...
Nonlinear Regression Support Vector Regression with -insensitive loss is more robust:            3                        ...
Support Vector Regression Optimization problem similar to support vector machine:                                         ...
Support Vector Regression Optimization problem similar to support vector machine:                              n          ...
Support Vector Regression Optimization problem similar to support vector machine:                                n        ...
Example – Head Pose Estimation        Detect faces in image        Compute gradient representation of face region        T...
Outlier/Anomaly Detection in RdFor unlabeled data, we are interested to detect outliers, i.e. samplesthat lie far away fro...
Outlier/Anomaly Detection in RdFor unlabeled data, we are interested to detect outliers, i.e. samplesthat lie far away fro...
Outlier/Anomaly Detection in Arbirary Inputs Use a kernel k : X × X → R with an implicit feature map ϕ : X → H. Do outlier...
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Chrisoph Lampert, Talk part 1
Upcoming SlideShare
Loading in …5
×

Chrisoph Lampert, Talk part 1

1,715 views

Published on

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,715
On SlideShare
0
From Embeds
0
Number of Embeds
405
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Chrisoph Lampert, Talk part 1

  1. 1. Kernel Methods in Computer Vision Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria Microsoft Computer Vision School, Moscow, 2011
  2. 2. Kernel Methods in Computer VisionOverview...Monday: 9:00 – 10:00 Introduction to Kernel Classifiers 10:00 – 10:10 — mini break — 10:10 – 10:50 More About Kernel MethodsTuesday: 9:00 – 9:50 Conditional Random Fields 9:50 – 10:00 — mini break — 10:00 – 10:50 Structured Support Vector Machines Slides available on my home page: http://www.ist.ac.at/~chl
  3. 3. Extended versions of the lectures in book form Foundations and Trends in Computer Graphics and Vision, now publisher www.nowpublishers.com/ Available as PDFs on my homepage
  4. 4. A Machine Learning View on Multimedia Problems     • Face Detection  • Object Recognition      Classification  • Action Classification      • Sign Language Recognition  •   Image Retrieval  • Pose Estimation    Regression  • Stereo Reconstruction   • Image Denoising  • Anomaly Detection in Videos  Outlier Detection  • Video Summarization   • Image Segmentation Clustering  • Image Duplicate Detection
  5. 5. A Machine Learning View on Multimedia Problems     • ... Classification  • Optical Character Recognition   • ... It’s difficult to program a solution to this. if (I[0,5]<128) & (I[0,6] > 192) & (I[0,7] < 128): return ’A’ elif (I[7,7]<50) & (I[6,3]) != 0: return ’Q’ else: print "I don’t know this letter."
  6. 6. A Machine Learning View on Multimedia Problems     • ... Classification  • Optical Character Recognition   • ... It’s difficult to program a solution to this. if (I[0,5]<128) & (I[0,6] > 192) & (I[0,7] < 128): return ’A’ elif (I[7,7]<50) & (I[6,3]) != 0: return ’Q’ else: print "I don’t know this letter." With Machine Learning, we can avoid this: We don’t program a solution to the specific problem. We program a generic classification program. We solve the problem by training the classifier with examples.
  7. 7. A Machine Learning View on Multimedia Problems     • ... Classification  • Object Category Recognition   • ... It’s difficult impossible to program a solution to this. if ??? With Machine Learning, we can avoid this: We don’t program a solution to the specific problem. We re-use our previous classifier. We solve the problem by training the classifier with examples.
  8. 8. Introduction to Kernel Classifiers
  9. 9. Linear Classification Separate these two sample sets. height width
  10. 10. Linear Classification Separate these two sample sets. height width
  11. 11. Linear Classification Separate these two sample sets.
  12. 12. Linear Classification Separate these two sample sets. height width Linear Classification
  13. 13. Why linear?Linear techniques have been studied for centries. Why? they are fast and easy to solve elementary maths, even closed form solutions typically involve only matrix operation they are intuitive simple is good, solution can be derived geometrically, solution corresponds to common sense. they often work very well, many physical processes are linear, almost all natural functions are smooth, smooth function can be approximated, at least locally, by linear functions.
  14. 14. Why linear?Linear techniques have been studied for centries. Why? they are fast and easy to solve elementary maths, even closed form solutions typically involve only matrix operation they are intuitive simple is good, solution can be derived geometrically, solution corresponds to common sense. they often work very well, many physical processes are linear, almost all natural functions are smooth, smooth function can be approximated, at least locally, by linear functions.The whole lecture will be about linear methods (with a twist).
  15. 15. Notation... data points X = {x1 , . . . , xn }, xi ∈ Rd , (images) class labels Y = {y1 , . . . , yn }, yi ∈ {+1, −1}, (cat or no cat) goal: classification rule g : Rd → {−1, +1}.
  16. 16. Notation... data points X = {x1 , . . . , xn }, xi ∈ Rd , (images) class labels Y = {y1 , . . . , yn }, yi ∈ {+1, −1}, (cat or no cat) goal: classification rule g : Rd → {−1, +1}. parameterize g(x) = sign f (x) with f : Rd → R: f (x) = a 1 x 1 + a 2 x 2 + . . . a n x n + a 0 simplify notation: x = (1, x), w = (a 0 , . . . , a n ): ˆ ˆ f (x) = w, x ˆ ˆ (inner product in Rd+1 ) out of lazyness, we just write f (x) = w, x with x, w ∈ Rd .
  17. 17. Example: Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
  18. 18. Example: Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions the data space into two half-spaces by means of f (x) = w, x . 2.0 f (x) > 0 f (x) < 0 1.0 w 0.0 −1.0 0.0 1.0 2.0 3.0
  19. 19. Example: Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions the data space into two half-spaces by means of f (x) = w, x . 2.0 f (x) > 0 f (x) < 0 1.0 w 0.0 −1.0 0.0 1.0 2.0 3.0 “How to find w?”
  20. 20. Example: Ad Hoc Linear Classification Because everything is linear, we only need Linear Algebra! X = x1 , . . . , xn ∈ Rd×n , y = y1 , . . . , yn ∈ Rn , remember w, x = w t x f (x1 ), . . . , f (xn ) = w, x1 , . . . , w, xn = w t x1 , . . . , w t xn = w t x1 , . . . , xn = w t X We want f (xi ) > 0 for yi = +1 and f (xi ) < 0 for yi = −1. Let’s just try f (xi ) = yi and solve wt X =y ⇒ w t XX t = yX t ⇒ wt = yX t (XX t )−1 1×d d×d
  21. 21. Does it work? w t = yX t (XX t )−1 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
  22. 22. Does it work? −1 t t t −1 t 47.82 27.75 −0.83 w = yX (XX ) = −7.97 9.37 = 27.75 28.33 1.14 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Looks good. But is it the best w?
  23. 23. Criteria for Linear Classification What properties should an optimal w have? Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. 2.0 2.0 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0 Are these the best? No, they misclassify many examples. Criterion 1: Enforce sign w, xi = yi for i = 1, . . . , n.
  24. 24. Criteria for Linear Classification What properties should an optimal w have? Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w? 2.0 2.0 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0 Are these the best? No, they would be “risky” for future samples. Criterion 2: Ensure sign w, x = y for future (x, y) as well.
  25. 25. Criteria for Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future samples are similar to current ones. What’s the best w? 2.0 ρ ρ 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassifications.
  26. 26. Criteria for Linear Classification Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future samples are similar to current ones. What’s the best w? 2.0 2.0 n rgi ma ion ρ ρ ρ reg 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassifications. Central quantity: w margin(x) = distance of x to decision hyperplane = w ,x
  27. 27. Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem: max γ w∈Rd ,γ∈R+subject to sign w, xi = yi for i = 1, . . . n. w , xi ≥γ for i = 1, . . . n. wClassify new samples using f (x) = w, x .
  28. 28. Maximum Margin ClassificationMaximum-margin solution is determined by a maximization problem: max γ w∈Rd , w =1 γ∈Rsubject to yi w, xi ≥ γ for i = 1, . . . n.Classify new samples using f (x) = w, x .
  29. 29. Maximum Margin ClassificationWe can rewrite this as a minimization problem: 2 min w w∈Rdsubject to yi w, xi ≥ 1 for i = 1, . . . n.Classify new samples using f (x) = w, x .
  30. 30. Maximum Margin ClassificationFrom the view of optimization theory 2 min w w∈Rdsubject to yi w, xi ≥ 1 for i = 1, . . . nis rather easy: The objective function is differentiable and convex. The constraints are all linear.We can find the globally optimal w in O(d 3 ) (usually faster). There are no local minima. We have a definite stopping criterion.
  31. 31. Linear Separability What is the best w for this dataset?
  32. 32. Linear Separability What is the best w for this dataset? Not this.
  33. 33. Linear Separability What is the best w for this dataset? Not this.
  34. 34. Linear Separability What is the best w for this dataset? Not this.
  35. 35. Linear Separability What is the best w for this dataset? Not this.
  36. 36. Linear Separability The problem 2 minw∈Rd w subject to y1 w, x1 ≥ 1 y2 w, x2 ≥ 1 y3 w, x3 ≥ 1 y4 w, x4 ≥ 1 has no solution. The constraints con- tradict each other! We cannot find a maximum-margin hyperplane here, because there is none. To fix this, we must allow hyperplanes that make mistakes.
  37. 37. Linear Separability What is the best w for this dataset? 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
  38. 38. Linear Separability What is the best w for this dataset? 2.0 rg in on ma vio l ati ρ n rgi ma 1.0 ξi xi 0.0 −1.0 0.0 1.0 2.0 3.0 Possibly this one, even though one sample is misclassified.
  39. 39. Linear Separability What is the best w for this dataset? 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
  40. 40. Linear Separability What is the best w for this dataset? 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Maybe not this one, even though all points are classified correctly.
  41. 41. Linear Separability What is the best w for this dataset? 2.0 rg in ma ρ ξi 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Trade-off: large margin vs. few mistakes on training set
  42. 42. Soft-Margin Classification Mathematically, we formulate the trade-off by slack-variables ξi : n 2 min w +C ξi w∈Rd ,ξi ∈R+ i=1 subject to yi w, xi ≥ 1 − ξi for i = 1, . . . n. We can fulfill every constraint by choosing ξi large enough. The larger ξi , the larger the objective (that we try to minimize). C is a regularization/trade-off parameter: small C → constraints are easily ignored large C → constraints are hard to ignore C = ∞ → hard margin case → no errors on training set Note: The problem is still convex and efficiently solvable. Note: This is not a hack. There’s strong theoretical justifications, e.g. generalization bounds. see e.g. [Shaw-Taylor, Cristianini: "Kernel Methods for Pattern Analysis", Cambridge University Press, 2004]
  43. 43. Solving for Soft-Margin Solution Reformulate constraints into objective functions with Hinge loss: n 2 min w + C 1 − yi w, xi w∈Rd + i=1 where [t]+ := max{0, t}. Linear Support Vector Machine (linear SVM) Now unconstrained optimization, but non-differentiable function Solve, e.g., by subgradient-descent (see tomorrow’s lecture)
  44. 44. Linear SVMs in Practice Linear SVMs are currently the most frequently used classifiers in Computer Vision. Efficient software packages exist: liblinear: http://www.csie.ntu.edu.tw/∼cjlin/liblinear/ SVMperf: http://www.cs.cornell.edu/People/tj/svm_light/svm_perf.html see also: Pegasos:, http://www.cs.huji.ac.il/∼shais/code/ see also: sgd:, http://leon.bottou.org/projects/sgd Training time: approximately linear in number of training examples, data dimensionality Evaluation time: linear in data dimensionality [O. Chapelle, "Training a support vector machine in the primal", Neural Computation, 2007]
  45. 45. Linear Separability So, what is the best soft-margin w for this dataset? 1.5 y 0.5 x −0.5 −1.5 −2.0 −1.0 0.0 1.0 2.0 None. We need something non-linear!
  46. 46. Non-Linear Classification: Stacking Idea 1) Use classifier outputs as input to other classifier: fi=<wi,x> σ(f5(x)) σ(f1(x)) σ(f2(x)) σ(f3(x)) σ(f4(x)) Multilayer Perceptron aka “(Artificial) Neural Network” Boosting, Decision Trees, Random Forests ⇒ decisions depend non-linearly on x and wj .
  47. 47. Non-linearity: Data Preprocessing Idea 2) Preprocess the data: y This dataset is not (well) linearly separable: x θ This one is: r In fact, both are the same dataset! Top: Cartesian coordinates. Bottom: polar coordinates
  48. 48. Non-linearity: Data Preprocessing y Non-linear separation x θ Linear separation r Linear classifier in polar space; acts non-linearly in Cartesian space.
  49. 49. Generalized Linear Classifier Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Given any (non-linear) feature map ϕ : Rk → Rm . Solve the minimization for ϕ(x1 ), . . . , ϕ(xn ) instead of x1 , . . . , xn : n 2 min w +C ξi w∈Rm ,ξi ∈R+ i=1 subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. The weight vector w now comes from the target space Rm . Distances/angles are measure by the inner product . , . in Rm . Classifier f (x) = w, ϕ(x) is linear in w, but non-linear in x.
  50. 50. Example Feature Mappings Polar coordinates: √ x x 2 + y2 ϕ: → y ∠(x, y) d-th degree polynomials: 2 2 d d ϕ : x1 , . . . , xn → 1, x1 , . . . , xn , x1 , . . . , xn , . . . , x1 , . . . , xn Distance map: ϕ:x→ x − pi , . . . , x − pN for a set of N prototype vectors pi , i = 1, . . . , N .
  51. 51. Is this enough? In this example, changing the coordinates did help. Does this trick always work? y θ x r ↔
  52. 52. Is this enough? In this example, changing the coordinates did help. Does this trick always work? y θ x r ↔ Answer: If we do it right, yes! Lemma Let (xi )i=1,...,n with xi = xj for i = j. Let ϕ : Rk → Rm be a feature map. If the set ϕ(xi )i=1,...,n is linearly independent, then the points ϕ(xi )i=1,...,n are linearly separable. Lemma If we choose m > n large enough, we can always find a map ϕ.
  53. 53. Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → R20001 such that – no matter how we label them – there is always a hyperplane classifier that has zero training error.
  54. 54. Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a fixed feature map ϕ : R2 → R20001 such that – no matter how we label them – there is always a hyperplane classifier that has zero training error. Never rely on training accuracy! Don’t forget to regularize!
  55. 55. Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm : n 2 min w +C ξi (1) w∈R ,ξi ∈R+ m i=1subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.For large m, won’t solving for w ∈ Rm become impossible?
  56. 56. Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm : n 2 min w +C ξi (1) w∈R ,ξi ∈R+ m i=1subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.For large m, won’t solving for w ∈ Rm become impossible? No!Theorem (Representer Theorem)The minimizing solution w to problem (1) can always be written as n w= αj ϕ(xj ) for coefficients α1 , . . . , αn ∈ R. j=1
  57. 57. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .
  58. 58. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm .
  59. 59. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }.
  60. 60. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ .
  61. 61. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) .
  62. 62. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .
  63. 63. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0.
  64. 64. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0?
  65. 65. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0? wV ⊥ 2 > 0 ⇒ w 2 > wV 2 .
  66. 66. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0? wV ⊥ 2 > 0 ⇒ w 2 > wV 2 . wV solves (∗) with same ξi but smaller norm. Contradiction!
  67. 67. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0? wV ⊥ 2 > 0 ⇒ w 2 > wV 2 . wV solves (∗) with same ξi but smaller norm. Contradiction!So wV ⊥ = 0, ergo w = wV ∈ V . q.e.d.
  68. 68. Kernel Trick The representer theorem allows us to rewrite the optimization: n 2 min w +C ξi w∈Rm ,ξi ∈R+ i=1 subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
  69. 69. Kernel Trick n Insert w = j=1 αj ϕ(xj ) and minimize over αi instead of w: n n 2 min αj ϕ(xj ) +C ξi αi ∈R,ξi ∈R+ j=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 The former m-dimensional optimization is now n-dimensional.
  70. 70. Kernel Trick 2 Use w = w, w and linearity of ·, · : n n min αj αk ϕ(xj ), ϕ(xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1
  71. 71. Kernel Trick 2 Use w = w, w and linearity of ·, · : n n min αj αk ϕ(xj ), ϕ(xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 Note: ϕ only occurs in ϕ(.), ϕ(.) pairs.
  72. 72. Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function. n n min αj αk k(xj , xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj k(xj , xi ) ≥ 1 − ξi for i = 1, . . . n. j=1
  73. 73. Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function. n n min αj αk k(xj , xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj k(xj , xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 To train, we only need to know the kernel matrix K Kij := k(xj , xi ) To evaluate on new data x, we need values k(x1 , x), . . . , k(xn , x): n f (x) = w, ϕ(x) = αi k(xi , x) i=1
  74. 74. Dualization More elegant: dualization with Langrian multipliers αi n n min yi yj αi αj k(xi , xj ) + C αi αi ∈R+ i,j=1 i=1 subject to n yi αi = 0 i=1 Support-Vector Machine (SVM) Optimization be solved numerically by any quadratic program (QP) solver but specialized software packages are more efficient.
  75. 75. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage: Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory. Storing k(x1 , x1 ), . . . , k(xn , xn ) requires O(n 2 ) memory.
  76. 76. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage: Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory. Storing k(x1 , x1 ), . . . , k(xn , xn ) requires O(n 2 ) memory. 2) Speed: We might find an expression for k(xi , xj ) that is faster to calculate than forming ϕ(xi ) and then ϕ(xi ), ϕ(xj ) . Example: comparing angles (x ∈ [0, 2π]) ϕ : x → (cos(x), sin(x)) ∈ R2 ϕ(xi ), ϕ(xj ) = (cos(xi ), sin(xi )), (cos(xj ), sin(xj )) = cos(xi ) cos(xj ) + sin(xi ) sin(xj )
  77. 77. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage: Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory. Storing k(x1 , x1 ), . . . , k(xn , xn ) requires O(n 2 ) memory. 2) Speed: We might find an expression for k(xi , xj ) that is faster to calculate than forming ϕ(xi ) and then ϕ(xi ), ϕ(xj ) . Example: comparing angles (x ∈ [0, 2π]) ϕ : x → (cos(x), sin(x)) ∈ R2 ϕ(xi ), ϕ(xj ) = (cos(xi ), sin(xi )), (cos(xj ), sin(xj )) = cos(xi ) cos(xj ) + sin(xi ) sin(xj ) = cos(xi − xj ) Equivalently, but faster, without ϕ: k(xi , xj ) : = cos(xi − xj )
  78. 78. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 3) Flexibility: There are kernel functions k(xi , xj ), for which we know that a feature transformation ϕ exists, but we don’t know what ϕ is.
  79. 79. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 3) Flexibility: There are kernel functions k(xi , xj ), for which we know that a feature transformation ϕ exists, but we don’t know what ϕ is. How that??? Theorem Let k : X × X → R be a positive definite kernel function. Then there exists a Hilbert Space H and a mapping ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) H where . , . H is the inner product in H.
  80. 80. Positive Definite Kernel Function Definition (Positive Definite Kernel Function) Let X be a non-empty set. A function k : X × X → R is called positive definite kernel function, iff k is symmetric, i.e. k(x, x ) = k(x , x) for all x, x ∈ X . For any set of points x1 , . . . , xn ∈ X , the matrix Kij = (k(xi , xj ))i,j is positive (semi-)definite, i.e. for all vectors t ∈ Rn : n ti Kij tj ≥ 0. i,j=1 Note: Instead of “positive definite kernel function”, we will often just say “kernel”.
  81. 81. Hilbert Spaces Definition (Hilbert Space) A Hilbert Space H is a vector space H with an inner product . , . H , e.g. a mapping .,. H :H ×H →R which is symmetric: v, v H = v ,v H for all v, v ∈ H , positive definite: v, v H ≥ 0 for all v ∈ H , where v, v H = 0 only for v = 0 ∈ H . bilinear: av, v H = a v, v H for v ∈ H , a ∈ R v + v , v H = v, v H + v , v H We can treat a Hilbert space like some Rn , if we only use concepts like vectors, angles, distances. Note: dim H = ∞ is possible!
  82. 82. Kernels for Arbitrary Sets Theorem Let k : X × X → R be a positive definite kernel function. Then there exists a Hilbert Space H and a mapping ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) H where . , . H is the inner product in H. Translation Take any set X and any function k : X × X → R. If k is a positive definite kernel, then we can use k to learn a (soft) maximum-margin classifier for the elements in X ! Note: X can be any set, e.g. X = "all videos on YouTube" or X = "all permutations of {1, . . . , k}", or X = "the internet".
  83. 83. How to Check if a Function is a Kernel Problem: Checking if a given k : X × X → R fulfills the conditions for a kernel is difficult: We need to prove or disprove n ti k(xi , xj )tj ≥ 0. i,j=1 for any set x1 , . . . , xn ∈ X and any t ∈ Rn for any n ∈ N. Workaround: It is easy to construct functions k that are positive definite kernels.
  84. 84. Constructing Kernels 1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. Example: ϕ(x) = "# of red pixels in image x", green,blue .
  85. 85. Constructing Kernels 1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. Example: ϕ(x) = "# of red pixels in image x", green,blue . Any norm . : V → Rm that fulfills the parallelogram equation • x + y 2 + x − y 2 = 2 x 2 + 2 y||2 induces a kernel by polarization: • k(x, y) := 1 x + y 2 − x 2 − y 2 . 2 ∞ 2 1 Example: X = time series with bounded values, x = x 2t t t=1
  86. 86. Constructing Kernels 1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. Example: ϕ(x) = "# of red pixels in image x", green,blue . Any norm . : V → Rm that fulfills the parallelogram equation • x + y 2 + x − y 2 = 2 x 2 + 2 y||2 induces a kernel by polarization: • k(x, y) := 1 x + y 2 − x 2 − y 2 . 2 ∞ 2 1 Example: X = time series with bounded values, x = x 2t t t=1 If d : X × X → R is conditionally positive definite, i.e. • n ti k(xi , xj )tj ≥ 0 for any t ∈ Rn with i ti = 0 i,j=1 for x1 , . . . , xn ∈ X for any n ∈ N, then • k(x, x ) := exp(−d(x, x )) is a positive kernel. 2 Example: d(x, x ) = x − x 2 2 . k(x, x ) = e − x−x L2 L
  87. 87. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel.
  88. 88. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel. Examples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian a.k.a. RBF k(x, x ) = exp − 2σ 2 with σ > 0,
  89. 89. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel. Examples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian a.k.a. RBF k(x, x ) = exp − 2σ 2 with σ > 0, Examples for kernels for other X : k(h, h ) = n min(hi , hi ) for n-bin histograms h, h . i=1 k(p, p ) = exp(−KL(p, p )) with KL the symmetrized KL-divergence between positive probability distributions. k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distance
  90. 90. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel. Examples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian a.k.a. RBF k(x, x ) = exp − 2σ 2 with σ > 0, Examples for kernels for other X : k(h, h ) = n min(hi , hi ) for n-bin histograms h, h . i=1 k(p, p ) = exp(−KL(p, p )) with KL the symmetrized KL-divergence between positive probability distributions. k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distance Not an example: tanh (a x, x + b) is not positive definite.
  91. 91. Summary – Classification with Kernels (Generalized) linear classification with SVMs conceptually simple, but powerful by using kernels
  92. 92. Summary – Classification with Kernels (Generalized) linear classification with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space
  93. 93. Summary – Classification with Kernels (Generalized) linear classification with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space Kernelization is implicit application of a feature map The method can become non-linear in the original data. The method is still linear in some feature space. ⇒ still somewhat intuitive/interpretable
  94. 94. Summary – Classification with Kernels (Generalized) linear classification with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space Kernelization is implicit application of a feature map The method can become non-linear in the original data. The method is still linear in some feature space. ⇒ still somewhat intuitive/interpretable We can build new kernels from explicit inner products distances existing kernels
  95. 95. Summary – Classification with Kernels (Generalized) linear classification with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space Kernelization is implicit application of a feature map The method can become non-linear in the original data. The method is still linear in some feature space. ⇒ still somewhat intuitive/interpretable We can build new kernels from explicit inner products distances existing kernels Kernels can be defined for arbitrary input data.
  96. 96. What did we not see?We have skipped a large part of theory on kernel methods: Optimization Dualization Numerics Algorithms to train SVMs Statistical Interpretations What are our assumptions on the samples? Generalization Bounds Theoretic guarantees on what accuracy the classifier will have!This and much more in standard references, e.g. Schölkopf, Smola: “Learning with Kernels”, MIT Press (50¤/60$) Shawe-Taylor, Cristianini: “Kernel Methods for Pattern Analysis”, Cambridge University Press (60¤/75$)
  97. 97. Linear SVMs in Practice SVMs with non-linear kernel are commonly used for small to medium sized Computer Vision problems. Software packages: libSVM: http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ SVMlight: http://svmlight.joachims.org/ Training time is typically cubic in number of training examples. Evaluation time: typically linear in number of training examples. Classification accurcay is typically higher than with linear SVMs.
  98. 98. Observation 1: Linear SVMs are very fast in training andevaluation.Observation 2: Non-linear kernel SVMs give better results, but donot scale well (with respect to number of training examples) Can we combine the strengths of both approaches?
  99. 99. Observation 1: Linear SVMs are very fast in training andevaluation.Observation 2: Non-linear kernel SVMs give better results, but donot scale well (with respect to number of training examples) Can we combine the strengths of both approaches?Yes! By (approximately) going back to explicit feature maps.[S. Maji, A. Berg, J. Malik, "Classification using intersection kernel support vector machines is efficient", CVPR 2008][A. Rahimi, "Random Features for Large-Scale Kernel Machines", NIPS, 2008]
  100. 100. (Approximate) Explicit Feature Maps Core Facts For every kernel k : X × X → R, there exists (implicit) ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) . In case that ϕ : X → RD , training a kernelized SVMs yields the same prediction function as preprocessing the data: make every x into a ϕ(x), training a linear SVM on the new data. Problem: ϕ is generally unknown, and it does not map to RD .
  101. 101. (Approximate) Explicit Feature Maps Core Facts For every kernel k : X × X → R, there exists (implicit) ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) . In case that ϕ : X → RD , training a kernelized SVMs yields the same prediction function as preprocessing the data: make every x into a ϕ(x), training a linear SVM on the new data. Problem: ϕ is generally unknown, and it does not map to RD . Idea: Approximate ϕ : X → H by explicit ϕ : X → RD such that ˜ k(x, x ) ≈ ϕ(x), ϕ(x ) ˜ ˜
  102. 102. Example: exact explicit feature map for Hellinger kernel d √ √ kH (x, x ) = xi xi ← ϕH (x) := x1 , . . . , xd i=1
  103. 103. Example: exact explicit feature map for Hellinger kernel d √ √ kH (x, x ) = xi xi ← ϕH (x) := x1 , . . . , xd i=1Example: approximate feature map for χ2 -kernel d xi xikχ2 (x, x ) = i=1 xi + xi ∼ √ √ √ ← ϕχ2 (x) := ˜ xi , 2πxi cos(log xi ), 2πxi sin(log xi ) i=1,...,d Current method of choice for large-scale non-linear learning.(see A. Zisserman’s lecture)[A. Vedaldi, A. Zisserman, "Efficient Additive Kernels via Explicit Feature Maps", CVPR 2010]
  104. 104. Selecting and Combining Kernels
  105. 105. Selecting From Multiple Kernels Typically, one has many different kernels to choose from: different functional forms linear, polynomial, RBF, . . . different parameters polynomial degree, Gaussian bandwidth, . . .
  106. 106. Selecting From Multiple Kernels Typically, one has many different kernels to choose from: different functional forms linear, polynomial, RBF, . . . different parameters polynomial degree, Gaussian bandwidth, . . . Different image features give rise to different kernels Color histograms, SIFT bag-of-words, HOG, Pyramid match, Spatial pyramids, . . .
  107. 107. Selecting From Multiple Kernels Typically, one has many different kernels to choose from: different functional forms linear, polynomial, RBF, . . . different parameters polynomial degree, Gaussian bandwidth, . . . Different image features give rise to different kernels Color histograms, SIFT bag-of-words, HOG, Pyramid match, Spatial pyramids, . . . How to choose? Ideally, based on the kernels’ performance on task at hand: estimate by cross-validation or validation set error Classically part of “Model Selection”.
  108. 108. Kernel Parameter Selection Remark: Model Selection makes a difference! Action Classification, KTH dataset Method Accuracy (on test data) Dollár et al. VS-PETS 2005: ”SVM classifier“ 80.66 Nowozin et al., ICCV 2007: ”baseline RBF“ 85.19 identical features, same kernel function difference: Nowozin used cross-validation for model selection (bandwidth and C ) Message: never rely on default parameters!
  109. 109. Kernel Parameter Selection Rule of thumb for kernel parameters For generalize Gaussian kernels: 1 2 k(x, x ) = exp(− d (x, x )) 2γ with any distance d, set γ ≈ mediani,j=1,...,n d(xi , xj ). Many variants: mean instead of median only d(xi , xj ) with yi = yj ... In general, if there are several classes, then the kernel matrix : Kij = k(xi , xj ) should have a block structure w.r.t. the classes.
  110. 110. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear Gauss: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 0 0.9 0 0.9 0.8 0.810 10 0.7 0.720 0.6 20 0.6 0.5 0.530 0.4 30 0.4 0.3 0.340 0.2 40 0.2 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000
  111. 111. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear Gauss: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0.8 0.8 0.810 10 10 0.7 0.7 0.720 0.6 20 0.6 20 0.6 0.5 0.5 0.530 0.4 30 0.4 30 0.4 0.3 0.3 0.340 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000 γ = 0.6 rule of thumb
  112. 112. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear Gauss: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000 γ = 0.6 γ = 1.6 rule of thumb 5-fold CV
  113. 113. Kernel Selection ↔ Kernel CombinationIs really one of the kernels best? Kernels are typcally designed to capture one aspect of the data texture, color, edges, . . . Choosing one kernel means to select exactly one such aspect.
  114. 114. Kernel Selection ↔ Kernel CombinationIs really one of the kernels best? Kernels are typcally designed to capture one aspect of the data texture, color, edges, . . . Choosing one kernel means to select exactly one such aspect. Combining aspects if often better than Selecting. Method Accuracy Colour 60.9 ± 2.1 Shape 70.2 ± 1.3 Texture 63.7 ± 2.7 HOG 58.5 ± 4.5 HSV 61.3 ± 0.7 siftint 70.6 ± 1.6 siftbdy 59.4 ± 3.3 combination 85.2 ± 1.5Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
  115. 115. Combining Two KernelsFor two kernels k1 , k2 : product k = k1 k2 is again a kernel Problem: very small kernel values suppress large ones average k = 1 (k1 + k2 ) is again a kernel 2 Problem: k1 , k2 on different scales. Re-scale first? convex combination kβ = (1 − β)k1 + βk2 with β ∈ [0, 1] Model selection: cross-validate over β ∈ {0, 0.1, . . . , 1}.
  116. 116. Combining Many KernelsMultiple kernels: k1 ,. . . ,kK all convex combinations are kernels: K K k= βj kj with βj ≥ 0, β = 1. j=1 j=1 Kernels can be “deactivated” by βj = 0. Combinatorial explosion forbids cross-validation over all combinations of βjProxy: instead of CV, maximize SVM-objective. Each combined kernel induces a feature space. In which combined feature spaces can we best explain the training data, and achieve a large margin between the classes?
  117. 117. Multiple Kernel Learning Same for soft-margin with slack-variables: 1 2 min vj Hj +C ξi (2) vj ∈Hj j βj i β =1 j j βj ≥0 ξi ∈R+ subject to yi vj , ϕj (xi ) Hj ≥ 1 − ξi for i = 1, . . . n. (3) j This optimization problem is jointly-convex in vj and βj . There is a unique global minimum, and we can find it efficiently! Learning kernel weights and SVM classifier together lie this is called Multiple-Kernel-Learning (MKL).
  118. 118. Combining Good KernelsObservation: if all kernels are reasonable, simple combinationmethods work as well as difficult ones (and are much faster): Single features Combination methods Method Accuracy Time Method Accuracy Time Colour 60.9 ± 2.1 3s product 85.5 ± 1.2 2s Shape 70.2 ± 1.3 4s averaging 84.9 ± 1.9 10 s Texture 63.7 ± 2.7 3s CG-Boost 84.8 ± 2.2 1225 s HOG 58.5 ± 4.5 4s MKL (SILP) 85.2 ± 1.5 97 s HSV 61.3 ± 0.7 3s MKL (Simple) 85.2 ± 1.5 152 s siftint 70.6 ± 1.6 4s LP-β 85.5 ± 3.0 80 s siftbdy 59.4 ± 3.3 5s LP-B 85.4 ± 2.4 98 sMean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset[Gehler, Nowozin: ICCV2009]Message: Never use MKL without comparing to simpler baselines!
  119. 119. Combining Good and Bad kernelsObservation: if some kernels are helpful, but others are not, smarttechniques are better. Performance with added noise features 90 85 80 75 accuracy 70 65 product 60 average CG−Boost 55 MKL (silp or simple) 50 LP−β LP−B 45 01 5 10 25 50 no. noise features added Mean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
  120. 120. SummaryKernel Selection and Combination Model selection is important to achive highest accuracy Combining several kernels is often superior to selecting one Simple techniques often work as well as difficult ones. Always try single best, averaging, product first.
  121. 121. Other Kernel Methods
  122. 122. Multiclass Classification – One-versus-rest SVM Classification problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }.
  123. 123. Multiclass Classification – One-versus-rest SVM Classification problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }. One-versus-rest SVM: train one SVM Fm for each class m: all samples with class label m are positive examples all other samples are negative examples classify by finding maximal response f (x) = argmax Fm (x) m=1,...,M Advantage: easy to implement, works well. Disadvantage: with many classes, training sets become unbalanced.
  124. 124. Multiclass Classification – All-versus-all SVM Classification problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }.
  125. 125. Multiclass Classification – All-versus-all SVM Classification problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }. All-versus-all SVM: train one SVM for each pair of classes Fij for each 1 ≤ i < j ≤ M , total m(m−1) SVMs 2 classify by voting f (x) = argmax #{i ∈ {1, . . . , M } : Fm,i (x) > 0}, m=1,...,M (writing Fj,i = −Fi,j for j > i and Fj,j = 0) Advantage: SVM problems stay small and balanced, also works well. Disadvantage: Number of classifiers grows quadratically in classes.
  126. 126. The “Kernel Trick”Many other algorithms besides SVMs can make use of kernels: Any algorithm that uses the data only in form of inner products can be kernelized.How to kernelize an algorithm: Write the algorithms in terms of only scalar products. Replace all scalar products by kernel function evaluations.The resulting algorithm will do the same as the linear version, but inthe (hidden) feature space H.Caveat: working in H is not a guarantee for better performance.A good choice of k and model-selection are important!
  127. 127. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i
  128. 128. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i Very popular, because it has a closed form solution! w = X (λIn + X T X )−1 y with In is the n × n identity matrix, X = (x1 , . . . , xn )t is the data matrix, y = (y1 , . . . , yn )t is the target vector.
  129. 129. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i What about non-linear? Very popular, because it has a closed form solution! w = X (λIn + X T X )−1 y with In is the n × n identity matrix, X = (x1 , . . . , xn )t is the data matrix, y = (y1 , . . . , yn )t is the target vector.
  130. 130. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i What about non-linear? Very popular, because it has a closed form solution! w = X (λIn + X T X )−1 y with In is the n × n identity matrix, X = (x1 , . . . , xn )t is the data matrix, y = (y1 , . . . , yn )t is the target vector.
  131. 131. Nonlinear Regression Given xi ∈ X , yi ∈ R, i = 1, . . . , n, and ϕ : X → H. Find an appro- ximating function f (x) = w, ϕ(x) H (non-linear in x, linear in w). Interpolation error: 3 2 ei := (yi − w, ϕ(xi ) H) 2 Solve for λ ≥ 0: 1 2 min n ei + λ w 0 w∈R 0 1 2 3 4 5 6 i
  132. 132. Nonlinear Regression Given xi ∈ X , yi ∈ R, i = 1, . . . , n, and ϕ : X → H. Find an appro- ximating function f (x) = w, ϕ(x) H (non-linear in x, linear in w). Interpolation error: 3 2 ei := (yi − w, ϕ(xi ) H) 2 Solve for λ ≥ 0: 1 2 min n ei + λ w 0 w∈R 0 1 2 3 4 5 6 i Closed form solution is still valid: w = Φ(λIn + ΦT Φ)−1 y with In is the n × n identity matrix, Φ = (ϕ(x1 ), . . . , ϕ(xn ))t .
  133. 133. Example: Kernel Ridge RegressionWhat if H and ϕ : X → H are given implicitly by kernel function?We cannot store the closed form solution vector w ∈ H: w = Φ(λIn + ΦT Φ)−1 ybut we can still calculate f : X → R: f (x) = w, ϕ(x) = Φ(λIn + ΦT Φ)−1 y, ϕ(x) =K t −1 = y (λIn + K ) κ(x) twhere κ(x) = k(x1 , x), . . . , k(xn , x) . Kernel Ridge Regression
  134. 134. Nonlinear Regression Like Least-Squared Regression, (Kernel) Ridge Regression is sensitive to outliers: 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 because the quadratic loss function penalized large residue.
  135. 135. Nonlinear Regression Support Vector Regression with -insensitive loss is more robust: 3 3 2 2 1 1 r r 0 0 −ǫ ǫ −3 −2 −1 0 1 2 3 → −3 −2 −1 0 1 2 3 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 → 0 1 2 3 4 5 6
  136. 136. Support Vector Regression Optimization problem similar to support vector machine: n 2 min w +C (ξi + ξi ) w∈H, i=1 ξ1 ,...,ξn ∈R+ , ξ1 ,...,ξn ∈R+ subject to yi − w, ϕ(xi ) ≤ + ξi , for i = 1, . . . , n, w, ϕ(xi ) − yi ≤ + ξi , for i = 1, . . . , n. Again, we can use a Representer Theorem and get w = j αj ϕ(xj ).
  137. 137. Support Vector Regression Optimization problem similar to support vector machine: n n min αi αj ϕ(xi ), ϕ(xj ) + C (ξi + ξi ) α1 ,...,αn ∈R, i,j=1 i=1 ξ1 ,...,ξn ∈R+ , ξ1 ,...,ξn ∈R+ subject to yi − j αj ϕ(xj ), ϕ(xi ) ≤ + ξi , for i = 1, . . . , n, j αj ϕ(xj ), ϕ(xi ) − yi ≤ + ξi , for i = 1, . . . , n. Rewrite in terms of kernel evaluations k(x, x ) = ϕ(x), ϕ(x ) .
  138. 138. Support Vector Regression Optimization problem similar to support vector machine: n n min αi αj k(xi , xj ) + C (ξi + ξi ) α1 ,...,αn ∈R, i,j=1 i=1 ξ1 ,...,ξn ∈R+ , ξ1 ,...,ξn ∈R+ subject to yi − j αj k(xj , xi ) ≤ + ξi , for i = 1, . . . , n, j αj k(xj , xi ) − yi ≤ + ξi , for i = 1, . . . , n. Regression function f (x) = w, ϕ(x) = j αj k(xj , x)
  139. 139. Example – Head Pose Estimation Detect faces in image Compute gradient representation of face region Train support vector regression for yaw, tilt (separately)[Li, Gong, Sherra, Liddell, "Support vector machine based multi-view face detection and recognition", IVC 2004]
  140. 140. Outlier/Anomaly Detection in RdFor unlabeled data, we are interested to detect outliers, i.e. samplesthat lie far away from most of the other samples. For samples x1 , . . . , xn find the smallest ball (center c, radius R) that contains “most” of the samples.
  141. 141. Outlier/Anomaly Detection in RdFor unlabeled data, we are interested to detect outliers, i.e. samplesthat lie far away from most of the other samples. For samples x1 , . . . , xn find the smallest ball (center c, radius R) that contains “most” of the samples. Solve 1 min R2 + ξi n R∈R,c∈R ,ξi ∈R+ νn i subject to 2 xi − c ≤R2 + ξi for i = 1, . . . , n. ν ∈ (0, 1) upper bounds the number of “outliers”.
  142. 142. Outlier/Anomaly Detection in Arbirary Inputs Use a kernel k : X × X → R with an implicit feature map ϕ : X → H. Do outlier detection for ϕ(x1 ), . . . , ϕ(xn ): Find the small ball (center c ∈ H, radius R) that contains “most” of the samples: Solve 1 min R2 + ξi R∈R,c∈H,ξi ∈R + νn i subject to 2 ϕ(xi ) − c ≤R2 + ξi for i = 1, . . . , n. Representer theorem: c = j αj ϕ(xj ), and everything can be written using only k(xi , xj ). Algorithm called Support Vector Data Description

×