Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

No Downloads

Total views

1,715

On SlideShare

0

From Embeds

0

Number of Embeds

405

Shares

0

Downloads

7

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Kernel Methods in Computer Vision Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria Microsoft Computer Vision School, Moscow, 2011
- 2. Kernel Methods in Computer VisionOverview...Monday: 9:00 – 10:00 Introduction to Kernel Classiﬁers 10:00 – 10:10 — mini break — 10:10 – 10:50 More About Kernel MethodsTuesday: 9:00 – 9:50 Conditional Random Fields 9:50 – 10:00 — mini break — 10:00 – 10:50 Structured Support Vector Machines Slides available on my home page: http://www.ist.ac.at/~chl
- 3. Extended versions of the lectures in book form Foundations and Trends in Computer Graphics and Vision, now publisher www.nowpublishers.com/ Available as PDFs on my homepage
- 4. A Machine Learning View on Multimedia Problems • Face Detection • Object Recognition Classiﬁcation • Action Classiﬁcation • Sign Language Recognition • Image Retrieval • Pose Estimation Regression • Stereo Reconstruction • Image Denoising • Anomaly Detection in Videos Outlier Detection • Video Summarization • Image Segmentation Clustering • Image Duplicate Detection
- 5. A Machine Learning View on Multimedia Problems • ... Classiﬁcation • Optical Character Recognition • ... It’s diﬃcult to program a solution to this. if (I[0,5]<128) & (I[0,6] > 192) & (I[0,7] < 128): return ’A’ elif (I[7,7]<50) & (I[6,3]) != 0: return ’Q’ else: print "I don’t know this letter."
- 6. A Machine Learning View on Multimedia Problems • ... Classiﬁcation • Optical Character Recognition • ... It’s diﬃcult to program a solution to this. if (I[0,5]<128) & (I[0,6] > 192) & (I[0,7] < 128): return ’A’ elif (I[7,7]<50) & (I[6,3]) != 0: return ’Q’ else: print "I don’t know this letter." With Machine Learning, we can avoid this: We don’t program a solution to the speciﬁc problem. We program a generic classiﬁcation program. We solve the problem by training the classiﬁer with examples.
- 7. A Machine Learning View on Multimedia Problems • ... Classiﬁcation • Object Category Recognition • ... It’s diﬃcult impossible to program a solution to this. if ??? With Machine Learning, we can avoid this: We don’t program a solution to the speciﬁc problem. We re-use our previous classiﬁer. We solve the problem by training the classiﬁer with examples.
- 8. Introduction to Kernel Classiﬁers
- 9. Linear Classiﬁcation Separate these two sample sets. height width
- 10. Linear Classiﬁcation Separate these two sample sets. height width
- 11. Linear Classiﬁcation Separate these two sample sets.
- 12. Linear Classiﬁcation Separate these two sample sets. height width Linear Classiﬁcation
- 13. Why linear?Linear techniques have been studied for centries. Why? they are fast and easy to solve elementary maths, even closed form solutions typically involve only matrix operation they are intuitive simple is good, solution can be derived geometrically, solution corresponds to common sense. they often work very well, many physical processes are linear, almost all natural functions are smooth, smooth function can be approximated, at least locally, by linear functions.
- 14. Why linear?Linear techniques have been studied for centries. Why? they are fast and easy to solve elementary maths, even closed form solutions typically involve only matrix operation they are intuitive simple is good, solution can be derived geometrically, solution corresponds to common sense. they often work very well, many physical processes are linear, almost all natural functions are smooth, smooth function can be approximated, at least locally, by linear functions.The whole lecture will be about linear methods (with a twist).
- 15. Notation... data points X = {x1 , . . . , xn }, xi ∈ Rd , (images) class labels Y = {y1 , . . . , yn }, yi ∈ {+1, −1}, (cat or no cat) goal: classiﬁcation rule g : Rd → {−1, +1}.
- 16. Notation... data points X = {x1 , . . . , xn }, xi ∈ Rd , (images) class labels Y = {y1 , . . . , yn }, yi ∈ {+1, −1}, (cat or no cat) goal: classiﬁcation rule g : Rd → {−1, +1}. parameterize g(x) = sign f (x) with f : Rd → R: f (x) = a 1 x 1 + a 2 x 2 + . . . a n x n + a 0 simplify notation: x = (1, x), w = (a 0 , . . . , a n ): ˆ ˆ f (x) = w, x ˆ ˆ (inner product in Rd+1 ) out of lazyness, we just write f (x) = w, x with x, w ∈ Rd .
- 17. Example: Linear Classiﬁcation Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
- 18. Example: Linear Classiﬁcation Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions the data space into two half-spaces by means of f (x) = w, x . 2.0 f (x) > 0 f (x) < 0 1.0 w 0.0 −1.0 0.0 1.0 2.0 3.0
- 19. Example: Linear Classiﬁcation Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Any w partitions the data space into two half-spaces by means of f (x) = w, x . 2.0 f (x) > 0 f (x) < 0 1.0 w 0.0 −1.0 0.0 1.0 2.0 3.0 “How to ﬁnd w?”
- 20. Example: Ad Hoc Linear Classiﬁcation Because everything is linear, we only need Linear Algebra! X = x1 , . . . , xn ∈ Rd×n , y = y1 , . . . , yn ∈ Rn , remember w, x = w t x f (x1 ), . . . , f (xn ) = w, x1 , . . . , w, xn = w t x1 , . . . , w t xn = w t x1 , . . . , xn = w t X We want f (xi ) > 0 for yi = +1 and f (xi ) < 0 for yi = −1. Let’s just try f (xi ) = yi and solve wt X =y ⇒ w t XX t = yX t ⇒ wt = yX t (XX t )−1 1×d d×d
- 21. Does it work? w t = yX t (XX t )−1 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
- 22. Does it work? −1 t t t −1 t 47.82 27.75 −0.83 w = yX (XX ) = −7.97 9.37 = 27.75 28.33 1.14 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Looks good. But is it the best w?
- 23. Criteria for Linear Classiﬁcation What properties should an optimal w have? Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. 2.0 2.0 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0 Are these the best? No, they misclassify many examples. Criterion 1: Enforce sign w, xi = yi for i = 1, . . . , n.
- 24. Criteria for Linear Classiﬁcation What properties should an optimal w have? Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. What’s the best w? 2.0 2.0 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0 Are these the best? No, they would be “risky” for future samples. Criterion 2: Ensure sign w, x = y for future (x, y) as well.
- 25. Criteria for Linear Classiﬁcation Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future samples are similar to current ones. What’s the best w? 2.0 ρ ρ 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassiﬁcations.
- 26. Criteria for Linear Classiﬁcation Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Assume that future samples are similar to current ones. What’s the best w? 2.0 2.0 n rgi ma ion ρ ρ ρ reg 1.0 1.0 0.0 0.0 −1.0 0.0 1.0 2.0 3.0 −1.0 0.0 1.0 2.0 3.0 Maximize “robustness”: use w such that we can maximally perturb the input samples without introducing misclassiﬁcations. Central quantity: w margin(x) = distance of x to decision hyperplane = w ,x
- 27. Maximum Margin ClassiﬁcationMaximum-margin solution is determined by a maximization problem: max γ w∈Rd ,γ∈R+subject to sign w, xi = yi for i = 1, . . . n. w , xi ≥γ for i = 1, . . . n. wClassify new samples using f (x) = w, x .
- 28. Maximum Margin ClassiﬁcationMaximum-margin solution is determined by a maximization problem: max γ w∈Rd , w =1 γ∈Rsubject to yi w, xi ≥ γ for i = 1, . . . n.Classify new samples using f (x) = w, x .
- 29. Maximum Margin ClassiﬁcationWe can rewrite this as a minimization problem: 2 min w w∈Rdsubject to yi w, xi ≥ 1 for i = 1, . . . n.Classify new samples using f (x) = w, x .
- 30. Maximum Margin ClassiﬁcationFrom the view of optimization theory 2 min w w∈Rdsubject to yi w, xi ≥ 1 for i = 1, . . . nis rather easy: The objective function is diﬀerentiable and convex. The constraints are all linear.We can ﬁnd the globally optimal w in O(d 3 ) (usually faster). There are no local minima. We have a deﬁnite stopping criterion.
- 31. Linear Separability What is the best w for this dataset?
- 32. Linear Separability What is the best w for this dataset? Not this.
- 33. Linear Separability What is the best w for this dataset? Not this.
- 34. Linear Separability What is the best w for this dataset? Not this.
- 35. Linear Separability What is the best w for this dataset? Not this.
- 36. Linear Separability The problem 2 minw∈Rd w subject to y1 w, x1 ≥ 1 y2 w, x2 ≥ 1 y3 w, x3 ≥ 1 y4 w, x4 ≥ 1 has no solution. The constraints con- tradict each other! We cannot ﬁnd a maximum-margin hyperplane here, because there is none. To ﬁx this, we must allow hyperplanes that make mistakes.
- 37. Linear Separability What is the best w for this dataset? 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
- 38. Linear Separability What is the best w for this dataset? 2.0 rg in on ma vio l ati ρ n rgi ma 1.0 ξi xi 0.0 −1.0 0.0 1.0 2.0 3.0 Possibly this one, even though one sample is misclassiﬁed.
- 39. Linear Separability What is the best w for this dataset? 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0
- 40. Linear Separability What is the best w for this dataset? 2.0 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Maybe not this one, even though all points are classiﬁed correctly.
- 41. Linear Separability What is the best w for this dataset? 2.0 rg in ma ρ ξi 1.0 0.0 −1.0 0.0 1.0 2.0 3.0 Trade-oﬀ: large margin vs. few mistakes on training set
- 42. Soft-Margin Classiﬁcation Mathematically, we formulate the trade-oﬀ by slack-variables ξi : n 2 min w +C ξi w∈Rd ,ξi ∈R+ i=1 subject to yi w, xi ≥ 1 − ξi for i = 1, . . . n. We can fulﬁll every constraint by choosing ξi large enough. The larger ξi , the larger the objective (that we try to minimize). C is a regularization/trade-oﬀ parameter: small C → constraints are easily ignored large C → constraints are hard to ignore C = ∞ → hard margin case → no errors on training set Note: The problem is still convex and eﬃciently solvable. Note: This is not a hack. There’s strong theoretical justiﬁcations, e.g. generalization bounds. see e.g. [Shaw-Taylor, Cristianini: "Kernel Methods for Pattern Analysis", Cambridge University Press, 2004]
- 43. Solving for Soft-Margin Solution Reformulate constraints into objective functions with Hinge loss: n 2 min w + C 1 − yi w, xi w∈Rd + i=1 where [t]+ := max{0, t}. Linear Support Vector Machine (linear SVM) Now unconstrained optimization, but non-diﬀerentiable function Solve, e.g., by subgradient-descent (see tomorrow’s lecture)
- 44. Linear SVMs in Practice Linear SVMs are currently the most frequently used classiﬁers in Computer Vision. Eﬃcient software packages exist: liblinear: http://www.csie.ntu.edu.tw/∼cjlin/liblinear/ SVMperf: http://www.cs.cornell.edu/People/tj/svm_light/svm_perf.html see also: Pegasos:, http://www.cs.huji.ac.il/∼shais/code/ see also: sgd:, http://leon.bottou.org/projects/sgd Training time: approximately linear in number of training examples, data dimensionality Evaluation time: linear in data dimensionality [O. Chapelle, "Training a support vector machine in the primal", Neural Computation, 2007]
- 45. Linear Separability So, what is the best soft-margin w for this dataset? 1.5 y 0.5 x −0.5 −1.5 −2.0 −1.0 0.0 1.0 2.0 None. We need something non-linear!
- 46. Non-Linear Classiﬁcation: Stacking Idea 1) Use classiﬁer outputs as input to other classiﬁer: fi=<wi,x> σ(f5(x)) σ(f1(x)) σ(f2(x)) σ(f3(x)) σ(f4(x)) Multilayer Perceptron aka “(Artiﬁcial) Neural Network” Boosting, Decision Trees, Random Forests ⇒ decisions depend non-linearly on x and wj .
- 47. Non-linearity: Data Preprocessing Idea 2) Preprocess the data: y This dataset is not (well) linearly separable: x θ This one is: r In fact, both are the same dataset! Top: Cartesian coordinates. Bottom: polar coordinates
- 48. Non-linearity: Data Preprocessing y Non-linear separation x θ Linear separation r Linear classiﬁer in polar space; acts non-linearly in Cartesian space.
- 49. Generalized Linear Classiﬁer Given X = {x1 , . . . , xn }, Y = {y1 , . . . , yn }. Given any (non-linear) feature map ϕ : Rk → Rm . Solve the minimization for ϕ(x1 ), . . . , ϕ(xn ) instead of x1 , . . . , xn : n 2 min w +C ξi w∈Rm ,ξi ∈R+ i=1 subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. The weight vector w now comes from the target space Rm . Distances/angles are measure by the inner product . , . in Rm . Classiﬁer f (x) = w, ϕ(x) is linear in w, but non-linear in x.
- 50. Example Feature Mappings Polar coordinates: √ x x 2 + y2 ϕ: → y ∠(x, y) d-th degree polynomials: 2 2 d d ϕ : x1 , . . . , xn → 1, x1 , . . . , xn , x1 , . . . , xn , . . . , x1 , . . . , xn Distance map: ϕ:x→ x − pi , . . . , x − pN for a set of N prototype vectors pi , i = 1, . . . , N .
- 51. Is this enough? In this example, changing the coordinates did help. Does this trick always work? y θ x r ↔
- 52. Is this enough? In this example, changing the coordinates did help. Does this trick always work? y θ x r ↔ Answer: If we do it right, yes! Lemma Let (xi )i=1,...,n with xi = xj for i = j. Let ϕ : Rk → Rm be a feature map. If the set ϕ(xi )i=1,...,n is linearly independent, then the points ϕ(xi )i=1,...,n are linearly separable. Lemma If we choose m > n large enough, we can always ﬁnd a map ϕ.
- 53. Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a ﬁxed feature map ϕ : R2 → R20001 such that – no matter how we label them – there is always a hyperplane classiﬁer that has zero training error.
- 54. Is this enough? Caveat: We can separate any set, not just one with “reasonable” yi : There is a ﬁxed feature map ϕ : R2 → R20001 such that – no matter how we label them – there is always a hyperplane classiﬁer that has zero training error. Never rely on training accuracy! Don’t forget to regularize!
- 55. Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm : n 2 min w +C ξi (1) w∈R ,ξi ∈R+ m i=1subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.For large m, won’t solving for w ∈ Rm become impossible?
- 56. Representer TheoremSolve the soft-margin minimization for ϕ(x1 ), . . . , ϕ(xn ) ∈ Rm : n 2 min w +C ξi (1) w∈R ,ξi ∈R+ m i=1subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.For large m, won’t solving for w ∈ Rm become impossible? No!Theorem (Representer Theorem)The minimizing solution w to problem (1) can always be written as n w= αj ϕ(xj ) for coeﬃcients α1 , . . . , αn ∈ R. j=1
- 57. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .
- 58. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm .
- 59. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }.
- 60. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ .
- 61. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) .
- 62. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .
- 63. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0.
- 64. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0?
- 65. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0? wV ⊥ 2 > 0 ⇒ w 2 > wV 2 .
- 66. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0? wV ⊥ 2 > 0 ⇒ w 2 > wV 2 . wV solves (∗) with same ξi but smaller norm. Contradiction!
- 67. Representer Theorem: ProofTheorem: w solves n 2 (∗) min w +C ξi subject to yi w, ϕ(xi ) ≥ 1 − ξi w∈R ,ξi ∈R+ m i=1 then w = j αj ϕ(xj ) for some α ∈ Rn .Proof: V := span{ϕ(x1 ), . . . , ϕ(xn )} = { j αj ϕ(xj ) : α ∈ Rn } ⊂ Rm . V ⊥ := {u ∈ Rm : u, v = 0 for all v ∈ V }. decompose w = wV + wV ⊥ with wV ∈ V , wV ⊥ ∈ V ⊥ . from wV ⊥ , ϕ(xi ) = 0 follows wV , ϕ(xi ) = w, ϕ(xi ) . from wV , wV ⊥ = 0 follows w 2 = wV 2 + wV ⊥ 2 .Now we would like to show wV ⊥ = 0. What if wV ⊥ = 0? wV ⊥ 2 > 0 ⇒ w 2 > wV 2 . wV solves (∗) with same ξi but smaller norm. Contradiction!So wV ⊥ = 0, ergo w = wV ∈ V . q.e.d.
- 68. Kernel Trick The representer theorem allows us to rewrite the optimization: n 2 min w +C ξi w∈Rm ,ξi ∈R+ i=1 subject to yi w, ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n.
- 69. Kernel Trick n Insert w = j=1 αj ϕ(xj ) and minimize over αi instead of w: n n 2 min αj ϕ(xj ) +C ξi αi ∈R,ξi ∈R+ j=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 The former m-dimensional optimization is now n-dimensional.
- 70. Kernel Trick 2 Use w = w, w and linearity of ·, · : n n min αj αk ϕ(xj ), ϕ(xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1
- 71. Kernel Trick 2 Use w = w, w and linearity of ·, · : n n min αj αk ϕ(xj ), ϕ(xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj ϕ(xj ), ϕ(xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 Note: ϕ only occurs in ϕ(.), ϕ(.) pairs.
- 72. Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function. n n min αj αk k(xj , xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj k(xj , xi ) ≥ 1 − ξi for i = 1, . . . n. j=1
- 73. Kernel Trick Set ϕ(x), ϕ(x ) =: k(x, x ), called kernel function. n n min αj αk k(xj , xk ) + C ξi αi ∈R,ξi ∈R+ j,k=1 i=1 subject to n yi αj k(xj , xi ) ≥ 1 − ξi for i = 1, . . . n. j=1 To train, we only need to know the kernel matrix K Kij := k(xj , xi ) To evaluate on new data x, we need values k(x1 , x), . . . , k(xn , x): n f (x) = w, ϕ(x) = αi k(xi , x) i=1
- 74. Dualization More elegant: dualization with Langrian multipliers αi n n min yi yj αi αj k(xi , xj ) + C αi αi ∈R+ i,j=1 i=1 subject to n yi αi = 0 i=1 Support-Vector Machine (SVM) Optimization be solved numerically by any quadratic program (QP) solver but specialized software packages are more eﬃcient.
- 75. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage: Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory. Storing k(x1 , x1 ), . . . , k(xn , xn ) requires O(n 2 ) memory.
- 76. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage: Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory. Storing k(x1 , x1 ), . . . , k(xn , xn ) requires O(n 2 ) memory. 2) Speed: We might ﬁnd an expression for k(xi , xj ) that is faster to calculate than forming ϕ(xi ) and then ϕ(xi ), ϕ(xj ) . Example: comparing angles (x ∈ [0, 2π]) ϕ : x → (cos(x), sin(x)) ∈ R2 ϕ(xi ), ϕ(xj ) = (cos(xi ), sin(xi )), (cos(xj ), sin(xj )) = cos(xi ) cos(xj ) + sin(xi ) sin(xj )
- 77. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 1) Memory usage: Storing ϕ(x1 ), . . . , ϕ(xn ) requires O(nm) memory. Storing k(x1 , x1 ), . . . , k(xn , xn ) requires O(n 2 ) memory. 2) Speed: We might ﬁnd an expression for k(xi , xj ) that is faster to calculate than forming ϕ(xi ) and then ϕ(xi ), ϕ(xj ) . Example: comparing angles (x ∈ [0, 2π]) ϕ : x → (cos(x), sin(x)) ∈ R2 ϕ(xi ), ϕ(xj ) = (cos(xi ), sin(xi )), (cos(xj ), sin(xj )) = cos(xi ) cos(xj ) + sin(xi ) sin(xj ) = cos(xi − xj ) Equivalently, but faster, without ϕ: k(xi , xj ) : = cos(xi − xj )
- 78. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 3) Flexibility: There are kernel functions k(xi , xj ), for which we know that a feature transformation ϕ exists, but we don’t know what ϕ is.
- 79. Why use k(x, x ) instead of ϕ(x), ϕ(x ) ? 3) Flexibility: There are kernel functions k(xi , xj ), for which we know that a feature transformation ϕ exists, but we don’t know what ϕ is. How that??? Theorem Let k : X × X → R be a positive deﬁnite kernel function. Then there exists a Hilbert Space H and a mapping ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) H where . , . H is the inner product in H.
- 80. Positive Deﬁnite Kernel Function Deﬁnition (Positive Deﬁnite Kernel Function) Let X be a non-empty set. A function k : X × X → R is called positive deﬁnite kernel function, iﬀ k is symmetric, i.e. k(x, x ) = k(x , x) for all x, x ∈ X . For any set of points x1 , . . . , xn ∈ X , the matrix Kij = (k(xi , xj ))i,j is positive (semi-)deﬁnite, i.e. for all vectors t ∈ Rn : n ti Kij tj ≥ 0. i,j=1 Note: Instead of “positive deﬁnite kernel function”, we will often just say “kernel”.
- 81. Hilbert Spaces Deﬁnition (Hilbert Space) A Hilbert Space H is a vector space H with an inner product . , . H , e.g. a mapping .,. H :H ×H →R which is symmetric: v, v H = v ,v H for all v, v ∈ H , positive deﬁnite: v, v H ≥ 0 for all v ∈ H , where v, v H = 0 only for v = 0 ∈ H . bilinear: av, v H = a v, v H for v ∈ H , a ∈ R v + v , v H = v, v H + v , v H We can treat a Hilbert space like some Rn , if we only use concepts like vectors, angles, distances. Note: dim H = ∞ is possible!
- 82. Kernels for Arbitrary Sets Theorem Let k : X × X → R be a positive deﬁnite kernel function. Then there exists a Hilbert Space H and a mapping ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) H where . , . H is the inner product in H. Translation Take any set X and any function k : X × X → R. If k is a positive deﬁnite kernel, then we can use k to learn a (soft) maximum-margin classiﬁer for the elements in X ! Note: X can be any set, e.g. X = "all videos on YouTube" or X = "all permutations of {1, . . . , k}", or X = "the internet".
- 83. How to Check if a Function is a Kernel Problem: Checking if a given k : X × X → R fulﬁlls the conditions for a kernel is diﬃcult: We need to prove or disprove n ti k(xi , xj )tj ≥ 0. i,j=1 for any set x1 , . . . , xn ∈ X and any t ∈ Rn for any n ∈ N. Workaround: It is easy to construct functions k that are positive deﬁnite kernels.
- 84. Constructing Kernels 1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. Example: ϕ(x) = "# of red pixels in image x", green,blue .
- 85. Constructing Kernels 1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. Example: ϕ(x) = "# of red pixels in image x", green,blue . Any norm . : V → Rm that fulﬁlls the parallelogram equation • x + y 2 + x − y 2 = 2 x 2 + 2 y||2 induces a kernel by polarization: • k(x, y) := 1 x + y 2 − x 2 − y 2 . 2 ∞ 2 1 Example: X = time series with bounded values, x = x 2t t t=1
- 86. Constructing Kernels 1) We can construct kernels from scratch: For any ϕ : X → Rm , k(x, x ) = ϕ(x), ϕ(x ) Rm is a kernel. Example: ϕ(x) = "# of red pixels in image x", green,blue . Any norm . : V → Rm that fulﬁlls the parallelogram equation • x + y 2 + x − y 2 = 2 x 2 + 2 y||2 induces a kernel by polarization: • k(x, y) := 1 x + y 2 − x 2 − y 2 . 2 ∞ 2 1 Example: X = time series with bounded values, x = x 2t t t=1 If d : X × X → R is conditionally positive deﬁnite, i.e. • n ti k(xi , xj )tj ≥ 0 for any t ∈ Rn with i ti = 0 i,j=1 for x1 , . . . , xn ∈ X for any n ∈ N, then • k(x, x ) := exp(−d(x, x )) is a positive kernel. 2 Example: d(x, x ) = x − x 2 2 . k(x, x ) = e − x−x L2 L
- 87. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel.
- 88. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel. Examples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian a.k.a. RBF k(x, x ) = exp − 2σ 2 with σ > 0,
- 89. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel. Examples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian a.k.a. RBF k(x, x ) = exp − 2σ 2 with σ > 0, Examples for kernels for other X : k(h, h ) = n min(hi , hi ) for n-bin histograms h, h . i=1 k(p, p ) = exp(−KL(p, p )) with KL the symmetrized KL-divergence between positive probability distributions. k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distance
- 90. Constructing Kernels 2) We can construct kernels from other kernels: if k is a kernel and α > 0, then αk and k + α are kernels. if k1 , k2 are kernels, then k1 + k2 and k1 · k2 are kernels. if k is a kernel, then exp(k) is a kernel. Examples for kernels for X = Rd : any linear combination j αj kj with αj ≥ 0, polynomial kernels k(x, x ) = (1 + x, x )m , m > 0 x−x 2 Gaussian a.k.a. RBF k(x, x ) = exp − 2σ 2 with σ > 0, Examples for kernels for other X : k(h, h ) = n min(hi , hi ) for n-bin histograms h, h . i=1 k(p, p ) = exp(−KL(p, p )) with KL the symmetrized KL-divergence between positive probability distributions. k(s, s ) = exp(−D(s, s )) for strings s, s and D = edit distance Not an example: tanh (a x, x + b) is not positive deﬁnite.
- 91. Summary – Classiﬁcation with Kernels (Generalized) linear classiﬁcation with SVMs conceptually simple, but powerful by using kernels
- 92. Summary – Classiﬁcation with Kernels (Generalized) linear classiﬁcation with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space
- 93. Summary – Classiﬁcation with Kernels (Generalized) linear classiﬁcation with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space Kernelization is implicit application of a feature map The method can become non-linear in the original data. The method is still linear in some feature space. ⇒ still somewhat intuitive/interpretable
- 94. Summary – Classiﬁcation with Kernels (Generalized) linear classiﬁcation with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space Kernelization is implicit application of a feature map The method can become non-linear in the original data. The method is still linear in some feature space. ⇒ still somewhat intuitive/interpretable We can build new kernels from explicit inner products distances existing kernels
- 95. Summary – Classiﬁcation with Kernels (Generalized) linear classiﬁcation with SVMs conceptually simple, but powerful by using kernels Kernels are at the same time similarity measures between arbitrary objects inner products in a (hidden) feature space Kernelization is implicit application of a feature map The method can become non-linear in the original data. The method is still linear in some feature space. ⇒ still somewhat intuitive/interpretable We can build new kernels from explicit inner products distances existing kernels Kernels can be deﬁned for arbitrary input data.
- 96. What did we not see?We have skipped a large part of theory on kernel methods: Optimization Dualization Numerics Algorithms to train SVMs Statistical Interpretations What are our assumptions on the samples? Generalization Bounds Theoretic guarantees on what accuracy the classiﬁer will have!This and much more in standard references, e.g. Schölkopf, Smola: “Learning with Kernels”, MIT Press (50¤/60$) Shawe-Taylor, Cristianini: “Kernel Methods for Pattern Analysis”, Cambridge University Press (60¤/75$)
- 97. Linear SVMs in Practice SVMs with non-linear kernel are commonly used for small to medium sized Computer Vision problems. Software packages: libSVM: http://www.csie.ntu.edu.tw/˜cjlin/libsvm/ SVMlight: http://svmlight.joachims.org/ Training time is typically cubic in number of training examples. Evaluation time: typically linear in number of training examples. Classiﬁcation accurcay is typically higher than with linear SVMs.
- 98. Observation 1: Linear SVMs are very fast in training andevaluation.Observation 2: Non-linear kernel SVMs give better results, but donot scale well (with respect to number of training examples) Can we combine the strengths of both approaches?
- 99. Observation 1: Linear SVMs are very fast in training andevaluation.Observation 2: Non-linear kernel SVMs give better results, but donot scale well (with respect to number of training examples) Can we combine the strengths of both approaches?Yes! By (approximately) going back to explicit feature maps.[S. Maji, A. Berg, J. Malik, "Classiﬁcation using intersection kernel support vector machines is eﬃcient", CVPR 2008][A. Rahimi, "Random Features for Large-Scale Kernel Machines", NIPS, 2008]
- 100. (Approximate) Explicit Feature Maps Core Facts For every kernel k : X × X → R, there exists (implicit) ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) . In case that ϕ : X → RD , training a kernelized SVMs yields the same prediction function as preprocessing the data: make every x into a ϕ(x), training a linear SVM on the new data. Problem: ϕ is generally unknown, and it does not map to RD .
- 101. (Approximate) Explicit Feature Maps Core Facts For every kernel k : X × X → R, there exists (implicit) ϕ : X → H such that k(x, x ) = ϕ(x), ϕ(x ) . In case that ϕ : X → RD , training a kernelized SVMs yields the same prediction function as preprocessing the data: make every x into a ϕ(x), training a linear SVM on the new data. Problem: ϕ is generally unknown, and it does not map to RD . Idea: Approximate ϕ : X → H by explicit ϕ : X → RD such that ˜ k(x, x ) ≈ ϕ(x), ϕ(x ) ˜ ˜
- 102. Example: exact explicit feature map for Hellinger kernel d √ √ kH (x, x ) = xi xi ← ϕH (x) := x1 , . . . , xd i=1
- 103. Example: exact explicit feature map for Hellinger kernel d √ √ kH (x, x ) = xi xi ← ϕH (x) := x1 , . . . , xd i=1Example: approximate feature map for χ2 -kernel d xi xikχ2 (x, x ) = i=1 xi + xi ∼ √ √ √ ← ϕχ2 (x) := ˜ xi , 2πxi cos(log xi ), 2πxi sin(log xi ) i=1,...,d Current method of choice for large-scale non-linear learning.(see A. Zisserman’s lecture)[A. Vedaldi, A. Zisserman, "Eﬃcient Additive Kernels via Explicit Feature Maps", CVPR 2010]
- 104. Selecting and Combining Kernels
- 105. Selecting From Multiple Kernels Typically, one has many diﬀerent kernels to choose from: diﬀerent functional forms linear, polynomial, RBF, . . . diﬀerent parameters polynomial degree, Gaussian bandwidth, . . .
- 106. Selecting From Multiple Kernels Typically, one has many diﬀerent kernels to choose from: diﬀerent functional forms linear, polynomial, RBF, . . . diﬀerent parameters polynomial degree, Gaussian bandwidth, . . . Diﬀerent image features give rise to diﬀerent kernels Color histograms, SIFT bag-of-words, HOG, Pyramid match, Spatial pyramids, . . .
- 107. Selecting From Multiple Kernels Typically, one has many diﬀerent kernels to choose from: diﬀerent functional forms linear, polynomial, RBF, . . . diﬀerent parameters polynomial degree, Gaussian bandwidth, . . . Diﬀerent image features give rise to diﬀerent kernels Color histograms, SIFT bag-of-words, HOG, Pyramid match, Spatial pyramids, . . . How to choose? Ideally, based on the kernels’ performance on task at hand: estimate by cross-validation or validation set error Classically part of “Model Selection”.
- 108. Kernel Parameter Selection Remark: Model Selection makes a diﬀerence! Action Classiﬁcation, KTH dataset Method Accuracy (on test data) Dollár et al. VS-PETS 2005: ”SVM classiﬁer“ 80.66 Nowozin et al., ICCV 2007: ”baseline RBF“ 85.19 identical features, same kernel function diﬀerence: Nowozin used cross-validation for model selection (bandwidth and C ) Message: never rely on default parameters!
- 109. Kernel Parameter Selection Rule of thumb for kernel parameters For generalize Gaussian kernels: 1 2 k(x, x ) = exp(− d (x, x )) 2γ with any distance d, set γ ≈ mediani,j=1,...,n d(xi , xj ). Many variants: mean instead of median only d(xi , xj ) with yi = yj ... In general, if there are several classes, then the kernel matrix : Kij = k(xi , xj ) should have a block structure w.r.t. the classes.
- 110. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear Gauss: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 0 0.9 0 0.9 0.8 0.810 10 0.7 0.720 0.6 20 0.6 0.5 0.530 0.4 30 0.4 0.3 0.340 0.2 40 0.2 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000
- 111. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear Gauss: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0.8 0.8 0.810 10 10 0.7 0.7 0.720 0.6 20 0.6 20 0.6 0.5 0.5 0.530 0.4 30 0.4 30 0.4 0.3 0.3 0.340 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000 γ = 0.6 rule of thumb
- 112. 1.0 1.0 0 0.9 0 0 0.9 2.4 1.0 y 0.8 0.8 10 10 1.6 10 0.7 0.7 0.5 x 20 0.6 20 0.8 20 0.6 0.5 0.5 0.0 0.0 30 0.4 30 30 0.4 −0.8−0.5 0.3 0.3 40 0.2 40 −1.6 40 0.2−1.0 0.1 −2.4 0.1 −2.0 −1.0 0.0 1.0 2.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0 10 20 30 40 50 0.0 two moons label “kernel” linear Gauss: γ = 0.001 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 0.01 γ = 0.1 γ=1 γ = 10 1.0 1.0 1.0 1.0 0 0.9 0 0.9 0 0.9 0 0.9 0.8 0.8 0.8 0.810 10 10 10 0.7 0.7 0.7 0.720 0.6 20 0.6 20 0.6 20 0.6 0.5 0.5 0.5 0.530 0.4 30 0.4 30 0.4 30 0.4 0.3 0.3 0.3 0.340 0.2 40 0.2 40 0.2 40 0.2 0.1 0.1 0.1 0.1 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 0 10 20 30 40 50 0.0 γ = 100 γ = 1000 γ = 0.6 γ = 1.6 rule of thumb 5-fold CV
- 113. Kernel Selection ↔ Kernel CombinationIs really one of the kernels best? Kernels are typcally designed to capture one aspect of the data texture, color, edges, . . . Choosing one kernel means to select exactly one such aspect.
- 114. Kernel Selection ↔ Kernel CombinationIs really one of the kernels best? Kernels are typcally designed to capture one aspect of the data texture, color, edges, . . . Choosing one kernel means to select exactly one such aspect. Combining aspects if often better than Selecting. Method Accuracy Colour 60.9 ± 2.1 Shape 70.2 ± 1.3 Texture 63.7 ± 2.7 HOG 58.5 ± 4.5 HSV 61.3 ± 0.7 siftint 70.6 ± 1.6 siftbdy 59.4 ± 3.3 combination 85.2 ± 1.5Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
- 115. Combining Two KernelsFor two kernels k1 , k2 : product k = k1 k2 is again a kernel Problem: very small kernel values suppress large ones average k = 1 (k1 + k2 ) is again a kernel 2 Problem: k1 , k2 on diﬀerent scales. Re-scale ﬁrst? convex combination kβ = (1 − β)k1 + βk2 with β ∈ [0, 1] Model selection: cross-validate over β ∈ {0, 0.1, . . . , 1}.
- 116. Combining Many KernelsMultiple kernels: k1 ,. . . ,kK all convex combinations are kernels: K K k= βj kj with βj ≥ 0, β = 1. j=1 j=1 Kernels can be “deactivated” by βj = 0. Combinatorial explosion forbids cross-validation over all combinations of βjProxy: instead of CV, maximize SVM-objective. Each combined kernel induces a feature space. In which combined feature spaces can we best explain the training data, and achieve a large margin between the classes?
- 117. Multiple Kernel Learning Same for soft-margin with slack-variables: 1 2 min vj Hj +C ξi (2) vj ∈Hj j βj i β =1 j j βj ≥0 ξi ∈R+ subject to yi vj , ϕj (xi ) Hj ≥ 1 − ξi for i = 1, . . . n. (3) j This optimization problem is jointly-convex in vj and βj . There is a unique global minimum, and we can ﬁnd it eﬃciently! Learning kernel weights and SVM classiﬁer together lie this is called Multiple-Kernel-Learning (MKL).
- 118. Combining Good KernelsObservation: if all kernels are reasonable, simple combinationmethods work as well as diﬃcult ones (and are much faster): Single features Combination methods Method Accuracy Time Method Accuracy Time Colour 60.9 ± 2.1 3s product 85.5 ± 1.2 2s Shape 70.2 ± 1.3 4s averaging 84.9 ± 1.9 10 s Texture 63.7 ± 2.7 3s CG-Boost 84.8 ± 2.2 1225 s HOG 58.5 ± 4.5 4s MKL (SILP) 85.2 ± 1.5 97 s HSV 61.3 ± 0.7 3s MKL (Simple) 85.2 ± 1.5 152 s siftint 70.6 ± 1.6 4s LP-β 85.5 ± 3.0 80 s siftbdy 59.4 ± 3.3 5s LP-B 85.4 ± 2.4 98 sMean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset[Gehler, Nowozin: ICCV2009]Message: Never use MKL without comparing to simpler baselines!
- 119. Combining Good and Bad kernelsObservation: if some kernels are helpful, but others are not, smarttechniques are better. Performance with added noise features 90 85 80 75 accuracy 70 65 product 60 average CG−Boost 55 MKL (silp or simple) 50 LP−β LP−B 45 01 5 10 25 50 no. noise features added Mean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009]
- 120. SummaryKernel Selection and Combination Model selection is important to achive highest accuracy Combining several kernels is often superior to selecting one Simple techniques often work as well as diﬃcult ones. Always try single best, averaging, product ﬁrst.
- 121. Other Kernel Methods
- 122. Multiclass Classiﬁcation – One-versus-rest SVM Classiﬁcation problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }.
- 123. Multiclass Classiﬁcation – One-versus-rest SVM Classiﬁcation problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }. One-versus-rest SVM: train one SVM Fm for each class m: all samples with class label m are positive examples all other samples are negative examples classify by ﬁnding maximal response f (x) = argmax Fm (x) m=1,...,M Advantage: easy to implement, works well. Disadvantage: with many classes, training sets become unbalanced.
- 124. Multiclass Classiﬁcation – All-versus-all SVM Classiﬁcation problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }.
- 125. Multiclass Classiﬁcation – All-versus-all SVM Classiﬁcation problems with M classes: Training samples {x1 , . . . , xn } ⊂ X , Training labels {y1 , . . . , yn } ⊂ {1, . . . , M }, Task: learn a prediction function f : X → {1, . . . , M }. All-versus-all SVM: train one SVM for each pair of classes Fij for each 1 ≤ i < j ≤ M , total m(m−1) SVMs 2 classify by voting f (x) = argmax #{i ∈ {1, . . . , M } : Fm,i (x) > 0}, m=1,...,M (writing Fj,i = −Fi,j for j > i and Fj,j = 0) Advantage: SVM problems stay small and balanced, also works well. Disadvantage: Number of classiﬁers grows quadratically in classes.
- 126. The “Kernel Trick”Many other algorithms besides SVMs can make use of kernels: Any algorithm that uses the data only in form of inner products can be kernelized.How to kernelize an algorithm: Write the algorithms in terms of only scalar products. Replace all scalar products by kernel function evaluations.The resulting algorithm will do the same as the linear version, but inthe (hidden) feature space H.Caveat: working in H is not a guarantee for better performance.A good choice of k and model-selection are important!
- 127. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i
- 128. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i Very popular, because it has a closed form solution! w = X (λIn + X T X )−1 y with In is the n × n identity matrix, X = (x1 , . . . , xn )t is the data matrix, y = (y1 , . . . , yn )t is the target vector.
- 129. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i What about non-linear? Very popular, because it has a closed form solution! w = X (λIn + X T X )−1 y with In is the n × n identity matrix, X = (x1 , . . . , xn )t is the data matrix, y = (y1 , . . . , yn )t is the target vector.
- 130. Linear Regression Given samples xi ∈ Rd and function values yi ∈ R. Find a linear function f (x) = w, x that approximates the values. Interpolation error: 3 ei := (yi − w, xi )2 2 1 Solve for λ ≥ 0: 2 0 min n ei + λ w 0 1 2 3 4 5 6 w∈R i What about non-linear? Very popular, because it has a closed form solution! w = X (λIn + X T X )−1 y with In is the n × n identity matrix, X = (x1 , . . . , xn )t is the data matrix, y = (y1 , . . . , yn )t is the target vector.
- 131. Nonlinear Regression Given xi ∈ X , yi ∈ R, i = 1, . . . , n, and ϕ : X → H. Find an appro- ximating function f (x) = w, ϕ(x) H (non-linear in x, linear in w). Interpolation error: 3 2 ei := (yi − w, ϕ(xi ) H) 2 Solve for λ ≥ 0: 1 2 min n ei + λ w 0 w∈R 0 1 2 3 4 5 6 i
- 132. Nonlinear Regression Given xi ∈ X , yi ∈ R, i = 1, . . . , n, and ϕ : X → H. Find an appro- ximating function f (x) = w, ϕ(x) H (non-linear in x, linear in w). Interpolation error: 3 2 ei := (yi − w, ϕ(xi ) H) 2 Solve for λ ≥ 0: 1 2 min n ei + λ w 0 w∈R 0 1 2 3 4 5 6 i Closed form solution is still valid: w = Φ(λIn + ΦT Φ)−1 y with In is the n × n identity matrix, Φ = (ϕ(x1 ), . . . , ϕ(xn ))t .
- 133. Example: Kernel Ridge RegressionWhat if H and ϕ : X → H are given implicitly by kernel function?We cannot store the closed form solution vector w ∈ H: w = Φ(λIn + ΦT Φ)−1 ybut we can still calculate f : X → R: f (x) = w, ϕ(x) = Φ(λIn + ΦT Φ)−1 y, ϕ(x) =K t −1 = y (λIn + K ) κ(x) twhere κ(x) = k(x1 , x), . . . , k(xn , x) . Kernel Ridge Regression
- 134. Nonlinear Regression Like Least-Squared Regression, (Kernel) Ridge Regression is sensitive to outliers: 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 because the quadratic loss function penalized large residue.
- 135. Nonlinear Regression Support Vector Regression with -insensitive loss is more robust: 3 3 2 2 1 1 r r 0 0 −ǫ ǫ −3 −2 −1 0 1 2 3 → −3 −2 −1 0 1 2 3 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 → 0 1 2 3 4 5 6
- 136. Support Vector Regression Optimization problem similar to support vector machine: n 2 min w +C (ξi + ξi ) w∈H, i=1 ξ1 ,...,ξn ∈R+ , ξ1 ,...,ξn ∈R+ subject to yi − w, ϕ(xi ) ≤ + ξi , for i = 1, . . . , n, w, ϕ(xi ) − yi ≤ + ξi , for i = 1, . . . , n. Again, we can use a Representer Theorem and get w = j αj ϕ(xj ).
- 137. Support Vector Regression Optimization problem similar to support vector machine: n n min αi αj ϕ(xi ), ϕ(xj ) + C (ξi + ξi ) α1 ,...,αn ∈R, i,j=1 i=1 ξ1 ,...,ξn ∈R+ , ξ1 ,...,ξn ∈R+ subject to yi − j αj ϕ(xj ), ϕ(xi ) ≤ + ξi , for i = 1, . . . , n, j αj ϕ(xj ), ϕ(xi ) − yi ≤ + ξi , for i = 1, . . . , n. Rewrite in terms of kernel evaluations k(x, x ) = ϕ(x), ϕ(x ) .
- 138. Support Vector Regression Optimization problem similar to support vector machine: n n min αi αj k(xi , xj ) + C (ξi + ξi ) α1 ,...,αn ∈R, i,j=1 i=1 ξ1 ,...,ξn ∈R+ , ξ1 ,...,ξn ∈R+ subject to yi − j αj k(xj , xi ) ≤ + ξi , for i = 1, . . . , n, j αj k(xj , xi ) − yi ≤ + ξi , for i = 1, . . . , n. Regression function f (x) = w, ϕ(x) = j αj k(xj , x)
- 139. Example – Head Pose Estimation Detect faces in image Compute gradient representation of face region Train support vector regression for yaw, tilt (separately)[Li, Gong, Sherra, Liddell, "Support vector machine based multi-view face detection and recognition", IVC 2004]
- 140. Outlier/Anomaly Detection in RdFor unlabeled data, we are interested to detect outliers, i.e. samplesthat lie far away from most of the other samples. For samples x1 , . . . , xn ﬁnd the smallest ball (center c, radius R) that contains “most” of the samples.
- 141. Outlier/Anomaly Detection in RdFor unlabeled data, we are interested to detect outliers, i.e. samplesthat lie far away from most of the other samples. For samples x1 , . . . , xn ﬁnd the smallest ball (center c, radius R) that contains “most” of the samples. Solve 1 min R2 + ξi n R∈R,c∈R ,ξi ∈R+ νn i subject to 2 xi − c ≤R2 + ξi for i = 1, . . . , n. ν ∈ (0, 1) upper bounds the number of “outliers”.
- 142. Outlier/Anomaly Detection in Arbirary Inputs Use a kernel k : X × X → R with an implicit feature map ϕ : X → H. Do outlier detection for ϕ(x1 ), . . . , ϕ(xn ): Find the small ball (center c ∈ H, radius R) that contains “most” of the samples: Solve 1 min R2 + ξi R∈R,c∈H,ξi ∈R + νn i subject to 2 ϕ(xi ) − c ≤R2 + ξi for i = 1, . . . , n. Representer theorem: c = j αj ϕ(xj ), and everything can be written using only k(xi , xj ). Algorithm called Support Vector Data Description

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment