SlideShare a Scribd company logo
1 of 95
Download to read offline
Kernel Methods and Nonlinear Classification
Machine Learning
(CS5350/6350) Kernel Methods September 15, 2011 1 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
Linear models (e.g., linear regression, linear SVM) are not just rich enough
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
Linear models (e.g., linear regression, linear SVM) are not just rich enough
Kernels: Make linear models work in nonlinear settings
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
Linear models (e.g., linear regression, linear SVM) are not just rich enough
Kernels: Make linear models work in nonlinear settings
By mapping data to higher dimensions where it exhibits linear patterns
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
Linear models (e.g., linear regression, linear SVM) are not just rich enough
Kernels: Make linear models work in nonlinear settings
By mapping data to higher dimensions where it exhibits linear patterns
Apply the linear model in the new input space
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
Linear models (e.g., linear regression, linear SVM) are not just rich enough
Kernels: Make linear models work in nonlinear settings
By mapping data to higher dimensions where it exhibits linear patterns
Apply the linear model in the new input space
Mapping ≡ changing the feature representation
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
Linear models (e.g., linear regression, linear SVM) are not just rich enough
Kernels: Make linear models work in nonlinear settings
By mapping data to higher dimensions where it exhibits linear patterns
Apply the linear model in the new input space
Mapping ≡ changing the feature representation
Note: Such mappings can be expensive to compute in general
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Kernel Methods: Motivation
Often we want to capture nonlinear patterns in the data
Nonlinear Regression: Input-output relationship may not be linear
Nonlinear Classification: Classes may not be separable by a linear boundary
Linear models (e.g., linear regression, linear SVM) are not just rich enough
Kernels: Make linear models work in nonlinear settings
By mapping data to higher dimensions where it exhibits linear patterns
Apply the linear model in the new input space
Mapping ≡ changing the feature representation
Note: Such mappings can be expensive to compute in general
Kernels give such mappings for (almost) free
In most cases, the mappings need not be even computed
.. using the Kernel Trick!
(CS5350/6350) Kernel Methods September 15, 2011 2 / 16
Classifying non-linearly separable data
Consider this binary classification problem
Each example represented by a single feature x
No linear separator exists for this data
(CS5350/6350) Kernel Methods September 15, 2011 3 / 16
Classifying non-linearly separable data
Consider this binary classification problem
Each example represented by a single feature x
No linear separator exists for this data
Now map each example as x → {x, x2
}
Each example now has two features (“derived” from the old representation)
(CS5350/6350) Kernel Methods September 15, 2011 3 / 16
Classifying non-linearly separable data
Consider this binary classification problem
Each example represented by a single feature x
No linear separator exists for this data
Now map each example as x → {x, x2
}
Each example now has two features (“derived” from the old representation)
Data now becomes linearly separable in the new representation
(CS5350/6350) Kernel Methods September 15, 2011 3 / 16
Classifying non-linearly separable data
Consider this binary classification problem
Each example represented by a single feature x
No linear separator exists for this data
Now map each example as x → {x, x2
}
Each example now has two features (“derived” from the old representation)
Data now becomes linearly separable in the new representation
Linear in the new representation ≡ nonlinear in the old representation
(CS5350/6350) Kernel Methods September 15, 2011 3 / 16
Classifying non-linearly separable data
Let’s look at another example:
Each example defined by a two features x = {x1, x2}
No linear separator exists for this data
(CS5350/6350) Kernel Methods September 15, 2011 4 / 16
Classifying non-linearly separable data
Let’s look at another example:
Each example defined by a two features x = {x1, x2}
No linear separator exists for this data
Now map each example as x = {x1, x2} → z = {x2
1 ,
√
2x1x2, x2
2 }
Each example now has three features (“derived” from the old representation)
(CS5350/6350) Kernel Methods September 15, 2011 4 / 16
Classifying non-linearly separable data
Let’s look at another example:
Each example defined by a two features x = {x1, x2}
No linear separator exists for this data
Now map each example as x = {x1, x2} → z = {x2
1 ,
√
2x1x2, x2
2 }
Each example now has three features (“derived” from the old representation)
Data now becomes linearly separable in the new representation
(CS5350/6350) Kernel Methods September 15, 2011 4 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
It’s an example of a quadratic mapping
Each new feature uses a pair of the original features
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
It’s an example of a quadratic mapping
Each new feature uses a pair of the original features
Problem: Mapping usually leads to the number of features blow up!
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
It’s an example of a quadratic mapping
Each new feature uses a pair of the original features
Problem: Mapping usually leads to the number of features blow up!
Computing the mapping itself can be inefficient in such cases
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
It’s an example of a quadratic mapping
Each new feature uses a pair of the original features
Problem: Mapping usually leads to the number of features blow up!
Computing the mapping itself can be inefficient in such cases
Moreover, using the mapped representation could be inefficient too
e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
It’s an example of a quadratic mapping
Each new feature uses a pair of the original features
Problem: Mapping usually leads to the number of features blow up!
Computing the mapping itself can be inefficient in such cases
Moreover, using the mapped representation could be inefficient too
e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)
Thankfully, Kernels help us avoid both these issues!
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
It’s an example of a quadratic mapping
Each new feature uses a pair of the original features
Problem: Mapping usually leads to the number of features blow up!
Computing the mapping itself can be inefficient in such cases
Moreover, using the mapped representation could be inefficient too
e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)
Thankfully, Kernels help us avoid both these issues!
The mapping doesn’t have to be explicitly computed
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Feature Mapping
Consider the following mapping φ for an example x = {x1, . . . , xD}
φ : x → {x2
1 , x2
2 , . . . , x2
D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD}
It’s an example of a quadratic mapping
Each new feature uses a pair of the original features
Problem: Mapping usually leads to the number of features blow up!
Computing the mapping itself can be inefficient in such cases
Moreover, using the mapped representation could be inefficient too
e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z)
Thankfully, Kernels help us avoid both these issues!
The mapping doesn’t have to be explicitly computed
Computations with the mapped features remain efficient
(CS5350/6350) Kernel Methods September 15, 2011 5 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
= (x
2
1 ,
√
2x1x2, x
2
2 )
⊤
(z
2
1 ,
√
2z1z2, z
2
2 )
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
= (x
2
1 ,
√
2x1x2, x
2
2 )
⊤
(z
2
1 ,
√
2z1z2, z
2
2 )
= φ(x)
⊤
φ(z)
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
= (x
2
1 ,
√
2x1x2, x
2
2 )
⊤
(z
2
1 ,
√
2z1z2, z
2
2 )
= φ(x)
⊤
φ(z)
The above k implicitly defines a mapping φ to a higher dimensional space
φ(x) = {x2
1 ,
√
2x1x2, x2
2 }
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
= (x
2
1 ,
√
2x1x2, x
2
2 )
⊤
(z
2
1 ,
√
2z1z2, z
2
2 )
= φ(x)
⊤
φ(z)
The above k implicitly defines a mapping φ to a higher dimensional space
φ(x) = {x2
1 ,
√
2x1x2, x2
2 }
Note that we didn’t have to define/compute this mapping
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
= (x
2
1 ,
√
2x1x2, x
2
2 )
⊤
(z
2
1 ,
√
2z1z2, z
2
2 )
= φ(x)
⊤
φ(z)
The above k implicitly defines a mapping φ to a higher dimensional space
φ(x) = {x2
1 ,
√
2x1x2, x2
2 }
Note that we didn’t have to define/compute this mapping
Simply defining the kernel a certain way gives a higher dim. mapping φ
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
= (x
2
1 ,
√
2x1x2, x
2
2 )
⊤
(z
2
1 ,
√
2z1z2, z
2
2 )
= φ(x)
⊤
φ(z)
The above k implicitly defines a mapping φ to a higher dimensional space
φ(x) = {x2
1 ,
√
2x1x2, x2
2 }
Note that we didn’t have to define/compute this mapping
Simply defining the kernel a certain way gives a higher dim. mapping φ
Moreover the kernel k(x, z) also computes the dot product φ(x)⊤
φ(z)
φ(x)⊤
φ(z) would otherwise be much more expensive to compute explicitly
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels as High Dimensional Feature Mapping
Consider two examples x = {x1, x2} and z = {z1, z2}
Let’s assume we are given a function k (kernel) that takes as inputs x and z
k(x, z) = (x
⊤
z)
2
= (x1z1 + x2z2)
2
= x
2
1 z
2
1 + x
2
2 z
2
2 + 2x1x2z1z2
= (x
2
1 ,
√
2x1x2, x
2
2 )
⊤
(z
2
1 ,
√
2z1z2, z
2
2 )
= φ(x)
⊤
φ(z)
The above k implicitly defines a mapping φ to a higher dimensional space
φ(x) = {x2
1 ,
√
2x1x2, x2
2 }
Note that we didn’t have to define/compute this mapping
Simply defining the kernel a certain way gives a higher dim. mapping φ
Moreover the kernel k(x, z) also computes the dot product φ(x)⊤
φ(z)
φ(x)⊤
φ(z) would otherwise be much more expensive to compute explicitly
All kernel functions have these properties
(CS5350/6350) Kernel Methods September 15, 2011 6 / 16
Kernels: Formally Defined
Recall: Each kernel k has an associated feature mapping φ
(CS5350/6350) Kernel Methods September 15, 2011 7 / 16
Kernels: Formally Defined
Recall: Each kernel k has an associated feature mapping φ
φ takes input x ∈ X (input space) and maps it to F (“feature space”)
(CS5350/6350) Kernel Methods September 15, 2011 7 / 16
Kernels: Formally Defined
Recall: Each kernel k has an associated feature mapping φ
φ takes input x ∈ X (input space) and maps it to F (“feature space”)
Kernel k(x, z) takes two inputs and gives their similarity in F space
φ : X → F
k : X × X → R, k(x, z) = φ(x)⊤
φ(z)
(CS5350/6350) Kernel Methods September 15, 2011 7 / 16
Kernels: Formally Defined
Recall: Each kernel k has an associated feature mapping φ
φ takes input x ∈ X (input space) and maps it to F (“feature space”)
Kernel k(x, z) takes two inputs and gives their similarity in F space
φ : X → F
k : X × X → R, k(x, z) = φ(x)⊤
φ(z)
F needs to be a vector space with a dot product defined on it
Also called a Hilbert Space
(CS5350/6350) Kernel Methods September 15, 2011 7 / 16
Kernels: Formally Defined
Recall: Each kernel k has an associated feature mapping φ
φ takes input x ∈ X (input space) and maps it to F (“feature space”)
Kernel k(x, z) takes two inputs and gives their similarity in F space
φ : X → F
k : X × X → R, k(x, z) = φ(x)⊤
φ(z)
F needs to be a vector space with a dot product defined on it
Also called a Hilbert Space
Can just any function be used as a kernel function?
(CS5350/6350) Kernel Methods September 15, 2011 7 / 16
Kernels: Formally Defined
Recall: Each kernel k has an associated feature mapping φ
φ takes input x ∈ X (input space) and maps it to F (“feature space”)
Kernel k(x, z) takes two inputs and gives their similarity in F space
φ : X → F
k : X × X → R, k(x, z) = φ(x)⊤
φ(z)
F needs to be a vector space with a dot product defined on it
Also called a Hilbert Space
Can just any function be used as a kernel function?
No. It must satisfy Mercer’s Condition
(CS5350/6350) Kernel Methods September 15, 2011 7 / 16
Mercer’s Condition
For k to be a kernel function
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
dx
Z
dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
dx
Z
dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
This is Mercer’s Condition
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
dx
Z
dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
This is Mercer’s Condition
Let k1, k2 be two kernel functions then the following are as well:
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
dx
Z
dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
This is Mercer’s Condition
Let k1, k2 be two kernel functions then the following are as well:
k(x, z) = k1(x, z) + k2(x, z): direct sum
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
dx
Z
dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
This is Mercer’s Condition
Let k1, k2 be two kernel functions then the following are as well:
k(x, z) = k1(x, z) + k2(x, z): direct sum
k(x, z) = αk1(x, z): scalar product
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
dx
Z
dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
This is Mercer’s Condition
Let k1, k2 be two kernel functions then the following are as well:
k(x, z) = k1(x, z) + k2(x, z): direct sum
k(x, z) = αk1(x, z): scalar product
k(x, z) = k1(x, z)k2(x, z): direct product
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
Mercer’s Condition
For k to be a kernel function
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function
Z
dx
Z
dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2)
This is Mercer’s Condition
Let k1, k2 be two kernel functions then the following are as well:
k(x, z) = k1(x, z) + k2(x, z): direct sum
k(x, z) = αk1(x, z): scalar product
k(x, z) = k1(x, z)k2(x, z): direct product
Kernels can also be constructed by composing these rules
(CS5350/6350) Kernel Methods September 15, 2011 8 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as:
Kij = k(xi , xj ) = φ(xi )⊤
φ(xj )
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as:
Kij = k(xi , xj ) = φ(xi )⊤
φ(xj )
Kij : Similarity between the i-th and j-th example in the feature space F
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as:
Kij = k(xi , xj ) = φ(xi )⊤
φ(xj )
Kij : Similarity between the i-th and j-th example in the feature space F
K: N × N matrix of pairwise similarities between examples in F space
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as:
Kij = k(xi , xj ) = φ(xi )⊤
φ(xj )
Kij : Similarity between the i-th and j-th example in the feature space F
K: N × N matrix of pairwise similarities between examples in F space
K is a symmetric matrix
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as:
Kij = k(xi , xj ) = φ(xi )⊤
φ(xj )
Kij : Similarity between the i-th and j-th example in the feature space F
K: N × N matrix of pairwise similarities between examples in F space
K is a symmetric matrix
K is a positive definite matrix (except for a few exceptions)
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as:
Kij = k(xi , xj ) = φ(xi )⊤
φ(xj )
Kij : Similarity between the i-th and j-th example in the feature space F
K: N × N matrix of pairwise similarities between examples in F space
K is a symmetric matrix
K is a positive definite matrix (except for a few exceptions)
For a P.D. matrix: z⊤
Kz > 0, ∀z ∈ RN
(also, all eigenvalues positive)
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
The Kernel Matrix
The kernel function k also defines the Kernel Matrix K over the data
Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as:
Kij = k(xi , xj ) = φ(xi )⊤
φ(xj )
Kij : Similarity between the i-th and j-th example in the feature space F
K: N × N matrix of pairwise similarities between examples in F space
K is a symmetric matrix
K is a positive definite matrix (except for a few exceptions)
For a P.D. matrix: z⊤
Kz > 0, ∀z ∈ RN
(also, all eigenvalues positive)
The Kernel Matrix K is also known as the Gram Matrix
(CS5350/6350) Kernel Methods September 15, 2011 9 / 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector inputs
Linear (trivial) Kernel:
k(x, z) = x⊤
z (mapping function φ is identity - no mapping)
(CS5350/6350) Kernel Methods September 15, 2011 10 / 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector inputs
Linear (trivial) Kernel:
k(x, z) = x⊤
z (mapping function φ is identity - no mapping)
Quadratic Kernel:
k(x, z) = (x⊤
z)2
or (1 + x⊤
z)2
(CS5350/6350) Kernel Methods September 15, 2011 10 / 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector inputs
Linear (trivial) Kernel:
k(x, z) = x⊤
z (mapping function φ is identity - no mapping)
Quadratic Kernel:
k(x, z) = (x⊤
z)2
or (1 + x⊤
z)2
Polynomial Kernel (of degree d):
k(x, z) = (x⊤
z)d
or (1 + x⊤
z)d
(CS5350/6350) Kernel Methods September 15, 2011 10 / 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector inputs
Linear (trivial) Kernel:
k(x, z) = x⊤
z (mapping function φ is identity - no mapping)
Quadratic Kernel:
k(x, z) = (x⊤
z)2
or (1 + x⊤
z)2
Polynomial Kernel (of degree d):
k(x, z) = (x⊤
z)d
or (1 + x⊤
z)d
Radial Basis Function (RBF) Kernel:
k(x, z) = exp[−γ||x − z||2
]
(CS5350/6350) Kernel Methods September 15, 2011 10 / 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector inputs
Linear (trivial) Kernel:
k(x, z) = x⊤
z (mapping function φ is identity - no mapping)
Quadratic Kernel:
k(x, z) = (x⊤
z)2
or (1 + x⊤
z)2
Polynomial Kernel (of degree d):
k(x, z) = (x⊤
z)d
or (1 + x⊤
z)d
Radial Basis Function (RBF) Kernel:
k(x, z) = exp[−γ||x − z||2
]
γ is a hyperparameter (also called the kernel bandwidth)
(CS5350/6350) Kernel Methods September 15, 2011 10 / 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector inputs
Linear (trivial) Kernel:
k(x, z) = x⊤
z (mapping function φ is identity - no mapping)
Quadratic Kernel:
k(x, z) = (x⊤
z)2
or (1 + x⊤
z)2
Polynomial Kernel (of degree d):
k(x, z) = (x⊤
z)d
or (1 + x⊤
z)d
Radial Basis Function (RBF) Kernel:
k(x, z) = exp[−γ||x − z||2
]
γ is a hyperparameter (also called the kernel bandwidth)
The RBF kernel corresponds to an infinite dimensional feature space F (i.e.,
you can’t actually write down the vector φ(x))
(CS5350/6350) Kernel Methods September 15, 2011 10 / 16
Some Examples of Kernels
The following are the most popular kernels for real-valued vector inputs
Linear (trivial) Kernel:
k(x, z) = x⊤
z (mapping function φ is identity - no mapping)
Quadratic Kernel:
k(x, z) = (x⊤
z)2
or (1 + x⊤
z)2
Polynomial Kernel (of degree d):
k(x, z) = (x⊤
z)d
or (1 + x⊤
z)d
Radial Basis Function (RBF) Kernel:
k(x, z) = exp[−γ||x − z||2
]
γ is a hyperparameter (also called the kernel bandwidth)
The RBF kernel corresponds to an infinite dimensional feature space F (i.e.,
you can’t actually write down the vector φ(x))
Note: Kernel hyperparameters (e.g., d, γ) chosen via cross-validation
(CS5350/6350) Kernel Methods September 15, 2011 10 / 16
Using Kernels
Kernels can turn a linear model into a nonlinear one
(CS5350/6350) Kernel Methods September 15, 2011 11 / 16
Using Kernels
Kernels can turn a linear model into a nonlinear one
Recall: Kernel k(x, z) represents a dot product in some high dimensional
feature space F
(CS5350/6350) Kernel Methods September 15, 2011 11 / 16
Using Kernels
Kernels can turn a linear model into a nonlinear one
Recall: Kernel k(x, z) represents a dot product in some high dimensional
feature space F
Any learning algorithm in which examples only appear as dot products (x⊤
i xj )
can be kernelized (i.e., non-linearlized)
(CS5350/6350) Kernel Methods September 15, 2011 11 / 16
Using Kernels
Kernels can turn a linear model into a nonlinear one
Recall: Kernel k(x, z) represents a dot product in some high dimensional
feature space F
Any learning algorithm in which examples only appear as dot products (x⊤
i xj )
can be kernelized (i.e., non-linearlized)
.. by replacing the x⊤
i xj terms by φ(xi )⊤
φ(xj ) = k(xi , xj )
(CS5350/6350) Kernel Methods September 15, 2011 11 / 16
Using Kernels
Kernels can turn a linear model into a nonlinear one
Recall: Kernel k(x, z) represents a dot product in some high dimensional
feature space F
Any learning algorithm in which examples only appear as dot products (x⊤
i xj )
can be kernelized (i.e., non-linearlized)
.. by replacing the x⊤
i xj terms by φ(xi )⊤
φ(xj ) = k(xi , xj )
Most learning algorithms are like that
(CS5350/6350) Kernel Methods September 15, 2011 11 / 16
Using Kernels
Kernels can turn a linear model into a nonlinear one
Recall: Kernel k(x, z) represents a dot product in some high dimensional
feature space F
Any learning algorithm in which examples only appear as dot products (x⊤
i xj )
can be kernelized (i.e., non-linearlized)
.. by replacing the x⊤
i xj terms by φ(xi )⊤
φ(xj ) = k(xi , xj )
Most learning algorithms are like that
Perceptron, SVM, linear regression, etc.
(CS5350/6350) Kernel Methods September 15, 2011 11 / 16
Using Kernels
Kernels can turn a linear model into a nonlinear one
Recall: Kernel k(x, z) represents a dot product in some high dimensional
feature space F
Any learning algorithm in which examples only appear as dot products (x⊤
i xj )
can be kernelized (i.e., non-linearlized)
.. by replacing the x⊤
i xj terms by φ(xi )⊤
φ(xj ) = k(xi , xj )
Most learning algorithms are like that
Perceptron, SVM, linear regression, etc.
Many of the unsupervised learning algorithms too can be kernelized (e.g.,
K-means clustering, Principal Component Analysis, etc.)
(CS5350/6350) Kernel Methods September 15, 2011 11 / 16
Kernelized SVM Training
Recall the SVM dual Lagrangian:
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymyn(x
T
mxn)
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
(CS5350/6350) Kernel Methods September 15, 2011 12 / 16
Kernelized SVM Training
Recall the SVM dual Lagrangian:
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymyn(x
T
mxn)
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
Replacing xT
mxn by φ(xm)⊤
φ(xn) = k(xm, xn) = Kmn, where k(., .) is some
suitable kernel function
(CS5350/6350) Kernel Methods September 15, 2011 12 / 16
Kernelized SVM Training
Recall the SVM dual Lagrangian:
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymyn(x
T
mxn)
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
Replacing xT
mxn by φ(xm)⊤
φ(xn) = k(xm, xn) = Kmn, where k(., .) is some
suitable kernel function
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymynKmn
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
(CS5350/6350) Kernel Methods September 15, 2011 12 / 16
Kernelized SVM Training
Recall the SVM dual Lagrangian:
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymyn(x
T
mxn)
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
Replacing xT
mxn by φ(xm)⊤
φ(xn) = k(xm, xn) = Kmn, where k(., .) is some
suitable kernel function
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymynKmn
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
SVM now learns a linear separator in the kernel defined feature space F
(CS5350/6350) Kernel Methods September 15, 2011 12 / 16
Kernelized SVM Training
Recall the SVM dual Lagrangian:
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymyn(x
T
mxn)
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
Replacing xT
mxn by φ(xm)⊤
φ(xn) = k(xm, xn) = Kmn, where k(., .) is some
suitable kernel function
Maximize LD (w, b, ξ, α, β) =
N
X
n=1
αn −
1
2
N
X
m,n=1
αmαnymynKmn
subject to
N
X
n=1
αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N
SVM now learns a linear separator in the kernel defined feature space F
This corresponds to a non-linear separator in the original space X
(CS5350/6350) Kernel Methods September 15, 2011 12 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x)
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x) = sign(
X
n∈SV
αnynxn
⊤
x)
SV is the set of support vectors (i.e., examples for which αn > 0)
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x) = sign(
X
n∈SV
αnynxn
⊤
x)
SV is the set of support vectors (i.e., examples for which αn > 0)
Replacing each example with its feature mapped representation (x → φ(x))
y = sign(
X
n∈SV
αnynφ(xn)⊤
φ(x))
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x) = sign(
X
n∈SV
αnynxn
⊤
x)
SV is the set of support vectors (i.e., examples for which αn > 0)
Replacing each example with its feature mapped representation (x → φ(x))
y = sign(
X
n∈SV
αnynφ(xn)⊤
φ(x)) = sign(
X
n∈SV
αnynk(xn, x))
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x) = sign(
X
n∈SV
αnynxn
⊤
x)
SV is the set of support vectors (i.e., examples for which αn > 0)
Replacing each example with its feature mapped representation (x → φ(x))
y = sign(
X
n∈SV
αnynφ(xn)⊤
φ(x)) = sign(
X
n∈SV
αnynk(xn, x))
The weight vector for the kernelized case can be expressed as:
w =
X
n∈SV
αnynφ(xn)
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x) = sign(
X
n∈SV
αnynxn
⊤
x)
SV is the set of support vectors (i.e., examples for which αn > 0)
Replacing each example with its feature mapped representation (x → φ(x))
y = sign(
X
n∈SV
αnynφ(xn)⊤
φ(x)) = sign(
X
n∈SV
αnynk(xn, x))
The weight vector for the kernelized case can be expressed as:
w =
X
n∈SV
αnynφ(xn) =
X
n∈SV
αnynk(xn, .)
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x) = sign(
X
n∈SV
αnynxn
⊤
x)
SV is the set of support vectors (i.e., examples for which αn > 0)
Replacing each example with its feature mapped representation (x → φ(x))
y = sign(
X
n∈SV
αnynφ(xn)⊤
φ(x)) = sign(
X
n∈SV
αnynk(xn, x))
The weight vector for the kernelized case can be expressed as:
w =
X
n∈SV
αnynφ(xn) =
X
n∈SV
αnynk(xn, .)
Important: Kernelized SVM needs the support vectors at the test time
(except when you can write φ(xn) as an explicit, reasonably-sized vector)
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
Kernelized SVM Prediction
Prediction for a test example x (assume b = 0)
y = sign(w⊤
x) = sign(
X
n∈SV
αnynxn
⊤
x)
SV is the set of support vectors (i.e., examples for which αn > 0)
Replacing each example with its feature mapped representation (x → φ(x))
y = sign(
X
n∈SV
αnynφ(xn)⊤
φ(x)) = sign(
X
n∈SV
αnynk(xn, x))
The weight vector for the kernelized case can be expressed as:
w =
X
n∈SV
αnynφ(xn) =
X
n∈SV
αnynk(xn, .)
Important: Kernelized SVM needs the support vectors at the test time
(except when you can write φ(xn) as an explicit, reasonably-sized vector)
In the unkernelized version w =
P
n∈SV αnynxn can be computed and stored as
a D × 1 vector, so the support vectors need not be stored
(CS5350/6350) Kernel Methods September 15, 2011 13 / 16
SVM with an RBF Kernel
The learned decision boundary in the original space is nonlinear
(CS5350/6350) Kernel Methods September 15, 2011 14 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
All the computations remain as efficient as in the original space
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
All the computations remain as efficient as in the original space
Choice of the kernel is an important factor
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
All the computations remain as efficient as in the original space
Choice of the kernel is an important factor
Many kernels are tailor-made for specific types of data
Strings (string kernels): DNA matching, text classification, etc.
Trees (tree kernels): Comparing parse trees of phrases/sentences
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
All the computations remain as efficient as in the original space
Choice of the kernel is an important factor
Many kernels are tailor-made for specific types of data
Strings (string kernels): DNA matching, text classification, etc.
Trees (tree kernels): Comparing parse trees of phrases/sentences
Kernels can even be learned from the data (hot research topic!)
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
All the computations remain as efficient as in the original space
Choice of the kernel is an important factor
Many kernels are tailor-made for specific types of data
Strings (string kernels): DNA matching, text classification, etc.
Trees (tree kernels): Comparing parse trees of phrases/sentences
Kernels can even be learned from the data (hot research topic!)
Kernel learning means learning the similarities between examples (instead of
using some pre-defined notion of similarity)
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
All the computations remain as efficient as in the original space
Choice of the kernel is an important factor
Many kernels are tailor-made for specific types of data
Strings (string kernels): DNA matching, text classification, etc.
Trees (tree kernels): Comparing parse trees of phrases/sentences
Kernels can even be learned from the data (hot research topic!)
Kernel learning means learning the similarities between examples (instead of
using some pre-defined notion of similarity)
A question worth thinking about: Wouldn’t mapping the data to higher
dimensional space cause my classifier (say SVM) to overfit?
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Kernels: concluding notes
Kernels give a modular way to learn nonlinear patterns using linear models
All you need to do is replace the inner products with the kernel
All the computations remain as efficient as in the original space
Choice of the kernel is an important factor
Many kernels are tailor-made for specific types of data
Strings (string kernels): DNA matching, text classification, etc.
Trees (tree kernels): Comparing parse trees of phrases/sentences
Kernels can even be learned from the data (hot research topic!)
Kernel learning means learning the similarities between examples (instead of
using some pre-defined notion of similarity)
A question worth thinking about: Wouldn’t mapping the data to higher
dimensional space cause my classifier (say SVM) to overfit?
The answer lies in the concepts of large margins and generalization
(CS5350/6350) Kernel Methods September 15, 2011 15 / 16
Next class..
Intro to probabilistic methods for supervised learning
Linear Regression (probabilistic version)
Logistic Regression
(CS5350/6350) Kernel Methods September 15, 2011 16 / 16

More Related Content

Similar to 3.SVM.pdf

Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset PreparationAndrew Ferlitsch
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowOswald Campesato
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine LearningPavithra Thippanaik
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsYONG ZHENG
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptxssuser2023c6
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesCHOOSE
 
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulinkMATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulinkreddyprasad reddyvari
 
Trust Measurement Presentation_Part 3
Trust Measurement Presentation_Part 3Trust Measurement Presentation_Part 3
Trust Measurement Presentation_Part 3Gan Chun Chet
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowOswald Campesato
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with ScalaNeelkanth Sachdeva
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsPhilip Schwarz
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
(slides 2) Visual Computing: Geometry, Graphics, and Vision
(slides 2)  Visual Computing: Geometry, Graphics, and Vision(slides 2)  Visual Computing: Geometry, Graphics, and Vision
(slides 2) Visual Computing: Geometry, Graphics, and VisionFrank Nielsen
 
Programming in Scala - Lecture Three
Programming in Scala - Lecture ThreeProgramming in Scala - Lecture Three
Programming in Scala - Lecture ThreeAngelo Corsaro
 
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetData Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetDr. Volkan OBAN
 

Similar to 3.SVM.pdf (20)

Machine Learning - Dataset Preparation
Machine Learning - Dataset PreparationMachine Learning - Dataset Preparation
Machine Learning - Dataset Preparation
 
Deep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlowDeep Learning, Keras, and TensorFlow
Deep Learning, Keras, and TensorFlow
 
Instance Based Learning in Machine Learning
Instance Based Learning in Machine LearningInstance Based Learning in Machine Learning
Instance Based Learning in Machine Learning
 
Matrix Factorization In Recommender Systems
Matrix Factorization In Recommender SystemsMatrix Factorization In Recommender Systems
Matrix Factorization In Recommender Systems
 
17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx17- Kernels and Clustering.pptx
17- Kernels and Clustering.pptx
 
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of TonguesChoose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
Choose'10: Ralf Laemmel - Dealing Confortably with the Confusion of Tongues
 
SVM.ppt
SVM.pptSVM.ppt
SVM.ppt
 
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulinkMATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
MATLAB/SIMULINK for Engineering Applications day 2:Introduction to simulink
 
Trust Measurement Presentation_Part 3
Trust Measurement Presentation_Part 3Trust Measurement Presentation_Part 3
Trust Measurement Presentation_Part 3
 
Scala and Deep Learning
Scala and Deep LearningScala and Deep Learning
Scala and Deep Learning
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
Functional programming with Scala
Functional programming with ScalaFunctional programming with Scala
Functional programming with Scala
 
MLconf NYC Xiangrui Meng
MLconf NYC Xiangrui MengMLconf NYC Xiangrui Meng
MLconf NYC Xiangrui Meng
 
Fusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with ViewsFusing Transformations of Strict Scala Collections with Views
Fusing Transformations of Strict Scala Collections with Views
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
(slides 2) Visual Computing: Geometry, Graphics, and Vision
(slides 2)  Visual Computing: Geometry, Graphics, and Vision(slides 2)  Visual Computing: Geometry, Graphics, and Vision
(slides 2) Visual Computing: Geometry, Graphics, and Vision
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Programming in Scala - Lecture Three
Programming in Scala - Lecture ThreeProgramming in Scala - Lecture Three
Programming in Scala - Lecture Three
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetData Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat Sheet
 

More from Variable14

12. Multi Resolution analysis, Scale invariant futures.pptx
12. Multi Resolution analysis, Scale invariant futures.pptx12. Multi Resolution analysis, Scale invariant futures.pptx
12. Multi Resolution analysis, Scale invariant futures.pptxVariable14
 
6.Evaluationandmodelselection.pdf
6.Evaluationandmodelselection.pdf6.Evaluationandmodelselection.pdf
6.Evaluationandmodelselection.pdfVariable14
 
2.Find_SandCandidateElimination.pdf
2.Find_SandCandidateElimination.pdf2.Find_SandCandidateElimination.pdf
2.Find_SandCandidateElimination.pdfVariable14
 
5.levelofAbstraction.pptx
5.levelofAbstraction.pptx5.levelofAbstraction.pptx
5.levelofAbstraction.pptxVariable14
 
Emotional Intelligence.pptx
Emotional Intelligence.pptxEmotional Intelligence.pptx
Emotional Intelligence.pptxVariable14
 

More from Variable14 (8)

16. DLT.pdf
16. DLT.pdf16. DLT.pdf
16. DLT.pdf
 
12. Multi Resolution analysis, Scale invariant futures.pptx
12. Multi Resolution analysis, Scale invariant futures.pptx12. Multi Resolution analysis, Scale invariant futures.pptx
12. Multi Resolution analysis, Scale invariant futures.pptx
 
6.Evaluationandmodelselection.pdf
6.Evaluationandmodelselection.pdf6.Evaluationandmodelselection.pdf
6.Evaluationandmodelselection.pdf
 
2.Find_SandCandidateElimination.pdf
2.Find_SandCandidateElimination.pdf2.Find_SandCandidateElimination.pdf
2.Find_SandCandidateElimination.pdf
 
2.ANN.pptx
2.ANN.pptx2.ANN.pptx
2.ANN.pptx
 
5.levelofAbstraction.pptx
5.levelofAbstraction.pptx5.levelofAbstraction.pptx
5.levelofAbstraction.pptx
 
HR.pptx
HR.pptxHR.pptx
HR.pptx
 
Emotional Intelligence.pptx
Emotional Intelligence.pptxEmotional Intelligence.pptx
Emotional Intelligence.pptx
 

Recently uploaded

Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfSkNahidulIslamShrabo
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...drjose256
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxMustafa Ahmed
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxMustafa Ahmed
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligencemahaffeycheryld
 
21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological universityMohd Saifudeen
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...IJECEIAES
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New HorizonMorshed Ahmed Rahath
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...IJECEIAES
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsVIEW
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxKarpagam Institute of Teechnology
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...Amil baba
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfssuser5c9d4b1
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Studentskannan348865
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docxrahulmanepalli02
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..MaherOthman7
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Stationsiddharthteach18
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxCHAIRMAN M
 

Recently uploaded (20)

Working Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdfWorking Principle of Echo Sounder and Doppler Effect.pdf
Working Principle of Echo Sounder and Doppler Effect.pdf
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
Dynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptxDynamo Scripts for Task IDs and Space Naming.pptx
Dynamo Scripts for Task IDs and Space Naming.pptx
 
Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligence
 
21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...Developing a smart system for infant incubators using the internet of things ...
Developing a smart system for infant incubators using the internet of things ...
 
15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon15-Minute City: A Completely New Horizon
15-Minute City: A Completely New Horizon
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
analog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptxanalog-vs-digital-communication (concept of analog and digital).pptx
analog-vs-digital-communication (concept of analog and digital).pptx
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Students
 
21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx21P35A0312 Internship eccccccReport.docx
21P35A0312 Internship eccccccReport.docx
 
Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..Maher Othman Interior Design Portfolio..
Maher Othman Interior Design Portfolio..
 
Independent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging StationIndependent Solar-Powered Electric Vehicle Charging Station
Independent Solar-Powered Electric Vehicle Charging Station
 
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptxSLIDESHARE PPT-DECISION MAKING METHODS.pptx
SLIDESHARE PPT-DECISION MAKING METHODS.pptx
 

3.SVM.pdf

  • 1. Kernel Methods and Nonlinear Classification Machine Learning (CS5350/6350) Kernel Methods September 15, 2011 1 / 16
  • 2. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 3. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.g., linear regression, linear SVM) are not just rich enough (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 4. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.g., linear regression, linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 5. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.g., linear regression, linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 6. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.g., linear regression, linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 7. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.g., linear regression, linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping ≡ changing the feature representation (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 8. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.g., linear regression, linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping ≡ changing the feature representation Note: Such mappings can be expensive to compute in general (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 9. Kernel Methods: Motivation Often we want to capture nonlinear patterns in the data Nonlinear Regression: Input-output relationship may not be linear Nonlinear Classification: Classes may not be separable by a linear boundary Linear models (e.g., linear regression, linear SVM) are not just rich enough Kernels: Make linear models work in nonlinear settings By mapping data to higher dimensions where it exhibits linear patterns Apply the linear model in the new input space Mapping ≡ changing the feature representation Note: Such mappings can be expensive to compute in general Kernels give such mappings for (almost) free In most cases, the mappings need not be even computed .. using the Kernel Trick! (CS5350/6350) Kernel Methods September 15, 2011 2 / 16
  • 10. Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data (CS5350/6350) Kernel Methods September 15, 2011 3 / 16
  • 11. Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data Now map each example as x → {x, x2 } Each example now has two features (“derived” from the old representation) (CS5350/6350) Kernel Methods September 15, 2011 3 / 16
  • 12. Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data Now map each example as x → {x, x2 } Each example now has two features (“derived” from the old representation) Data now becomes linearly separable in the new representation (CS5350/6350) Kernel Methods September 15, 2011 3 / 16
  • 13. Classifying non-linearly separable data Consider this binary classification problem Each example represented by a single feature x No linear separator exists for this data Now map each example as x → {x, x2 } Each example now has two features (“derived” from the old representation) Data now becomes linearly separable in the new representation Linear in the new representation ≡ nonlinear in the old representation (CS5350/6350) Kernel Methods September 15, 2011 3 / 16
  • 14. Classifying non-linearly separable data Let’s look at another example: Each example defined by a two features x = {x1, x2} No linear separator exists for this data (CS5350/6350) Kernel Methods September 15, 2011 4 / 16
  • 15. Classifying non-linearly separable data Let’s look at another example: Each example defined by a two features x = {x1, x2} No linear separator exists for this data Now map each example as x = {x1, x2} → z = {x2 1 , √ 2x1x2, x2 2 } Each example now has three features (“derived” from the old representation) (CS5350/6350) Kernel Methods September 15, 2011 4 / 16
  • 16. Classifying non-linearly separable data Let’s look at another example: Each example defined by a two features x = {x1, x2} No linear separator exists for this data Now map each example as x = {x1, x2} → z = {x2 1 , √ 2x1x2, x2 2 } Each example now has three features (“derived” from the old representation) Data now becomes linearly separable in the new representation (CS5350/6350) Kernel Methods September 15, 2011 4 / 16
  • 17. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 18. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} It’s an example of a quadratic mapping Each new feature uses a pair of the original features (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 19. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 20. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 21. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover, using the mapped representation could be inefficient too e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z) (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 22. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover, using the mapped representation could be inefficient too e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z) Thankfully, Kernels help us avoid both these issues! (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 23. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover, using the mapped representation could be inefficient too e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z) Thankfully, Kernels help us avoid both these issues! The mapping doesn’t have to be explicitly computed (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 24. Feature Mapping Consider the following mapping φ for an example x = {x1, . . . , xD} φ : x → {x2 1 , x2 2 , . . . , x2 D, , x1x2, x1x2, . . . , x1xD, . . . . . . , xD−1xD} It’s an example of a quadratic mapping Each new feature uses a pair of the original features Problem: Mapping usually leads to the number of features blow up! Computing the mapping itself can be inefficient in such cases Moreover, using the mapped representation could be inefficient too e.g., imagine computing the similarity between two examples: φ(x)⊤φ(z) Thankfully, Kernels help us avoid both these issues! The mapping doesn’t have to be explicitly computed Computations with the mapped features remain efficient (CS5350/6350) Kernel Methods September 15, 2011 5 / 16
  • 25. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 26. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 27. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 28. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 29. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 = (x 2 1 , √ 2x1x2, x 2 2 ) ⊤ (z 2 1 , √ 2z1z2, z 2 2 ) (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 30. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 = (x 2 1 , √ 2x1x2, x 2 2 ) ⊤ (z 2 1 , √ 2z1z2, z 2 2 ) = φ(x) ⊤ φ(z) (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 31. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 = (x 2 1 , √ 2x1x2, x 2 2 ) ⊤ (z 2 1 , √ 2z1z2, z 2 2 ) = φ(x) ⊤ φ(z) The above k implicitly defines a mapping φ to a higher dimensional space φ(x) = {x2 1 , √ 2x1x2, x2 2 } (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 32. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 = (x 2 1 , √ 2x1x2, x 2 2 ) ⊤ (z 2 1 , √ 2z1z2, z 2 2 ) = φ(x) ⊤ φ(z) The above k implicitly defines a mapping φ to a higher dimensional space φ(x) = {x2 1 , √ 2x1x2, x2 2 } Note that we didn’t have to define/compute this mapping (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 33. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 = (x 2 1 , √ 2x1x2, x 2 2 ) ⊤ (z 2 1 , √ 2z1z2, z 2 2 ) = φ(x) ⊤ φ(z) The above k implicitly defines a mapping φ to a higher dimensional space φ(x) = {x2 1 , √ 2x1x2, x2 2 } Note that we didn’t have to define/compute this mapping Simply defining the kernel a certain way gives a higher dim. mapping φ (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 34. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 = (x 2 1 , √ 2x1x2, x 2 2 ) ⊤ (z 2 1 , √ 2z1z2, z 2 2 ) = φ(x) ⊤ φ(z) The above k implicitly defines a mapping φ to a higher dimensional space φ(x) = {x2 1 , √ 2x1x2, x2 2 } Note that we didn’t have to define/compute this mapping Simply defining the kernel a certain way gives a higher dim. mapping φ Moreover the kernel k(x, z) also computes the dot product φ(x)⊤ φ(z) φ(x)⊤ φ(z) would otherwise be much more expensive to compute explicitly (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 35. Kernels as High Dimensional Feature Mapping Consider two examples x = {x1, x2} and z = {z1, z2} Let’s assume we are given a function k (kernel) that takes as inputs x and z k(x, z) = (x ⊤ z) 2 = (x1z1 + x2z2) 2 = x 2 1 z 2 1 + x 2 2 z 2 2 + 2x1x2z1z2 = (x 2 1 , √ 2x1x2, x 2 2 ) ⊤ (z 2 1 , √ 2z1z2, z 2 2 ) = φ(x) ⊤ φ(z) The above k implicitly defines a mapping φ to a higher dimensional space φ(x) = {x2 1 , √ 2x1x2, x2 2 } Note that we didn’t have to define/compute this mapping Simply defining the kernel a certain way gives a higher dim. mapping φ Moreover the kernel k(x, z) also computes the dot product φ(x)⊤ φ(z) φ(x)⊤ φ(z) would otherwise be much more expensive to compute explicitly All kernel functions have these properties (CS5350/6350) Kernel Methods September 15, 2011 6 / 16
  • 36. Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ (CS5350/6350) Kernel Methods September 15, 2011 7 / 16
  • 37. Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) (CS5350/6350) Kernel Methods September 15, 2011 7 / 16
  • 38. Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x, z) takes two inputs and gives their similarity in F space φ : X → F k : X × X → R, k(x, z) = φ(x)⊤ φ(z) (CS5350/6350) Kernel Methods September 15, 2011 7 / 16
  • 39. Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x, z) takes two inputs and gives their similarity in F space φ : X → F k : X × X → R, k(x, z) = φ(x)⊤ φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space (CS5350/6350) Kernel Methods September 15, 2011 7 / 16
  • 40. Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x, z) takes two inputs and gives their similarity in F space φ : X → F k : X × X → R, k(x, z) = φ(x)⊤ φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space Can just any function be used as a kernel function? (CS5350/6350) Kernel Methods September 15, 2011 7 / 16
  • 41. Kernels: Formally Defined Recall: Each kernel k has an associated feature mapping φ φ takes input x ∈ X (input space) and maps it to F (“feature space”) Kernel k(x, z) takes two inputs and gives their similarity in F space φ : X → F k : X × X → R, k(x, z) = φ(x)⊤ φ(z) F needs to be a vector space with a dot product defined on it Also called a Hilbert Space Can just any function be used as a kernel function? No. It must satisfy Mercer’s Condition (CS5350/6350) Kernel Methods September 15, 2011 7 / 16
  • 42. Mercer’s Condition For k to be a kernel function (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 43. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 44. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z dx Z dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2) (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 45. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z dx Z dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2) This is Mercer’s Condition (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 46. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z dx Z dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2) This is Mercer’s Condition Let k1, k2 be two kernel functions then the following are as well: (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 47. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z dx Z dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2) This is Mercer’s Condition Let k1, k2 be two kernel functions then the following are as well: k(x, z) = k1(x, z) + k2(x, z): direct sum (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 48. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z dx Z dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2) This is Mercer’s Condition Let k1, k2 be two kernel functions then the following are as well: k(x, z) = k1(x, z) + k2(x, z): direct sum k(x, z) = αk1(x, z): scalar product (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 49. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z dx Z dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2) This is Mercer’s Condition Let k1, k2 be two kernel functions then the following are as well: k(x, z) = k1(x, z) + k2(x, z): direct sum k(x, z) = αk1(x, z): scalar product k(x, z) = k1(x, z)k2(x, z): direct product (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 50. Mercer’s Condition For k to be a kernel function There must exist a Hilbert Space F for which k defines a dot product The above is true if K is a positive definite function Z dx Z dzf (x)k(x, z)f (z) > 0 (∀f ∈ L2) This is Mercer’s Condition Let k1, k2 be two kernel functions then the following are as well: k(x, z) = k1(x, z) + k2(x, z): direct sum k(x, z) = αk1(x, z): scalar product k(x, z) = k1(x, z)k2(x, z): direct product Kernels can also be constructed by composing these rules (CS5350/6350) Kernel Methods September 15, 2011 8 / 16
  • 51. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 52. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as: Kij = k(xi , xj ) = φ(xi )⊤ φ(xj ) (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 53. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as: Kij = k(xi , xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 54. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as: Kij = k(xi , xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 55. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as: Kij = k(xi , xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 56. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as: Kij = k(xi , xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix K is a positive definite matrix (except for a few exceptions) (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 57. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as: Kij = k(xi , xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix K is a positive definite matrix (except for a few exceptions) For a P.D. matrix: z⊤ Kz > 0, ∀z ∈ RN (also, all eigenvalues positive) (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 58. The Kernel Matrix The kernel function k also defines the Kernel Matrix K over the data Given N examples {x1, . . . , xN }, the (i, j)-th entry of K is defined as: Kij = k(xi , xj ) = φ(xi )⊤ φ(xj ) Kij : Similarity between the i-th and j-th example in the feature space F K: N × N matrix of pairwise similarities between examples in F space K is a symmetric matrix K is a positive definite matrix (except for a few exceptions) For a P.D. matrix: z⊤ Kz > 0, ∀z ∈ RN (also, all eigenvalues positive) The Kernel Matrix K is also known as the Gram Matrix (CS5350/6350) Kernel Methods September 15, 2011 9 / 16
  • 59. Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x, z) = x⊤ z (mapping function φ is identity - no mapping) (CS5350/6350) Kernel Methods September 15, 2011 10 / 16
  • 60. Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x, z) = x⊤ z (mapping function φ is identity - no mapping) Quadratic Kernel: k(x, z) = (x⊤ z)2 or (1 + x⊤ z)2 (CS5350/6350) Kernel Methods September 15, 2011 10 / 16
  • 61. Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x, z) = x⊤ z (mapping function φ is identity - no mapping) Quadratic Kernel: k(x, z) = (x⊤ z)2 or (1 + x⊤ z)2 Polynomial Kernel (of degree d): k(x, z) = (x⊤ z)d or (1 + x⊤ z)d (CS5350/6350) Kernel Methods September 15, 2011 10 / 16
  • 62. Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x, z) = x⊤ z (mapping function φ is identity - no mapping) Quadratic Kernel: k(x, z) = (x⊤ z)2 or (1 + x⊤ z)2 Polynomial Kernel (of degree d): k(x, z) = (x⊤ z)d or (1 + x⊤ z)d Radial Basis Function (RBF) Kernel: k(x, z) = exp[−γ||x − z||2 ] (CS5350/6350) Kernel Methods September 15, 2011 10 / 16
  • 63. Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x, z) = x⊤ z (mapping function φ is identity - no mapping) Quadratic Kernel: k(x, z) = (x⊤ z)2 or (1 + x⊤ z)2 Polynomial Kernel (of degree d): k(x, z) = (x⊤ z)d or (1 + x⊤ z)d Radial Basis Function (RBF) Kernel: k(x, z) = exp[−γ||x − z||2 ] γ is a hyperparameter (also called the kernel bandwidth) (CS5350/6350) Kernel Methods September 15, 2011 10 / 16
  • 64. Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x, z) = x⊤ z (mapping function φ is identity - no mapping) Quadratic Kernel: k(x, z) = (x⊤ z)2 or (1 + x⊤ z)2 Polynomial Kernel (of degree d): k(x, z) = (x⊤ z)d or (1 + x⊤ z)d Radial Basis Function (RBF) Kernel: k(x, z) = exp[−γ||x − z||2 ] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down the vector φ(x)) (CS5350/6350) Kernel Methods September 15, 2011 10 / 16
  • 65. Some Examples of Kernels The following are the most popular kernels for real-valued vector inputs Linear (trivial) Kernel: k(x, z) = x⊤ z (mapping function φ is identity - no mapping) Quadratic Kernel: k(x, z) = (x⊤ z)2 or (1 + x⊤ z)2 Polynomial Kernel (of degree d): k(x, z) = (x⊤ z)d or (1 + x⊤ z)d Radial Basis Function (RBF) Kernel: k(x, z) = exp[−γ||x − z||2 ] γ is a hyperparameter (also called the kernel bandwidth) The RBF kernel corresponds to an infinite dimensional feature space F (i.e., you can’t actually write down the vector φ(x)) Note: Kernel hyperparameters (e.g., d, γ) chosen via cross-validation (CS5350/6350) Kernel Methods September 15, 2011 10 / 16
  • 66. Using Kernels Kernels can turn a linear model into a nonlinear one (CS5350/6350) Kernel Methods September 15, 2011 11 / 16
  • 67. Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F (CS5350/6350) Kernel Methods September 15, 2011 11 / 16
  • 68. Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i.e., non-linearlized) (CS5350/6350) Kernel Methods September 15, 2011 11 / 16
  • 69. Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i.e., non-linearlized) .. by replacing the x⊤ i xj terms by φ(xi )⊤ φ(xj ) = k(xi , xj ) (CS5350/6350) Kernel Methods September 15, 2011 11 / 16
  • 70. Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i.e., non-linearlized) .. by replacing the x⊤ i xj terms by φ(xi )⊤ φ(xj ) = k(xi , xj ) Most learning algorithms are like that (CS5350/6350) Kernel Methods September 15, 2011 11 / 16
  • 71. Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i.e., non-linearlized) .. by replacing the x⊤ i xj terms by φ(xi )⊤ φ(xj ) = k(xi , xj ) Most learning algorithms are like that Perceptron, SVM, linear regression, etc. (CS5350/6350) Kernel Methods September 15, 2011 11 / 16
  • 72. Using Kernels Kernels can turn a linear model into a nonlinear one Recall: Kernel k(x, z) represents a dot product in some high dimensional feature space F Any learning algorithm in which examples only appear as dot products (x⊤ i xj ) can be kernelized (i.e., non-linearlized) .. by replacing the x⊤ i xj terms by φ(xi )⊤ φ(xj ) = k(xi , xj ) Most learning algorithms are like that Perceptron, SVM, linear regression, etc. Many of the unsupervised learning algorithms too can be kernelized (e.g., K-means clustering, Principal Component Analysis, etc.) (CS5350/6350) Kernel Methods September 15, 2011 11 / 16
  • 73. Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymyn(x T mxn) subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N (CS5350/6350) Kernel Methods September 15, 2011 12 / 16
  • 74. Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymyn(x T mxn) subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N Replacing xT mxn by φ(xm)⊤ φ(xn) = k(xm, xn) = Kmn, where k(., .) is some suitable kernel function (CS5350/6350) Kernel Methods September 15, 2011 12 / 16
  • 75. Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymyn(x T mxn) subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N Replacing xT mxn by φ(xm)⊤ φ(xn) = k(xm, xn) = Kmn, where k(., .) is some suitable kernel function Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymynKmn subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N (CS5350/6350) Kernel Methods September 15, 2011 12 / 16
  • 76. Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymyn(x T mxn) subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N Replacing xT mxn by φ(xm)⊤ φ(xn) = k(xm, xn) = Kmn, where k(., .) is some suitable kernel function Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymynKmn subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N SVM now learns a linear separator in the kernel defined feature space F (CS5350/6350) Kernel Methods September 15, 2011 12 / 16
  • 77. Kernelized SVM Training Recall the SVM dual Lagrangian: Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymyn(x T mxn) subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N Replacing xT mxn by φ(xm)⊤ φ(xn) = k(xm, xn) = Kmn, where k(., .) is some suitable kernel function Maximize LD (w, b, ξ, α, β) = N X n=1 αn − 1 2 N X m,n=1 αmαnymynKmn subject to N X n=1 αnyn = 0, 0 ≤ αn ≤ C; n = 1, . . . , N SVM now learns a linear separator in the kernel defined feature space F This corresponds to a non-linear separator in the original space X (CS5350/6350) Kernel Methods September 15, 2011 12 / 16
  • 78. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 79. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X n∈SV αnynxn ⊤ x) SV is the set of support vectors (i.e., examples for which αn > 0) (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 80. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X n∈SV αnynxn ⊤ x) SV is the set of support vectors (i.e., examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X n∈SV αnynφ(xn)⊤ φ(x)) (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 81. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X n∈SV αnynxn ⊤ x) SV is the set of support vectors (i.e., examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X n∈SV αnynφ(xn)⊤ φ(x)) = sign( X n∈SV αnynk(xn, x)) (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 82. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X n∈SV αnynxn ⊤ x) SV is the set of support vectors (i.e., examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X n∈SV αnynφ(xn)⊤ φ(x)) = sign( X n∈SV αnynk(xn, x)) The weight vector for the kernelized case can be expressed as: w = X n∈SV αnynφ(xn) (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 83. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X n∈SV αnynxn ⊤ x) SV is the set of support vectors (i.e., examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X n∈SV αnynφ(xn)⊤ φ(x)) = sign( X n∈SV αnynk(xn, x)) The weight vector for the kernelized case can be expressed as: w = X n∈SV αnynφ(xn) = X n∈SV αnynk(xn, .) (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 84. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X n∈SV αnynxn ⊤ x) SV is the set of support vectors (i.e., examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X n∈SV αnynφ(xn)⊤ φ(x)) = sign( X n∈SV αnynk(xn, x)) The weight vector for the kernelized case can be expressed as: w = X n∈SV αnynφ(xn) = X n∈SV αnynk(xn, .) Important: Kernelized SVM needs the support vectors at the test time (except when you can write φ(xn) as an explicit, reasonably-sized vector) (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 85. Kernelized SVM Prediction Prediction for a test example x (assume b = 0) y = sign(w⊤ x) = sign( X n∈SV αnynxn ⊤ x) SV is the set of support vectors (i.e., examples for which αn > 0) Replacing each example with its feature mapped representation (x → φ(x)) y = sign( X n∈SV αnynφ(xn)⊤ φ(x)) = sign( X n∈SV αnynk(xn, x)) The weight vector for the kernelized case can be expressed as: w = X n∈SV αnynφ(xn) = X n∈SV αnynk(xn, .) Important: Kernelized SVM needs the support vectors at the test time (except when you can write φ(xn) as an explicit, reasonably-sized vector) In the unkernelized version w = P n∈SV αnynxn can be computed and stored as a D × 1 vector, so the support vectors need not be stored (CS5350/6350) Kernel Methods September 15, 2011 13 / 16
  • 86. SVM with an RBF Kernel The learned decision boundary in the original space is nonlinear (CS5350/6350) Kernel Methods September 15, 2011 14 / 16
  • 87. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 88. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 89. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 90. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 91. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 92. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) Kernel learning means learning the similarities between examples (instead of using some pre-defined notion of similarity) (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 93. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) Kernel learning means learning the similarities between examples (instead of using some pre-defined notion of similarity) A question worth thinking about: Wouldn’t mapping the data to higher dimensional space cause my classifier (say SVM) to overfit? (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 94. Kernels: concluding notes Kernels give a modular way to learn nonlinear patterns using linear models All you need to do is replace the inner products with the kernel All the computations remain as efficient as in the original space Choice of the kernel is an important factor Many kernels are tailor-made for specific types of data Strings (string kernels): DNA matching, text classification, etc. Trees (tree kernels): Comparing parse trees of phrases/sentences Kernels can even be learned from the data (hot research topic!) Kernel learning means learning the similarities between examples (instead of using some pre-defined notion of similarity) A question worth thinking about: Wouldn’t mapping the data to higher dimensional space cause my classifier (say SVM) to overfit? The answer lies in the concepts of large margins and generalization (CS5350/6350) Kernel Methods September 15, 2011 15 / 16
  • 95. Next class.. Intro to probabilistic methods for supervised learning Linear Regression (probabilistic version) Logistic Regression (CS5350/6350) Kernel Methods September 15, 2011 16 / 16