SlideShare a Scribd company logo
1 of 137
Support Vector Machines
CSE 6363 – Machine Learning
Vassilis Athitsos
Computer Science and Engineering Department
University of Texas at Arlington
1
A Linearly Separable Problem
• Consider the binary classification problem on the figure.
– The blue points belong to one class, with label +1.
– The orange points belong to the other class, with label -1.
• These two classes are linearly separable.
– Infinitely many lines separate them.
– Are any of those infinitely many lines preferable?
2
A Linearly Separable Problem
• Do we prefer the blue line or the red line, as decision
boundary?
• What criterion can we use?
• Both decision boundaries classify the training data with
100% accuracy.
3
Margin of a Decision Boundary
• The margin of a decision boundary is defined as the
smallest distance between the boundary and any of the
samples.
4
Margin of a Decision Boundary
• One way to visualize the margin is this:
– For each class, draw a line that:
• is parallel to the decision boundary.
• touches the class point that is the closest to the decision boundary.
– The margin is the smallest distance between the decision
boundary and one of those two parallel lines.
• In this example, the decision boundary is equally far from both lines.
5
margin
margin
Support Vector Machines
• One way to visualize the margin is this:
– For each class, draw a line that:
• is parallel to the decision boundary.
• touches the class point that is the closest to the decision boundary.
– The margin is the smallest distance between the decision
boundary and one of those two parallel lines.
6
margin
Support Vector Machines
• Support Vector Machines (SVMs) are a classification
method, whose goal is to find the decision boundary
with the maximum margin.
– The idea is that, even if multiple decision boundaries give
100% accuracy on the training data, larger margins lead to
less overfitting.
– Larger margins can tolerate more perturbations of the data.
7
margin
margin
Support Vector Machines
• Note: so far, we are only discussing cases where the training data
is linearly separable.
• First, we will see how to maximize the margin for such data.
• Second, we will deal with data that are not linearly separable.
– We will define SVMs that classify such training data imperfectly.
• Third, we will see how to define nonlinear SVMs, which can
define non-linear decision boundaries.
8
margin
margin
Support Vector Machines
• Note: so far, we are only discussing cases where the training data
is linearly separable.
• First, we will see how to maximize the margin for such data.
• Second, we will deal with data that are not linearly separable.
– We will define SVMs that classify such training data imperfectly.
• Third, we will see how to define nonlinear SVMs, which can
define non-linear decision boundaries.
9
An example of a nonlinear
decision boundary produced
by a nonlinear SVM.
Support Vectors
• In the figure, the red line is the maximum margin decision
boundary.
• One of the parallel lines touches a single orange point.
– If that orange point moves closer to or farther from the red line, the
optimal boundary changes.
– If other orange points move, the optimal boundary does not change,
unless those points move to the right of the blue line.
10
margin
margin
Support Vectors
• In the figure, the red line is the maximum margin decision
boundary.
• One of the parallel lines touches two blue points.
– If either of those points moves closer to or farther from the red line, the
optimal boundary changes.
– If other blue points move, the optimal boundary does not change, unless
those points move to the left of the blue line.
11
margin
margin
Support Vectors
• In summary, in this example, the maximum margin is
defined by only three points:
– One orange point.
– Two blue points.
• These points are called support vectors.
– They are indicated by a black circle around them.
12
margin
margin
Distances to the Boundary
• The decision boundary consists of all points 𝒙 that
are solutions to equation: 𝒘𝑇𝒙 + 𝑏 = 0.
– 𝒘 is a column vector of parameters (weights).
– 𝒙 is an input vector.
– 𝑏 is a scalar value (a real number).
• If 𝒙𝑛 is a training point, its distance to the boundary
is computed using this equation:
𝐷 𝒙𝑛, 𝒘 =
𝒘𝑇𝒙 + 𝑏
𝒘
13
Distances to the Boundary
• If 𝒙𝑛 is a training point, its distance to the boundary is
computed using this equation:
𝐷 𝒙𝑛, 𝒘 =
𝒘𝑇𝒙𝑛 + 𝑏
𝒘
• Since the training data are linearly separable, the data from
each class should fall on opposite sides of the boundary.
• Suppose that 𝑡𝑛 = −1 for points of one class, and 𝑡𝑛 = +1
for points of the other class.
• Then, we can rewrite the distance as:
𝐷 𝒙𝑛, 𝒘 =
𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏)
𝒘 14
Distances to the Boundary
• So, given a decision boundary defined 𝒘 and 𝒃, and given a
training input 𝒙𝑛, the distance of 𝒙𝑛 to the boundary is:
𝐷 𝒙𝑛, 𝒘 =
𝑡𝑛(𝒘𝑇
𝒙𝑛 + 𝑏)
𝒘
• If 𝑡𝑛 = −1, then:
– 𝒘𝑇
𝒙𝑛 + 𝑏 < 0.
– 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 > 0.
• If 𝑡𝑛 = 1, then:
– 𝒘𝑇𝒙𝑛 + 𝑏 > 0.
– 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 > 0.
• So, in all cases, 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) is positive.
15
• If 𝒙𝑛 is a training point, its distance to the boundary
is computed using this equation:
𝐷 𝒙𝑛, 𝒘 =
𝑡𝑛(𝒘𝑇
𝒙𝑛 + 𝑏)
𝒘
• Therefore, the optimal boundary 𝒘opt is defined as:
(𝒘opt, 𝑏opt) = argmax 𝒘,𝑏
min𝑛
𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏)
𝒘
– In words: find the 𝒘 and 𝑏 that maximize the minimum
distance of any training input from the boundary. 16
Optimization Criterion
Optimization Criterion
• The optimal boundary 𝒘opt is defined as:
(𝒘opt, 𝑏opt) = argmax 𝒘,𝑏
min𝑛
𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏)
𝒘
• Suppose that, for some values 𝒘 and 𝑏, the decision boundary
defined by 𝒘𝑇𝒙𝑛 + 𝑏 = 0 misclassifies some objects.
• Can those values of 𝒘 and 𝑏 be selected as 𝒘opt, 𝑏opt?
17
Optimization Criterion
• The optimal boundary 𝒘opt is defined as:
(𝒘opt, 𝑏opt) = argmax 𝒘,𝑏
min𝑛
𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏)
𝒘
• Suppose that, for some values 𝒘 and 𝑏, the decision boundary
defined by 𝒘𝑇𝒙𝑛 + 𝑏 = 0 misclassifies some objects.
• Can those values of 𝒘 and 𝑏 be selected as 𝒘opt, 𝑏opt?
• No.
– If some objects get misclassified, then, for some 𝒙𝑛 it holds that
𝑡𝑛(𝒘𝑇𝒙𝑛+𝑏)
𝒘
< 𝟎.
– Thus, for such 𝒘 and 𝑏, the expression in red will be negative.
– Since the data is linearly separable, we can find better values for 𝒘 and 𝑏,
for which the expression in red will be greater than 0. 18
Scale of 𝒘
• The optimal boundary 𝒘opt is defined as:
(𝒘opt, 𝑏opt) = argmax 𝒘,𝑏
min𝑛
𝑡𝑛(𝒘𝑇
𝒙𝑛 + 𝑏)
𝒘
• Suppose that 𝑔 is a real number, and 𝑐 > 0.
• If 𝒘opt and 𝑏opt define an optimal boundary, then 𝑔 ∗
𝒘opt and 𝑔 ∗ 𝑏opt also define an optimal boundary.
• We constrain the scale of 𝒘opt to a single value, by
requiring that:
min𝑛 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) = 1 19
Optimization Criterion
• We introduced the requirement that: min𝑛 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) = 1
• Therefore, for any 𝒙𝑛, it holds that: 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1
• The original optimization criterion becomes:
𝒘opt, 𝑏opt = argmax 𝒘,𝑏
min𝑛
𝑡𝑛 𝒘𝑇𝒙𝑛+𝑏
𝒘
⇒
𝒘opt = argmax 𝒘
1
𝒘
= argmin 𝒘
𝒘 ⇒
𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
20
These are equivalent formulations.
The textbook uses the last one because
it simplifies subsequent calculations.
Constrained Optimization
• Summarizing the previous slides, we want to find:
𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
subject to the following constraints:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≥ 1
• This is a different optimization problem than what we have
seen before.
• We need to minimize a quantity while satisfying a set of
inequalities.
• This type of problem is a constrained optimization problem.
21
Quadratic Programming
• Our constrained optimization problem can be solved
using a method called quadratic programming.
• Describing quadratic programming in depth is
outside the scope of this course.
• Our goal is simply to understand how to use
quadratic programming as a black box, to solve our
optimization problem.
– This way, you can use any quadratic programming toolkit
(Matlab includes one).
22
Quadratic Programming
• The quadratic programming problem is defined as follows:
• Inputs:
– 𝒔: an 𝑅-dimensional column vector.
– 𝑸: an 𝑅 × 𝑅-dimensional symmetric matrix.
– 𝑯: an 𝑄 × 𝑅-dimensional symmetric matrix.
– 𝒛: an 𝑄-dimensional column vector.
• Output:
– 𝒖opt: an 𝑅-dimensional column vector, such that:
𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
subject to constraint: 𝑯𝒖 ≤ 𝒛 23
Quadratic Programming for SVMs
• Quadratic Programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 +
24
Quadratic Programming for SVMs
• Quadratic Programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≥ 1 ⇒
∀𝑛 ∈ 1, … , 𝑁 , −𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≤ −1
25
Quadratic Programming for SVMs
• SVM constraints: ∀𝑛 ∈ 1, … , 𝑁 , −𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≤ −1
• Define: 𝑯 =
−𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷
−𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷
⋯
−𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷
, 𝒖 =
𝑏
𝑤1
𝑤2
⋯
𝑤𝐷
, 𝒛 =
−1
−1
⋯
−1 𝑁
– Matrix 𝑯 is 𝑁 × (𝐷 + 1), vector 𝒖 has (𝐷 + 1) rows, vector 𝒛 has 𝑁 rows.
• 𝑯𝒖 =
−𝑡1𝑏 −𝑡1 𝑥11𝑤1 − ⋯ − 𝑡1𝑥1𝐷𝑤1
…
−𝑡𝑁𝑏 −𝑡𝑁 𝑥𝑁1𝑤1 − ⋯ − 𝑡1𝑥𝑁𝐷𝑤1
=
−𝑡1(𝑏 + 𝒘𝑇
𝒙1)
…
−𝑡𝑁(𝑏 + 𝒘𝑇𝒙𝑁)
• The 𝑛-th row of 𝑯𝒖 is −𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 , which should be ≤ −1.
• SVM constraint  quadratic programming constraint 𝑯𝒖 ≤ 𝒛. 26
Quadratic Programming for SVMs
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
• SVM: 𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
• Define: 𝑸 =
0, 0, 0, … , 0
0, 1, 0, … , 0
0, 0, 1, … , 0
⋯
0, 0, 0, … , 1
, 𝒖 =
𝑏
𝑤1
𝑤2
⋯
𝑤𝐷
, 𝒔 =
0
0
⋯
0 𝐷+1
– 𝑸 is like the (𝐷 + 1) × (𝐷 + 1) identity matrix, except that 𝑄11 = 0.
• Then,
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 =
1
2
𝒘 𝟐.
27
𝒖 already defined in
the previous slides
𝒔: 𝐷 + 1 -
dimensional
column vector
of zeros.
Quadratic Programming for SVMs
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
• SVM: 𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
• Alternative definitions that would NOT work:
• Define: 𝑸 =
1, 0, … , 0
…
0, 0, … , 1
, 𝒖 =
𝑤1
…
𝑤𝐷
, 𝒔 =
0
…
0 𝑫
– 𝑸 is the 𝐷 × 𝐷 identity matrix, 𝒖 = 𝒘, 𝒔 is the 𝐷-dimensional zero vector.
• It still holds that
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 =
1
2
𝒘 𝟐.
• Why would these definitions not work?
28
Quadratic Programming for SVMs
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
• SVM: 𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
• Alternative definitions that would NOT work:
• Define: 𝑸 =
1, 0, … , 0
…
0, 0, … , 1
, 𝒖 =
𝑤1
…
𝑤𝐷
, 𝒔 =
0
…
0 𝑫
– 𝑸 is the 𝐷 × 𝐷 identity matrix, 𝒖 = 𝒘, 𝒔 is the 𝐷-dimensional zero vector.
• It still holds that
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 =
1
2
𝒘 𝟐.
• Why would these definitions not work?
• Vector 𝒖 must also make 𝑯𝒖 ≤ 𝒛 match the SVM constraints.
– With this definition of 𝒖, no appropriate 𝑯 and 𝒛 can be found.
29
Quadratic Programming for SVMs
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject
to constraint: 𝑯𝒖 ≤ 𝒛
• SVM goal: 𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≥ 1
• Task: define 𝑸, 𝒔, 𝑯, 𝒛, 𝒖 so that quadratic programming computes
𝒘opt and 𝑏opt.
𝑸 =
0, 0, 0, … , 0
0, 1, 0, … , 0
0, 0, 1, … , 0
⋯
0, 0, 0, … , 1
, 𝒔 =
0
0
⋯
0
, 𝑯 =
−𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷
−𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷
⋯
−𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷
, 𝒛 =
−1
−1
⋯
−1
, 𝒖 =
𝑏
𝑤1
𝑤2
⋯
𝑤𝐷
30
Like (𝐷 + 1) × (𝐷 + 1)
identity matrix, except
that 𝑄11 = 0
𝐷 + 1
rows
𝐷 + 1
rows
𝑁 rows
𝑁 rows,
𝐷 + 1 columns
31
Quadratic Programming for SVMs
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject
to constraint: 𝑯𝒖 ≤ 𝒛
• SVM goal: 𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≥ 1
• Task: define 𝑸, 𝒔, 𝑯, 𝒛, 𝒖 so that quadratic programming computes
𝒘opt and 𝑏opt.
𝑸 =
0, 0, 0, … , 0
0, 1, 0, … , 0
0, 0, 1, … , 0
⋯
0, 0, 0, … , 1
, 𝒔 =
0
0
⋯
0
, 𝑯 =
−𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷
−𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷
⋯
−𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷
, 𝒛 =
−1
−1
⋯
−1
, 𝒖 =
𝑏
𝑤1
𝑤2
⋯
𝑤𝐷
• Quadratic programming takes as inputs 𝑸, 𝒔, 𝑯, 𝒛, and outputs
𝒖opt, from which we get the 𝒘opt, 𝑏opt values for our SVM.
32
Quadratic Programming for SVMs
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject
to constraint: 𝑯𝒖 ≤ 𝒛
• SVM goal: 𝒘opt = argmin 𝒘
1
2
𝒘 𝟐
subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≥ 1
• Task: define 𝑸, 𝒔, 𝑯, 𝒛, 𝒖 so that quadratic programming computes
𝒘opt and 𝑏opt.
𝑸 =
0, 0, 0, … , 0
0, 1, 0, … , 0
0, 0, 1, … , 0
⋯
0, 0, 0, … , 1
, 𝒔 =
0
0
⋯
0
, 𝑯 =
−𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷
−𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷
⋯
−𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷
, 𝒛 =
−1
−1
⋯
−1
, 𝒖 =
𝑏
𝑤1
𝑤2
⋯
𝑤𝐷
• 𝒘opt is the vector of values at dimensions 2, …, 𝐷 + 1 of 𝒖opt.
• 𝑏opt is the value at dimension 1 of 𝒖opt.
Solving the Same Problem, Again
• So far, we have solved the problem of defining an SVM (i.e.,
defining 𝒘opt and 𝑏opt), so as to maximize the margin between
linearly separable data.
• If this were all that SVMs can do, SVMs would not be that
important.
– Linearly separable data are a rare case, and a very easy case to deal with.
33
margin
margin
• So far, we have solved the problem of defining an SVM (i.e.,
defining 𝒘opt and 𝑏opt), so as to maximize the margin between
linearly separable data.
• We will see two extensions that
make SVMs much more powerful.
– The extensions will allow SVMs to
define highly non-linear decision
boundaries, as in this figure.
• However, first we need to solve
the same problem again.
– Maximize the margin between
linearly separable data.
• We will get a more complicated solution, but that solution will be
easier to improve upon.
Solving the Same Problem, Again
34
Lagrange Multipliers
• Our new solutions are derived using Lagrange
multipliers.
• Here is a quick review from multivariate calculus.
• Let 𝒙 be a 𝐷-dimensional vector.
• Let 𝑓(𝒙) and 𝑔(𝒙) be functions from ℝ𝐷 to ℝ.
– Functions 𝑓 and 𝑔 map 𝐷-dimensional vectors to real
numbers.
• Suppose that we want to minimize 𝑓(𝒙), subject to the
constraint that 𝑔 𝒙 ≥ 0.
• Then, we can solve this problem using a Lagrange
multiplier to define a Lagrangian function.
35
36
Lagrange Multipliers
• To minimize 𝑓(𝒙), subject to the constraint: 𝑔 𝒙 ≥ 0:
• We define the Lagrangian function: 𝐿 𝒙, 𝜆 = 𝑓 𝒙 − 𝜆𝑔(𝒙)
– 𝜆 is called a Lagrange multiplier, 𝜆 ≥ 0.
• We find 𝒙opt = argmin𝒙 𝐿(𝒙, 𝜆) , and a corresponding value for
𝜆, subject to the following constraints:
1. 𝑔 𝒙 ≥ 0
2. 𝜆 ≥ 0
3. 𝜆𝑔 𝒙 = 0
• If 𝑔 𝒙opt > 0, the third constraint implies that 𝜆 = 0.
• Then, the constraint 𝑔 𝒙 ≥ 0 is called inactive.
• If 𝑔 𝒙opt = 0, then 𝜆 > 0.
• Then, constraint 𝑔 𝒙 ≥ 0 is called active.
37
Multiple Constraints
• Suppose that we have 𝑁 constraints:
∀𝑛 ∈ 1, … , 𝑁 , 𝑔𝑛 𝒙 ≥ 0
• We want to minimize 𝑓(𝒙), subject to those 𝑁 constraints.
• Define vector 𝝀 = 𝜆1, … , 𝜆𝑁 .
• Define the Lagrangian function as:
𝐿 𝒙, 𝝀 = 𝑓 𝒙 −
𝑛=1
𝑁
𝜆𝑛𝑔𝑛 𝒙
• We find 𝒙opt = argmin𝒙 𝐿(𝒙, 𝝀) , and a value for 𝝀, subject to:
• ∀𝑛, 𝑔𝑛 𝒙 ≥ 0
• ∀𝑛, 𝜆𝑛 ≥ 0
• ∀𝑛, 𝜆𝑛𝑔𝑛 𝒙 = 0
38
Lagrange Dual Problems
• We have 𝑁 constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑔𝑛 𝒙 ≥ 0
• We want to minimize 𝑓(𝒙), subject to those 𝑁 constraints.
• Under some conditions (which are satisfied in our SVM problem),
we can solve an alternative dual problem:
• Define the Lagrangian function as before:
𝐿 𝒙, 𝝀 = 𝑓 𝒙 −
𝑛=1
𝑁
𝜆𝑛𝑔𝑛 𝒙
• We find 𝒙opt, and the best value for 𝝀, denoted as 𝝀opt, by solving:
𝝀opt = argmax
𝝀
min
𝒙
𝐿 𝒙, 𝝀
𝒙opt = argmin
𝒙
𝐿 𝒙, 𝝀opt
subject to constraints: 𝜆𝑛 ≥ 𝟎
39
Lagrange Dual Problems
• Lagrangian dual problem: Solve:
𝝀opt = argmax
𝝀
min
𝒙
𝐿 𝒙, 𝝀
𝒙opt = argmin
𝒙
𝐿 𝒙, 𝝀opt
subject to constraints: 𝜆𝑛 ≥ 𝟎
• This dual problem formulation will be used in training
SVMs.
• The key thing to remember is:
– We minimize the Lagrangian 𝐿 with respect to 𝒙.
– We maximize 𝐿 with respect to the Lagrange multipliers 𝜆𝑛.
40
Lagrange Multipliers and SVMs
• SVM goal: 𝒘opt = argmin 𝒘
1
2
𝒘 𝟐 subject to constraints:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1
• To make the constraints more amenable to Lagrange multipliers,
we rewrite them as:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 ≥ 0
• Define 𝒂 = (𝑎1, … , 𝑎𝑁) to be a vector of 𝑁 Lagrange multipliers.
• Define the Lagrangian function:
𝐿 𝒘, 𝑏, 𝒂 =
1
2
𝒘 𝟐
−
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 − 1
• Remember from the previous slides, we minimize 𝐿 with respect
to 𝒘, 𝑏, and maximize 𝐿 with respect to 𝒂.
41
Lagrange Multipliers and SVMs
• The 𝒘 and 𝑏 that minimize 𝐿 𝒘, 𝑏, 𝒂 must satisfy:
𝜕𝐿
𝜕𝒘
= 0,
𝜕𝐿
𝜕𝑏
= 0.
𝜕𝐿
𝜕𝒘
= 0 ⇒
𝜕
1
2
𝒘 𝟐 − 𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1
𝜕𝒘
= 0
⇒ 𝒘 −
𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛 = 0
⇒ 𝒘 = 𝑎𝑛𝑡𝑛𝒙𝑛
𝑁
𝑛=1
42
Lagrange Multipliers and SVMs
• The 𝒘 and 𝑏 that minimize 𝐿 𝒘, 𝑏, 𝒂 must satisfy:
𝜕𝐿
𝜕𝒘
= 0,
𝜕𝐿
𝜕𝑏
= 0.
𝜕𝐿
𝜕𝑏
= 0 ⇒
𝜕
1
2
𝒘 𝟐 − 𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1
𝜕𝑏
= 0
⇒
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
43
Lagrange Multipliers and SVMs
• Our Lagrangian function is:
𝐿 𝒘, 𝑏, 𝒂 =
1
2
𝒘 2
−
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 − 1
• We showed that 𝒘 = 𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛 . Using that, we get:
1
2
𝒘 2 =
1
2
𝒘𝑇𝒘 =
1
2
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 𝒙𝑛
𝑇
𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛
44
Lagrange Multipliers and SVMs
• We showed that:
⇒
1
2
𝒘 2 =
1
2
𝑛=1
𝑁
𝑚=1
𝑁
𝑎𝑛𝑎𝑚𝑡𝑛𝑡𝑚 𝒙𝑛
𝑇
𝒙𝑚
• Define an 𝑁 × 𝑁 matrix 𝑸 such that 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛
𝑇
𝒙𝑚.
• Remember that we have defined 𝒂 = (𝑎1, … , 𝑎𝑁).
• Then, it follows that:
1
2
𝒘 2 =
1
2
𝒂𝑇
𝑸𝒂
45
Lagrange Multipliers and SVMs
𝐿 𝒘, 𝑏, 𝒂 =
1
2
𝒂𝑇
𝑸𝒂 −
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1
• We showed that 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0. Using that, we get:
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1
=
𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒘𝑇
𝒙𝑛 + 𝑏
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 −
𝑛=1
𝑁
𝑎𝑛
=
𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒘𝑇𝒙𝑛 −
𝑛=1
𝑁
𝑎𝑛
We have shown before
that the red part equals 0.
46
Lagrange Multipliers and SVMs
𝐿 𝒘, 𝒂 =
1
2
𝒂𝑇
𝑸𝒂 −
𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒘𝑇𝒙𝑛 +
𝑛=1
𝑁
𝑎𝑛
• Function 𝐿 now does not depend on 𝑏.
• We simplify more, using again the fact that 𝒘 = 𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛 :
𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒘𝑇𝒙𝑛 =
𝑛=1
𝑁
𝑎𝑛𝑡𝑛
𝑚=1
𝑁
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇 𝒙𝑛
=
𝑛=1
𝑁
𝑚=1
𝑁
𝑎𝑛𝑎𝑚𝑡𝑛𝑡𝑚 𝒙𝑛
𝑇
𝒙𝑛 = 𝒂𝑇
𝑸𝒂
47
Lagrange Multipliers and SVMs
𝐿 𝒘, 𝒂 =
1
2
𝒂𝑇
𝑸𝒂 − 𝒂𝑇
𝑸𝒂 +
𝑛=1
𝑁
𝑎𝑛 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
• Function 𝐿 now does not depend on 𝑤 anymore.
• We can rewrite 𝐿 as a function whose only input is 𝒂:
𝐿 𝒂 =
1
2
𝒂𝑇
𝑸𝒂 − 𝒂𝑇
𝑸𝒂 +
𝑛=1
𝑁
𝑎𝑛 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
• Remember, we want to maximize 𝐿 𝒂 with respect to 𝒂.
48
Lagrange Multipliers and SVMs
• By combining the results from the last few slides, our
optimization problem becomes:
Maximize
𝐿 𝒂 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
subject to these constraints:
𝑎𝑛 ≥ 0
49
Lagrange Multipliers and SVMs
𝐿 𝒂 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
• We want to maximize 𝐿 𝒂 subject to some constraints.
• Therefore, we want to find an 𝒂opt such that:
𝒂opt = argmax
𝒂
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂 = argmin
𝒂
1
2
𝒂𝑇
𝑸𝒂 −
𝑛=1
𝑁
𝑎𝑛
subject to those constraints.
50
SVM Optimization Problem
• Our SVM optimization problem now is to find an 𝒂opt such
that:
𝒂opt = argmin
𝒂
1
2
𝒂𝑇
𝑸𝒂 −
𝑛=1
𝑁
𝑎𝑛
subject to these constraints:
𝑎𝑛 ≥ 0
This problem can be
solved again using
quadratic programming.
Using Quadratic Programming
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
subject to constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem: find 𝒂opt = argmin
𝒂
1
2
𝒂𝑇
𝑸𝒂 − 𝑛=1
𝑁
𝑎𝑛
subject to constraints:
𝑎𝑛 ≥ 0,
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Again, we must find values for 𝑸, 𝒔, 𝑯, 𝒛 that convert the SVM
problem into a quadratic programming problem.
• Note that we already have a matrix 𝑸 in the Lagrangian. It is
an 𝑁 × 𝑁 matrix 𝑸 such that 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛
𝑇
𝒙𝑛.
• We will use the same 𝑸 for quadratic programming. 51
Using Quadratic Programming
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
subject to constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem: find 𝒂opt = argmin
𝒂
1
2
𝒂𝑇
𝑸𝒂 − 𝑛=1
𝑁
𝑎𝑛
subject to constraints: 𝑎𝑛 ≥ 0, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Define: 𝒖 = 𝒂, and define N-dimensional vector 𝒔 =
−1
−1
⋯
−1 𝑁
.
• Then,
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 =
1
2
𝒂𝑇
𝑸𝒂 − 𝑛=1
𝑁
𝑎𝑛
• We have mapped the SVM minimization goal of finding 𝒂opt
to the quadratic programming minimization goal. 52
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 𝑎𝑛 ≥ 0, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝑁
, 𝒛 =
0
0
⋯
0 𝑁+2
• The first 𝑁 rows of 𝑯 are the negation of the 𝑁 × 𝑁 identity
matrix.
• Row 𝑁 + 1 of 𝑯 is the transpose of vector 𝒕 of target outputs.
• Row 𝑁 + 2 of 𝑯 is the negation of the previous row. 53
𝒛: 𝑁 + 2 -
dimensional
column vector
of zeros.
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 𝑎𝑛 ≥ 0, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝑁
, 𝒛 =
0
0
⋯
0 𝑁+2
• Since we defined 𝒖 = 𝒂:
– For 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 equals 𝑎𝑛.
– For 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 and 𝒛 specifies that −𝑎𝑛≤ 0 ⇒ 𝑎𝑛 ≥ 0.
– The (𝑛 + 1)-th row of 𝑯𝒖 and 𝒛 specifies that 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 ≤ 0.
– The (𝑛 + 2)-th row of 𝑯𝒖 and 𝒛 specifies that 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 ≥ 0. 54
𝒛: 𝑁 + 2 -
dimensional
column vector
of zeros.
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 𝑎𝑛 ≥ 0, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝑁
, 𝒛 =
0
0
⋯
0 𝑁+2
• Since we defined 𝒖 = 𝑎:
– For 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 and 𝒛 specifies that 𝑎𝑛 ≥ 0.
– The last two rows of 𝑯𝒖 and 𝒛 specify that 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0.
• We have mapped the SVM constraints to the quadratic
programming constraint 𝑯𝒖 ≤ 𝒛. 55
𝒛: 𝑁 + 2 -
dimensional
column vector
of zeros.
Interpretation of the Solution
• Quadratic programming, given the inputs 𝑸, 𝒔, 𝑯, 𝒛 defined in
the previous slides, outputs the optimal value for vector 𝒂.
• We used this Lagrangian function:
𝐿 𝒘, 𝑏, 𝒂 =
1
2
𝒘 𝟐
−
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 − 1
• Remember, when we find 𝒙opt to minimize any Lagrangian
𝐿 𝒙, 𝝀 = 𝑓 𝒙 −
𝑛=1
𝑁
𝜆𝑛𝑔𝑛 𝒙
one of the constraints we enforce is that ∀𝑛, 𝜆𝑛𝑔𝑛 𝒙 = 0.
• In the SVM case, this means: ∀𝑛, 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 = 0
56
Interpretation of the Solution
• In the SVM case, it holds that:
∀𝑛, 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 = 0
• What does this mean?
• Mathematically, ∀𝑛:
• Either 𝑎𝑛 = 0,
• Or 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1.
• If 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1, what does that imply for 𝒙𝑛?
57
Interpretation of the Solution
• In the SVM case, it holds that:
∀𝑛, 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 = 0
• What does this mean?
• Mathematically, ∀𝑛:
• Either 𝑎𝑛 = 0,
• Or 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1.
• If 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1, what does that imply for 𝒙𝑛?
• Remember, many slides back, we imposed the constraint that
∀𝑛, 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1.
• Equality 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1 holds only for the support vectors.
• Therefore, 𝑎𝑛 > 0 only for the support vectors.
58
Support Vectors
• This is an example we have seen before.
• The three support vectors have a black circle around them.
• When we find the optimal values 𝑎1, 𝑎2, … , 𝑎𝑁 for this problem,
only three of those values will be non-zero.
– If 𝑥𝑛 is not a support vector, then 𝑎𝑛 = 0.
59
margin
margin
Interpretation of the Solution
• We showed before that: 𝒘 = 𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛
• This means that 𝒘 is a linear combination of the
training data.
• However, since 𝑎𝑛 > 0 only for the support vectors,
obviously only the support vector influence 𝒘.
60
margin
margin
Computing 𝑏
• 𝒘 = 𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛
• Define set 𝑆 = 𝑛 𝒙𝑛 is a support vector}.
• Since 𝑎𝑛 > 0 only for the support vectors, we get:
𝒘 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛𝒙𝑛
• If 𝒙𝑛 is a support vector, then 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1.
• Substituting 𝒘 = 𝑚∈𝑆 𝑎𝑚𝑡𝑚𝒙𝑚 , if 𝒙𝑛 is a support vector:
𝑡𝑛
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇
𝒙𝑛 + 𝑏 = 1
61
Computing 𝑏
𝑡𝑛
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛 + 𝑏 = 1
• Remember that 𝑡𝑛 can only take values 1 and −1.
• Therefore, 𝑡𝑛
2 = 1
• Multiplying both sides of the equation with 𝑡𝑛 we get:
𝑡𝑛𝑡𝑛
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛 + 𝑏 = 𝑡𝑛
⇒
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛 + 𝑏 = 𝑡𝑛 ⇒ 𝑏 = 𝑡𝑛 −
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛
62
Computing 𝑏
• Thus, if 𝒙𝑛 is a support vector, we can compute 𝑏 with formula:
𝑏 = 𝑡𝑛 −
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛
• To avoid numerical problems, instead of using a single support
vector to calculate 𝑏, we can use all support vectors (and take the
average of the computed values for 𝑏).
• If 𝑁𝑆 is the number of support vectors, then:
𝑏 =
1
𝑁𝑆
𝑛∈𝑆
𝑡𝑛 −
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇
𝒙𝑛
63
Classification Using 𝒂
• To classify a test object 𝒙, we can use the original
formula 𝑦 𝒙 = 𝒘𝑇𝒙 + 𝑏.
• However, since 𝒘 = 𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛 , we can
substitute that formula for 𝒘, and classify 𝒙 using:
𝑦 𝒙 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛 𝒙𝑛
𝑇𝒙 + 𝑏
• This formula will be our only choice when we use
SVMs that produce nonlinear boundaries.
– Details on that will be coming later in this presentation.
64
Recap of Lagrangian-Based Solution
• We defined the Lagrangian function, which became (after
simplifications):
𝐿 𝒂 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
• Using quadratic programming, we find the value of 𝒂 that
maximizes 𝐿 𝒂 , subject to constraints: 𝑎𝑛 ≥ 0, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Then:
65
𝒘 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛𝒙𝑛 𝑏 =
1
𝑁𝑆
𝑛∈𝑆
𝑡𝑛 −
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛
Why Did We Do All This?
• The Lagrangian-based solution solves a problem that
we had already solved, in a more complicated way.
• Typically, we prefer simpler solutions.
• However, this more complicated solution can be
tweaked relatively easily, to produce more powerful
SVMs that:
– Can be trained even if the training data is not linearly
separable.
– Can produce nonlinear decision boundaries.
66
The Not Linearly Separable Case
• Our previous formulation required that the training examples
are linearly separable.
• This requirement was encoded in the constraint:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1
• According to that constraint, every 𝒙𝑛 has to be on the correct
side of the boundary.
• To handle data that is not linearly separable, we introduce 𝑁
variables 𝜉𝑛 ≥ 0, to define a modified constraint:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛
• These variables 𝜉𝑛 are called slack variables.
• The values for 𝜉𝑛 will be computed during optimization.
67
The Meaning of Slack Variables
68
• To handle data that is not linearly separable, we introduce 𝑁
slack variables 𝜉𝑛 ≥ 0, to define a modified constraint:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛
• If 𝜉𝑛 = 0, then 𝒙𝑛 is where it
should be:
– either on the blue line for
objects of its class, or on the
correct side of that blue line.
• If 0 < 𝜉𝑛 < 1, then 𝒙𝑛 is too
close to the decision boundary.
– Between the red line and the
blue line for objects of its class.
• If 𝜉𝑛 = 1, 𝒙𝑛 is on the red line.
• If 𝜉𝑛 > 1, 𝒙𝑛 is misclassified (on the wrong side of the red line).
The Meaning of
Slack Variables
69
• If training data is not linearly
separable, we introduce 𝑁
slack variables 𝜉𝑛 ≥ 0, to
define a modified constraint:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛
• If 𝜉𝑛 = 0, then 𝒙𝑛 is where it should be:
– either on the blue line for objects of its class, or on the correct side of
that blue line.
• If 0 < 𝜉𝑛 < 1, then 𝒙𝑛 is too close to the decision boundary.
– Between the red line and the blue line for objects of its class.
• If 𝜉𝑛 = 1, 𝒙𝑛 is on the decision boundary (the red line).
• If 𝜉𝑛 > 1, 𝒙𝑛 is misclassified (on the wrong side of the red line).
Optimization Criterion
• Before, we minimized
1
2
𝒘 𝟐 subject to constraints:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1
• Now, we want to minimize error function: 𝐶 𝑛=1
𝑁
𝜉𝑛 +
1
2
𝒘 𝟐
subject to constraints:
∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇
𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛
𝜉𝑛 ≥ 0
• 𝐶 is a parameter that we pick manually, 𝐶 ≥ 0.
• 𝐶 controls the trade-off between maximizing the margin, and
penalizing training examples that violate the margin.
– The higher 𝜉𝑛 is, the farther away 𝒙𝑛 is from where it should be.
– If 𝒙𝑛 is on the correct side of the decision boundary, and the distance of 𝒙𝑛
to the boundary is greater than or equal to the margin, then 𝜉𝑛 = 0, and
𝒙𝑛 does not contribute to the error function. 70
Lagrangian Function
• To do our constrained optimization, we define the Lagrangian:
𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 =
1
2
𝒘 𝟐 + 𝐶
𝑛=1
𝑁
𝜉𝑛 −
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 −
𝑛=1
𝑁
𝜇𝑛𝜉𝑛
• The Lagrange multipliers are now 𝑎𝑛 and 𝜇𝑛.
• The term 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 in the Lagrangian corresponds to
constraint 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 ≥ 0.
• The term 𝜇𝑛𝜉𝑛 in the Lagrangian corresponds to constraint 𝜉𝑛 ≥ 0.
71
Lagrangian Function
• Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 :
1
2
𝒘 𝟐
+ 𝐶
𝑛=1
𝑁
𝜉𝑛 −
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 −
𝑛=1
𝑁
𝜇𝑛𝜉𝑛
• Constraints:
𝑎𝑛 ≥ 0
𝜇𝑛 ≥ 0
𝜉𝑛 ≥ 0
𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 ≥ 0
𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 = 0
𝜇𝑛𝜉𝑛 = 0
72
Lagrangian Function
• Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 :
1
2
𝒘 𝟐
+ 𝐶
𝑛=1
𝑁
𝜉𝑛 −
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 −
𝑛=1
𝑁
𝜇𝑛𝜉𝑛
• As before, we can simplify the Lagrangian significantly:
𝜕𝐿
𝜕𝒘
= 0 ⇒ 𝒘 =
𝑛=1
𝑁
𝑎𝑛𝑡𝑛𝒙𝑛
𝜕𝐿
𝜕𝜉𝑛
= 0 ⇒ 𝐶 − 𝑎𝑛 − 𝜇𝑛 = 0 ⇒ 𝑎𝑛 = 𝐶 − 𝜇𝑛
73
𝜕𝐿
𝜕𝑏
= 0 ⇒
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
Lagrangian Function
• Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 :
1
2
𝒘 𝟐
+ 𝐶
𝑛=1
𝑁
𝜉𝑛 −
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 −
𝑛=1
𝑁
𝜇𝑛𝜉𝑛
• In computing 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 , 𝜉𝑛 contributes the following value:
𝐶𝜉𝑛 − 𝑎𝑛 𝜉𝑛 − 𝜇𝑛𝜉𝑛 = 𝜉𝑛(𝐶 − 𝑎𝑛 − 𝜇𝑛)
• As we saw in the previous slide, 𝐶 − 𝑎𝑛 − 𝜇𝑛 = 0.
• Therefore, we can eliminate all those occurrences of 𝜉𝑛.
74
Lagrangian Function
• Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 :
1
2
𝒘 𝟐
+ 𝐶
𝑛=1
𝑁
𝜉𝑛 −
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 −
𝑛=1
𝑁
𝜇𝑛𝜉𝑛
• In computing 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 , 𝜉𝑛 contributes the following value:
𝐶𝜉𝑛 − 𝑎𝑛 𝜉𝑛 − 𝜇𝑛𝜉𝑛 = 𝜉𝑛(𝐶 − 𝑎𝑛 − 𝜇𝑛)
• As we saw in the previous slide, 𝐶 − 𝑎𝑛 − 𝜇𝑛 = 0.
• Therefore, we can eliminate all those occurrences of 𝜉𝑛.
• By eliminating 𝜉𝑛 we also eliminate all appearances of 𝐶 and 𝜇𝑛.
75
Lagrangian Function
• By eliminating 𝜉𝑛 and 𝜇𝑛, we get:
𝐿 𝒘, 𝑏, 𝒂 =
1
2
𝒘 𝟐 −
𝑛=1
𝑁
𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1
• This is exactly the Lagrangian we had for the linearly separable
case.
• Following the same steps as in the linearly separable case, we can
simplify even more to:
𝐿 𝒂 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
76
Lagrangian Function
• So, we have ended up with the same Lagrangian as in the linearly
separable case:
𝐿 𝒂 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
• There is a small difference, however, in the constraints.
77
Linearly separable case:
0 ≤ 𝑎𝑛
Linearly inseparable case:
0 ≤ 𝑎𝑛 ≤ 𝐶
Lagrangian Function
• So, we have ended up with the same Lagrangian as in the linearly
separable case:
𝐿 𝒂 =
𝑛=1
𝑁
𝑎𝑛 −
1
2
𝒂𝑇
𝑸𝒂
• Where does constraint 0 ≤ 𝑎𝑛 ≤ 𝐶 come from?
• 0 ≤ 𝑎𝑛 because 𝑎𝑛 is a Lagrange multiplier.
• 𝑎𝑛 ≤ 𝐶 comes from the fact that we showed earlier, that 𝑎𝑛 = 𝐶 −
𝜇𝑛.
– Since 𝜇𝑛 is also a Lagrange multiplier, 𝜇𝑛 ≥ 0 and thus 𝑎𝑛 ≤ 𝐶.
78
Using Quadratic Programming
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
subject to constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem: find 𝒂opt = argmin
𝒂
1
2
𝒂𝑇
𝑸𝒂 − 𝑛=1
𝑁
𝑎𝑛
subject to constraints:
0 ≤ 𝑎𝑛 ≤ 𝐶,
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Again, we must find values for 𝑸, 𝒔, 𝑯, 𝒛 that convert the SVM
problem into a quadratic programming problem.
• Values for 𝑸 and 𝒔 are the same as in the linearly separable
case, since in both cases we minimize
1
2
𝒂𝑇
𝑸𝒂 − 𝑛=1
𝑁
𝑎𝑛.
79
Using Quadratic Programming
• Quadratic programming: 𝒖opt = argmin𝑢
1
2
𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖
subject to constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem: find 𝒂opt = argmin
𝒂
1
2
𝒂𝑇
𝑸𝒂 − 𝑛=1
𝑁
𝑎𝑛
subject to constraints:
0 ≤ 𝑎𝑛 ≤ 𝐶,
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• 𝑸 is an 𝑁 × 𝑁 matrix such that 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛
𝑇
𝒙𝑛.
• 𝒖 = 𝒂, and 𝒔 =
−1
−1
⋯
−1 𝑁
.
80
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• The top 𝑁 rows of 𝑯 are the negation of the 𝑁 × 𝑁 identity matrix.
• Rows 𝑁 + 1 to 2N of 𝑯 are the 𝑁 × 𝑁 identity matrix
• Row 𝑁 + 1 of 𝑯 is the transpose of vector 𝒕 of target outputs.
• Row 𝑁 + 2 of 𝑯 is the negation of the previous row. 81
Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
1, 0, 0, … , 0
0, 1, 0, … , 0
…
0, 0, 0, … , 1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛
𝒛 =
0
0
⋯
0
𝐶
𝐶
…
𝐶
0
0
rows
1 to 𝑁
rows
(N + 1)
to 2𝑁
rows (2N + 1)
and (2N + 2)
rows 1 to 𝑁,
set to 0.
rows (N + 1)
to 2𝑁, set to
𝐶.
rows 2N + 1,
2N + 2, set to 0.
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• If 1 ≤ 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 is −𝑎𝑛, and the 𝑛-th row of 𝒛 is 0.
• Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint −𝑎𝑛 ≤ 0 ⇒ 𝑎𝑛 ≥ 0.
82
Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
1, 0, 0, … , 0
0, 1, 0, … , 0
…
0, 0, 0, … , 1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛
𝒛 =
0
0
⋯
0
𝐶
𝐶
…
𝐶
0
0
rows
1 to 𝑁
rows
(N + 1)
to 2𝑁
rows (2N + 1)
and (2N + 2)
rows 1 to 𝑁,
set to 0.
rows (N + 1)
to 2𝑁, set to
𝐶.
rows 2N + 1,
2N + 2, set to 0.
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• If (N + 1) ≤ 𝑛 ≤ 2𝑁, the 𝑛-th row of 𝑯𝒖 is 𝑎𝑛, and the 𝑛-th row of 𝒛 is 𝐶.
• Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint 𝑎𝑛 ≤ 0.
83
Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
1, 0, 0, … , 0
0, 1, 0, … , 0
…
0, 0, 0, … , 1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛
𝒛 =
0
0
⋯
0
𝐶
𝐶
…
𝐶
0
0
rows
1 to 𝑁
rows
(N + 1)
to 2𝑁
rows (2N + 1)
and (2N + 2)
rows 1 to 𝑁,
set to 0.
rows (N + 1)
to 2𝑁, set to
𝐶.
rows 2N + 1,
2N + 2, set to 0.
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• If 𝑛 = 2𝑁 + 1, the 𝑛-th row of 𝑯𝒖 is 𝒕, and the 𝑛-th row of 𝒛 is 0.
• Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 ≤ 0.
84
Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
1, 0, 0, … , 0
0, 1, 0, … , 0
…
0, 0, 0, … , 1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛
𝒛 =
0
0
⋯
0
𝐶
𝐶
…
𝐶
0
0
rows
1 to 𝑁
rows
(N + 1)
to 2𝑁
rows (2N + 1)
and (2N + 2)
rows 1 to 𝑁,
set to 0.
rows (N + 1)
to 2𝑁, set to
𝐶.
rows 2N + 1,
2N + 2, set to 0.
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• If 𝑛 = 2𝑁 + 2, the 𝑛-th row of 𝑯𝒖 is −𝒕, and the 𝑛-th row of 𝒛 is 0.
• Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint − 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 ≤ 0 ⇒
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 ≥ 0. 85
Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
1, 0, 0, … , 0
0, 1, 0, … , 0
…
0, 0, 0, … , 1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛
𝒛 =
0
0
⋯
0
𝐶
𝐶
…
𝐶
0
0
rows
1 to 𝑁
rows
(N + 1)
to 2𝑁
rows (2N + 1)
and (2N + 2)
rows 1 to 𝑁,
set to 0.
rows (N + 1)
to 2𝑁, set to
𝐶.
rows 2N + 1,
2N + 2, set to 0.
Using Quadratic Programming
• Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛
• SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
• Thus, the last two rows of 𝑯𝒖 and 𝒛 in combination capture constraint
𝑛=1
𝑁
𝑎𝑛𝑡𝑛 = 0
86
Define: 𝑯 =
−1, 0, 0, … , 0
0, −1, 0, … , 0
⋯
0, 0, 0, … , −1
1, 0, 0, … , 0
0, 1, 0, … , 0
…
0, 0, 0, … , 1
𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛
−𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛
𝒛 =
0
0
⋯
0
𝐶
𝐶
…
𝐶
0
0
rows
1 to 𝑁
rows
(N + 1)
to 2𝑁
rows (2N + 1)
and (2N + 2)
rows 1 to 𝑁,
set to 0.
rows (N + 1)
to 2𝑁, set to
𝐶.
rows 2N + 1,
2N + 2, set to 0.
Using the Solution
• Quadratic programming, given the inputs 𝑸, 𝒔, 𝑯, 𝒛 defined in
the previous slides, outputs the optimal value for vector 𝒂.
• The value of 𝑏 is computed as in the linearly separable case:
𝑏 =
1
𝑁𝑆
𝑛∈𝑆
𝑡𝑛 −
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛
• Classification of input x is done as in the linearly separable
case:
𝑦 𝒙 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛 𝒙𝑛
𝑇𝒙 + 𝑏
87
The Disappearing 𝒘
• We have used vector 𝒘, to define the decision boundary
𝑦 𝒙 = 𝒘𝑇𝒙 + 𝑏.
• However, 𝒘 does not need to be either computed, or used.
• Quadratic programming computes values 𝑎𝑛.
• Using those values 𝑎𝑛, we compute 𝑏:
𝑏 =
1
𝑁𝑆
𝑛∈𝑆
𝑡𝑛 −
𝑚∈𝑆
𝑎𝑚𝑡𝑚 𝒙𝑚
𝑇𝒙𝑛
• Using those values for 𝑎𝑛 and 𝑏, we can classify test objects 𝒙:
𝑦 𝒙 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛 𝒙𝑛
𝑇𝒙 + 𝑏
• If we know the values for 𝑎𝑛 and 𝑏, we do not need 𝒘. 88
The Role of Training Inputs
• Where, during training, do we use input vectors?
• Where, during classification, do we use input vectors?
• Overall, input vectors are only used in two formulas:
1. During training, we use vectors 𝒙𝑛 to define matrix 𝑸, where:
𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛
𝑇
𝒙𝑚
2. During classification, we use training vectors 𝒙𝑛 and test
input 𝒙 in this formula:
𝑦 𝒙 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛 𝒙𝑛
𝑇
𝒙 + 𝑏
• In both formulas, input vectors are only used by taking their
dot products.
89
The Role of Training Inputs
• To make this more clear, define a kernel function 𝑘(𝒙, 𝒙′) as:
𝑘 𝒙, 𝒙′
= 𝒙𝑇
𝒙′
• During training, we use vectors 𝒙𝑛 to define matrix 𝑸, where:
𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚𝑘(𝒙𝑛, 𝒙𝑚)
• During classification, we use training vectors 𝒙𝑛 and test input
𝒙 in this formula:
𝑦 𝒙 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛𝑘(𝒙𝑛, 𝒙) + 𝑏
• In both formulas, input vectors are used only through
function 𝑘.
90
The Kernel Trick
• We have defined kernel function 𝑘 𝒙, 𝒙′ = 𝒙𝑇
𝒙′.
• During training, we use vectors 𝒙𝑛 to define matrix 𝑸, where:
𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚𝑘(𝒙𝑛, 𝒙𝑚)
• During classification, we use this formula:
𝑦 𝒙 =
𝑛∈𝑆
𝑎𝑛𝑡𝑛𝑘(𝒙𝑛, 𝒙) + 𝑏
• What if we defined 𝑘 𝒙, 𝒙′ differently?
• The SVM formulation (both for training and for classification)
would remain exactly the same.
• In the SVM formulation, the kernel 𝑘 𝒙, 𝒙′ is a black box.
– You can define 𝑘 𝒙, 𝒙′ any way you like.
91
A Different Kernel
• Let 𝒙 = 𝑥1, 𝑥2 and 𝒛 = 𝑧1, 𝑧2 be 2-dimensional vectors.
• Consider this alternative definition for the kernel:
𝑘 𝒙, 𝒛 = 1 + 𝒙𝑇
𝒛
2
• Then:
𝑘 𝒙, 𝒛 = 1 + 𝒙𝑇
𝒛
2
= 1 + 𝑥1𝑧1 + 𝑥2𝑧2
2
= 1 + 2𝑥1𝑧1 + 2𝑥2𝑧2 + (𝑥1)2(𝑧1)2+2𝑥1𝑧12𝑥2𝑧2 + (𝑥2)2(𝑧2)2
• e a basis function 𝜑(𝒙) as:
•
92
Kernels and Basis Functions
• In general, kernels make it easy to incorporate basis
functions into SVMs:
– Define 𝜑 𝒙 any way you like.
– Define 𝑘 𝒙, 𝒛 = 𝜑 𝒙 𝑻𝜑 𝒛 .
• The kernel function represents a dot product, but in
a (typically) higher-dimensional feature space
compared to the original space of 𝒙 and 𝒛.
93
Polynomial Kernels
• Let 𝒙 and 𝒛 be 𝐷-dimensional vectors.
• A polynomial kernel of degree 𝑑 is defined as:
𝑘 𝒙, 𝒛 = 𝑐 + 𝒙𝑇
𝒛
𝑑
• The kernel 𝑘 𝒙, 𝒛 = 1 + 𝒙𝑇
𝒛
2
that we saw a
couple of slides back was a quadratic kernel.
• Parameter 𝑐 controls the trade-off between influence
higher-order and lower-order terms.
– Increasing values of 𝑐 give increasing influence to lower-
order terms. 94
Decision boundary with polynomial kernel of degree 1.
95
This is identical to
the result using
the standard dot
product as kernel.
Polynomial Kernels – An Easy Case
Decision boundary with polynomial kernel of degree 2.
96
The decision
boundary is
not linear
anymore.
Polynomial Kernels – An Easy Case
Decision boundary with polynomial kernel of degree 3.
97
Polynomial Kernels – An Easy Case
Decision boundary with polynomial kernel of degree 1.
98
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 2.
99
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 3.
100
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 4.
101
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 5.
102
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 6.
103
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 7.
104
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 8.
105
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 9.
106
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 10.
107
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 20.
108
Polynomial Kernels – A Harder Case
Decision boundary with polynomial kernel of degree 100.
109
Polynomial Kernels – A Harder Case
RBF/Gaussian Kernels
• The Radial Basis Function (RBF) kernel, also known
as Gaussian kernel, is defined as:
𝑘𝜎 𝒙, 𝒛 = 𝑒
−
𝒙−𝒛 2
2𝜎2
• Given 𝜎, the value of 𝑘𝜎 𝒙, 𝒛 only depends on the
distance between 𝒙 and 𝒛.
– 𝑘𝜎 𝒙, 𝒛 decreases exponentially to the distance between
𝒙 and 𝒛.
• Parameter 𝜎 is chosen manually.
– Parameter 𝜎 specifies how fast 𝑘𝜎 𝒙, 𝒛 decreases as 𝒙
moves away from 𝒛. 110
RBF Output Vs. Distance
• X axis: distance between 𝒙 and 𝒛.
• Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 = 3.
111
RBF Output Vs. Distance
• X axis: distance between 𝒙 and 𝒛.
• Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 = 2.
112
RBF Output Vs. Distance
• X axis: distance between 𝒙 and 𝒛.
• Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 =1.
113
RBF Output Vs. Distance
• X axis: distance between 𝒙 and 𝒛.
• Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 = 0.5.
114
RBF Kernels – An Easier Dataset
Decision boundary
with a linear kernel.
115
RBF Kernels – An Easier Dataset
Decision boundary
with an RBF kernel.
For this dataset, this
is a relatively large
value for 𝜎, and it
produces a boundary
that is almost linear.
116
RBF Kernels – An Easier Dataset
117
Decision boundary
with an RBF kernel.
Decreasing the value
of 𝜎 leads to less
linear boundaries.
RBF Kernels – An Easier Dataset
Decision boundary
with an RBF kernel.
Decreasing the value
of 𝜎 leads to less
linear boundaries.
118
RBF Kernels – An Easier Dataset
Decision boundary
with an RBF kernel.
Decreasing the value
of 𝜎 leads to less
linear boundaries.
119
RBF Kernels – An Easier Dataset
Decision boundary
with an RBF kernel.
Decreasing the value
of 𝜎 leads to less
linear boundaries.
120
RBF Kernels – An Easier Dataset
Decision boundary
with an RBF kernel.
Decreasing the value
of 𝜎 leads to less
linear boundaries.
121
RBF Kernels – An Easier Dataset
Decision boundary
with an RBF kernel.
Note that smaller
values of 𝜎 increase
dangers of
overfitting.
122
RBF Kernels – An Easier Dataset
Decision boundary
with an RBF kernel.
Note that smaller
values of 𝜎 increase
dangers of
overfitting.
123
RBF Kernels – A Harder Dataset
Decision boundary
with a linear kernel.
124
RBF Kernels – A Harder Dataset
Decision boundary
with an RBF kernel.
The boundary is
almost linear.
125
RBF Kernels – A Harder Dataset
Decision boundary
with an RBF kernel.
The boundary now
is clearly nonlinear.
126
RBF Kernels – A Harder Dataset
Decision boundary
with an RBF kernel.
127
RBF Kernels – A Harder Dataset
Decision boundary
with an RBF kernel.
128
RBF Kernels – A Harder Dataset
Decision boundary
with an RBF kernel.
Again, smaller values
of 𝜎 increase
dangers of
overfitting.
129
RBF Kernels – A Harder Dataset
Decision boundary
with an RBF kernel.
Again, smaller values
of 𝜎 increase
dangers of
overfitting.
130
RBF Kernels and Basis Functions
• The RBF kernel is defined as:
𝑘𝜎 𝒙, 𝒛 = 𝑒
−
𝒙−𝒛 2
2𝜎2
• Is the RBF kernel equivalent to taking the dot product
in some feature space?
• In other words, is there any basis function 𝜑 such
that 𝑘𝜎 𝒙, 𝒛 = 𝜑 𝒙 𝑇𝜑 𝒛 ?
• The answer is "yes":
– There exists such a function 𝜑, but its output is infinite-
dimensional.
– We will not get into more details here. 131
Kernels for Non-Vector Data
• So far, all our methods have been taking real-valued vectors as
input.
– The inputs have been elements of ℝ𝐷, for some 𝐷 ≥ 1.
• However, there are many interesting problems where the
input is not such real-valued vectors.
• Examples???
132
Kernels for Non-Vector Data
• So far, all our methods have been taking real-valued vectors as
input.
– The inputs have been elements of ℝ𝐷, for some 𝐷 ≥ 1.
• However, there are many interesting problems where the
input is not such real-valued vectors.
• The inputs can be strings, like "cat", "elephant", "dog".
• The inputs can be sets, such as {1, 5, 3}, {5, 1, 2, 7}.
– Sets are NOT vectors. They have very different properties.
– As sets, {1, 5, 3} = {5, 3, 1}.
– As vectors, 1, 5, 3 ≠ (5, 3, 1)
• There are many other types of non-vector data…
• SVMs can be applied to such data, as long as we define an
appropriate kernel function. 133
Training Time Complexity
• Solving the quadratic programming problem takes 𝑂(𝑑3)
time, where 𝑑 is the the dimensionality of vector 𝒖.
– In other words, 𝑑 is the number of values we want to estimate.
• If we d𝒘 and 𝑏, then we estimate 𝐷 + 1 values.
– The time complexity is 𝑂(𝐷3), where 𝐷 is the dimensionality of input
vectors 𝒙 (or the dimensionality of 𝜑 𝒙 , if we use a basis function).
• If we use quadratic programming to compute vector 𝒂 =
(𝑎1, … , 𝑎𝑁), then we estimate 𝑁 values.
– The time complexity is 𝑂(𝑁3), where 𝑁 is the number of training
inputs.
• Which one is faster?
134
Training Time Complexity
• If we use quadratic programming to compute directly 𝒘 and
𝑏, then the time complexity is 𝑂(𝐷3).
– If we use no basis function, 𝐷 is the dimensionality of input vectors 𝒙.
– If we use a basis function 𝜑, 𝐷 is the dimensionality of 𝜑 𝒙 .
• If we use quadratic programming to compute vector 𝒂 =
(𝑎1, … , 𝑎𝑁), the time complexity is 𝑂(𝑁3).
• For linear SVMs (i.e., SVMs with linear kernels, that use the
regular dot product), usually 𝐷 is much smaller than 𝑁.
• If we use RBF kernels, 𝜑 𝒙 is infinite-dimensional.
– Computing but computing vector 𝒂 still takes 𝑂(𝑁3
) time.
• If you want to use the kernel trick, then there is no choice, it
takes 𝑂(𝑁3) time to do training.
135
SVMs for Multiclass Problems
• As usual, you can always train one-vs.-all SVMs if there are
more than two classes.
• Other, more complicated methods are also available.
• You can also train what is called "all-pairs" classifiers:
– Each SVM is trained to discriminate between two classes.
– The number of SVMs is quadratic to the number of classes.
• All-pairs classifiers can be used in different ways to classify an
input object.
– Each pairwise classifier votes for one of the two classes it was trained
on. Classification time is quadratic to the number of classes.
– There is an alternative method, called DAGSVM, where the all-pairs
classifiers are organized in a directed acyclic graph, and classification
time is linear to the number of classes.
136
SVMs: Recap
• Advantages:
– Training finds globally best solution.
• No need to worry about local optima, or iterations.
– SVMs can define complex decision boundaries.
• Disadvantages:
– Training time is cubic to the number of training data. This
makes it hard to apply SVMs to large datasets.
– High-dimensional kernels increase the risk of overfitting.
• Usually larger training sets help reduce overfitting, but SVMs cannot
be applied to large training sets due to cubic time complexity.
– Some choices must still be made manually.
• Choice of kernel function.
• Choice of parameter 𝐶 in formula 𝐶 𝑛=1
𝑁
𝜉𝑛 +
1
2
𝒘 𝟐
. 137

More Related Content

Similar to 10_support_vector_machines (1).pptx

Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsSantiagoGarridoBulln
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคลBAINIDA
 
Constraint satisfaction problems (csp)
Constraint satisfaction problems (csp)   Constraint satisfaction problems (csp)
Constraint satisfaction problems (csp) Archana432045
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the thesanjaibalajeessn
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptxKarasuLee
 
Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)Niharika Varshney
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelineChenYiHuang5
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxSubrata Kumer Paul
 
classification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxclassification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxjasontseng19
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer
 
Undecidable Problems and Approximation Algorithms
Undecidable Problems and Approximation AlgorithmsUndecidable Problems and Approximation Algorithms
Undecidable Problems and Approximation AlgorithmsMuthu Vinayagam
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewChia-Ching Lin
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptxTekle12
 
Linear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialLinear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialJia-Bin Huang
 
Data Analysis and Algorithms Lecture 1: Introduction
 Data Analysis and Algorithms Lecture 1: Introduction Data Analysis and Algorithms Lecture 1: Introduction
Data Analysis and Algorithms Lecture 1: IntroductionTayyabSattar5
 

Similar to 10_support_vector_machines (1).pptx (20)

Approximation Algorithms TSP
Approximation Algorithms   TSPApproximation Algorithms   TSP
Approximation Algorithms TSP
 
Optimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methodsOptimum Engineering Design - Day 2b. Classical Optimization methods
Optimum Engineering Design - Day 2b. Classical Optimization methods
 
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคลMachine Learning: An introduction โดย รศ.ดร.สุรพงค์  เอื้อวัฒนามงคล
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
 
MSE.pptx
MSE.pptxMSE.pptx
MSE.pptx
 
Constraint satisfaction problems (csp)
Constraint satisfaction problems (csp)   Constraint satisfaction problems (csp)
Constraint satisfaction problems (csp)
 
Support Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the theSupport Vector Machines is the the the the the the the the the
Support Vector Machines is the the the the the the the the the
 
13Kernel_Machines.pptx
13Kernel_Machines.pptx13Kernel_Machines.pptx
13Kernel_Machines.pptx
 
Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)Vehicle Routing Problem using PSO (Particle Swarm Optimization)
Vehicle Routing Problem using PSO (Particle Swarm Optimization)
 
Lecture 11 linear regression
Lecture 11 linear regressionLecture 11 linear regression
Lecture 11 linear regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Paper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipelinePaper Study: Melding the data decision pipeline
Paper Study: Melding the data decision pipeline
 
Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxData Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx
 
classification algorithms in machine learning.pptx
classification algorithms in machine learning.pptxclassification algorithms in machine learning.pptx
classification algorithms in machine learning.pptx
 
مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةمدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة
 
Undecidable Problems and Approximation Algorithms
Undecidable Problems and Approximation AlgorithmsUndecidable Problems and Approximation Algorithms
Undecidable Problems and Approximation Algorithms
 
Domain adaptation: A Theoretical View
Domain adaptation: A Theoretical ViewDomain adaptation: A Theoretical View
Domain adaptation: A Theoretical View
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptx
 
Linear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorialLinear Algebra and Matlab tutorial
Linear Algebra and Matlab tutorial
 
svm.pptx
svm.pptxsvm.pptx
svm.pptx
 
Data Analysis and Algorithms Lecture 1: Introduction
 Data Analysis and Algorithms Lecture 1: Introduction Data Analysis and Algorithms Lecture 1: Introduction
Data Analysis and Algorithms Lecture 1: Introduction
 

Recently uploaded

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 

10_support_vector_machines (1).pptx

  • 1. Support Vector Machines CSE 6363 – Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1
  • 2. A Linearly Separable Problem • Consider the binary classification problem on the figure. – The blue points belong to one class, with label +1. – The orange points belong to the other class, with label -1. • These two classes are linearly separable. – Infinitely many lines separate them. – Are any of those infinitely many lines preferable? 2
  • 3. A Linearly Separable Problem • Do we prefer the blue line or the red line, as decision boundary? • What criterion can we use? • Both decision boundaries classify the training data with 100% accuracy. 3
  • 4. Margin of a Decision Boundary • The margin of a decision boundary is defined as the smallest distance between the boundary and any of the samples. 4
  • 5. Margin of a Decision Boundary • One way to visualize the margin is this: – For each class, draw a line that: • is parallel to the decision boundary. • touches the class point that is the closest to the decision boundary. – The margin is the smallest distance between the decision boundary and one of those two parallel lines. • In this example, the decision boundary is equally far from both lines. 5 margin margin
  • 6. Support Vector Machines • One way to visualize the margin is this: – For each class, draw a line that: • is parallel to the decision boundary. • touches the class point that is the closest to the decision boundary. – The margin is the smallest distance between the decision boundary and one of those two parallel lines. 6 margin
  • 7. Support Vector Machines • Support Vector Machines (SVMs) are a classification method, whose goal is to find the decision boundary with the maximum margin. – The idea is that, even if multiple decision boundaries give 100% accuracy on the training data, larger margins lead to less overfitting. – Larger margins can tolerate more perturbations of the data. 7 margin margin
  • 8. Support Vector Machines • Note: so far, we are only discussing cases where the training data is linearly separable. • First, we will see how to maximize the margin for such data. • Second, we will deal with data that are not linearly separable. – We will define SVMs that classify such training data imperfectly. • Third, we will see how to define nonlinear SVMs, which can define non-linear decision boundaries. 8 margin margin
  • 9. Support Vector Machines • Note: so far, we are only discussing cases where the training data is linearly separable. • First, we will see how to maximize the margin for such data. • Second, we will deal with data that are not linearly separable. – We will define SVMs that classify such training data imperfectly. • Third, we will see how to define nonlinear SVMs, which can define non-linear decision boundaries. 9 An example of a nonlinear decision boundary produced by a nonlinear SVM.
  • 10. Support Vectors • In the figure, the red line is the maximum margin decision boundary. • One of the parallel lines touches a single orange point. – If that orange point moves closer to or farther from the red line, the optimal boundary changes. – If other orange points move, the optimal boundary does not change, unless those points move to the right of the blue line. 10 margin margin
  • 11. Support Vectors • In the figure, the red line is the maximum margin decision boundary. • One of the parallel lines touches two blue points. – If either of those points moves closer to or farther from the red line, the optimal boundary changes. – If other blue points move, the optimal boundary does not change, unless those points move to the left of the blue line. 11 margin margin
  • 12. Support Vectors • In summary, in this example, the maximum margin is defined by only three points: – One orange point. – Two blue points. • These points are called support vectors. – They are indicated by a black circle around them. 12 margin margin
  • 13. Distances to the Boundary • The decision boundary consists of all points 𝒙 that are solutions to equation: 𝒘𝑇𝒙 + 𝑏 = 0. – 𝒘 is a column vector of parameters (weights). – 𝒙 is an input vector. – 𝑏 is a scalar value (a real number). • If 𝒙𝑛 is a training point, its distance to the boundary is computed using this equation: 𝐷 𝒙𝑛, 𝒘 = 𝒘𝑇𝒙 + 𝑏 𝒘 13
  • 14. Distances to the Boundary • If 𝒙𝑛 is a training point, its distance to the boundary is computed using this equation: 𝐷 𝒙𝑛, 𝒘 = 𝒘𝑇𝒙𝑛 + 𝑏 𝒘 • Since the training data are linearly separable, the data from each class should fall on opposite sides of the boundary. • Suppose that 𝑡𝑛 = −1 for points of one class, and 𝑡𝑛 = +1 for points of the other class. • Then, we can rewrite the distance as: 𝐷 𝒙𝑛, 𝒘 = 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) 𝒘 14
  • 15. Distances to the Boundary • So, given a decision boundary defined 𝒘 and 𝒃, and given a training input 𝒙𝑛, the distance of 𝒙𝑛 to the boundary is: 𝐷 𝒙𝑛, 𝒘 = 𝑡𝑛(𝒘𝑇 𝒙𝑛 + 𝑏) 𝒘 • If 𝑡𝑛 = −1, then: – 𝒘𝑇 𝒙𝑛 + 𝑏 < 0. – 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 > 0. • If 𝑡𝑛 = 1, then: – 𝒘𝑇𝒙𝑛 + 𝑏 > 0. – 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 > 0. • So, in all cases, 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) is positive. 15
  • 16. • If 𝒙𝑛 is a training point, its distance to the boundary is computed using this equation: 𝐷 𝒙𝑛, 𝒘 = 𝑡𝑛(𝒘𝑇 𝒙𝑛 + 𝑏) 𝒘 • Therefore, the optimal boundary 𝒘opt is defined as: (𝒘opt, 𝑏opt) = argmax 𝒘,𝑏 min𝑛 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) 𝒘 – In words: find the 𝒘 and 𝑏 that maximize the minimum distance of any training input from the boundary. 16 Optimization Criterion
  • 17. Optimization Criterion • The optimal boundary 𝒘opt is defined as: (𝒘opt, 𝑏opt) = argmax 𝒘,𝑏 min𝑛 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) 𝒘 • Suppose that, for some values 𝒘 and 𝑏, the decision boundary defined by 𝒘𝑇𝒙𝑛 + 𝑏 = 0 misclassifies some objects. • Can those values of 𝒘 and 𝑏 be selected as 𝒘opt, 𝑏opt? 17
  • 18. Optimization Criterion • The optimal boundary 𝒘opt is defined as: (𝒘opt, 𝑏opt) = argmax 𝒘,𝑏 min𝑛 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) 𝒘 • Suppose that, for some values 𝒘 and 𝑏, the decision boundary defined by 𝒘𝑇𝒙𝑛 + 𝑏 = 0 misclassifies some objects. • Can those values of 𝒘 and 𝑏 be selected as 𝒘opt, 𝑏opt? • No. – If some objects get misclassified, then, for some 𝒙𝑛 it holds that 𝑡𝑛(𝒘𝑇𝒙𝑛+𝑏) 𝒘 < 𝟎. – Thus, for such 𝒘 and 𝑏, the expression in red will be negative. – Since the data is linearly separable, we can find better values for 𝒘 and 𝑏, for which the expression in red will be greater than 0. 18
  • 19. Scale of 𝒘 • The optimal boundary 𝒘opt is defined as: (𝒘opt, 𝑏opt) = argmax 𝒘,𝑏 min𝑛 𝑡𝑛(𝒘𝑇 𝒙𝑛 + 𝑏) 𝒘 • Suppose that 𝑔 is a real number, and 𝑐 > 0. • If 𝒘opt and 𝑏opt define an optimal boundary, then 𝑔 ∗ 𝒘opt and 𝑔 ∗ 𝑏opt also define an optimal boundary. • We constrain the scale of 𝒘opt to a single value, by requiring that: min𝑛 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) = 1 19
  • 20. Optimization Criterion • We introduced the requirement that: min𝑛 𝑡𝑛(𝒘𝑇𝒙𝑛 + 𝑏) = 1 • Therefore, for any 𝒙𝑛, it holds that: 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 • The original optimization criterion becomes: 𝒘opt, 𝑏opt = argmax 𝒘,𝑏 min𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛+𝑏 𝒘 ⇒ 𝒘opt = argmax 𝒘 1 𝒘 = argmin 𝒘 𝒘 ⇒ 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 20 These are equivalent formulations. The textbook uses the last one because it simplifies subsequent calculations.
  • 21. Constrained Optimization • Summarizing the previous slides, we want to find: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 subject to the following constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≥ 1 • This is a different optimization problem than what we have seen before. • We need to minimize a quantity while satisfying a set of inequalities. • This type of problem is a constrained optimization problem. 21
  • 22. Quadratic Programming • Our constrained optimization problem can be solved using a method called quadratic programming. • Describing quadratic programming in depth is outside the scope of this course. • Our goal is simply to understand how to use quadratic programming as a black box, to solve our optimization problem. – This way, you can use any quadratic programming toolkit (Matlab includes one). 22
  • 23. Quadratic Programming • The quadratic programming problem is defined as follows: • Inputs: – 𝒔: an 𝑅-dimensional column vector. – 𝑸: an 𝑅 × 𝑅-dimensional symmetric matrix. – 𝑯: an 𝑄 × 𝑅-dimensional symmetric matrix. – 𝒛: an 𝑄-dimensional column vector. • Output: – 𝒖opt: an 𝑅-dimensional column vector, such that: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 23
  • 24. Quadratic Programming for SVMs • Quadratic Programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 24
  • 25. Quadratic Programming for SVMs • Quadratic Programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≥ 1 ⇒ ∀𝑛 ∈ 1, … , 𝑁 , −𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≤ −1 25
  • 26. Quadratic Programming for SVMs • SVM constraints: ∀𝑛 ∈ 1, … , 𝑁 , −𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≤ −1 • Define: 𝑯 = −𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷 −𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷 ⋯ −𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷 , 𝒖 = 𝑏 𝑤1 𝑤2 ⋯ 𝑤𝐷 , 𝒛 = −1 −1 ⋯ −1 𝑁 – Matrix 𝑯 is 𝑁 × (𝐷 + 1), vector 𝒖 has (𝐷 + 1) rows, vector 𝒛 has 𝑁 rows. • 𝑯𝒖 = −𝑡1𝑏 −𝑡1 𝑥11𝑤1 − ⋯ − 𝑡1𝑥1𝐷𝑤1 … −𝑡𝑁𝑏 −𝑡𝑁 𝑥𝑁1𝑤1 − ⋯ − 𝑡1𝑥𝑁𝐷𝑤1 = −𝑡1(𝑏 + 𝒘𝑇 𝒙1) … −𝑡𝑁(𝑏 + 𝒘𝑇𝒙𝑁) • The 𝑛-th row of 𝑯𝒖 is −𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 , which should be ≤ −1. • SVM constraint  quadratic programming constraint 𝑯𝒖 ≤ 𝒛. 26
  • 27. Quadratic Programming for SVMs • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 • SVM: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 • Define: 𝑸 = 0, 0, 0, … , 0 0, 1, 0, … , 0 0, 0, 1, … , 0 ⋯ 0, 0, 0, … , 1 , 𝒖 = 𝑏 𝑤1 𝑤2 ⋯ 𝑤𝐷 , 𝒔 = 0 0 ⋯ 0 𝐷+1 – 𝑸 is like the (𝐷 + 1) × (𝐷 + 1) identity matrix, except that 𝑄11 = 0. • Then, 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 = 1 2 𝒘 𝟐. 27 𝒖 already defined in the previous slides 𝒔: 𝐷 + 1 - dimensional column vector of zeros.
  • 28. Quadratic Programming for SVMs • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 • SVM: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 • Alternative definitions that would NOT work: • Define: 𝑸 = 1, 0, … , 0 … 0, 0, … , 1 , 𝒖 = 𝑤1 … 𝑤𝐷 , 𝒔 = 0 … 0 𝑫 – 𝑸 is the 𝐷 × 𝐷 identity matrix, 𝒖 = 𝒘, 𝒔 is the 𝐷-dimensional zero vector. • It still holds that 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 = 1 2 𝒘 𝟐. • Why would these definitions not work? 28
  • 29. Quadratic Programming for SVMs • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 • SVM: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 • Alternative definitions that would NOT work: • Define: 𝑸 = 1, 0, … , 0 … 0, 0, … , 1 , 𝒖 = 𝑤1 … 𝑤𝐷 , 𝒔 = 0 … 0 𝑫 – 𝑸 is the 𝐷 × 𝐷 identity matrix, 𝒖 = 𝒘, 𝒔 is the 𝐷-dimensional zero vector. • It still holds that 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 = 1 2 𝒘 𝟐. • Why would these definitions not work? • Vector 𝒖 must also make 𝑯𝒖 ≤ 𝒛 match the SVM constraints. – With this definition of 𝒖, no appropriate 𝑯 and 𝒛 can be found. 29
  • 30. Quadratic Programming for SVMs • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 • SVM goal: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≥ 1 • Task: define 𝑸, 𝒔, 𝑯, 𝒛, 𝒖 so that quadratic programming computes 𝒘opt and 𝑏opt. 𝑸 = 0, 0, 0, … , 0 0, 1, 0, … , 0 0, 0, 1, … , 0 ⋯ 0, 0, 0, … , 1 , 𝒔 = 0 0 ⋯ 0 , 𝑯 = −𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷 −𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷 ⋯ −𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷 , 𝒛 = −1 −1 ⋯ −1 , 𝒖 = 𝑏 𝑤1 𝑤2 ⋯ 𝑤𝐷 30 Like (𝐷 + 1) × (𝐷 + 1) identity matrix, except that 𝑄11 = 0 𝐷 + 1 rows 𝐷 + 1 rows 𝑁 rows 𝑁 rows, 𝐷 + 1 columns
  • 31. 31 Quadratic Programming for SVMs • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 • SVM goal: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≥ 1 • Task: define 𝑸, 𝒔, 𝑯, 𝒛, 𝒖 so that quadratic programming computes 𝒘opt and 𝑏opt. 𝑸 = 0, 0, 0, … , 0 0, 1, 0, … , 0 0, 0, 1, … , 0 ⋯ 0, 0, 0, … , 1 , 𝒔 = 0 0 ⋯ 0 , 𝑯 = −𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷 −𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷 ⋯ −𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷 , 𝒛 = −1 −1 ⋯ −1 , 𝒖 = 𝑏 𝑤1 𝑤2 ⋯ 𝑤𝐷 • Quadratic programming takes as inputs 𝑸, 𝒔, 𝑯, 𝒛, and outputs 𝒖opt, from which we get the 𝒘opt, 𝑏opt values for our SVM.
  • 32. 32 Quadratic Programming for SVMs • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 • SVM goal: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≥ 1 • Task: define 𝑸, 𝒔, 𝑯, 𝒛, 𝒖 so that quadratic programming computes 𝒘opt and 𝑏opt. 𝑸 = 0, 0, 0, … , 0 0, 1, 0, … , 0 0, 0, 1, … , 0 ⋯ 0, 0, 0, … , 1 , 𝒔 = 0 0 ⋯ 0 , 𝑯 = −𝑡1, −𝑡1𝑥11, … , −𝑡1𝑥1𝐷 −𝑡2, −𝑡2𝑥21, … , −𝑡1𝑥2𝐷 ⋯ −𝑡𝑁, −𝑡𝑁𝑥𝑁1, … , −𝑡𝑁𝑥𝑁𝐷 , 𝒛 = −1 −1 ⋯ −1 , 𝒖 = 𝑏 𝑤1 𝑤2 ⋯ 𝑤𝐷 • 𝒘opt is the vector of values at dimensions 2, …, 𝐷 + 1 of 𝒖opt. • 𝑏opt is the value at dimension 1 of 𝒖opt.
  • 33. Solving the Same Problem, Again • So far, we have solved the problem of defining an SVM (i.e., defining 𝒘opt and 𝑏opt), so as to maximize the margin between linearly separable data. • If this were all that SVMs can do, SVMs would not be that important. – Linearly separable data are a rare case, and a very easy case to deal with. 33 margin margin
  • 34. • So far, we have solved the problem of defining an SVM (i.e., defining 𝒘opt and 𝑏opt), so as to maximize the margin between linearly separable data. • We will see two extensions that make SVMs much more powerful. – The extensions will allow SVMs to define highly non-linear decision boundaries, as in this figure. • However, first we need to solve the same problem again. – Maximize the margin between linearly separable data. • We will get a more complicated solution, but that solution will be easier to improve upon. Solving the Same Problem, Again 34
  • 35. Lagrange Multipliers • Our new solutions are derived using Lagrange multipliers. • Here is a quick review from multivariate calculus. • Let 𝒙 be a 𝐷-dimensional vector. • Let 𝑓(𝒙) and 𝑔(𝒙) be functions from ℝ𝐷 to ℝ. – Functions 𝑓 and 𝑔 map 𝐷-dimensional vectors to real numbers. • Suppose that we want to minimize 𝑓(𝒙), subject to the constraint that 𝑔 𝒙 ≥ 0. • Then, we can solve this problem using a Lagrange multiplier to define a Lagrangian function. 35
  • 36. 36 Lagrange Multipliers • To minimize 𝑓(𝒙), subject to the constraint: 𝑔 𝒙 ≥ 0: • We define the Lagrangian function: 𝐿 𝒙, 𝜆 = 𝑓 𝒙 − 𝜆𝑔(𝒙) – 𝜆 is called a Lagrange multiplier, 𝜆 ≥ 0. • We find 𝒙opt = argmin𝒙 𝐿(𝒙, 𝜆) , and a corresponding value for 𝜆, subject to the following constraints: 1. 𝑔 𝒙 ≥ 0 2. 𝜆 ≥ 0 3. 𝜆𝑔 𝒙 = 0 • If 𝑔 𝒙opt > 0, the third constraint implies that 𝜆 = 0. • Then, the constraint 𝑔 𝒙 ≥ 0 is called inactive. • If 𝑔 𝒙opt = 0, then 𝜆 > 0. • Then, constraint 𝑔 𝒙 ≥ 0 is called active.
  • 37. 37 Multiple Constraints • Suppose that we have 𝑁 constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑔𝑛 𝒙 ≥ 0 • We want to minimize 𝑓(𝒙), subject to those 𝑁 constraints. • Define vector 𝝀 = 𝜆1, … , 𝜆𝑁 . • Define the Lagrangian function as: 𝐿 𝒙, 𝝀 = 𝑓 𝒙 − 𝑛=1 𝑁 𝜆𝑛𝑔𝑛 𝒙 • We find 𝒙opt = argmin𝒙 𝐿(𝒙, 𝝀) , and a value for 𝝀, subject to: • ∀𝑛, 𝑔𝑛 𝒙 ≥ 0 • ∀𝑛, 𝜆𝑛 ≥ 0 • ∀𝑛, 𝜆𝑛𝑔𝑛 𝒙 = 0
  • 38. 38 Lagrange Dual Problems • We have 𝑁 constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑔𝑛 𝒙 ≥ 0 • We want to minimize 𝑓(𝒙), subject to those 𝑁 constraints. • Under some conditions (which are satisfied in our SVM problem), we can solve an alternative dual problem: • Define the Lagrangian function as before: 𝐿 𝒙, 𝝀 = 𝑓 𝒙 − 𝑛=1 𝑁 𝜆𝑛𝑔𝑛 𝒙 • We find 𝒙opt, and the best value for 𝝀, denoted as 𝝀opt, by solving: 𝝀opt = argmax 𝝀 min 𝒙 𝐿 𝒙, 𝝀 𝒙opt = argmin 𝒙 𝐿 𝒙, 𝝀opt subject to constraints: 𝜆𝑛 ≥ 𝟎
  • 39. 39 Lagrange Dual Problems • Lagrangian dual problem: Solve: 𝝀opt = argmax 𝝀 min 𝒙 𝐿 𝒙, 𝝀 𝒙opt = argmin 𝒙 𝐿 𝒙, 𝝀opt subject to constraints: 𝜆𝑛 ≥ 𝟎 • This dual problem formulation will be used in training SVMs. • The key thing to remember is: – We minimize the Lagrangian 𝐿 with respect to 𝒙. – We maximize 𝐿 with respect to the Lagrange multipliers 𝜆𝑛.
  • 40. 40 Lagrange Multipliers and SVMs • SVM goal: 𝒘opt = argmin 𝒘 1 2 𝒘 𝟐 subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 • To make the constraints more amenable to Lagrange multipliers, we rewrite them as: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 ≥ 0 • Define 𝒂 = (𝑎1, … , 𝑎𝑁) to be a vector of 𝑁 Lagrange multipliers. • Define the Lagrangian function: 𝐿 𝒘, 𝑏, 𝒂 = 1 2 𝒘 𝟐 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 − 1 • Remember from the previous slides, we minimize 𝐿 with respect to 𝒘, 𝑏, and maximize 𝐿 with respect to 𝒂.
  • 41. 41 Lagrange Multipliers and SVMs • The 𝒘 and 𝑏 that minimize 𝐿 𝒘, 𝑏, 𝒂 must satisfy: 𝜕𝐿 𝜕𝒘 = 0, 𝜕𝐿 𝜕𝑏 = 0. 𝜕𝐿 𝜕𝒘 = 0 ⇒ 𝜕 1 2 𝒘 𝟐 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 𝜕𝒘 = 0 ⇒ 𝒘 − 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛 = 0 ⇒ 𝒘 = 𝑎𝑛𝑡𝑛𝒙𝑛 𝑁 𝑛=1
  • 42. 42 Lagrange Multipliers and SVMs • The 𝒘 and 𝑏 that minimize 𝐿 𝒘, 𝑏, 𝒂 must satisfy: 𝜕𝐿 𝜕𝒘 = 0, 𝜕𝐿 𝜕𝑏 = 0. 𝜕𝐿 𝜕𝑏 = 0 ⇒ 𝜕 1 2 𝒘 𝟐 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 𝜕𝑏 = 0 ⇒ 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0
  • 43. 43 Lagrange Multipliers and SVMs • Our Lagrangian function is: 𝐿 𝒘, 𝑏, 𝒂 = 1 2 𝒘 2 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 − 1 • We showed that 𝒘 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛 . Using that, we get: 1 2 𝒘 2 = 1 2 𝒘𝑇𝒘 = 1 2 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 𝒙𝑛 𝑇 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛
  • 44. 44 Lagrange Multipliers and SVMs • We showed that: ⇒ 1 2 𝒘 2 = 1 2 𝑛=1 𝑁 𝑚=1 𝑁 𝑎𝑛𝑎𝑚𝑡𝑛𝑡𝑚 𝒙𝑛 𝑇 𝒙𝑚 • Define an 𝑁 × 𝑁 matrix 𝑸 such that 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛 𝑇 𝒙𝑚. • Remember that we have defined 𝒂 = (𝑎1, … , 𝑎𝑁). • Then, it follows that: 1 2 𝒘 2 = 1 2 𝒂𝑇 𝑸𝒂
  • 45. 45 Lagrange Multipliers and SVMs 𝐿 𝒘, 𝑏, 𝒂 = 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 • We showed that 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0. Using that, we get: 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒘𝑇 𝒙𝑛 + 𝑏 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 − 𝑛=1 𝑁 𝑎𝑛 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒘𝑇𝒙𝑛 − 𝑛=1 𝑁 𝑎𝑛 We have shown before that the red part equals 0.
  • 46. 46 Lagrange Multipliers and SVMs 𝐿 𝒘, 𝒂 = 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒘𝑇𝒙𝑛 + 𝑛=1 𝑁 𝑎𝑛 • Function 𝐿 now does not depend on 𝑏. • We simplify more, using again the fact that 𝒘 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛 : 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒘𝑇𝒙𝑛 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 𝑚=1 𝑁 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇 𝒙𝑛 = 𝑛=1 𝑁 𝑚=1 𝑁 𝑎𝑛𝑎𝑚𝑡𝑛𝑡𝑚 𝒙𝑛 𝑇 𝒙𝑛 = 𝒂𝑇 𝑸𝒂
  • 47. 47 Lagrange Multipliers and SVMs 𝐿 𝒘, 𝒂 = 1 2 𝒂𝑇 𝑸𝒂 − 𝒂𝑇 𝑸𝒂 + 𝑛=1 𝑁 𝑎𝑛 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 • Function 𝐿 now does not depend on 𝑤 anymore. • We can rewrite 𝐿 as a function whose only input is 𝒂: 𝐿 𝒂 = 1 2 𝒂𝑇 𝑸𝒂 − 𝒂𝑇 𝑸𝒂 + 𝑛=1 𝑁 𝑎𝑛 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 • Remember, we want to maximize 𝐿 𝒂 with respect to 𝒂.
  • 48. 48 Lagrange Multipliers and SVMs • By combining the results from the last few slides, our optimization problem becomes: Maximize 𝐿 𝒂 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 subject to these constraints: 𝑎𝑛 ≥ 0
  • 49. 49 Lagrange Multipliers and SVMs 𝐿 𝒂 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 • We want to maximize 𝐿 𝒂 subject to some constraints. • Therefore, we want to find an 𝒂opt such that: 𝒂opt = argmax 𝒂 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 = argmin 𝒂 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 subject to those constraints.
  • 50. 50 SVM Optimization Problem • Our SVM optimization problem now is to find an 𝒂opt such that: 𝒂opt = argmin 𝒂 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 subject to these constraints: 𝑎𝑛 ≥ 0 This problem can be solved again using quadratic programming.
  • 51. Using Quadratic Programming • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem: find 𝒂opt = argmin 𝒂 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 subject to constraints: 𝑎𝑛 ≥ 0, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Again, we must find values for 𝑸, 𝒔, 𝑯, 𝒛 that convert the SVM problem into a quadratic programming problem. • Note that we already have a matrix 𝑸 in the Lagrangian. It is an 𝑁 × 𝑁 matrix 𝑸 such that 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛 𝑇 𝒙𝑛. • We will use the same 𝑸 for quadratic programming. 51
  • 52. Using Quadratic Programming • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem: find 𝒂opt = argmin 𝒂 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 subject to constraints: 𝑎𝑛 ≥ 0, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Define: 𝒖 = 𝒂, and define N-dimensional vector 𝒔 = −1 −1 ⋯ −1 𝑁 . • Then, 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 = 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 • We have mapped the SVM minimization goal of finding 𝒂opt to the quadratic programming minimization goal. 52
  • 53. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 𝑎𝑛 ≥ 0, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝑁 , 𝒛 = 0 0 ⋯ 0 𝑁+2 • The first 𝑁 rows of 𝑯 are the negation of the 𝑁 × 𝑁 identity matrix. • Row 𝑁 + 1 of 𝑯 is the transpose of vector 𝒕 of target outputs. • Row 𝑁 + 2 of 𝑯 is the negation of the previous row. 53 𝒛: 𝑁 + 2 - dimensional column vector of zeros.
  • 54. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 𝑎𝑛 ≥ 0, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝑁 , 𝒛 = 0 0 ⋯ 0 𝑁+2 • Since we defined 𝒖 = 𝒂: – For 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 equals 𝑎𝑛. – For 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 and 𝒛 specifies that −𝑎𝑛≤ 0 ⇒ 𝑎𝑛 ≥ 0. – The (𝑛 + 1)-th row of 𝑯𝒖 and 𝒛 specifies that 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 ≤ 0. – The (𝑛 + 2)-th row of 𝑯𝒖 and 𝒛 specifies that 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 ≥ 0. 54 𝒛: 𝑁 + 2 - dimensional column vector of zeros.
  • 55. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 𝑎𝑛 ≥ 0, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝑁 , 𝒛 = 0 0 ⋯ 0 𝑁+2 • Since we defined 𝒖 = 𝑎: – For 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 and 𝒛 specifies that 𝑎𝑛 ≥ 0. – The last two rows of 𝑯𝒖 and 𝒛 specify that 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0. • We have mapped the SVM constraints to the quadratic programming constraint 𝑯𝒖 ≤ 𝒛. 55 𝒛: 𝑁 + 2 - dimensional column vector of zeros.
  • 56. Interpretation of the Solution • Quadratic programming, given the inputs 𝑸, 𝒔, 𝑯, 𝒛 defined in the previous slides, outputs the optimal value for vector 𝒂. • We used this Lagrangian function: 𝐿 𝒘, 𝑏, 𝒂 = 1 2 𝒘 𝟐 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 − 1 • Remember, when we find 𝒙opt to minimize any Lagrangian 𝐿 𝒙, 𝝀 = 𝑓 𝒙 − 𝑛=1 𝑁 𝜆𝑛𝑔𝑛 𝒙 one of the constraints we enforce is that ∀𝑛, 𝜆𝑛𝑔𝑛 𝒙 = 0. • In the SVM case, this means: ∀𝑛, 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 = 0 56
  • 57. Interpretation of the Solution • In the SVM case, it holds that: ∀𝑛, 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 = 0 • What does this mean? • Mathematically, ∀𝑛: • Either 𝑎𝑛 = 0, • Or 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1. • If 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1, what does that imply for 𝒙𝑛? 57
  • 58. Interpretation of the Solution • In the SVM case, it holds that: ∀𝑛, 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 = 0 • What does this mean? • Mathematically, ∀𝑛: • Either 𝑎𝑛 = 0, • Or 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1. • If 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1, what does that imply for 𝒙𝑛? • Remember, many slides back, we imposed the constraint that ∀𝑛, 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1. • Equality 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1 holds only for the support vectors. • Therefore, 𝑎𝑛 > 0 only for the support vectors. 58
  • 59. Support Vectors • This is an example we have seen before. • The three support vectors have a black circle around them. • When we find the optimal values 𝑎1, 𝑎2, … , 𝑎𝑁 for this problem, only three of those values will be non-zero. – If 𝑥𝑛 is not a support vector, then 𝑎𝑛 = 0. 59 margin margin
  • 60. Interpretation of the Solution • We showed before that: 𝒘 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛 • This means that 𝒘 is a linear combination of the training data. • However, since 𝑎𝑛 > 0 only for the support vectors, obviously only the support vector influence 𝒘. 60 margin margin
  • 61. Computing 𝑏 • 𝒘 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛 • Define set 𝑆 = 𝑛 𝒙𝑛 is a support vector}. • Since 𝑎𝑛 > 0 only for the support vectors, we get: 𝒘 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛𝒙𝑛 • If 𝒙𝑛 is a support vector, then 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 = 1. • Substituting 𝒘 = 𝑚∈𝑆 𝑎𝑚𝑡𝑚𝒙𝑚 , if 𝒙𝑛 is a support vector: 𝑡𝑛 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇 𝒙𝑛 + 𝑏 = 1 61
  • 62. Computing 𝑏 𝑡𝑛 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛 + 𝑏 = 1 • Remember that 𝑡𝑛 can only take values 1 and −1. • Therefore, 𝑡𝑛 2 = 1 • Multiplying both sides of the equation with 𝑡𝑛 we get: 𝑡𝑛𝑡𝑛 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛 + 𝑏 = 𝑡𝑛 ⇒ 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛 + 𝑏 = 𝑡𝑛 ⇒ 𝑏 = 𝑡𝑛 − 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛 62
  • 63. Computing 𝑏 • Thus, if 𝒙𝑛 is a support vector, we can compute 𝑏 with formula: 𝑏 = 𝑡𝑛 − 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛 • To avoid numerical problems, instead of using a single support vector to calculate 𝑏, we can use all support vectors (and take the average of the computed values for 𝑏). • If 𝑁𝑆 is the number of support vectors, then: 𝑏 = 1 𝑁𝑆 𝑛∈𝑆 𝑡𝑛 − 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇 𝒙𝑛 63
  • 64. Classification Using 𝒂 • To classify a test object 𝒙, we can use the original formula 𝑦 𝒙 = 𝒘𝑇𝒙 + 𝑏. • However, since 𝒘 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛 , we can substitute that formula for 𝒘, and classify 𝒙 using: 𝑦 𝒙 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛 𝒙𝑛 𝑇𝒙 + 𝑏 • This formula will be our only choice when we use SVMs that produce nonlinear boundaries. – Details on that will be coming later in this presentation. 64
  • 65. Recap of Lagrangian-Based Solution • We defined the Lagrangian function, which became (after simplifications): 𝐿 𝒂 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 • Using quadratic programming, we find the value of 𝒂 that maximizes 𝐿 𝒂 , subject to constraints: 𝑎𝑛 ≥ 0, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Then: 65 𝒘 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛𝒙𝑛 𝑏 = 1 𝑁𝑆 𝑛∈𝑆 𝑡𝑛 − 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛
  • 66. Why Did We Do All This? • The Lagrangian-based solution solves a problem that we had already solved, in a more complicated way. • Typically, we prefer simpler solutions. • However, this more complicated solution can be tweaked relatively easily, to produce more powerful SVMs that: – Can be trained even if the training data is not linearly separable. – Can produce nonlinear decision boundaries. 66
  • 67. The Not Linearly Separable Case • Our previous formulation required that the training examples are linearly separable. • This requirement was encoded in the constraint: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 • According to that constraint, every 𝒙𝑛 has to be on the correct side of the boundary. • To handle data that is not linearly separable, we introduce 𝑁 variables 𝜉𝑛 ≥ 0, to define a modified constraint: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛 • These variables 𝜉𝑛 are called slack variables. • The values for 𝜉𝑛 will be computed during optimization. 67
  • 68. The Meaning of Slack Variables 68 • To handle data that is not linearly separable, we introduce 𝑁 slack variables 𝜉𝑛 ≥ 0, to define a modified constraint: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛 • If 𝜉𝑛 = 0, then 𝒙𝑛 is where it should be: – either on the blue line for objects of its class, or on the correct side of that blue line. • If 0 < 𝜉𝑛 < 1, then 𝒙𝑛 is too close to the decision boundary. – Between the red line and the blue line for objects of its class. • If 𝜉𝑛 = 1, 𝒙𝑛 is on the red line. • If 𝜉𝑛 > 1, 𝒙𝑛 is misclassified (on the wrong side of the red line).
  • 69. The Meaning of Slack Variables 69 • If training data is not linearly separable, we introduce 𝑁 slack variables 𝜉𝑛 ≥ 0, to define a modified constraint: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛 • If 𝜉𝑛 = 0, then 𝒙𝑛 is where it should be: – either on the blue line for objects of its class, or on the correct side of that blue line. • If 0 < 𝜉𝑛 < 1, then 𝒙𝑛 is too close to the decision boundary. – Between the red line and the blue line for objects of its class. • If 𝜉𝑛 = 1, 𝒙𝑛 is on the decision boundary (the red line). • If 𝜉𝑛 > 1, 𝒙𝑛 is misclassified (on the wrong side of the red line).
  • 70. Optimization Criterion • Before, we minimized 1 2 𝒘 𝟐 subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 ≥ 1 • Now, we want to minimize error function: 𝐶 𝑛=1 𝑁 𝜉𝑛 + 1 2 𝒘 𝟐 subject to constraints: ∀𝑛 ∈ 1, … , 𝑁 , 𝑡𝑛 𝒘𝑇 𝒙𝑛 + 𝑏 ≥ 1 − 𝜉𝑛 𝜉𝑛 ≥ 0 • 𝐶 is a parameter that we pick manually, 𝐶 ≥ 0. • 𝐶 controls the trade-off between maximizing the margin, and penalizing training examples that violate the margin. – The higher 𝜉𝑛 is, the farther away 𝒙𝑛 is from where it should be. – If 𝒙𝑛 is on the correct side of the decision boundary, and the distance of 𝒙𝑛 to the boundary is greater than or equal to the margin, then 𝜉𝑛 = 0, and 𝒙𝑛 does not contribute to the error function. 70
  • 71. Lagrangian Function • To do our constrained optimization, we define the Lagrangian: 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 = 1 2 𝒘 𝟐 + 𝐶 𝑛=1 𝑁 𝜉𝑛 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 − 𝑛=1 𝑁 𝜇𝑛𝜉𝑛 • The Lagrange multipliers are now 𝑎𝑛 and 𝜇𝑛. • The term 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 in the Lagrangian corresponds to constraint 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 ≥ 0. • The term 𝜇𝑛𝜉𝑛 in the Lagrangian corresponds to constraint 𝜉𝑛 ≥ 0. 71
  • 72. Lagrangian Function • Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 : 1 2 𝒘 𝟐 + 𝐶 𝑛=1 𝑁 𝜉𝑛 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 − 𝑛=1 𝑁 𝜇𝑛𝜉𝑛 • Constraints: 𝑎𝑛 ≥ 0 𝜇𝑛 ≥ 0 𝜉𝑛 ≥ 0 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 ≥ 0 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 = 0 𝜇𝑛𝜉𝑛 = 0 72
  • 73. Lagrangian Function • Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 : 1 2 𝒘 𝟐 + 𝐶 𝑛=1 𝑁 𝜉𝑛 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 − 𝑛=1 𝑁 𝜇𝑛𝜉𝑛 • As before, we can simplify the Lagrangian significantly: 𝜕𝐿 𝜕𝒘 = 0 ⇒ 𝒘 = 𝑛=1 𝑁 𝑎𝑛𝑡𝑛𝒙𝑛 𝜕𝐿 𝜕𝜉𝑛 = 0 ⇒ 𝐶 − 𝑎𝑛 − 𝜇𝑛 = 0 ⇒ 𝑎𝑛 = 𝐶 − 𝜇𝑛 73 𝜕𝐿 𝜕𝑏 = 0 ⇒ 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0
  • 74. Lagrangian Function • Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 : 1 2 𝒘 𝟐 + 𝐶 𝑛=1 𝑁 𝜉𝑛 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 − 𝑛=1 𝑁 𝜇𝑛𝜉𝑛 • In computing 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 , 𝜉𝑛 contributes the following value: 𝐶𝜉𝑛 − 𝑎𝑛 𝜉𝑛 − 𝜇𝑛𝜉𝑛 = 𝜉𝑛(𝐶 − 𝑎𝑛 − 𝜇𝑛) • As we saw in the previous slide, 𝐶 − 𝑎𝑛 − 𝜇𝑛 = 0. • Therefore, we can eliminate all those occurrences of 𝜉𝑛. 74
  • 75. Lagrangian Function • Lagrangian 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 : 1 2 𝒘 𝟐 + 𝐶 𝑛=1 𝑁 𝜉𝑛 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛𝒚 𝒙𝑛 − 1 + 𝜉𝑛 − 𝑛=1 𝑁 𝜇𝑛𝜉𝑛 • In computing 𝐿 𝒘, 𝑏, 𝝃, 𝒂, 𝝁 , 𝜉𝑛 contributes the following value: 𝐶𝜉𝑛 − 𝑎𝑛 𝜉𝑛 − 𝜇𝑛𝜉𝑛 = 𝜉𝑛(𝐶 − 𝑎𝑛 − 𝜇𝑛) • As we saw in the previous slide, 𝐶 − 𝑎𝑛 − 𝜇𝑛 = 0. • Therefore, we can eliminate all those occurrences of 𝜉𝑛. • By eliminating 𝜉𝑛 we also eliminate all appearances of 𝐶 and 𝜇𝑛. 75
  • 76. Lagrangian Function • By eliminating 𝜉𝑛 and 𝜇𝑛, we get: 𝐿 𝒘, 𝑏, 𝒂 = 1 2 𝒘 𝟐 − 𝑛=1 𝑁 𝑎𝑛 𝑡𝑛 𝒘𝑇𝒙𝑛 + 𝑏 − 1 • This is exactly the Lagrangian we had for the linearly separable case. • Following the same steps as in the linearly separable case, we can simplify even more to: 𝐿 𝒂 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 76
  • 77. Lagrangian Function • So, we have ended up with the same Lagrangian as in the linearly separable case: 𝐿 𝒂 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 • There is a small difference, however, in the constraints. 77 Linearly separable case: 0 ≤ 𝑎𝑛 Linearly inseparable case: 0 ≤ 𝑎𝑛 ≤ 𝐶
  • 78. Lagrangian Function • So, we have ended up with the same Lagrangian as in the linearly separable case: 𝐿 𝒂 = 𝑛=1 𝑁 𝑎𝑛 − 1 2 𝒂𝑇 𝑸𝒂 • Where does constraint 0 ≤ 𝑎𝑛 ≤ 𝐶 come from? • 0 ≤ 𝑎𝑛 because 𝑎𝑛 is a Lagrange multiplier. • 𝑎𝑛 ≤ 𝐶 comes from the fact that we showed earlier, that 𝑎𝑛 = 𝐶 − 𝜇𝑛. – Since 𝜇𝑛 is also a Lagrange multiplier, 𝜇𝑛 ≥ 0 and thus 𝑎𝑛 ≤ 𝐶. 78
  • 79. Using Quadratic Programming • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem: find 𝒂opt = argmin 𝒂 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 subject to constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Again, we must find values for 𝑸, 𝒔, 𝑯, 𝒛 that convert the SVM problem into a quadratic programming problem. • Values for 𝑸 and 𝒔 are the same as in the linearly separable case, since in both cases we minimize 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛. 79
  • 80. Using Quadratic Programming • Quadratic programming: 𝒖opt = argmin𝑢 1 2 𝒖𝑇𝑸𝒖 + 𝒔𝑇𝒖 subject to constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem: find 𝒂opt = argmin 𝒂 1 2 𝒂𝑇 𝑸𝒂 − 𝑛=1 𝑁 𝑎𝑛 subject to constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • 𝑸 is an 𝑁 × 𝑁 matrix such that 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛 𝑇 𝒙𝑛. • 𝒖 = 𝒂, and 𝒔 = −1 −1 ⋯ −1 𝑁 . 80
  • 81. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • The top 𝑁 rows of 𝑯 are the negation of the 𝑁 × 𝑁 identity matrix. • Rows 𝑁 + 1 to 2N of 𝑯 are the 𝑁 × 𝑁 identity matrix • Row 𝑁 + 1 of 𝑯 is the transpose of vector 𝒕 of target outputs. • Row 𝑁 + 2 of 𝑯 is the negation of the previous row. 81 Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 1, 0, 0, … , 0 0, 1, 0, … , 0 … 0, 0, 0, … , 1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝒛 = 0 0 ⋯ 0 𝐶 𝐶 … 𝐶 0 0 rows 1 to 𝑁 rows (N + 1) to 2𝑁 rows (2N + 1) and (2N + 2) rows 1 to 𝑁, set to 0. rows (N + 1) to 2𝑁, set to 𝐶. rows 2N + 1, 2N + 2, set to 0.
  • 82. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • If 1 ≤ 𝑛 ≤ 𝑁, the 𝑛-th row of 𝑯𝒖 is −𝑎𝑛, and the 𝑛-th row of 𝒛 is 0. • Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint −𝑎𝑛 ≤ 0 ⇒ 𝑎𝑛 ≥ 0. 82 Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 1, 0, 0, … , 0 0, 1, 0, … , 0 … 0, 0, 0, … , 1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝒛 = 0 0 ⋯ 0 𝐶 𝐶 … 𝐶 0 0 rows 1 to 𝑁 rows (N + 1) to 2𝑁 rows (2N + 1) and (2N + 2) rows 1 to 𝑁, set to 0. rows (N + 1) to 2𝑁, set to 𝐶. rows 2N + 1, 2N + 2, set to 0.
  • 83. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • If (N + 1) ≤ 𝑛 ≤ 2𝑁, the 𝑛-th row of 𝑯𝒖 is 𝑎𝑛, and the 𝑛-th row of 𝒛 is 𝐶. • Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint 𝑎𝑛 ≤ 0. 83 Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 1, 0, 0, … , 0 0, 1, 0, … , 0 … 0, 0, 0, … , 1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝒛 = 0 0 ⋯ 0 𝐶 𝐶 … 𝐶 0 0 rows 1 to 𝑁 rows (N + 1) to 2𝑁 rows (2N + 1) and (2N + 2) rows 1 to 𝑁, set to 0. rows (N + 1) to 2𝑁, set to 𝐶. rows 2N + 1, 2N + 2, set to 0.
  • 84. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • If 𝑛 = 2𝑁 + 1, the 𝑛-th row of 𝑯𝒖 is 𝒕, and the 𝑛-th row of 𝒛 is 0. • Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 ≤ 0. 84 Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 1, 0, 0, … , 0 0, 1, 0, … , 0 … 0, 0, 0, … , 1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝒛 = 0 0 ⋯ 0 𝐶 𝐶 … 𝐶 0 0 rows 1 to 𝑁 rows (N + 1) to 2𝑁 rows (2N + 1) and (2N + 2) rows 1 to 𝑁, set to 0. rows (N + 1) to 2𝑁, set to 𝐶. rows 2N + 1, 2N + 2, set to 0.
  • 85. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • If 𝑛 = 2𝑁 + 2, the 𝑛-th row of 𝑯𝒖 is −𝒕, and the 𝑛-th row of 𝒛 is 0. • Thus, the 𝑛-th rows of 𝑯𝒖 and 𝒛 capture constraint − 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 ≤ 0 ⇒ 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 ≥ 0. 85 Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 1, 0, 0, … , 0 0, 1, 0, … , 0 … 0, 0, 0, … , 1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝒛 = 0 0 ⋯ 0 𝐶 𝐶 … 𝐶 0 0 rows 1 to 𝑁 rows (N + 1) to 2𝑁 rows (2N + 1) and (2N + 2) rows 1 to 𝑁, set to 0. rows (N + 1) to 2𝑁, set to 𝐶. rows 2N + 1, 2N + 2, set to 0.
  • 86. Using Quadratic Programming • Quadratic programming constraint: 𝑯𝒖 ≤ 𝒛 • SVM problem constraints: 0 ≤ 𝑎𝑛 ≤ 𝐶, 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 • Thus, the last two rows of 𝑯𝒖 and 𝒛 in combination capture constraint 𝑛=1 𝑁 𝑎𝑛𝑡𝑛 = 0 86 Define: 𝑯 = −1, 0, 0, … , 0 0, −1, 0, … , 0 ⋯ 0, 0, 0, … , −1 1, 0, 0, … , 0 0, 1, 0, … , 0 … 0, 0, 0, … , 1 𝑡1, 𝑡2, 𝑡3, … , 𝑡𝑛 −𝑡1, −𝑡2, −𝑡3, … , −𝑡𝑛 𝒛 = 0 0 ⋯ 0 𝐶 𝐶 … 𝐶 0 0 rows 1 to 𝑁 rows (N + 1) to 2𝑁 rows (2N + 1) and (2N + 2) rows 1 to 𝑁, set to 0. rows (N + 1) to 2𝑁, set to 𝐶. rows 2N + 1, 2N + 2, set to 0.
  • 87. Using the Solution • Quadratic programming, given the inputs 𝑸, 𝒔, 𝑯, 𝒛 defined in the previous slides, outputs the optimal value for vector 𝒂. • The value of 𝑏 is computed as in the linearly separable case: 𝑏 = 1 𝑁𝑆 𝑛∈𝑆 𝑡𝑛 − 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛 • Classification of input x is done as in the linearly separable case: 𝑦 𝒙 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛 𝒙𝑛 𝑇𝒙 + 𝑏 87
  • 88. The Disappearing 𝒘 • We have used vector 𝒘, to define the decision boundary 𝑦 𝒙 = 𝒘𝑇𝒙 + 𝑏. • However, 𝒘 does not need to be either computed, or used. • Quadratic programming computes values 𝑎𝑛. • Using those values 𝑎𝑛, we compute 𝑏: 𝑏 = 1 𝑁𝑆 𝑛∈𝑆 𝑡𝑛 − 𝑚∈𝑆 𝑎𝑚𝑡𝑚 𝒙𝑚 𝑇𝒙𝑛 • Using those values for 𝑎𝑛 and 𝑏, we can classify test objects 𝒙: 𝑦 𝒙 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛 𝒙𝑛 𝑇𝒙 + 𝑏 • If we know the values for 𝑎𝑛 and 𝑏, we do not need 𝒘. 88
  • 89. The Role of Training Inputs • Where, during training, do we use input vectors? • Where, during classification, do we use input vectors? • Overall, input vectors are only used in two formulas: 1. During training, we use vectors 𝒙𝑛 to define matrix 𝑸, where: 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚 𝒙𝑛 𝑇 𝒙𝑚 2. During classification, we use training vectors 𝒙𝑛 and test input 𝒙 in this formula: 𝑦 𝒙 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛 𝒙𝑛 𝑇 𝒙 + 𝑏 • In both formulas, input vectors are only used by taking their dot products. 89
  • 90. The Role of Training Inputs • To make this more clear, define a kernel function 𝑘(𝒙, 𝒙′) as: 𝑘 𝒙, 𝒙′ = 𝒙𝑇 𝒙′ • During training, we use vectors 𝒙𝑛 to define matrix 𝑸, where: 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚𝑘(𝒙𝑛, 𝒙𝑚) • During classification, we use training vectors 𝒙𝑛 and test input 𝒙 in this formula: 𝑦 𝒙 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛𝑘(𝒙𝑛, 𝒙) + 𝑏 • In both formulas, input vectors are used only through function 𝑘. 90
  • 91. The Kernel Trick • We have defined kernel function 𝑘 𝒙, 𝒙′ = 𝒙𝑇 𝒙′. • During training, we use vectors 𝒙𝑛 to define matrix 𝑸, where: 𝑄𝑚𝑛 = 𝑡𝑛𝑡𝑚𝑘(𝒙𝑛, 𝒙𝑚) • During classification, we use this formula: 𝑦 𝒙 = 𝑛∈𝑆 𝑎𝑛𝑡𝑛𝑘(𝒙𝑛, 𝒙) + 𝑏 • What if we defined 𝑘 𝒙, 𝒙′ differently? • The SVM formulation (both for training and for classification) would remain exactly the same. • In the SVM formulation, the kernel 𝑘 𝒙, 𝒙′ is a black box. – You can define 𝑘 𝒙, 𝒙′ any way you like. 91
  • 92. A Different Kernel • Let 𝒙 = 𝑥1, 𝑥2 and 𝒛 = 𝑧1, 𝑧2 be 2-dimensional vectors. • Consider this alternative definition for the kernel: 𝑘 𝒙, 𝒛 = 1 + 𝒙𝑇 𝒛 2 • Then: 𝑘 𝒙, 𝒛 = 1 + 𝒙𝑇 𝒛 2 = 1 + 𝑥1𝑧1 + 𝑥2𝑧2 2 = 1 + 2𝑥1𝑧1 + 2𝑥2𝑧2 + (𝑥1)2(𝑧1)2+2𝑥1𝑧12𝑥2𝑧2 + (𝑥2)2(𝑧2)2 • e a basis function 𝜑(𝒙) as: • 92
  • 93. Kernels and Basis Functions • In general, kernels make it easy to incorporate basis functions into SVMs: – Define 𝜑 𝒙 any way you like. – Define 𝑘 𝒙, 𝒛 = 𝜑 𝒙 𝑻𝜑 𝒛 . • The kernel function represents a dot product, but in a (typically) higher-dimensional feature space compared to the original space of 𝒙 and 𝒛. 93
  • 94. Polynomial Kernels • Let 𝒙 and 𝒛 be 𝐷-dimensional vectors. • A polynomial kernel of degree 𝑑 is defined as: 𝑘 𝒙, 𝒛 = 𝑐 + 𝒙𝑇 𝒛 𝑑 • The kernel 𝑘 𝒙, 𝒛 = 1 + 𝒙𝑇 𝒛 2 that we saw a couple of slides back was a quadratic kernel. • Parameter 𝑐 controls the trade-off between influence higher-order and lower-order terms. – Increasing values of 𝑐 give increasing influence to lower- order terms. 94
  • 95. Decision boundary with polynomial kernel of degree 1. 95 This is identical to the result using the standard dot product as kernel. Polynomial Kernels – An Easy Case
  • 96. Decision boundary with polynomial kernel of degree 2. 96 The decision boundary is not linear anymore. Polynomial Kernels – An Easy Case
  • 97. Decision boundary with polynomial kernel of degree 3. 97 Polynomial Kernels – An Easy Case
  • 98. Decision boundary with polynomial kernel of degree 1. 98 Polynomial Kernels – A Harder Case
  • 99. Decision boundary with polynomial kernel of degree 2. 99 Polynomial Kernels – A Harder Case
  • 100. Decision boundary with polynomial kernel of degree 3. 100 Polynomial Kernels – A Harder Case
  • 101. Decision boundary with polynomial kernel of degree 4. 101 Polynomial Kernels – A Harder Case
  • 102. Decision boundary with polynomial kernel of degree 5. 102 Polynomial Kernels – A Harder Case
  • 103. Decision boundary with polynomial kernel of degree 6. 103 Polynomial Kernels – A Harder Case
  • 104. Decision boundary with polynomial kernel of degree 7. 104 Polynomial Kernels – A Harder Case
  • 105. Decision boundary with polynomial kernel of degree 8. 105 Polynomial Kernels – A Harder Case
  • 106. Decision boundary with polynomial kernel of degree 9. 106 Polynomial Kernels – A Harder Case
  • 107. Decision boundary with polynomial kernel of degree 10. 107 Polynomial Kernels – A Harder Case
  • 108. Decision boundary with polynomial kernel of degree 20. 108 Polynomial Kernels – A Harder Case
  • 109. Decision boundary with polynomial kernel of degree 100. 109 Polynomial Kernels – A Harder Case
  • 110. RBF/Gaussian Kernels • The Radial Basis Function (RBF) kernel, also known as Gaussian kernel, is defined as: 𝑘𝜎 𝒙, 𝒛 = 𝑒 − 𝒙−𝒛 2 2𝜎2 • Given 𝜎, the value of 𝑘𝜎 𝒙, 𝒛 only depends on the distance between 𝒙 and 𝒛. – 𝑘𝜎 𝒙, 𝒛 decreases exponentially to the distance between 𝒙 and 𝒛. • Parameter 𝜎 is chosen manually. – Parameter 𝜎 specifies how fast 𝑘𝜎 𝒙, 𝒛 decreases as 𝒙 moves away from 𝒛. 110
  • 111. RBF Output Vs. Distance • X axis: distance between 𝒙 and 𝒛. • Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 = 3. 111
  • 112. RBF Output Vs. Distance • X axis: distance between 𝒙 and 𝒛. • Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 = 2. 112
  • 113. RBF Output Vs. Distance • X axis: distance between 𝒙 and 𝒛. • Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 =1. 113
  • 114. RBF Output Vs. Distance • X axis: distance between 𝒙 and 𝒛. • Y axis: 𝑘𝜎 𝒙, 𝒛 , with 𝜎 = 0.5. 114
  • 115. RBF Kernels – An Easier Dataset Decision boundary with a linear kernel. 115
  • 116. RBF Kernels – An Easier Dataset Decision boundary with an RBF kernel. For this dataset, this is a relatively large value for 𝜎, and it produces a boundary that is almost linear. 116
  • 117. RBF Kernels – An Easier Dataset 117 Decision boundary with an RBF kernel. Decreasing the value of 𝜎 leads to less linear boundaries.
  • 118. RBF Kernels – An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of 𝜎 leads to less linear boundaries. 118
  • 119. RBF Kernels – An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of 𝜎 leads to less linear boundaries. 119
  • 120. RBF Kernels – An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of 𝜎 leads to less linear boundaries. 120
  • 121. RBF Kernels – An Easier Dataset Decision boundary with an RBF kernel. Decreasing the value of 𝜎 leads to less linear boundaries. 121
  • 122. RBF Kernels – An Easier Dataset Decision boundary with an RBF kernel. Note that smaller values of 𝜎 increase dangers of overfitting. 122
  • 123. RBF Kernels – An Easier Dataset Decision boundary with an RBF kernel. Note that smaller values of 𝜎 increase dangers of overfitting. 123
  • 124. RBF Kernels – A Harder Dataset Decision boundary with a linear kernel. 124
  • 125. RBF Kernels – A Harder Dataset Decision boundary with an RBF kernel. The boundary is almost linear. 125
  • 126. RBF Kernels – A Harder Dataset Decision boundary with an RBF kernel. The boundary now is clearly nonlinear. 126
  • 127. RBF Kernels – A Harder Dataset Decision boundary with an RBF kernel. 127
  • 128. RBF Kernels – A Harder Dataset Decision boundary with an RBF kernel. 128
  • 129. RBF Kernels – A Harder Dataset Decision boundary with an RBF kernel. Again, smaller values of 𝜎 increase dangers of overfitting. 129
  • 130. RBF Kernels – A Harder Dataset Decision boundary with an RBF kernel. Again, smaller values of 𝜎 increase dangers of overfitting. 130
  • 131. RBF Kernels and Basis Functions • The RBF kernel is defined as: 𝑘𝜎 𝒙, 𝒛 = 𝑒 − 𝒙−𝒛 2 2𝜎2 • Is the RBF kernel equivalent to taking the dot product in some feature space? • In other words, is there any basis function 𝜑 such that 𝑘𝜎 𝒙, 𝒛 = 𝜑 𝒙 𝑇𝜑 𝒛 ? • The answer is "yes": – There exists such a function 𝜑, but its output is infinite- dimensional. – We will not get into more details here. 131
  • 132. Kernels for Non-Vector Data • So far, all our methods have been taking real-valued vectors as input. – The inputs have been elements of ℝ𝐷, for some 𝐷 ≥ 1. • However, there are many interesting problems where the input is not such real-valued vectors. • Examples??? 132
  • 133. Kernels for Non-Vector Data • So far, all our methods have been taking real-valued vectors as input. – The inputs have been elements of ℝ𝐷, for some 𝐷 ≥ 1. • However, there are many interesting problems where the input is not such real-valued vectors. • The inputs can be strings, like "cat", "elephant", "dog". • The inputs can be sets, such as {1, 5, 3}, {5, 1, 2, 7}. – Sets are NOT vectors. They have very different properties. – As sets, {1, 5, 3} = {5, 3, 1}. – As vectors, 1, 5, 3 ≠ (5, 3, 1) • There are many other types of non-vector data… • SVMs can be applied to such data, as long as we define an appropriate kernel function. 133
  • 134. Training Time Complexity • Solving the quadratic programming problem takes 𝑂(𝑑3) time, where 𝑑 is the the dimensionality of vector 𝒖. – In other words, 𝑑 is the number of values we want to estimate. • If we d𝒘 and 𝑏, then we estimate 𝐷 + 1 values. – The time complexity is 𝑂(𝐷3), where 𝐷 is the dimensionality of input vectors 𝒙 (or the dimensionality of 𝜑 𝒙 , if we use a basis function). • If we use quadratic programming to compute vector 𝒂 = (𝑎1, … , 𝑎𝑁), then we estimate 𝑁 values. – The time complexity is 𝑂(𝑁3), where 𝑁 is the number of training inputs. • Which one is faster? 134
  • 135. Training Time Complexity • If we use quadratic programming to compute directly 𝒘 and 𝑏, then the time complexity is 𝑂(𝐷3). – If we use no basis function, 𝐷 is the dimensionality of input vectors 𝒙. – If we use a basis function 𝜑, 𝐷 is the dimensionality of 𝜑 𝒙 . • If we use quadratic programming to compute vector 𝒂 = (𝑎1, … , 𝑎𝑁), the time complexity is 𝑂(𝑁3). • For linear SVMs (i.e., SVMs with linear kernels, that use the regular dot product), usually 𝐷 is much smaller than 𝑁. • If we use RBF kernels, 𝜑 𝒙 is infinite-dimensional. – Computing but computing vector 𝒂 still takes 𝑂(𝑁3 ) time. • If you want to use the kernel trick, then there is no choice, it takes 𝑂(𝑁3) time to do training. 135
  • 136. SVMs for Multiclass Problems • As usual, you can always train one-vs.-all SVMs if there are more than two classes. • Other, more complicated methods are also available. • You can also train what is called "all-pairs" classifiers: – Each SVM is trained to discriminate between two classes. – The number of SVMs is quadratic to the number of classes. • All-pairs classifiers can be used in different ways to classify an input object. – Each pairwise classifier votes for one of the two classes it was trained on. Classification time is quadratic to the number of classes. – There is an alternative method, called DAGSVM, where the all-pairs classifiers are organized in a directed acyclic graph, and classification time is linear to the number of classes. 136
  • 137. SVMs: Recap • Advantages: – Training finds globally best solution. • No need to worry about local optima, or iterations. – SVMs can define complex decision boundaries. • Disadvantages: – Training time is cubic to the number of training data. This makes it hard to apply SVMs to large datasets. – High-dimensional kernels increase the risk of overfitting. • Usually larger training sets help reduce overfitting, but SVMs cannot be applied to large training sets due to cubic time complexity. – Some choices must still be made manually. • Choice of kernel function. • Choice of parameter 𝐶 in formula 𝐶 𝑛=1 𝑁 𝜉𝑛 + 1 2 𝒘 𝟐 . 137