SlideShare a Scribd company logo
1 of 31
Download to read offline
Support vector machines (SVMs)
Lecture 2
David Sontag
New York University
Slides adapted from Luke Zettlemoyer, Vibhav Gogate,
and Carlos Guestrin
Geometry of linear separators
(see blackboard)
A plane can be specified as the set of all points given by:
Barber, Section 29.1.1-4
Vector from origin to a point in the plane
Two non-parallel directions in the plane
Alternatively, it can be specified as:
Normal vector
(we will call this w)
Only need to specify this dot product,
a scalar (we will call this the offset, b)
Linear Separators
 If training data is linearly separable, perceptron is
guaranteed to find some linear separator
 Which of these is optimal?
 SVMs (Vapnik, 1990’s) choose the linear separator with the
largest margin
• Good according to intuition, theory, practice
• SVM became famous when, using images as input, it gave
accuracy comparable to neural-network with hand-designed
features in a handwriting recognition task
Support Vector Machine (SVM)
V. Vapnik
Robust to
outliers!
1. Use optimization to find solution (i.e. a
hyperplane) with few errors
2. Seek large margin separator to improve
generalization
3. Use kernel trick to make large feature
spaces computationally efficient
Support vector machines: 3 key ideas
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0
Finding a perfect classifier (when one exists)
using linear programming
for yt = +1,
and for yt = -1,
For every data point (xt, yt), enforce the
constraint
Equivalently, we want to satisfy all of the
linear constraints
This linear program can be efficiently
solved using algorithms such as simplex,
interior point, or ellipsoid
Finding a perfect classifier (when one exists)
using linear programming
Example of 2-dimensional
linear programming
(feasibility) problem:
For SVMs, each data point
gives one inequality:
What happens if the data set is not linearly separable?
Weight space
• Try to find weights that violate as few
constraints as possible?
• Formalize this using the 0-1 loss:
• Unfortunately, minimizing 0-1 loss is
NP-hard in the worst-case
– Non-starter. We need another
approach.
#(mistakes)
Minimizing number of errors (0-1 loss)
min
w,b
X
j
`0,1(yj, w · xj + b)
where `0,1(y, ŷ) = 1[y 6= sign(ŷ)]
Key idea #1: Allow for slack
For each data point:
•If functional margin ≥ 1, don’t care
•If functional margin < 1, pay linear penalty
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0
ξ2
ξ1
ξ3
ξ4
Σj ξj
- ξj ξj≥0
“slack variables”
We now have a linear program again,
and can efficiently find its optimum
, ξ
Key idea #1: Allow for slack
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0 Σj ξj
- ξj ξj≥0
“slack variables”
, ξ
What is the optimal value ξj
* as a function
of w* and b*?
If then ξj = 0
If then ξj =
Sometimes written as
ξ2
ξ1
ξ3
ξ4
Equivalent hinge loss formulation
Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:
, ξ
min
w,b
X
j
max
⇣
0, 1 (w · xj + b) yj
⌘
This is empirical risk minimization,
using the hinge loss
min
w,b
X
j
`hinge(yj, w · xj + b)
The hinge loss is defined as `hinge(y, ŷ) = max
⇣
0, 1 ŷy
⌘
Hinge loss vs. 0/1 loss
1
0
1
Hinge loss upper bounds 0/1 loss!
It is the tightest convex upper bound on the 0/1 loss
Hinge loss:
`hinge(y, ŷ) = max
⇣
0, 1 ŷy
⌘
0-1 Loss:
`0,1(y, ŷ) = 1[y 6= sign(ŷ)]
Key idea #2: seek large margin
• Suppose again that the data is linearly separable and we are solving
a feasibility problem, with constraints
• If the length of the weight vector ||w|| is too small, the optimization
problem is infeasible! Why?
Key idea #2: seek large margin
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0
As ||w|| (and |b|)
get smaller
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0
x1
x2
γ
What is (geometric margin) as a function of w?
-
We also know that:
So,
i = Distance to i’th data point
= min
i
i
(assuming there is a data point
on the w.x + b = +1 or -1 line)
Final result: can maximize by minimizing ||w||2!!!
(Hard margin) support vector machines
• Example of a convex optimization problem
– A quadratic program
– Polynomial-time algorithms to solve!
• Hyperplane defined by support vectors
– Could use them as a lower-dimension
basis to write down line, although we
haven’t seen how yet
• More on these later
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0
margin 2γ	

Support Vectors:
• data points on the
canonical lines
Non-support Vectors:
• everything else
• moving them will
not change w
Allowing for slack: “Soft margin SVM”
For each data point:
•If margin ≥ 1, don’t care
•If margin < 1, pay linear penalty
w
.
x
+
b
=
+
1
w
.
x
+
b
=
-
1
w
.
x
+
b
=
0
+ C Σj ξj
- ξj ξj≥0
Slack penalty C > 0:
• C=∞  have to separate the data!
• C=0  ignores the data entirely!
• Select using cross-validation
“slack variables”
ξ2
ξ1
ξ3
ξ4
Equivalent formulation using hinge loss
+ C Σj ξj
- ξj ξj≥0
Substituting into the objective, we get:
This part is empirical risk minimization,
using the hinge loss
This is called regularization;
used to prevent overfitting!
The hinge loss is defined as `hinge(y, ŷ) = max
⇣
0, 1 ŷy
⌘
min
w,b
||w||2
2 + C
X
j
`hinge(yj, w · xj + b)
What if the data is not linearly
separable?
Use features of features
of features of features….
Feature space can get really large really quickly!
(x) =
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇤
x(1)
. . .
x(n)
x(1)x(2)
x(1)x(3)
. . .
ex(1)
. . .
⇥
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌅
[Tommi Jaakkola]
Example
What’s Next!
• Learn one of the most interesting and
exciting recent advancements in machine
learning
– Key idea #3: the “kernel trick”
– High dimensional feature spaces at no extra
cost
• But first, a detour
– Constrained optimization!
Constrained optimization
x*=0
No Constraint x ≥ -1
x*=0 x*=1
x ≥ 1
How do we solve with constraints?
 Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Introduce Lagrangian (objective):
We will solve:
Add Lagrange multiplier
Add new
constraint
Why is this equivalent?
• min is fighting max!
x<b  (x-b)<0  maxα-α(x-b) = ∞
• min won’t let this happen!
x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0
• min is cool with 0, and L(x, α)=x2 (original objective)
x=b  α can be anything, and L(x, α)=x2 (original objective)
Rewrite
Constraint
The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case (hard margin SVM)
Original optimization problem:
Lagrangian:
Rewrite
constraints
One Lagrange multiplier
per example
Our goal now is to solve:
Dual SVM derivation (2) – the linearly
separable case (hard margin SVM)
Swap min and max
Slater’s condition from convex optimization guarantees that
these two optimization problems are equivalent!
(Primal)
(Dual)
Dual SVM derivation (3) – the linearly
separable case (hard margin SVM)
Can solve for optimal w, b as function of α:

⇥(x) =
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇧
⇤
x(1)
. . .
x(n)
x(1)x(2)
x(1)x(3)
. . .
ex(1)
. . .
⇥
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌃
⌅
⇤L
⇤w
= w
⌥
j
jyjxj
(Dual)

Substituting these values back in (and simplifying), we obtain:
(Dual)
Sums over all training examples dot product
scalars
Dual formulation only depends on
dot-products of the features!
First, we introduce a feature mapping:
Next, replace the dot product with an equivalent kernel function:

SVM with kernels
• Never compute features explicitly!!!
– Compute dot products in closed form
• O(n2) time in size of dataset to
compute objective
– much work on speeding up
Predict with:
Common kernels
• Polynomials of degree exactly d
• Polynomials of degree up to d
• Gaussian kernels
• Sigmoid
• And many others: very active area of research!
[Tommi Jaakkola]
Quadratic kernel
Quadratic kernel
[Cynthia Rudin]
Feature mapping given by:

More Related Content

Similar to Notes relating to Machine Learning and SVM

Machine learning interviews day2
Machine learning interviews   day2Machine learning interviews   day2
Machine learning interviews day2
rajmohanc
 

Similar to Notes relating to Machine Learning and SVM (20)

super vector machines algorithms using deep
super vector machines algorithms using deepsuper vector machines algorithms using deep
super vector machines algorithms using deep
 
Machine learning interviews day2
Machine learning interviews   day2Machine learning interviews   day2
Machine learning interviews day2
 
1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector1629 stochastic subgradient approach for solving linear support vector
1629 stochastic subgradient approach for solving linear support vector
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
 
Anomaly detection using deep one class classifier
Anomaly detection using deep one class classifierAnomaly detection using deep one class classifier
Anomaly detection using deep one class classifier
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
 
course slides of Support-Vector-Machine.pdf
course slides of Support-Vector-Machine.pdfcourse slides of Support-Vector-Machine.pdf
course slides of Support-Vector-Machine.pdf
 
lecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.pptlecture9-support vector machines algorithms_ML-1.ppt
lecture9-support vector machines algorithms_ML-1.ppt
 
support vector machine
support vector machinesupport vector machine
support vector machine
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
 
Support Vector Machines- SVM
Support Vector Machines- SVMSupport Vector Machines- SVM
Support Vector Machines- SVM
 
lecture15-regularization.pptx
lecture15-regularization.pptxlecture15-regularization.pptx
lecture15-regularization.pptx
 
Gentle intro to SVM
Gentle intro to SVMGentle intro to SVM
Gentle intro to SVM
 
Svm vs ls svm
Svm vs ls svmSvm vs ls svm
Svm vs ls svm
 
Introduction to Support Vector Machines
Introduction to Support Vector MachinesIntroduction to Support Vector Machines
Introduction to Support Vector Machines
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
Support Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptxSupport Vector Machine topic of machine learning.pptx
Support Vector Machine topic of machine learning.pptx
 
linear SVM.ppt
linear SVM.pptlinear SVM.ppt
linear SVM.ppt
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights ReservedMachine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
 

Recently uploaded

Corporate_Science-based_Target_Setting.pptx
Corporate_Science-based_Target_Setting.pptxCorporate_Science-based_Target_Setting.pptx
Corporate_Science-based_Target_Setting.pptx
arnab132
 
Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...
ZAPPAC1
 

Recently uploaded (20)

Sensual Call Girls in Surajpur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
Sensual Call Girls in Surajpur { 9332606886 } VVIP NISHA Call Girls Near 5 St...Sensual Call Girls in Surajpur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
Sensual Call Girls in Surajpur { 9332606886 } VVIP NISHA Call Girls Near 5 St...
 
Hook Up Call Girls Rajgir 9332606886 High Profile Call Girls You Can Get T...
Hook Up Call Girls Rajgir   9332606886  High Profile Call Girls You Can Get T...Hook Up Call Girls Rajgir   9332606886  High Profile Call Girls You Can Get T...
Hook Up Call Girls Rajgir 9332606886 High Profile Call Girls You Can Get T...
 
Call Girls Chikhali ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Chikhali ( 8250092165 ) Cheap rates call girls | Get low budgetCall Girls Chikhali ( 8250092165 ) Cheap rates call girls | Get low budget
Call Girls Chikhali ( 8250092165 ) Cheap rates call girls | Get low budget
 
Russian Call girl Dubai 0503464457 Dubai Call girls
Russian Call girl Dubai 0503464457 Dubai Call girlsRussian Call girl Dubai 0503464457 Dubai Call girls
Russian Call girl Dubai 0503464457 Dubai Call girls
 
Your Budget Call Girls in Hassan 9332606886Call Girls Advance Cash On Delive...
Your Budget Call Girls in Hassan  9332606886Call Girls Advance Cash On Delive...Your Budget Call Girls in Hassan  9332606886Call Girls Advance Cash On Delive...
Your Budget Call Girls in Hassan 9332606886Call Girls Advance Cash On Delive...
 
Corporate_Science-based_Target_Setting.pptx
Corporate_Science-based_Target_Setting.pptxCorporate_Science-based_Target_Setting.pptx
Corporate_Science-based_Target_Setting.pptx
 
Call Girl Service in Wardha 9332606886 HOT & SEXY Models beautiful and charm...
Call Girl Service in Wardha  9332606886 HOT & SEXY Models beautiful and charm...Call Girl Service in Wardha  9332606886 HOT & SEXY Models beautiful and charm...
Call Girl Service in Wardha 9332606886 HOT & SEXY Models beautiful and charm...
 
Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...
Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...
Test bank for beckmann and ling s obstetrics and gynecology 8th edition by ro...
 
2024-05-08 Composting at Home 101 for the Rotary Club of Pinecrest.pptx
2024-05-08 Composting at Home 101 for the Rotary Club of Pinecrest.pptx2024-05-08 Composting at Home 101 for the Rotary Club of Pinecrest.pptx
2024-05-08 Composting at Home 101 for the Rotary Club of Pinecrest.pptx
 
Call Girls in Tiruppur 9332606886 ust Genuine Escort Model Sevice
Call Girls in Tiruppur  9332606886  ust Genuine Escort Model SeviceCall Girls in Tiruppur  9332606886  ust Genuine Escort Model Sevice
Call Girls in Tiruppur 9332606886 ust Genuine Escort Model Sevice
 
Bhubaneswar Call Girl Service 📞9777949614📞Just Call Inaaya📲 Call Girls In Odi...
Bhubaneswar Call Girl Service 📞9777949614📞Just Call Inaaya📲 Call Girls In Odi...Bhubaneswar Call Girl Service 📞9777949614📞Just Call Inaaya📲 Call Girls In Odi...
Bhubaneswar Call Girl Service 📞9777949614📞Just Call Inaaya📲 Call Girls In Odi...
 
Trusted call girls in Fatehabad 9332606886 High Profile Call Girls You Can...
Trusted call girls in Fatehabad   9332606886  High Profile Call Girls You Can...Trusted call girls in Fatehabad   9332606886  High Profile Call Girls You Can...
Trusted call girls in Fatehabad 9332606886 High Profile Call Girls You Can...
 
Hertwich_EnvironmentalImpacts_BuildingsGRO.pptx
Hertwich_EnvironmentalImpacts_BuildingsGRO.pptxHertwich_EnvironmentalImpacts_BuildingsGRO.pptx
Hertwich_EnvironmentalImpacts_BuildingsGRO.pptx
 
Deforestation
DeforestationDeforestation
Deforestation
 
Book Call Girls in Kathua { 9332606886 } VVIP NISHA Call Girls Near 5 Star Hotel
Book Call Girls in Kathua { 9332606886 } VVIP NISHA Call Girls Near 5 Star HotelBook Call Girls in Kathua { 9332606886 } VVIP NISHA Call Girls Near 5 Star Hotel
Book Call Girls in Kathua { 9332606886 } VVIP NISHA Call Girls Near 5 Star Hotel
 
Faridabad Call Girl ₹7.5k Pick Up & Drop With Cash Payment 8168257667 Badarpu...
Faridabad Call Girl ₹7.5k Pick Up & Drop With Cash Payment 8168257667 Badarpu...Faridabad Call Girl ₹7.5k Pick Up & Drop With Cash Payment 8168257667 Badarpu...
Faridabad Call Girl ₹7.5k Pick Up & Drop With Cash Payment 8168257667 Badarpu...
 
Call girl in Ajman 0503464457 Ajman Call girl services
Call girl in Ajman 0503464457 Ajman Call girl servicesCall girl in Ajman 0503464457 Ajman Call girl services
Call girl in Ajman 0503464457 Ajman Call girl services
 
Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...Principle of erosion control- Introduction to contouring,strip cropping,conto...
Principle of erosion control- Introduction to contouring,strip cropping,conto...
 
High Profile Call Girls Service in Udhampur 9332606886 High Profile Call G...
High Profile Call Girls Service in Udhampur   9332606886  High Profile Call G...High Profile Call Girls Service in Udhampur   9332606886  High Profile Call G...
High Profile Call Girls Service in Udhampur 9332606886 High Profile Call G...
 
A Review on Integrated River Basin Management and Development Master Plan of ...
A Review on Integrated River Basin Management and Development Master Plan of ...A Review on Integrated River Basin Management and Development Master Plan of ...
A Review on Integrated River Basin Management and Development Master Plan of ...
 

Notes relating to Machine Learning and SVM

  • 1. Support vector machines (SVMs) Lecture 2 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin
  • 2. Geometry of linear separators (see blackboard) A plane can be specified as the set of all points given by: Barber, Section 29.1.1-4 Vector from origin to a point in the plane Two non-parallel directions in the plane Alternatively, it can be specified as: Normal vector (we will call this w) Only need to specify this dot product, a scalar (we will call this the offset, b)
  • 3. Linear Separators  If training data is linearly separable, perceptron is guaranteed to find some linear separator  Which of these is optimal?
  • 4.  SVMs (Vapnik, 1990’s) choose the linear separator with the largest margin • Good according to intuition, theory, practice • SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task Support Vector Machine (SVM) V. Vapnik Robust to outliers!
  • 5. 1. Use optimization to find solution (i.e. a hyperplane) with few errors 2. Seek large margin separator to improve generalization 3. Use kernel trick to make large feature spaces computationally efficient Support vector machines: 3 key ideas
  • 6. w . x + b = + 1 w . x + b = - 1 w . x + b = 0 Finding a perfect classifier (when one exists) using linear programming for yt = +1, and for yt = -1, For every data point (xt, yt), enforce the constraint Equivalently, we want to satisfy all of the linear constraints This linear program can be efficiently solved using algorithms such as simplex, interior point, or ellipsoid
  • 7. Finding a perfect classifier (when one exists) using linear programming Example of 2-dimensional linear programming (feasibility) problem: For SVMs, each data point gives one inequality: What happens if the data set is not linearly separable? Weight space
  • 8. • Try to find weights that violate as few constraints as possible? • Formalize this using the 0-1 loss: • Unfortunately, minimizing 0-1 loss is NP-hard in the worst-case – Non-starter. We need another approach. #(mistakes) Minimizing number of errors (0-1 loss) min w,b X j `0,1(yj, w · xj + b) where `0,1(y, ŷ) = 1[y 6= sign(ŷ)]
  • 9. Key idea #1: Allow for slack For each data point: •If functional margin ≥ 1, don’t care •If functional margin < 1, pay linear penalty w . x + b = + 1 w . x + b = - 1 w . x + b = 0 ξ2 ξ1 ξ3 ξ4 Σj ξj - ξj ξj≥0 “slack variables” We now have a linear program again, and can efficiently find its optimum , ξ
  • 10. Key idea #1: Allow for slack w . x + b = + 1 w . x + b = - 1 w . x + b = 0 Σj ξj - ξj ξj≥0 “slack variables” , ξ What is the optimal value ξj * as a function of w* and b*? If then ξj = 0 If then ξj = Sometimes written as ξ2 ξ1 ξ3 ξ4
  • 11. Equivalent hinge loss formulation Σj ξj - ξj ξj≥0 Substituting into the objective, we get: , ξ min w,b X j max ⇣ 0, 1 (w · xj + b) yj ⌘ This is empirical risk minimization, using the hinge loss min w,b X j `hinge(yj, w · xj + b) The hinge loss is defined as `hinge(y, ŷ) = max ⇣ 0, 1 ŷy ⌘
  • 12. Hinge loss vs. 0/1 loss 1 0 1 Hinge loss upper bounds 0/1 loss! It is the tightest convex upper bound on the 0/1 loss Hinge loss: `hinge(y, ŷ) = max ⇣ 0, 1 ŷy ⌘ 0-1 Loss: `0,1(y, ŷ) = 1[y 6= sign(ŷ)]
  • 13. Key idea #2: seek large margin
  • 14. • Suppose again that the data is linearly separable and we are solving a feasibility problem, with constraints • If the length of the weight vector ||w|| is too small, the optimization problem is infeasible! Why? Key idea #2: seek large margin w . x + b = + 1 w . x + b = - 1 w . x + b = 0 w . x + b = + 1 w . x + b = - 1 w . x + b = 0 As ||w|| (and |b|) get smaller
  • 15. w . x + b = + 1 w . x + b = - 1 w . x + b = 0 x1 x2 γ What is (geometric margin) as a function of w? - We also know that: So, i = Distance to i’th data point = min i i (assuming there is a data point on the w.x + b = +1 or -1 line) Final result: can maximize by minimizing ||w||2!!!
  • 16. (Hard margin) support vector machines • Example of a convex optimization problem – A quadratic program – Polynomial-time algorithms to solve! • Hyperplane defined by support vectors – Could use them as a lower-dimension basis to write down line, although we haven’t seen how yet • More on these later w . x + b = + 1 w . x + b = - 1 w . x + b = 0 margin 2γ Support Vectors: • data points on the canonical lines Non-support Vectors: • everything else • moving them will not change w
  • 17. Allowing for slack: “Soft margin SVM” For each data point: •If margin ≥ 1, don’t care •If margin < 1, pay linear penalty w . x + b = + 1 w . x + b = - 1 w . x + b = 0 + C Σj ξj - ξj ξj≥0 Slack penalty C > 0: • C=∞  have to separate the data! • C=0  ignores the data entirely! • Select using cross-validation “slack variables” ξ2 ξ1 ξ3 ξ4
  • 18. Equivalent formulation using hinge loss + C Σj ξj - ξj ξj≥0 Substituting into the objective, we get: This part is empirical risk minimization, using the hinge loss This is called regularization; used to prevent overfitting! The hinge loss is defined as `hinge(y, ŷ) = max ⇣ 0, 1 ŷy ⌘ min w,b ||w||2 2 + C X j `hinge(yj, w · xj + b)
  • 19. What if the data is not linearly separable? Use features of features of features of features…. Feature space can get really large really quickly! (x) = ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇤ x(1) . . . x(n) x(1)x(2) x(1)x(3) . . . ex(1) . . . ⇥ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌅
  • 21. What’s Next! • Learn one of the most interesting and exciting recent advancements in machine learning – Key idea #3: the “kernel trick” – High dimensional feature spaces at no extra cost • But first, a detour – Constrained optimization!
  • 22. Constrained optimization x*=0 No Constraint x ≥ -1 x*=0 x*=1 x ≥ 1 How do we solve with constraints?  Lagrange Multipliers!!!
  • 23. Lagrange multipliers – Dual variables Introduce Lagrangian (objective): We will solve: Add Lagrange multiplier Add new constraint Why is this equivalent? • min is fighting max! x<b  (x-b)<0  maxα-α(x-b) = ∞ • min won’t let this happen! x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0 • min is cool with 0, and L(x, α)=x2 (original objective) x=b  α can be anything, and L(x, α)=x2 (original objective) Rewrite Constraint The min on the outside forces max to behave, so constraints will be satisfied.
  • 24. Dual SVM derivation (1) – the linearly separable case (hard margin SVM) Original optimization problem: Lagrangian: Rewrite constraints One Lagrange multiplier per example Our goal now is to solve:
  • 25. Dual SVM derivation (2) – the linearly separable case (hard margin SVM) Swap min and max Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent! (Primal) (Dual)
  • 26. Dual SVM derivation (3) – the linearly separable case (hard margin SVM) Can solve for optimal w, b as function of α:  ⇥(x) = ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇧ ⇤ x(1) . . . x(n) x(1)x(2) x(1)x(3) . . . ex(1) . . . ⇥ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌃ ⌅ ⇤L ⇤w = w ⌥ j jyjxj (Dual)  Substituting these values back in (and simplifying), we obtain: (Dual) Sums over all training examples dot product scalars
  • 27. Dual formulation only depends on dot-products of the features! First, we introduce a feature mapping: Next, replace the dot product with an equivalent kernel function: 
  • 28. SVM with kernels • Never compute features explicitly!!! – Compute dot products in closed form • O(n2) time in size of dataset to compute objective – much work on speeding up Predict with:
  • 29. Common kernels • Polynomials of degree exactly d • Polynomials of degree up to d • Gaussian kernels • Sigmoid • And many others: very active area of research!