2. Support vector machines
We’ll look at a simple
classification task.
Class 1 : yellow dot
Label : -1
Class 2 : red dot
Label : +1
We need a decision rule to
separate them.
3. Possible decision rules
All are valid decision
boundaries. There are many
more.
When we can do it with a
line, should we go for
another shape?
“Occam’s Razor”
1 Reason : Usually, simpler
ones are hard to find. Less
chance of fooling ourselves.
This is roughly how
your decision
boundary would look
if you used nearest
neighbor
classification.
4. Intuitively best decision rule
The decision boundary that
leaves equal and maximum
margin on either sides.
This is what we are trying to
learn.
SUPPORT
VECTORS
5. THE APPROACH
Let’s define w as the unit
vector from origin
perpendicular to our decision
boundary.
Recall :
Given a vector U and a unit
vector w, U.w is the
component of U along w.
6. The approach
U be an unknown data-point.
We say, U is of +ve class if
U.w > c
Otherwise, U belongs to the
-ve class.
7. The approach
U be an unknown data-point.
We say, U is of +ve class if
U.w > c U.w + b > 0
Otherwise, U belongs to the
-ve class.
(Put b = -c)
Next, we’ll add some
constraints.
8. The approach Let the margin be m on each
side.
For a +ve sample (not
unknown) X+
we want
X+
.w + b >= m
X+
.w + b >= 1
(where i divided through out
by m. So, w is no longer a
unit vector, but that is
okay.)
9. The approach Similarly, for negative
sample X-
we have
X-
.w + b <= -m
X-
.w + b <= -1
(where i divided through out
by m. So, w is no longer a
unit vector, but that is
okay.)
10. The approach
To make things easy, consider
the label y:
+1 for +ve class
y =
-1 for -ve class
Then, we can rewrite the
inequalities as follows.
11. The approach
For any sample Xi
,
yi
(Xi
.w + b) >= 1
Or
yi
(Xi
.w + b) - 1 >= 0
(Convince yourself that this
is okay.)
12. The approach
For support vectors, we say
yi
(Xi
.w + b) - 1 = 0
(This is our rule to
identify support vectors)
Now, we need an expression
for margin width (which we
want to maximize).
13. The approach
Consider a positive support
vector X+
and a negative
support vector X-
.
Total Margin =
X+
. - X-
.
= (1 - b - (-1 - b))/
= 2/
(This is what we’ll maximize)
14. The approach Maximizing 2/ is same as
Maximizing 1/ , same as
Minimizing , same as
Minimizing .
So our problem is,
Minimize
s.t., yi
(Xi
.w + b) - 1 >= 0
For i ranging from 1 to n.
15. takeaway
We will not go through how this is solved. But the important
takeaway is that when we do the algebra, the training
samples figure only as dot products. No other forms.
Now we’ll look at some problems.
16. Sample problems
What if there are some
noisy samples which make
the data linearly
non-separable?
Soft-margin classification.
It will try to minimize the
number of points in error.
Picture taken from
https://www.cs.utexas.edu/~mooney/cs391L/slides
/svm.ppt
17. Sample problems : kernel trick
What if dataset is just too hard?
Move to a new perspective.
Map data to a higher
dimensional space.
Pictures from
https://www.cs.utexas.edu/~mooney/cs391L/slides/svm.ppt
18. Sample problems : kernel trick
Pictures from https://www.cs.utexas.edu/~mooney/cs391L/slides/svm.ppt
19. Sample problems : kernel trick
To do this kind of transformation, only thing we need to
provide is the dot product for training samples in the new
space. (We needn’t know the transformation of each point,
just the dot product.) Such a function is called a Kernel.
[This is an implication of the takeaway we noted : only dot
products figure in the subsequent steps]
20. Some common kernels
Linear kernel
K(Xi
, Xj
) = Xi
.Xj
(no transformation)
Polynomial kernel
K(Xi
, Xj
) = (1 + Xi
.Xj
)p
RBF kernel
K(Xi
, Xj
) =
exp(-((Xi
-Xj
).(Xi
-Xj
))/2휎2
)
Linear separators in higher
dimensional space correspond
to non-linear separators in
original space.
21. References
Dr. Patrick Winston’s simple video lecture (MIT
OpenCourseWare) https://youtu.be/_PwhiWxHK8o
Slides:
https://www.cs.utexas.edu/~mooney/cs391L/slides/svm.ppt
For extensions of codes we tried out, refer scikit-learn’s
SVM documentation.