This document discusses support vector machines (SVMs). It explains that SVMs are supervised learning models that can be used for classification or regression tasks. The document outlines hard and soft margin SVMs, describing how soft margin SVMs allow for some errors in the classification. It presents the mathematical formulations of linear SVMs, including defining the decision boundary, maximizing the margin between classes, and deriving the dual and primal optimization problems. Finally, it introduces kernel methods that can extend linear SVMs to handle nonlinear decision boundaries using kernel tricks.
2. Part I. The Fundamentals of ML
Ch.5 Support Vector Machines
3. In this chapter, we will discuss about
• What is SVM
- supervised learning
- binary classifier
- linear and unlinear regressor
- linear and unlinear learning using kernel trick
Support
vectors
4. Linear SVM Classification
• What is Margin?
• SVM is sensitive to feature scale
→ use Scikit-Learn's StandardScaler
Decision boundary
= Decision line
= Decision hyperplane
5. Soft and Hard Margin SVM
• Hard Margin SVM : error 를 허용하지 않고 margin 을 최대화함
* Hard Margin SVM 은 outlier 에 취약
6. Soft and Hard Margin SVM
• 그래서 약간의 에누리를 줄 수 있는데, 에누리의 크기를
margin violation 이라고 한다. 이 경우에 support vector와
decision line 사이에 feature 들이 들어있을 수 있음
7. Mathematical approach 1
• Decision boundary(=hyperplane)을
𝑤 ⋅ 𝑥 + 𝑏 = 𝑤1 ⋅ 𝑥1 + ⋯ + 𝑤 𝑛 ⋅ 𝑥 𝑛 + 𝑏 = 0
라고 하자.
• SVM 은 2진분류기(binary classifier)이고, 분류방법은 𝑤 ⋅ 𝑥 +
𝑏 > 0을 만족하는 positive case와 𝑤 ⋅ 𝑥 + 𝑏 < 0 을 만족하는
negative case 이다.
• Define a map s ∶ 𝑋 → ±1 by 𝑠 𝑥 = 𝑠𝑖𝑔𝑛 𝑤 ⋅ 𝑥 + 𝑏
Then we have 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 0
∵ 𝑤 ⋅ 𝑥𝑖 + 𝑏 > 0 ⇒ 𝑦𝑖 = 1 & 𝑤 ⋅ 𝑥𝑖 + 𝑏 < 0 ⇒ 𝑦𝑖= −1
8. Mathematical approach 2 : Margin Distance
• Define a map 𝑓 𝑥 = 𝑤 ⋅ 𝑥 + 𝑏
• A point 𝑥 is said to be on the boundary if 𝑓 𝑥 = 0.
• Suppose that 𝑓 𝑥 = 𝑎(≠ 0).
• For a point 𝑥 𝑝 on the boundary, we have
𝑥 = 𝑥 𝑝 + 𝑟
𝑤
𝑤
where 𝑟 is the distance between 𝑥 and the decision boundary.
Decision
boundary
𝑥 𝑝
𝑥
90°
r
9. Mathematical approach 2 : Margin Distance
• Since 𝑓 𝑥 𝑝 = 𝑤 ⋅ 𝑥 𝑝 + 𝑏 = 0, we have
𝑓 𝑥 = 𝑤 ⋅ 𝑥 + 𝑏 = 𝑤 𝑥 𝑝 + 𝑟
𝑤
𝑤
+ 𝑏 = 𝑟
𝑤⋅𝑤
𝑤
= 𝑟 𝑤 .
Therefore the distance between 𝑥 and the decision boundary is
𝑟 =
𝑓(𝑥)
𝑤
.
Decision boundary
𝑥 𝑝
𝑥
90°
𝑓(𝑥)
𝑤
10. Maximizing the margin
• Optimization problem :
max
𝑤,𝑏
𝑎
| 𝑤 |
such that 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 𝑎, ∀𝑖
• 우리는 𝑎 를 임의의 숫자로 사용했으므로, normalize 하여
아래의 optimization problem 을 얻을 수 있다.
min
𝑤,𝑏
| 𝑤 | such that 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1, ∀𝑖
⇒ Quadratic Programming(QP) problem
∵ | 𝑤 | 에 제곱항이 포함되어 있음
11. Dual problem
• Constrained optimization :
min
𝑥
𝑓(𝑥) s.t. 𝑔 𝑥 ≤ 0, ℎ 𝑥 = 0.
• Lagrange method:
Lagrange prime function : 𝐿 𝑥, 𝛼, 𝛽 = 𝑓 𝑥 + 𝛼𝑔 𝑥 + 𝛽ℎ(𝑥)
Lagrange multiplier : 𝛼 ≥ 0, 𝛽
Lagrange dual function :
𝑑 𝛼, 𝛽 = inf
𝑥∈𝑋
𝐿(𝑥, 𝛼, 𝛽) = min
𝑥∈𝑋
𝐿(𝑥, 𝛼, 𝛽).
Then we have
max
𝛼≥0,𝛽
𝐿 𝑥, 𝛼, 𝛽 =
𝑓 𝑥 ∶ 𝑖𝑓 𝑥 𝑖𝑠 𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒
∞ ∶ 𝑜. 𝑤.
우리의 optimization problem
min
𝑤,𝑏
| 𝑤 | such that 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1, ∀𝑖
을 대입해보면 아래와 같다.
𝑓 𝑤 =
1
2
𝑤 ⋅ 𝑤,
𝑔 𝑤 = 𝑤 ⋅ 𝑥 + 𝑏 𝑦𝑖 − 1 ≤ 0,
ℎ 𝑤 = 0
원래 𝑓 𝑥 = 𝑤 이지만,
arg min 𝑤 = arg min 𝑤 2
12. Prime & Dual problem
• Primal Problem
min
𝑥
𝑓(𝑥) s.t 𝑓 𝑥 ≤ 0, ℎ 𝑥 = 0
• Dual Problem
m𝑎𝑥
𝛼>0,𝛽
𝑑 𝛼, 𝛽 s.t 𝛼 > 0
min
𝑥
max
𝛼≥0,𝛽
𝐿(𝑥, 𝛼, 𝛽) max
𝛼≥0,𝛽
min
𝑥
𝐿(𝑥, 𝛼, 𝛽)
• Weak duality theorem
1. 𝑑 𝛼, 𝛽 ≤ 𝑓(𝑥)
2. 𝑑∗ = max
𝛼≥0,𝛽
min
𝑥
𝐿(𝑥, 𝛼, 𝛽) ≤ min
𝑥
max
𝛼≥0,𝛽
𝐿 𝑥, 𝛼, 𝛽 = 𝑝∗
13. Strong Duality and KKT-conditions
• The Karush-Kuhn-Tucker(KKT) conditions:
Parallel gradients conditions :
𝜕𝐿(𝑥 𝑖,𝛼,𝛽)
𝜕𝑥 𝑖
= 0 ∀𝑖
Orthogonality conditions : 𝛼∗ 𝑔 𝑥∗ = 0
Satisfaction of original constraints : 𝑔 𝑥∗
≤ 0
Lagrange multiplier nonnegativity : 𝛼∗
≥ 0
• KKT-condition 을 만족하면, strong duality 가 성립한다. 즉,
𝑑∗
= max
𝛼≥0,𝛽
min
𝑥
𝐿(𝑥, 𝛼, 𝛽) = min
𝑥
max
𝛼≥0,𝛽
𝐿 𝑥, 𝛼, 𝛽 = 𝑝∗
14. Strong Duality and KKT-conditions
• 우리의 문제 SVM의 Primal problem, Dual Problem, KKT-
conditions 는 아래와 같다.
• Primal Problem
min
𝑤,𝑏
max
𝛼>0,𝛽
(
1
2
𝑤 ⋅ 𝑤 −
𝑖
𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1 ) 𝑠. 𝑡. 𝛼𝑖 ≥ 0 ∀𝑖.
• Dual Problem
max
𝛼>0,𝛽
min
𝑤,𝑏
(
1
2
𝑤 ⋅ 𝑤 −
𝑖
𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1 ) 𝑠. 𝑡. 𝛼𝑖 ≥ 0 ∀𝑖.
∴ Our KKT-conditions:
𝜕𝐿(𝑤,𝑏,𝛼)
𝜕𝑤
= 0,
𝜕𝐿(𝑤,𝑏,𝛼)
𝜕𝑏
= 0
𝛼𝑖 ≥ 0, ∀𝑖
𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1 = 0, ∀𝑖
※ 𝒙𝒊가 decision boundary에 있지 않으면 𝒘 ⋅ 𝒙𝒊 + 𝒃 𝒚𝒊 − 𝟏 ≠ 𝟎 이고,
𝜶𝒊 ≥ 𝟎 이므로, 𝜶𝒊 𝒘 ⋅ 𝒙𝒊 + 𝒃 𝒚𝒊 − 𝟏 = 𝟎 으로부터 𝜶𝒊 = 𝟎를 얻을 수 있다.
따라서, SVM에서는 support vector 가 중요하다!!