Support vector machines

Part I. The Fundamentals of ML
Ch.5 Support Vector Machines

In this chapter, we will discuss about
• What is SVM
- supervised learning
- binary classifier
- linear and unlinear regressor
- linear and unlinear learning using kernel trick
Support
vectors

Linear SVM Classification
• What is Margin?
• SVM is sensitive to feature scale
→ use Scikit-Learn's StandardScaler
Decision boundary
= Decision line
= Decision hyperplane

Soft and Hard Margin SVM
• Hard Margin SVM : error 를 허용하지 않고 margin 을 최대화함
* Hard Margin SVM 은 outlier 에 취약

Soft and Hard Margin SVM
• 그래서 약간의 에누리를 줄 수 있는데, 에누리의 크기를
margin violation 이라고 한다. 이 경우에 support vector와
decision line 사이에 feature 들이 들어있을 수 있음

Mathematical approach 1
• Decision boundary(=hyperplane)을
𝑤 ⋅ 𝑥 + 𝑏 = 𝑤1 ⋅ 𝑥1 + ⋯ + 𝑤 𝑛 ⋅ 𝑥 𝑛 + 𝑏 = 0
라고 하자.
• SVM 은 2진분류기(binary classifier)이고, 분류방법은 𝑤 ⋅ 𝑥 +
𝑏 > 0을 만족하는 positive case와 𝑤 ⋅ 𝑥 + 𝑏 < 0 을 만족하는
negative case 이다.
• Define a map s ∶ 𝑋 → ±1 by 𝑠 𝑥 = 𝑠𝑖𝑔𝑛 𝑤 ⋅ 𝑥 + 𝑏
Then we have 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 0
∵ 𝑤 ⋅ 𝑥𝑖 + 𝑏 > 0 ⇒ 𝑦𝑖 = 1 & 𝑤 ⋅ 𝑥𝑖 + 𝑏 < 0 ⇒ 𝑦𝑖= −1

Mathematical approach 2 : Margin Distance
• Define a map 𝑓 𝑥 = 𝑤 ⋅ 𝑥 + 𝑏
• A point 𝑥 is said to be on the boundary if 𝑓 𝑥 = 0.
• Suppose that 𝑓 𝑥 = 𝑎(≠ 0).
• For a point 𝑥 𝑝 on the boundary, we have
𝑥 = 𝑥 𝑝 + 𝑟
𝑤
𝑤
where 𝑟 is the distance between 𝑥 and the decision boundary.
Decision
boundary
𝑥 𝑝
𝑥
90°
r

Mathematical approach 2 : Margin Distance
• Since 𝑓 𝑥 𝑝 = 𝑤 ⋅ 𝑥 𝑝 + 𝑏 = 0, we have
𝑓 𝑥 = 𝑤 ⋅ 𝑥 + 𝑏 = 𝑤 𝑥 𝑝 + 𝑟
𝑤
𝑤
+ 𝑏 = 𝑟
𝑤⋅𝑤
𝑤
= 𝑟 𝑤 .
Therefore the distance between 𝑥 and the decision boundary is
𝑟 =
𝑓(𝑥)
𝑤
.
Decision boundary
𝑥 𝑝
𝑥
90°
𝑓(𝑥)
𝑤

Maximizing the margin
• Optimization problem :
max
𝑤,𝑏
𝑎
| 𝑤 |
such that 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 𝑎, ∀𝑖
• 우리는 𝑎 를 임의의 숫자로 사용했으므로, normalize 하여
아래의 optimization problem 을 얻을 수 있다.
min
𝑤,𝑏
| 𝑤 | such that 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1, ∀𝑖
⇒ Quadratic Programming(QP) problem
∵ | 𝑤 | 에 제곱항이 포함되어 있음

Dual problem
• Constrained optimization :
min
𝑥
𝑓(𝑥) s.t. 𝑔 𝑥 ≤ 0, ℎ 𝑥 = 0.
• Lagrange method:
Lagrange prime function : 𝐿 𝑥, 𝛼, 𝛽 = 𝑓 𝑥 + 𝛼𝑔 𝑥 + 𝛽ℎ(𝑥)
Lagrange multiplier : 𝛼 ≥ 0, 𝛽
Lagrange dual function :
𝑑 𝛼, 𝛽 = inf
𝑥∈𝑋
𝐿(𝑥, 𝛼, 𝛽) = min
𝑥∈𝑋
𝐿(𝑥, 𝛼, 𝛽).
Then we have
max
𝛼≥0,𝛽
𝐿 𝑥, 𝛼, 𝛽 =
𝑓 𝑥 ∶ 𝑖𝑓 𝑥 𝑖𝑠 𝑓𝑒𝑎𝑠𝑖𝑏𝑙𝑒
∞ ∶ 𝑜. 𝑤.
우리의 optimization problem
min
𝑤,𝑏
| 𝑤 | such that 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 ≥ 1, ∀𝑖
을 대입해보면 아래와 같다.
𝑓 𝑤 =
1
2
𝑤 ⋅ 𝑤,
𝑔 𝑤 = 𝑤 ⋅ 𝑥 + 𝑏 𝑦𝑖 − 1 ≤ 0,
ℎ 𝑤 = 0
원래 𝑓 𝑥 = 𝑤 이지만,
arg min 𝑤 = arg min 𝑤 2

Prime & Dual problem
• Primal Problem
min
𝑥
𝑓(𝑥) s.t 𝑓 𝑥 ≤ 0, ℎ 𝑥 = 0
• Dual Problem
m𝑎𝑥
𝛼>0,𝛽
𝑑 𝛼, 𝛽 s.t 𝛼 > 0
min
𝑥
max
𝛼≥0,𝛽
𝐿(𝑥, 𝛼, 𝛽) max
𝛼≥0,𝛽
min
𝑥
𝐿(𝑥, 𝛼, 𝛽)
• Weak duality theorem
1. 𝑑 𝛼, 𝛽 ≤ 𝑓(𝑥)
2. 𝑑∗ = max
𝛼≥0,𝛽
min
𝑥
𝐿(𝑥, 𝛼, 𝛽) ≤ min
𝑥
max
𝛼≥0,𝛽
𝐿 𝑥, 𝛼, 𝛽 = 𝑝∗

Strong Duality and KKT-conditions
• The Karush-Kuhn-Tucker(KKT) conditions:
Parallel gradients conditions :
𝜕𝐿(𝑥 𝑖,𝛼,𝛽)
𝜕𝑥 𝑖
= 0 ∀𝑖
Orthogonality conditions : 𝛼∗ 𝑔 𝑥∗ = 0
Satisfaction of original constraints : 𝑔 𝑥∗
≤ 0
Lagrange multiplier nonnegativity : 𝛼∗
≥ 0
• KKT-condition 을 만족하면, strong duality 가 성립한다. 즉,
𝑑∗
= max
𝛼≥0,𝛽
min
𝑥
𝐿(𝑥, 𝛼, 𝛽) = min
𝑥
max
𝛼≥0,𝛽
𝐿 𝑥, 𝛼, 𝛽 = 𝑝∗

• 우리의 문제 SVM의 Primal problem, Dual Problem, KKT-
conditions 는 아래와 같다.
• Primal Problem
min
𝑤,𝑏
max
𝛼>0,𝛽
(
1
2
𝑤 ⋅ 𝑤 −
𝑖
𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1 ) 𝑠. 𝑡. 𝛼𝑖 ≥ 0 ∀𝑖.
• Dual Problem
max
𝛼>0,𝛽
min
𝑤,𝑏
(
1
2
𝑤 ⋅ 𝑤 −
𝑖
𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1 ) 𝑠. 𝑡. 𝛼𝑖 ≥ 0 ∀𝑖.
∴ Our KKT-conditions:
𝜕𝐿(𝑤,𝑏,𝛼)
𝜕𝑤
= 0,
𝜕𝑏
= 0
𝛼𝑖 ≥ 0, ∀𝑖
𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1 = 0, ∀𝑖
※ 𝒙𝒊가 decision boundary에 있지 않으면 𝒘 ⋅ 𝒙𝒊 + 𝒃 𝒚𝒊 − 𝟏 ≠ 𝟎 이고,
𝜶𝒊 ≥ 𝟎 이므로, 𝜶𝒊 𝒘 ⋅ 𝒙𝒊 + 𝒃 𝒚𝒊 − 𝟏 = 𝟎 으로부터 𝜶𝒊 = 𝟎를 얻을 수 있다.
따라서, SVM에서는 support vector 가 중요하다!!

• 우리의 함수 𝑓 𝑥 = 𝑤 ⋅ 𝑥 + 𝑏 의 Lagrange prime function 은
𝐿 𝑤, 𝑏, 𝛼 =
1
2
𝑤 ⋅ 𝑤 − 𝑖 𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1
이므로, KKT-condition 을 풀어보면
• 결과적으로 우리는 아래의 식을 얻을 수 있다.
• 𝐿 𝑤, 𝑏, 𝛼 =
1
2
𝑤 ⋅ 𝑤 − 𝑖 𝛼𝑖 𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖 − 1
=
1
2
𝑤 ⋅ 𝑤 −
𝑖
𝛼𝑖 𝑦𝑖 𝑤 𝑥𝑖 − 𝑏
𝑖
𝛼𝑖 𝑦𝑖 +
𝑖
𝛼𝑖
=
1
2 𝑖 𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑗 − 𝑖 𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑗 + 𝑖 𝛼𝑖
= 𝑖 𝛼𝑖 −
1
2 𝑖 𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑗
•
𝜕𝑤
= 0 → 𝑤 = 𝑖 𝛼𝑖 𝑥𝑖 𝑦𝑖
•
𝜕𝑏
= 0 → 𝑖 𝛼𝑖 𝑦𝑖 = 0

Mercer’s Theorem
• We assume that:
∙ 𝑋 = 𝑥1, ⋯ , 𝑥 𝑛 : finite input space
∙ 𝑘 ∶ 𝑋 × 𝑋 → ℝ : kernel
∙ 𝐾 = 𝑘 𝑥𝑖, 𝑥𝑗
𝑖,𝑗=1
𝑛
: Gram matrix induced by 𝑘
- since 𝐾 is a symmetric matrix, we have spectral decomposition
then the map 𝜙: 𝑋 → ℝ 𝑛 defined by
𝜙 𝑥𝑖 = 𝜆𝑖 𝑣𝑡𝑖 ∈ ℝ 𝑛, 𝑖 = 1, ⋯ , 𝑛
where 𝜆𝑖 is the 𝑖-th eigenvalue of 𝐾 and 𝑣 𝑡 = 𝑣 𝑡𝑖 𝑖=1
𝑛
is the 𝑖-th
eigenvector of 𝐾 and
𝜙 𝑥𝑖 ⋅ 𝜙 𝑥𝑗 =
𝑡=1
𝑛
𝜆 𝑡 𝑣 𝑡𝑖 𝑣 𝑡𝑗 = 𝑉Λ𝑉 𝑇
𝑖𝑗 = 𝐾𝑖𝑗 = 𝑘 𝑥𝑖, 𝑥𝑗 .
𝐾
𝑉 𝑉 𝑇
𝑣1 𝑣2 𝑣 𝑛
𝑣2
𝑣1
𝑣 𝑛
https://cseweb.ucsd.edu/~dasg
upta/291-unsup/lec7.pdf

Kernel Trick
• A function 𝐾 ∶ 𝑋 × 𝑋 → ℝ is a kernel of 𝜙 if
𝐾 𝑥𝑖, 𝑥𝑗 = 𝜙 𝑥𝑖 ⋅ 𝜙 𝑥𝑗 .
여기서 𝜙는 𝑋 보다 높은 차원으로 embedding 해주는 함수이다.
• Polynomial kernel
𝐾 𝑥𝑖, 𝑥𝑗 = 𝑥𝑖 ⋅ 𝑥𝑗 + 1
𝑛
• Gaussian kernel
𝐾 𝑥𝑖, 𝑥𝑗 = exp −
𝑥𝑖 − 𝑥𝑗
2
2𝜎2
• Gaussian radial basis function(RBF) :
It is general-purpose kernel; used when there is no prior knowledge
about the data
𝐾 𝑥𝑖, 𝑥𝑗 = exp(−𝛾 𝑥𝑖 − 𝑥𝑗
2
), 𝛾 > 0
• Laplace RBL kernel
𝐾 𝑥𝑖, 𝑥𝑗 = exp −
| 𝑥𝑖 − 𝑥𝑗 |
𝜎
• Hyperbolic tangent kernel
𝐾 𝑥𝑖, 𝑥𝑗 = tanh 𝜅𝑥𝑖 ⋅ 𝑥𝑗 + 𝑐 , 𝜅 > 0, 𝑐 < 0.
참고. https://data-flair.training/blogs/svm-kernel-functions/

Convert dual problem using Kernel Trick
• P.14 에서 우리는 아래의 dual problem 을 얻었다.
𝐿 𝑤, 𝑏, 𝛼 =
𝑖
𝛼𝑖 −
1
2
𝑖 𝑗
𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝑥𝑖 𝑥𝑗
• 위의 식을 𝜙 함수와 kernel 함수 𝐾를 이용하여 아래와 같이 변형할 수
있다.
max
𝛼≥0
( 𝑖 𝛼𝑖 −
1
2 𝑖 𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝜙 𝑥𝑖 𝜙(𝑥𝑗))
= max
𝛼≥0
( 𝑖 𝛼𝑖 −
1
2 𝑖 𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝐾(𝑥𝑖, 𝑥𝑗))
where
𝑤 = 𝑖 𝛼𝑖 𝑦𝑖 𝜙(𝑥𝑖),
𝑏 = 𝑦𝑖 −
𝑖
𝛼𝑖 𝑦𝑖 𝜙 𝑥𝑖 𝜙 𝑥𝑗 = 𝑦𝑖 −
𝑖
𝛼𝑖 𝑦𝑖 𝐾(𝑥𝑖, 𝑥𝑗)

Convert dual problem using Kernel Trick
• P.16의 𝑤와 𝑏를 이용하여 아래를 얻을 수 있다.
𝑠𝑖𝑔𝑛 𝑤 ⋅ 𝜙 𝑥 + 𝑏
= 𝑠𝑖𝑔𝑛( 𝑖 𝛼𝑖 𝑦𝑖 𝜙 𝑥𝑖 𝜙 𝑥 − 𝑦𝑖 − 𝑖 𝛼𝑖 𝑦𝑖 𝐾 𝑥𝑖, 𝑥𝑗 )
= 𝑠𝑖𝑔𝑛( 𝑖 𝛼𝑖 𝐾(𝑥𝑖, 𝑥) − 𝑦𝑖 − 𝑖 𝛼𝑖 𝑦𝑖 𝐾 𝑥𝑖, 𝑥𝑗 )
∴ Kernel Trick 의 장점은 높은 차원으로 올려주는 함수 𝜙를 사용하
지만, 실제로 계산은 kernel 로 되므로 𝜙함수의 존재성만 이용하여
Kernel을 사용할 수 있다.

The cost function of SVM : Hinge Loss
Decision boundary
90°
𝑤 ⋅ 𝑥𝑖 + 𝑏 𝑦𝑖
1
1
Hinge Loss Function

Support vector machines

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Support vector machines

Similar to Support vector machines (20)

More from Jinho Lee

More from Jinho Lee (10)

Recently uploaded

Recently uploaded (20)

Support vector machines