3. How to compute ഥ𝑤?
ഥ𝑤 ∙ 𝑥+ + 𝑏 ≥ 1
ഥ𝑤 ∙ 𝑥− + 𝑏 ≤ −1
𝑦𝑖(ഥ𝑤 ∙ ഥ𝑥𝑖 + 𝑏) ≥ 1
𝑦𝑖(ഥ𝑤 ∙ ഥ𝑥𝑖 + 𝑏) ≥ 1
𝑦𝑖(ഥ𝑤 ∙ ഥ𝑥𝑖 + 𝑏) − 1 ≥ 0
𝑦𝑖(ഥ𝑤 ∙ ഥ𝑥𝑖 + 𝑏) − 1 = 0
𝑦𝑖 such that. 𝑦𝑖=+1 for + samples
𝑦𝑖=-1 for - samples
For 𝑥𝑖 in gutter
4. How to compute ഥ𝑤?
+-
-
+
𝑥+ − 𝑥−
𝑥+
𝑥−
MAX
2
ഥw
→ MIN ഥw → MIN
1
2
ഥW 2
Width = 𝑥+ − 𝑥− ∙
ഥw
ഥw
=
2
ഥw
1 − b 1 + b
5. How to minimize it?
𝑐𝑜𝑛𝑠𝑡𝑟𝑎𝑖𝑛𝑡𝑠 ∶ 𝑦𝑖 (ഥ𝑤 ∙ ഥ𝑥𝑖 + 𝑏) − 1 ≥ 0
If the constraints are equality, we call it equality constraint problem
and we solve it by Lagrangian multiplier.
The method of Lagrange multipliers is generalized by the Karush–
Kuhn–Tucker conditions, which can also take into account inequality
constraints.
6. How to minimize it?
Karush–Kuhn–Tucker conditions
Conditions on primal
Kuhn–Tucker conditions, are first
derivative tests (sometimes called
first-order necessary conditions) for a
solution in nonlinear programming to
be optimal, provided that some
regularity conditions are satisfied (i.e
KKT-condtions derive from the
relationship between primal and dual
when some regularity conditions are
satisfied).
7. Duality
We want to minimize the primal problem
through maximize the dual problem.
Due to Slater’s condition, the difference between
the primal and dual solutions (duality gap) equal to zero.
If the optimal duality gap is zero, then
we say that strong duality holds.
8. How to minimize it?
In order for a minimum point 𝑥∗
satisfy the above KKT conditions,
the problem should satisfy some regularity conditions;
some common examples are tabulated here:
Slater's condition
9. How to minimize it?
For a problem with strong duality
𝑥∗
and 𝑢∗
, 𝑣∗
are primal and dual solutions
↔
𝑥∗
and 𝑢∗
, 𝑣∗
satisfy KKT-conditions
10. How to minimize it?
We will find 𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠 that satisfy KKT-conditions,
because SVM satisfy Slater’s condition,
𝑜𝑢𝑟 𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑠 will be solutions of primal and dual
11. How to minimize it?
If primal is convex problem, then it satisfies Slater’s condition.
Fortunately, SVM satisfy Slater's condition
and thus a strong duality hold.
The strong duality leads us to KKT-condition.
(if 𝑥∗
satisfy KKT-condition, then 𝑥∗
will be solution of primal)
Let ℎ𝑖 𝑥∗
= 𝑦𝑖 (ഥ𝑤 ∙ ഥ𝑥𝑖 + 𝑏) − 1, then
important conclusion of satisfying KKT-condition is that σ 𝛼𝑖
∗
ℎ𝑖 𝑥∗
= 0.
Since each term in this sum is nonpositive, we conclude that 𝛼𝑖
∗
ℎ𝑖 𝑥∗ = 0 for all 𝑖.
13. Therefore, ഥ𝑤 is the linear sum of several ഥ𝑥𝑖. To say ‘several' is not
included in the sum in the case of 𝛼𝑖 =0. This means that the
algorithm is designed so that the value of α is zero for samples( ഥ𝑥𝑖)
that do not affect classification.
𝛼𝑖[𝑦𝑖 (ഥ𝑤 ∙ ഥ𝑥𝑖 + 𝑏) − 1] = 0
complementary slackness
by KKT-conditions
If 𝛼𝑖 is non-zero, then ഥ𝑥𝑖 is a support vector (which lies on the
gutter).
14. 𝐿 =
1
2
σ 𝛼𝑖 𝑦𝑖 ഥ𝑥𝑖 ∙ σ 𝛼𝑖 𝑦𝑖 ഥ𝑥𝑖 − σ 𝛼𝑖 𝑦𝑖 ഥ𝑥𝑖 ∙ σ 𝛼𝑖 𝑦𝑖 ഥ𝑥𝑖 − σ 𝛼𝑖 𝑦𝑖 𝑏 + σ 𝛼𝑖
∴ 𝐿 = σ 𝛼𝑖 −
1
2
σ σ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 ഥ𝑥𝑖 ∙ ഥ𝑥𝑗
ഥ𝑥𝑖 ∙ ഥ𝑥𝑗
Maximization will depend only
on the dot product of pairs of
support vector!!
15. σ 𝛼𝑖 𝑦𝑖 ഥ𝑥𝑖 ∙ 𝑢 + 𝑏 > 0, Then +
ഥ𝑥𝑖 ∙ 𝑢
Decision Rule only depends on the
dot product of the support vectors
and the unknown sample. Thanks to
this property, we can conduct magical
kernel trick.
Kernel Tricks
𝐾 𝑥𝑖, 𝑢 = ∅(𝑥𝑖) ∙ ∅(𝑢)
1) Linear kernel 𝐾 𝑥𝑖, 𝑢 = (𝑥𝑖 ∙ 𝑢 + 1) 𝑛
, n=1
2) Non-Linear kernel RBF 𝐾 𝑥𝑖, 𝑢 = 𝑒−𝛾(𝑥 𝑖−𝑢)2
16. Logit calculates the 'distance' or 'error' respect to the decision boundary based on
each data point, while SVM calculates the distance from the decision boundary based
on the data divided into 'groups' of 0/1.
Although SVM is mathematically more elegant, they are inevitably very vulnerable to
outliers.
A typical SVM ignores values other than support vectors, and by default hinge loss is
less sensitive than logistic loss, but can SVM be considered more vulnerable to
outlier than Logit?
What if the outlier defines the support vector?
How is SVM different from Logit?
21. Kernel Tricks
𝐾 𝑥𝑖, 𝑢 = ∅(𝑥𝑖) ∙ ∅(𝑢)
In the case of Gaussian Kernel, the same
effect as the embedding to infinite
dimensions. (by Taylor expansion)
Kernel Trick: In effect, the dimension expansion to
high dimension leads to an increase in computational
capacity, so the Kernel method is used to achieve the
same effect even if it does not actually scale up the
dimension.
22. More on Kernels
A mathematical definition of Kernel is very
simple, if there is a mapping function ∅ on
Hilbert space, it can be defined immediately.
But the problem is ∅. How do we find ∅ on
Hilbert Space or how do we know that there
exist ∅
−>
By Mercer’s Theorem, if Kernel K satisfies
certain conditions (symmetric, PD, etc.) then K
can be decomposed by eigenfunction and
eigenvalue. That is, if the K satisfies the above
conditions, ∅ always exists, so we can create a
Kernel even if we do not define ∅ explicitly .
23. More on Kernels
The above equations indicates that if Kernel K is
positive semi definite, it can be decomposed to
eigenfunctions and eigenvalues.
*It is an infinite summation formula.
Remember when we proved the Fourier series!
it is just an approximation if not infinite (important)
25. RKHS
Informal definition of RKHS : Hilbert Space with Reproducing Kernel is RKHS.
Formal definition of RKHS
Functional 𝐿 𝑥 = <∙, 𝑘 ∙, 𝑥 > 𝐿 𝑥 𝑓 = 𝑓 𝑥
In other words, if we can reproduce putting function 𝑓 into evaluation functional 𝐿 𝑥 by a dot
product of any vector 𝑘 and 𝑓, then reproducing kernel property is satisfied.
27. RKHS
WHY IS RKHS GOOD?
The evaluation of 𝒇 in 𝒙 is
expressed as eigenfunction ∅ of
𝒌𝒆𝒓𝒏𝒆𝒍.
28. RKHS
Another view on RKHS: Moore-Aronszajn Theorem-> The RKHS itself is in line with the Kernel Function
existence condition. In other words, if we define Kernel K that satisfies some conditions, it is the only
Kernel that define specific reproducing kernel Hilbert space H which respond to Kernel K.
The shape of f is affected by kernel, because the new function f is also decomposed by eigenfunction of
kernel.
e.g) GPR
A new function f similar to Gaussian kernel shape is created
Thus using the kernel method is fitting our data to the kernel function, so choosing the proper kernel is
what we must focus in kernel method.
https://www.edwith.org/bayesiandeeplearning/lecture/24684/
https://patternsofideas.wordpress.com/2016/12/12/mercers-theorem-and-svms/
https://bi.snu.ac.kr/Publications/Conferences/Domestic/KCC2005_LeeSK.pdf
https://stats.stackexchange.com/questions/268429/do-gaussian-process-regression-have-the-universal-approximation-
property
http://iera.name/a-story-of-basis-and-kernel-part-i-function-basis/
https://youtu.be/7JRwjCpKewQ
30. References
• Strongly Based on MIT 6.034 Artificial Intelligence, Fall 2010
Instructor: Patrick Winston
video link: https://youtu.be/_PwhiWxHK8o
• Trevor Hastie et al. The Elements of Statistical Learning (2001)
• Machine Learning Lecture 26 "Gaussian Processes" -Cornell CS4780 SP17 by Kilian Weinberger
-video link: https://www.youtube.com/watch?v=R-NUdqxKjos&t=1000s
• 9.520/6.860S Statistical Learning Theory by Lorenzo Rosasco
http://www.mit.edu/~9.520/fall14/slides/class03/class03_rkhsPart1.pdf
-video link: https://www.youtube.com/watch?v=9-oxo_k69qs
• Bayesian Deep Learning by Sungjoon Choi
-video link: https://www.edwith.org/bayesiandeeplearning/joinLectures/14426