The slides are the course project presentation for INFSCI 2915 Machine Learning Foundations course. The presentation reviewed and summarized how the L2 regularization techniques are applied in the linear machine models including linear regression, logistic regression, support vector machine and perceptron learning algorithm. Also the presentation reviewed the quadratic programming problem and took SVM model as an example to illustrate the relation between primal and dual problem. At last, the presentation reviewed the general conclusion which is the representer theorem, and connected the kernel tricks to the L2 regularized linear models.
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
1. Linear Machine Learning Models with L2-Regularization
and Kernel Tricks
Fengtao Wu
University of Pittsburgh
few14@pitt.edu
November 15, 2016
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 1 / 46
2. Outline
1 Feature Transformation
Overfitting
2 L2-Regularization for Linear Models
Linear Models
L2-Regularization
3 Quadratic Programming
Standard Form
QP Solver
Example: SVM
4 Kernel Trick and L2-Regularization
Representer Theorem
Kernelized Ridge Regression
Kernelized L2-Regularized Logistic Regression
Support Vector Regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 2 / 46
3. Feature Transformation
Original Dataset
After the data cleaning process, we have the original dataset:
(xi , yi ) i = 1, 2, ..., N
The feature vector xi has m dimensions:
xi = (xi1, xi2, ..., xim)T
i = 1, 2, ..., N
Feature Transformation
The original feature vector xi is transformed to new feature vector zi
which has l dimensions by the transformation φ : xi → zi
φ(xi ) = zi = (zi1, zi2, ..., zil )T
i = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 3 / 46
4. Feature Transformation
Example: Quadratic Transformation
The original feature vector xi is
xi = (xi1, xi2)T
i = 1, 2, ..., N
The new feature vector φ(xi ) is
φ(xi ) = (xi1, xi2, x2
i1, xi1xi2, x2
i2)T
i = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 4 / 46
5. Overfitting
The feature transformation may cause the overfitting problem:
φ(xi ) = (xi , x2
i , x3
i , ..., x10
i )T
i = 1, 2, ..., N
Figure: Polynomial transformation cause overfitting problem, from [Wikipedia].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 5 / 46
6. Linear Models
Linear Regression
Redefine zi ≡ (1, zi )T , the linear regression model is
h(z) = ωT
z
The error function of the model is
Ein(h) =
1
N
N
i=1
(h(zi ) − yi )2
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 6 / 46
7. Linear Models
Logistic Regression
Redefine zi ≡ (1, zi )T , the logistic regression model is
h(z) = 1/(1 + exp(−ωT
z))
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
The error function of the model is
max
h
N
i=1
h(yi zi ) ⇐⇒ min
ω
N
i=1
ln(1 + exp(−yi ωT
zi ))
Ein(ω) =
1
N
N
i=1
ln(1 + exp(−yi ωT
zi ))
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 7 / 46
8. Linear Models
PLA and SVM
The PLA and SVM model have the same response function
hPLA/SVM(z) = sign(ωT
z + b)
The error function of the model is
Ein(h) =
N
i=1
χh(zi )=yi
(zi )
Ein(ω) =
N
i=1
χsign(ωT zi +b)=yi
(zi )
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 8 / 46
9. Regularization
Regularization: add constraints when seeking ω∗ to avoid overfitting
Constraint 1
min
ω∈IR11
Ein(ω)
s.t. ω3 = · · · = ω10 = 0
Constraint 2
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
χωi =0(ωi ) ≤ 3
Regularization by constraint
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
ω2
i ≤ C
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 9 / 46
10. Geometric Interpretation of L2-Regularization
L2-Regularization
The optimal w∗ satisfies:
− Ein(ω∗) ω∗
i.e.
Ein(ω∗) +
2λ
N
ω∗ = 0
i.e.
min
ω
Ein(ω) +
λ
N
ωT
ω
Geometric Interpretation
Figure: Interpretation of
L2-regularization, from [Lin].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 10 / 46
11. L2-Regularized Linear Regression
Ridge Regression
The error function of linear regression model is
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
According to Ein(ω∗) + 2λ
N ω∗ = 0
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z + λI)−1
ZT
y
Compared to the optimal weight vector ω∗ for linear regression model is
ω∗ = (ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 11 / 46
12. L2-Regularized Logistic Regression
L2-Regularized Logistic Regression
The error function of logistic regression model is
Ein(ω) =
1
N
N
i=1
ln(1 + exp(−yi ωT
zi ))
According to min Ein(ω) + λ
N ωT ω
The optimal weight vector ω∗ is the solution to the unconstrained problem:
min
ω
1
N
N
i=1
ln(1 + exp(−yi ωT
zi )) +
λ
N
ωT
ω
The problem is solved by the gradient descent approach.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 12 / 46
13. L2-Regularizer
Weight-Decay Regularizer
L-2 regularizer λ
N ωT ω is also called weight-decay regularizer.
Larger λ ⇐⇒ Shorter ω ⇐⇒ Smaller C
Is switching the objective function and the constraints just a coincidence?
L2-Regularization
min
ω∈IRl+1
Ein(ω)
s.t.
l
i=0
ω2
i ≤ C
Hard-Margin SVM
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 13 / 46
14. Quadratic Programming [Burke]
QP Standard Form
The standard form of quadratic programming Q is
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
where A ∈ IRm×n
and b ∈ IRm
, and the matrix Q is symmetric. In the QP
standard form, the number of unknown variables is n, and the number of
constraints is m.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 14 / 46
15. Quadratic Programming
Lagrangian Function
The Lagrangian function is
L(x, λ) =
1
2
xT
Qx + cT
x − λT
(Ax − b)
where λ ≥ 0 and λ is called Lagrangian Multipliers.
Karush-Kuhn-Tucker Conditions for Q
A pair (x, λ) ∈ IRn
× IRm
is said to be a Karush-Kuhn-Tucker pair for the
quadratic program Q if and only if the following conditions are satisfied:
Ax ≥ b (primal feasibility)
λ ≥ 0, c + Qx − AT
λ ≥ 0 (dual feasibility)
λT
(Ax − b) = 0, xT
(c + Qx − AT
λ) = 0 (complementarity)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 15 / 46
16. Theorem
Theorem (First-Order Necessary Conditions for Optimality in QP)
If x∗ solves Q, then there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a
KKT pair for Q
Theorem (Necessary and Sufficient Conditions for Optimality in
Convex QP)
If Q is symmetric and positive semi-definite, the x∗ solves Q if and only if
there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a KKT pair for Q
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 16 / 46
17. QP Solver [Hoppe]
Direct Solution
Symmetric Indefinite Factorization
Range-Space Approach
Null-Space Approach
Iterative Solution
Krylov Methods
Transforming Range-Space Iterations
Transforming Null-Space Iterations
Active Set Strategies for Convex QP Problems
Primal Active Set Strategies
Primal-dual Active Set Strategies
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 17 / 46
18. Hard-Margin SVM Primal
Hard-Margin SVM Primal: solve l + 1 variables under N constraints
Hard-Margin SVM Primal
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N
QP Standard Form
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
QP Solver
x =
b
ω
Q =
0 01×l
0l×1 Il×l
c = 0 an =
yn
ynzn
A =
a1
T
...
aN
T
bn = 1
b =
b1
...
bN
x ← QP(Q, c, A, b)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 18 / 46
19. Hard-Margin SVM Primal Lagrangian Function
Hard-Margin SVM Primal Lagrangian Function
The Lagrangian function is
L(b, ω, α) =
1
2
ωT
ω +
N
n=1
αn[1 − yn(ωT
zn + b)]
where α ≥ 0 and α is called Lagrangian Multipliers.
Hard-Margin SVM Primal Equivalent
Hard-Margin SVM Primal ⇐⇒ min
b,ω
(max
α
L(b, ω, α))
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 19 / 46
20. Duality
Weak Duality
min
b,ω
(max
α
L(b, ω, α)) ≥ max
α
(min
b,ω
L(b, ω, α))
Strong Duality
min
b,ω
(max
α
L(b, ω, α)) = max
α
(min
b,ω
L(b, ω, α))
Conclusion for Strong Duality
Strong duality holds for quadratic programming problem if
The objective function in the primal problem is convex
The constraints in the primal problem is feasible
The constraints in the primal problem is linear
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 20 / 46
21. Hard-Margin SVM Dual
KKT Conditions for Hard-Margin SVM
Primal Feasible: yn(ωT zn + b) ≥ 1
Dual Feasible: α ≥ 0
Primal-Inner Optimal: αn(1 − yn(ωT zn + b)) = 0
Dual-Inner Optimal: N
n=1 ynαn = 0, ω = N
n=1 αnynzn
Hard-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 21 / 46
22. Hard-Margin SVM Dual
Hard-Margin SVM Dual: solve N variables under N + 1 constraints
Hard-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0
QP Solver
α ← QP(Q, c, A, b)
ω =
N
n=1
αnynzn
b = yn − ωT
zn
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 22 / 46
23. Soft-Margin SVM
Soft-Margin SVM Primal: solve l + 1 + N variables under 2N constraints
Soft-Margin SVM Dual: solve N variables under 2N + 1 constraints
Soft-Margin SVM Primal
min
b,ω
1
2
ωT
ω + C
N
n=1
ξn
s.t. yn(ωT
zn + b) ≥ 1 − ξn
ξn ≥ 0
n = 1, 2, ..., N
Soft-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm
−
N
n=1
αn
s.t.
N
n=1
ynαn = 0
0 ≤ α ≤ C
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 23 / 46
24. Soft-Margin SVM
Unconstrainted Form of Soft-Margin SVM Primal
Soft-Margin SVM Primal ⇐⇒ min
b,ω
1
2
ωT
ω+C
N
n=1
max(1−yn(ωT
zn+b), 0)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 24 / 46
25. SVM and L2-Regularization
Minimize Constraint
Regularization by constraint Ein(h) ωT ω ≤ C
Hard-Margin SVM ωT ω Ein(h) = 0
L2 Regularization Ein(h) + λ
N ωT ω
Soft-Margin SVM 1
2 ωT ω + CNEin(h)
Table: SVM as Regularized Model
View SVM as Regularized Model
large margin ⇐⇒ L2 regularization of short ω
larger C ⇐⇒ smaller λ (less regularization)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 25 / 46
26. Probabilistic SVM [Platt, 1999]
Platt’s Model of Probabilistic SVM for Soft-Binary Classification
The probabilistic SVM model is
Θ(x) = 1/(1 + exp(−x))
h(x) = Θ(A(ωSVM
T
Φ(x) + bSVM) + B)
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
A functions as scaling: often A > 0 if ωSVM is reasonably good
B functions as shifting: often B ≈ 0 if bSVM is reasonably good
Two-level learning: logistic regression on SVM-transformed data
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 26 / 46
27. Kernel Trick
SVM Dual
qn,m = ynymzn
T
zm = ynymΦ(xn)T
Φ(xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynΦ(xn)T
Φ(x) + b)
Kernel Trick
KΦ : X × X → IR
KΦ(xn, xm) = Φ(xn)T
Φ(xm)
qn,m = ynymzn
T
zm = ynymKΦ(xn, xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynKΦ(xn, x) + b)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 27 / 46
28. Kernel Trick
General Polynomial Kernel
KQ(x1, x2) = (ξ + γx1
T
x2)Q
Corresponding to Qth order polynomial transformation
Gaussian Kernel
K(x1, x2) = exp(−γ x1 − x2
2
)
What is the corresponding transformation Φ?
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 28 / 46
30. Mercer’s Condition [Wikipedia]
Mercer’s Condition
The function KΦ : X × X → IR, KΦ(xi , xj ) = Φ(xi )T Φ(xj ) is
symmetric
Define
Z =
Φ(x1)T
...
Φ(xN)T
K = ZZT
The matrix K is positive semi-definite.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 30 / 46
31. Representer Theorem
[Schlkopf, Herbrich and Smola, 2001]
Theorem (Representer Theorem)
For any L2-regularized linear model,
min
ω
λ
N
ωT
ω +
1
N
N
n=1
Ein(yn, ωT
zn)
The optimal ω∗ satisfies:
ω∗ =
N
n=1
αnzn = ZT
α
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 31 / 46
32. Representer Theorem
Proof
ω∗ = ω + ω⊥
where
ω ∈ span(zn) n = 1, 2, · · · , N
ω⊥ ⊥ span(zn) n = 1, 2, · · · , N
if ω⊥ = 0
Ein(yn, ω∗T
zn) = Ein(yn, ω T
zn)
however
ω∗T
ω∗ = ω T
ω + ω⊥
T
ω⊥ > ω T
ω (Contradiction!)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 32 / 46
33. Representer Theorem
Conclusion
The optimal ω∗ in L2-regularized linear model is a linear combination of
zn n = 1, 2, · · · , N i.e. The optimal ω∗ is represented by data.
SVM Dual
ωSVM =
N
n=1
αnynzn
PLA
ωPLA =
N
n=1
βnynzn if ω0 = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 33 / 46
34. Representer Theorem
Kernelized Linear Model
For any linear model,
ω∗T
z =
N
n=1
αnzn
T
z =
N
n=1
αnKΦ(xn, x)
Any L2-regularized linear model can be kernelized.
Kernelized Linear Model
Kernelized Ridge Regression
Kernelized Logistic Regression
Kernelized Support Vector Regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 34 / 46
35. Kernelized Ridge Regression
Kernelized Ridge Regression
min
ω
λ
N
ωT
ω +
1
N
N
n=1
(yn − ωT
zn)2
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm) +
1
N
N
n=1
(yn −
N
m=1
αmK(xm, xn))2
⇒ min
α
λ
N
αT
Kα +
1
N
(y − Kα)T
(y − Kα) ≡ min
α
Eaug (α)
⇒
∂Eaug (α)
∂α
=
2
N
KT
((λI + K)α − y) = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 35 / 46
36. Kernelized Ridge Regression
Kernelized Ridge Regression
The optimal α is
α = (λI + K)−1
y
The optimal ω∗ is
ω∗ = ZT
(λI + K)−1
y
Compared to the optimal ω∗ of ridge regression
ω∗ = (λI + ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 36 / 46
37. Kernelized L2-Regularized Logistic Regression
Kernelized L2-Regularized Logistic Regression
min
ω
λ
N
ωT
ω +
1
N
N
n=1
ln(1 + exp(−ynωT
zn))
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm)
+
1
N
N
n=1
ln(1 + exp(−yn
N
m=1
αmK(xm, xn)))
Linear model of ω with embedded-in-kernel transformation and
L2-regularizer
Linear model of α with kernel as transformation and kernelized
regularizer
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 37 / 46
38. Tube Regression
Tube Regression
The tube regression model is
h(z) = ωT
z + b
The error function of the model is
Ein(h) =
N
i=1
max(0, |h(zi )−yi |− )
Tube Regression
Figure: Interpretation of tube
regularization, from [ResearchGate].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 38 / 46
39. L2-Regularized Tube Regression
L2-Regularized Tube Regression
The L2-regularized tube regression model is
min
b,ω
N
i=1
max(0, |h(zi ) − yi | − ) +
λ
N
ωT
ω
Remember?
Unconstrainted Form of Soft-Margin SVM Primal
Soft-Margin SVM Primal ⇐⇒ min
b,ω
1
2
ωT
ω+C
N
n=1
max(1−yn(ωT
zn+b), 0)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 39 / 46
40. Support Vector Regression [Welling, 2004]
SVR Primal: solve l + 1 + 2N variables under 4N constraints
Support Vector Regression
Primal
min
b,ω,ξ
1
2
ωT
ω + C
N
n=1
ξn
s.t. |ωT
zn + b − yn| ≤ + ξn
ξn ≥ 0
n = 1, 2, ..., N
Support Vector Regression Primal
Refinement
min
b,ω,ˆξ,ˇξ
1
2
ωT
ω + C
N
n=1
( ˆξn + ˇξn)
s.t. −b − ωT
zn + ˆξn ≥ − − yn
b + ωT
zn + ˇξn ≥ − + yn
ˆξn ≥ 0
ˇξn ≥ 0
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 40 / 46
41. Support Vector Regression
SVR Dual: solve 2N variables under 4N + 1 constraints
Kernelized Support Vector Regression Dual
min
b,ω,ξ
1
2
N
n=1
N
m=1
( ˆαn − ˇαn)( ˆαm − ˇαm)kn,m
+
N
n=1
[( − yn) ˆαn + ( + yn) ˇαn]
s.t.
N
n=1
( ˆαn − ˇαn) = 0
0 ≤ ˆαn ≤ C
0 ≤ ˇαn ≤ C
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 41 / 46
42. Support Vector Regression
KKT Conditions of Support Vector Regression
Primal-Inner Optimal ˆαn( + ˆξn − yn + ωT
zn + b) = 0
ˇαn( + ˇξn + yn − ωT
zn − b) = 0
Dual-Inner Optimal ω =
N
n=1
( ˆαn − ˇαn)zn
N
n=1
( ˆαn − ˇαn) = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 42 / 46
44. References
Platt, J. (1999)
Probabilistic outputs for support vector machines and comparisons to regularized
likelihood methods
Advances in large margin classifiers 10(3), 61 – 74.
Schlkopf, B., Herbrich, R. and Smola, A.J. (2001)
A generalized representer theorem
In International Conference on Computational Learning Theory 416 – 426.
Welling, M. (2004)
Support vector regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 44 / 46
45. Q & A
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 45 / 46