Linear Machine Learning Models with L2 Regularization and Kernel Tricks

Linear Machine Learning Models with L2-Regularization
and Kernel Tricks
Fengtao Wu
University of Pittsburgh
few14@pitt.edu
November 15, 2016
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 1 / 46

Outline
1 Feature Transformation
Overﬁtting
2 L2-Regularization for Linear Models
Linear Models
L2-Regularization
3 Quadratic Programming
Standard Form
QP Solver
Example: SVM
4 Kernel Trick and L2-Regularization
Representer Theorem
Kernelized Ridge Regression
Kernelized L2-Regularized Logistic Regression
Support Vector Regression

Feature Transformation
Original Dataset
After the data cleaning process, we have the original dataset:
(xi , yi ) i = 1, 2, ..., N
The feature vector xi has m dimensions:
xi = (xi1, xi2, ..., xim)T
i = 1, 2, ..., N
The original feature vector xi is transformed to new feature vector zi
which has l dimensions by the transformation φ : xi → zi
φ(xi ) = zi = (zi1, zi2, ..., zil )T
i = 1, 2, ..., N

Example: Quadratic Transformation
The original feature vector xi is
xi = (xi1, xi2)T
i = 1, 2, ..., N
The new feature vector φ(xi ) is
φ(xi ) = (xi1, xi2, x2
i1, xi1xi2, x2
i2)T
i = 1, 2, ..., N

Overfitting
The feature transformation may cause the overfitting problem:
φ(xi ) = (xi , x2
i , x3
i , ..., x10
i )T
i = 1, 2, ..., N
Figure: Polynomial transformation cause overfitting problem, from [Wikipedia].

Linear Models
Linear Regression
Redeﬁne zi ≡ (1, zi )T , the linear regression model is
h(z) = ωT
z
The error function of the model is
Ein(h) =
1
N
N
i=1
(h(zi ) − yi )2
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z)−1
ZT
y

Linear Models
Logistic Regression
Redeﬁne zi ≡ (1, zi )T , the logistic regression model is
h(z) = 1/(1 + exp(−ωT
z))
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
max
h
N
i=1
h(yi zi ) ⇐⇒ min
ω
N
i=1
ln(1 + exp(−yi ωT
zi ))
Ein(ω) =
1
N
N
i=1
zi ))

Linear Models
PLA and SVM
The PLA and SVM model have the same response function
hPLA/SVM(z) = sign(ωT
z + b)
Ein(h) =
N
i=1
χh(zi )=yi
(zi )
Ein(ω) =
N
i=1
χsign(ωT zi +b)=yi
(zi )

Regularization
Regularization: add constraints when seeking ω∗ to avoid overﬁtting
Constraint 1
min
ω∈IR11
Ein(ω)
s.t. ω3 = · · · = ω10 = 0
Constraint 2
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
χωi =0(ωi ) ≤ 3
Regularization by constraint
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
ω2
i ≤ C

Geometric Interpretation of L2-Regularization
L2-Regularization
The optimal w∗ satisﬁes:
− Ein(ω∗) ω∗
i.e.
Ein(ω∗) +
2λ
N
ω∗ = 0
i.e.
min
ω
Ein(ω) +
λ
N
ωT
ω
Geometric Interpretation
Figure: Interpretation of
L2-regularization, from [Lin].

L2-Regularized Linear Regression
Ridge Regression
The error function of linear regression model is
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
According to Ein(ω∗) + 2λ
N ω∗ = 0
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z + λI)−1
ZT
y
Compared to the optimal weight vector ω∗ for linear regression model is
ω∗ = (ZT
Z)−1
ZT
y

L2-Regularized Logistic Regression
L2-Regularized Logistic Regression
The error function of logistic regression model is
Ein(ω) =
1
N
N
i=1
zi ))
According to min Ein(ω) + λ
N ωT ω
The optimal weight vector ω∗ is the solution to the unconstrained problem:
min
ω
1
N
N
i=1
zi )) +
λ
N
ωT
ω
The problem is solved by the gradient descent approach.

L2-Regularizer
Weight-Decay Regularizer
L-2 regularizer λ
N ωT ω is also called weight-decay regularizer.
Larger λ ⇐⇒ Shorter ω ⇐⇒ Smaller C
Is switching the objective function and the constraints just a coincidence?
L2-Regularization
min
ω∈IRl+1
Ein(ω)
s.t.
l
i=0
ω2
i ≤ C
Hard-Margin SVM
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N

Quadratic Programming [Burke]
QP Standard Form
The standard form of quadratic programming Q is
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
where A ∈ IRm×n
and b ∈ IRm
, and the matrix Q is symmetric. In the QP
standard form, the number of unknown variables is n, and the number of
constraints is m.

Quadratic Programming
Lagrangian Function
The Lagrangian function is
L(x, λ) =
1
2
xT
Qx + cT
x − λT
(Ax − b)
where λ ≥ 0 and λ is called Lagrangian Multipliers.
Karush-Kuhn-Tucker Conditions for Q
A pair (x, λ) ∈ IRn
× IRm
is said to be a Karush-Kuhn-Tucker pair for the
quadratic program Q if and only if the following conditions are satisﬁed:
Ax ≥ b (primal feasibility)
λ ≥ 0, c + Qx − AT
λ ≥ 0 (dual feasibility)
λT
(Ax − b) = 0, xT
(c + Qx − AT
λ) = 0 (complementarity)

Theorem
Theorem (First-Order Necessary Conditions for Optimality in QP)
If x∗ solves Q, then there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a
KKT pair for Q
Theorem (Necessary and Suﬃcient Conditions for Optimality in
Convex QP)
If Q is symmetric and positive semi-deﬁnite, the x∗ solves Q if and only if
there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a KKT pair for Q

QP Solver [Hoppe]
Direct Solution
Symmetric Indeﬁnite Factorization
Range-Space Approach
Null-Space Approach
Iterative Solution
Krylov Methods
Transforming Range-Space Iterations
Transforming Null-Space Iterations
Active Set Strategies for Convex QP Problems
Primal Active Set Strategies
Primal-dual Active Set Strategies

Hard-Margin SVM Primal
Hard-Margin SVM Primal: solve l + 1 variables under N constraints
Hard-Margin SVM Primal
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N
QP Standard Form
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
QP Solver
x =
b
ω
Q =
0 01×l
0l×1 Il×l
c = 0 an =
yn
ynzn
A =



a1
T
...
aN
T


 bn = 1
b =



b1
...
bN


 x ← QP(Q, c, A, b)

Hard-Margin SVM Primal Lagrangian Function
Hard-Margin SVM Primal Lagrangian Function
The Lagrangian function is
L(b, ω, α) =
1
2
ωT
ω +
N
n=1
αn[1 − yn(ωT
zn + b)]
where α ≥ 0 and α is called Lagrangian Multipliers.
Hard-Margin SVM Primal Equivalent
Hard-Margin SVM Primal ⇐⇒ min
b,ω
(max
α
L(b, ω, α))

Duality
Weak Duality
min
b,ω
(max
α
L(b, ω, α)) ≥ max
α
(min
b,ω
L(b, ω, α))
Strong Duality
min
b,ω
(max
α
L(b, ω, α)) = max
α
(min
b,ω
L(b, ω, α))
Conclusion for Strong Duality
Strong duality holds for quadratic programming problem if
The objective function in the primal problem is convex
The constraints in the primal problem is feasible
The constraints in the primal problem is linear

Hard-Margin SVM Dual
KKT Conditions for Hard-Margin SVM
Primal Feasible: yn(ωT zn + b) ≥ 1
Dual Feasible: α ≥ 0
Primal-Inner Optimal: αn(1 − yn(ωT zn + b)) = 0
Dual-Inner Optimal: N
n=1 ynαn = 0, ω = N
n=1 αnynzn
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0

Hard-Margin SVM Dual: solve N variables under N + 1 constraints
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0
QP Solver
α ← QP(Q, c, A, b)
ω =
N
n=1
αnynzn
b = yn − ωT
zn

Soft-Margin SVM
Soft-Margin SVM Primal: solve l + 1 + N variables under 2N constraints
Soft-Margin SVM Dual: solve N variables under 2N + 1 constraints
Soft-Margin SVM Primal
min
b,ω
1
2
ωT
ω + C
N
n=1
ξn
s.t. yn(ωT
zn + b) ≥ 1 − ξn
ξn ≥ 0
n = 1, 2, ..., N
Soft-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm
−
N
n=1
αn
s.t.
N
n=1
ynαn = 0
0 ≤ α ≤ C

Soft-Margin SVM
Unconstrainted Form of Soft-Margin SVM Primal
Soft-Margin SVM Primal ⇐⇒ min
b,ω
1
2
ωT
ω+C
N
n=1
max(1−yn(ωT
zn+b), 0)

SVM and L2-Regularization
Minimize Constraint
Regularization by constraint Ein(h) ωT ω ≤ C
Hard-Margin SVM ωT ω Ein(h) = 0
L2 Regularization Ein(h) + λ
N ωT ω
Soft-Margin SVM 1
2 ωT ω + CNEin(h)
Table: SVM as Regularized Model
View SVM as Regularized Model
large margin ⇐⇒ L2 regularization of short ω
larger C ⇐⇒ smaller λ (less regularization)

Probabilistic SVM [Platt, 1999]
Platt’s Model of Probabilistic SVM for Soft-Binary Classiﬁcation
The probabilistic SVM model is
Θ(x) = 1/(1 + exp(−x))
h(x) = Θ(A(ωSVM
T
Φ(x) + bSVM) + B)
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
A functions as scaling: often A > 0 if ωSVM is reasonably good
B functions as shifting: often B ≈ 0 if bSVM is reasonably good
Two-level learning: logistic regression on SVM-transformed data

Kernel Trick
SVM Dual
qn,m = ynymzn
T
zm = ynymΦ(xn)T
Φ(xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynΦ(xn)T
Φ(x) + b)
Kernel Trick
KΦ : X × X → IR
KΦ(xn, xm) = Φ(xn)T
Φ(xm)
qn,m = ynymzn
T
zm = ynymKΦ(xn, xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynKΦ(xn, x) + b)

Kernel Trick
General Polynomial Kernel
KQ(x1, x2) = (ξ + γx1
T
x2)Q
Corresponding to Qth order polynomial transformation
Gaussian Kernel
K(x1, x2) = exp(−γ x1 − x2
2
)
What is the corresponding transformation Φ?

Gaussian Kernel
Gaussian Kernel
K(x1, x2) = exp(−γ x1 − x2
2
)
e.g.
K(x1, x2) = exp(−(x1 − x2)2
)
= exp(−x2
1 ) exp(−x2
2 ) exp(2x1x2)
= exp(−x2
1 ) exp(−x2
2 )(
∞
i=0
(2x1x2)i
i!
)
=
∞
i=0
[
2i
i!
exp(−x2
1 )xi
1][
2i
i!
exp(−x2
2 )xi
2]
Φ(x) = exp(−x2
)(1,
2
1!
x,
22
2!
x2
, · · · )

Mercer’s Condition [Wikipedia]
Mercer’s Condition
The function KΦ : X × X → IR, KΦ(xi , xj ) = Φ(xi )T Φ(xj ) is
symmetric
Deﬁne
Z =



Φ(x1)T
...
Φ(xN)T


 K = ZZT
The matrix K is positive semi-deﬁnite.

Representer Theorem
[Schlkopf, Herbrich and Smola, 2001]
Theorem (Representer Theorem)
For any L2-regularized linear model,
min
ω
λ
N
ωT
ω +
1
N
N
n=1
Ein(yn, ωT
zn)
The optimal ω∗ satisﬁes:
ω∗ =
N
n=1
αnzn = ZT
α

Representer Theorem
Proof
ω∗ = ω + ω⊥
where
ω ∈ span(zn) n = 1, 2, · · · , N
ω⊥ ⊥ span(zn) n = 1, 2, · · · , N
if ω⊥ = 0
Ein(yn, ω∗T
zn) = Ein(yn, ω T
zn)
however
ω∗T
ω∗ = ω T
ω + ω⊥
T
ω⊥ > ω T
ω (Contradiction!)

Representer Theorem
Conclusion
The optimal ω∗ in L2-regularized linear model is a linear combination of
zn n = 1, 2, · · · , N i.e. The optimal ω∗ is represented by data.
SVM Dual
ωSVM =
N
n=1
αnynzn
PLA
ωPLA =
N
n=1
βnynzn if ω0 = 0

Representer Theorem
Kernelized Linear Model
For any linear model,
ω∗T
z =
N
n=1
αnzn
T
z =
N
n=1
αnKΦ(xn, x)
Any L2-regularized linear model can be kernelized.
Kernelized Linear Model
Kernelized Logistic Regression
Kernelized Support Vector Regression

min
ω
λ
N
ωT
ω +
1
N
N
n=1
(yn − ωT
zn)2
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm) +
1
N
N
n=1
(yn −
N
m=1
αmK(xm, xn))2
⇒ min
α
λ
N
αT
Kα +
1
N
(y − Kα)T
(y − Kα) ≡ min
α
Eaug (α)
⇒
∂Eaug (α)
∂α
=
2
N
KT
((λI + K)α − y) = 0

The optimal α is
α = (λI + K)−1
y
The optimal ω∗ is
ω∗ = ZT
(λI + K)−1
y
Compared to the optimal ω∗ of ridge regression
ω∗ = (λI + ZT
Z)−1
ZT
y

min
ω
λ
N
ωT
ω +
1
N
N
n=1
ln(1 + exp(−ynωT
zn))
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm)
+
1
N
N
n=1
ln(1 + exp(−yn
N
m=1
αmK(xm, xn)))
Linear model of ω with embedded-in-kernel transformation and
L2-regularizer
Linear model of α with kernel as transformation and kernelized
regularizer

Tube Regression
Tube Regression
The tube regression model is
h(z) = ωT
z + b
Ein(h) =
N
i=1
max(0, |h(zi )−yi |− )
Tube Regression
Figure: Interpretation of tube
regularization, from [ResearchGate].

L2-Regularized Tube Regression
L2-Regularized Tube Regression
The L2-regularized tube regression model is
min
b,ω
N
i=1
max(0, |h(zi ) − yi | − ) +
λ
N
ωT
ω
Remember?
Unconstrainted Form of Soft-Margin SVM Primal
Soft-Margin SVM Primal ⇐⇒ min
b,ω
1
2
ωT
ω+C
N
n=1
max(1−yn(ωT
zn+b), 0)

Support Vector Regression [Welling, 2004]
SVR Primal: solve l + 1 + 2N variables under 4N constraints
Primal
min
b,ω,ξ
1
2
ωT
ω + C
N
n=1
ξn
s.t. |ωT
zn + b − yn| ≤ + ξn
ξn ≥ 0
n = 1, 2, ..., N
Support Vector Regression Primal
Reﬁnement
min
b,ω,ˆξ,ˇξ
1
2
ωT
ω + C
N
n=1
( ˆξn + ˇξn)
s.t. −b − ωT
zn + ˆξn ≥ − − yn
b + ωT
zn + ˇξn ≥ − + yn
ˆξn ≥ 0
ˇξn ≥ 0
n = 1, 2, ..., N

SVR Dual: solve 2N variables under 4N + 1 constraints
Kernelized Support Vector Regression Dual
min
b,ω,ξ
1
2
N
n=1
N
m=1
( ˆαn − ˇαn)( ˆαm − ˇαm)kn,m
+
N
n=1
[( − yn) ˆαn + ( + yn) ˇαn]
s.t.
N
n=1
( ˆαn − ˇαn) = 0
0 ≤ ˆαn ≤ C
0 ≤ ˇαn ≤ C
n = 1, 2, ..., N

KKT Conditions of Support Vector Regression
Primal-Inner Optimal ˆαn( + ˆξn − yn + ωT
zn + b) = 0
ˇαn( + ˇξn + yn − ωT
zn − b) = 0
Dual-Inner Optimal ω =
N
n=1
( ˆαn − ˇαn)zn
N
n=1
( ˆαn − ˇαn) = 0

References
ResearchGate
Tube Regression
Wikipedia
Mercer’s Theorem
Wikipedia
Overﬁtting
James V. Burke
Nonlinear Programming
Lin, H. T
Machine Learning Foundations
Ronald W. Hoppe
Optimization I

References
Platt, J. (1999)
Probabilistic outputs for support vector machines and comparisons to regularized
likelihood methods
Advances in large margin classiﬁers 10(3), 61 – 74.
Schlkopf, B., Herbrich, R. and Smola, A.J. (2001)
A generalized representer theorem
In International Conference on Computational Learning Theory 416 – 426.
Welling, M. (2004)
Support vector regression

Q & A

Thank You!

Linear Machine Learning Models with L2 Regularization and Kernel Tricks

In this document

More Related Content

What's hot

Similar to Linear Machine Learning Models with L2 Regularization and Kernel Tricks

Recently uploaded

Linear Machine Learning Models with L2 Regularization and Kernel Tricks