Linear Machine Learning Models with L2 Regularization and Kernel Tricks
The document covers linear machine learning models, specifically focusing on l2-regularization and kernel tricks. It discusses feature transformation, overfitting, linear regression, logistic regression, and the principles of quadratic programming related to support vector machines (SVMs). Key concepts such as regularization methods, duality in optimization, and popular kernel functions are also explored.
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
1.
Linear Machine LearningModels with L2-Regularization
and Kernel Tricks
Fengtao Wu
University of Pittsburgh
few14@pitt.edu
November 15, 2016
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 1 / 46
2.
Outline
1 Feature Transformation
Overfitting
2L2-Regularization for Linear Models
Linear Models
L2-Regularization
3 Quadratic Programming
Standard Form
QP Solver
Example: SVM
4 Kernel Trick and L2-Regularization
Representer Theorem
Kernelized Ridge Regression
Kernelized L2-Regularized Logistic Regression
Support Vector Regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 2 / 46
3.
Feature Transformation
Original Dataset
Afterthe data cleaning process, we have the original dataset:
(xi , yi ) i = 1, 2, ..., N
The feature vector xi has m dimensions:
xi = (xi1, xi2, ..., xim)T
i = 1, 2, ..., N
Feature Transformation
The original feature vector xi is transformed to new feature vector zi
which has l dimensions by the transformation φ : xi → zi
φ(xi ) = zi = (zi1, zi2, ..., zil )T
i = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 3 / 46
4.
Feature Transformation
Example: QuadraticTransformation
The original feature vector xi is
xi = (xi1, xi2)T
i = 1, 2, ..., N
The new feature vector φ(xi ) is
φ(xi ) = (xi1, xi2, x2
i1, xi1xi2, x2
i2)T
i = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 4 / 46
5.
Overfitting
The feature transformationmay cause the overfitting problem:
φ(xi ) = (xi , x2
i , x3
i , ..., x10
i )T
i = 1, 2, ..., N
Figure: Polynomial transformation cause overfitting problem, from [Wikipedia].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 5 / 46
6.
Linear Models
Linear Regression
Redefinezi ≡ (1, zi )T , the linear regression model is
h(z) = ωT
z
The error function of the model is
Ein(h) =
1
N
N
i=1
(h(zi ) − yi )2
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 6 / 46
7.
Linear Models
Logistic Regression
Redefinezi ≡ (1, zi )T , the logistic regression model is
h(z) = 1/(1 + exp(−ωT
z))
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
The error function of the model is
max
h
N
i=1
h(yi zi ) ⇐⇒ min
ω
N
i=1
ln(1 + exp(−yi ωT
zi ))
Ein(ω) =
1
N
N
i=1
ln(1 + exp(−yi ωT
zi ))
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 7 / 46
8.
Linear Models
PLA andSVM
The PLA and SVM model have the same response function
hPLA/SVM(z) = sign(ωT
z + b)
The error function of the model is
Ein(h) =
N
i=1
χh(zi )=yi
(zi )
Ein(ω) =
N
i=1
χsign(ωT zi +b)=yi
(zi )
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 8 / 46
9.
Regularization
Regularization: add constraintswhen seeking ω∗ to avoid overfitting
Constraint 1
min
ω∈IR11
Ein(ω)
s.t. ω3 = · · · = ω10 = 0
Constraint 2
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
χωi =0(ωi ) ≤ 3
Regularization by constraint
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
ω2
i ≤ C
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 9 / 46
10.
Geometric Interpretation ofL2-Regularization
L2-Regularization
The optimal w∗ satisfies:
− Ein(ω∗) ω∗
i.e.
Ein(ω∗) +
2λ
N
ω∗ = 0
i.e.
min
ω
Ein(ω) +
λ
N
ωT
ω
Geometric Interpretation
Figure: Interpretation of
L2-regularization, from [Lin].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 10 / 46
11.
L2-Regularized Linear Regression
RidgeRegression
The error function of linear regression model is
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
According to Ein(ω∗) + 2λ
N ω∗ = 0
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z + λI)−1
ZT
y
Compared to the optimal weight vector ω∗ for linear regression model is
ω∗ = (ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 11 / 46
12.
L2-Regularized Logistic Regression
L2-RegularizedLogistic Regression
The error function of logistic regression model is
Ein(ω) =
1
N
N
i=1
ln(1 + exp(−yi ωT
zi ))
According to min Ein(ω) + λ
N ωT ω
The optimal weight vector ω∗ is the solution to the unconstrained problem:
min
ω
1
N
N
i=1
ln(1 + exp(−yi ωT
zi )) +
λ
N
ωT
ω
The problem is solved by the gradient descent approach.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 12 / 46
13.
L2-Regularizer
Weight-Decay Regularizer
L-2 regularizerλ
N ωT ω is also called weight-decay regularizer.
Larger λ ⇐⇒ Shorter ω ⇐⇒ Smaller C
Is switching the objective function and the constraints just a coincidence?
L2-Regularization
min
ω∈IRl+1
Ein(ω)
s.t.
l
i=0
ω2
i ≤ C
Hard-Margin SVM
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 13 / 46
14.
Quadratic Programming [Burke]
QPStandard Form
The standard form of quadratic programming Q is
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
where A ∈ IRm×n
and b ∈ IRm
, and the matrix Q is symmetric. In the QP
standard form, the number of unknown variables is n, and the number of
constraints is m.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 14 / 46
15.
Quadratic Programming
Lagrangian Function
TheLagrangian function is
L(x, λ) =
1
2
xT
Qx + cT
x − λT
(Ax − b)
where λ ≥ 0 and λ is called Lagrangian Multipliers.
Karush-Kuhn-Tucker Conditions for Q
A pair (x, λ) ∈ IRn
× IRm
is said to be a Karush-Kuhn-Tucker pair for the
quadratic program Q if and only if the following conditions are satisfied:
Ax ≥ b (primal feasibility)
λ ≥ 0, c + Qx − AT
λ ≥ 0 (dual feasibility)
λT
(Ax − b) = 0, xT
(c + Qx − AT
λ) = 0 (complementarity)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 15 / 46
16.
Theorem
Theorem (First-Order NecessaryConditions for Optimality in QP)
If x∗ solves Q, then there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a
KKT pair for Q
Theorem (Necessary and Sufficient Conditions for Optimality in
Convex QP)
If Q is symmetric and positive semi-definite, the x∗ solves Q if and only if
there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a KKT pair for Q
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 16 / 46
17.
QP Solver [Hoppe]
DirectSolution
Symmetric Indefinite Factorization
Range-Space Approach
Null-Space Approach
Iterative Solution
Krylov Methods
Transforming Range-Space Iterations
Transforming Null-Space Iterations
Active Set Strategies for Convex QP Problems
Primal Active Set Strategies
Primal-dual Active Set Strategies
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 17 / 46
18.
Hard-Margin SVM Primal
Hard-MarginSVM Primal: solve l + 1 variables under N constraints
Hard-Margin SVM Primal
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N
QP Standard Form
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
QP Solver
x =
b
ω
Q =
0 01×l
0l×1 Il×l
c = 0 an =
yn
ynzn
A =
a1
T
...
aN
T
bn = 1
b =
b1
...
bN
x ← QP(Q, c, A, b)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 18 / 46
19.
Hard-Margin SVM PrimalLagrangian Function
Hard-Margin SVM Primal Lagrangian Function
The Lagrangian function is
L(b, ω, α) =
1
2
ωT
ω +
N
n=1
αn[1 − yn(ωT
zn + b)]
where α ≥ 0 and α is called Lagrangian Multipliers.
Hard-Margin SVM Primal Equivalent
Hard-Margin SVM Primal ⇐⇒ min
b,ω
(max
α
L(b, ω, α))
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 19 / 46
20.
Duality
Weak Duality
min
b,ω
(max
α
L(b, ω,α)) ≥ max
α
(min
b,ω
L(b, ω, α))
Strong Duality
min
b,ω
(max
α
L(b, ω, α)) = max
α
(min
b,ω
L(b, ω, α))
Conclusion for Strong Duality
Strong duality holds for quadratic programming problem if
The objective function in the primal problem is convex
The constraints in the primal problem is feasible
The constraints in the primal problem is linear
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 20 / 46
21.
Hard-Margin SVM Dual
KKTConditions for Hard-Margin SVM
Primal Feasible: yn(ωT zn + b) ≥ 1
Dual Feasible: α ≥ 0
Primal-Inner Optimal: αn(1 − yn(ωT zn + b)) = 0
Dual-Inner Optimal: N
n=1 ynαn = 0, ω = N
n=1 αnynzn
Hard-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 21 / 46
22.
Hard-Margin SVM Dual
Hard-MarginSVM Dual: solve N variables under N + 1 constraints
Hard-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0
QP Solver
α ← QP(Q, c, A, b)
ω =
N
n=1
αnynzn
b = yn − ωT
zn
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 22 / 46
23.
Soft-Margin SVM
Soft-Margin SVMPrimal: solve l + 1 + N variables under 2N constraints
Soft-Margin SVM Dual: solve N variables under 2N + 1 constraints
Soft-Margin SVM Primal
min
b,ω
1
2
ωT
ω + C
N
n=1
ξn
s.t. yn(ωT
zn + b) ≥ 1 − ξn
ξn ≥ 0
n = 1, 2, ..., N
Soft-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm
−
N
n=1
αn
s.t.
N
n=1
ynαn = 0
0 ≤ α ≤ C
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 23 / 46
SVM and L2-Regularization
MinimizeConstraint
Regularization by constraint Ein(h) ωT ω ≤ C
Hard-Margin SVM ωT ω Ein(h) = 0
L2 Regularization Ein(h) + λ
N ωT ω
Soft-Margin SVM 1
2 ωT ω + CNEin(h)
Table: SVM as Regularized Model
View SVM as Regularized Model
large margin ⇐⇒ L2 regularization of short ω
larger C ⇐⇒ smaller λ (less regularization)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 25 / 46
26.
Probabilistic SVM [Platt,1999]
Platt’s Model of Probabilistic SVM for Soft-Binary Classification
The probabilistic SVM model is
Θ(x) = 1/(1 + exp(−x))
h(x) = Θ(A(ωSVM
T
Φ(x) + bSVM) + B)
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
A functions as scaling: often A > 0 if ωSVM is reasonably good
B functions as shifting: often B ≈ 0 if bSVM is reasonably good
Two-level learning: logistic regression on SVM-transformed data
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 26 / 46
27.
Kernel Trick
SVM Dual
qn,m= ynymzn
T
zm = ynymΦ(xn)T
Φ(xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynΦ(xn)T
Φ(x) + b)
Kernel Trick
KΦ : X × X → IR
KΦ(xn, xm) = Φ(xn)T
Φ(xm)
qn,m = ynymzn
T
zm = ynymKΦ(xn, xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynKΦ(xn, x) + b)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 27 / 46
28.
Kernel Trick
General PolynomialKernel
KQ(x1, x2) = (ξ + γx1
T
x2)Q
Corresponding to Qth order polynomial transformation
Gaussian Kernel
K(x1, x2) = exp(−γ x1 − x2
2
)
What is the corresponding transformation Φ?
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 28 / 46
Mercer’s Condition [Wikipedia]
Mercer’sCondition
The function KΦ : X × X → IR, KΦ(xi , xj ) = Φ(xi )T Φ(xj ) is
symmetric
Define
Z =
Φ(x1)T
...
Φ(xN)T
K = ZZT
The matrix K is positive semi-definite.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 30 / 46
31.
Representer Theorem
[Schlkopf, Herbrichand Smola, 2001]
Theorem (Representer Theorem)
For any L2-regularized linear model,
min
ω
λ
N
ωT
ω +
1
N
N
n=1
Ein(yn, ωT
zn)
The optimal ω∗ satisfies:
ω∗ =
N
n=1
αnzn = ZT
α
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 31 / 46
32.
Representer Theorem
Proof
ω∗ =ω + ω⊥
where
ω ∈ span(zn) n = 1, 2, · · · , N
ω⊥ ⊥ span(zn) n = 1, 2, · · · , N
if ω⊥ = 0
Ein(yn, ω∗T
zn) = Ein(yn, ω T
zn)
however
ω∗T
ω∗ = ω T
ω + ω⊥
T
ω⊥ > ω T
ω (Contradiction!)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 32 / 46
33.
Representer Theorem
Conclusion
The optimalω∗ in L2-regularized linear model is a linear combination of
zn n = 1, 2, · · · , N i.e. The optimal ω∗ is represented by data.
SVM Dual
ωSVM =
N
n=1
αnynzn
PLA
ωPLA =
N
n=1
βnynzn if ω0 = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 33 / 46
34.
Representer Theorem
Kernelized LinearModel
For any linear model,
ω∗T
z =
N
n=1
αnzn
T
z =
N
n=1
αnKΦ(xn, x)
Any L2-regularized linear model can be kernelized.
Kernelized Linear Model
Kernelized Ridge Regression
Kernelized Logistic Regression
Kernelized Support Vector Regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 34 / 46
35.
Kernelized Ridge Regression
KernelizedRidge Regression
min
ω
λ
N
ωT
ω +
1
N
N
n=1
(yn − ωT
zn)2
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm) +
1
N
N
n=1
(yn −
N
m=1
αmK(xm, xn))2
⇒ min
α
λ
N
αT
Kα +
1
N
(y − Kα)T
(y − Kα) ≡ min
α
Eaug (α)
⇒
∂Eaug (α)
∂α
=
2
N
KT
((λI + K)α − y) = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 35 / 46
36.
Kernelized Ridge Regression
KernelizedRidge Regression
The optimal α is
α = (λI + K)−1
y
The optimal ω∗ is
ω∗ = ZT
(λI + K)−1
y
Compared to the optimal ω∗ of ridge regression
ω∗ = (λI + ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 36 / 46
37.
Kernelized L2-Regularized LogisticRegression
Kernelized L2-Regularized Logistic Regression
min
ω
λ
N
ωT
ω +
1
N
N
n=1
ln(1 + exp(−ynωT
zn))
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm)
+
1
N
N
n=1
ln(1 + exp(−yn
N
m=1
αmK(xm, xn)))
Linear model of ω with embedded-in-kernel transformation and
L2-regularizer
Linear model of α with kernel as transformation and kernelized
regularizer
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 37 / 46
38.
Tube Regression
Tube Regression
Thetube regression model is
h(z) = ωT
z + b
The error function of the model is
Ein(h) =
N
i=1
max(0, |h(zi )−yi |− )
Tube Regression
Figure: Interpretation of tube
regularization, from [ResearchGate].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 38 / 46
39.
L2-Regularized Tube Regression
L2-RegularizedTube Regression
The L2-regularized tube regression model is
min
b,ω
N
i=1
max(0, |h(zi ) − yi | − ) +
λ
N
ωT
ω
Remember?
Unconstrainted Form of Soft-Margin SVM Primal
Soft-Margin SVM Primal ⇐⇒ min
b,ω
1
2
ωT
ω+C
N
n=1
max(1−yn(ωT
zn+b), 0)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 39 / 46
40.
Support Vector Regression[Welling, 2004]
SVR Primal: solve l + 1 + 2N variables under 4N constraints
Support Vector Regression
Primal
min
b,ω,ξ
1
2
ωT
ω + C
N
n=1
ξn
s.t. |ωT
zn + b − yn| ≤ + ξn
ξn ≥ 0
n = 1, 2, ..., N
Support Vector Regression Primal
Refinement
min
b,ω,ˆξ,ˇξ
1
2
ωT
ω + C
N
n=1
( ˆξn + ˇξn)
s.t. −b − ωT
zn + ˆξn ≥ − − yn
b + ωT
zn + ˇξn ≥ − + yn
ˆξn ≥ 0
ˇξn ≥ 0
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 40 / 46
41.
Support Vector Regression
SVRDual: solve 2N variables under 4N + 1 constraints
Kernelized Support Vector Regression Dual
min
b,ω,ξ
1
2
N
n=1
N
m=1
( ˆαn − ˇαn)( ˆαm − ˇαm)kn,m
+
N
n=1
[( − yn) ˆαn + ( + yn) ˇαn]
s.t.
N
n=1
( ˆαn − ˇαn) = 0
0 ≤ ˆαn ≤ C
0 ≤ ˇαn ≤ C
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 41 / 46
42.
Support Vector Regression
KKTConditions of Support Vector Regression
Primal-Inner Optimal ˆαn( + ˆξn − yn + ωT
zn + b) = 0
ˇαn( + ˇξn + yn − ωT
zn − b) = 0
Dual-Inner Optimal ω =
N
n=1
( ˆαn − ˇαn)zn
N
n=1
( ˆαn − ˇαn) = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 42 / 46
References
Platt, J. (1999)
Probabilisticoutputs for support vector machines and comparisons to regularized
likelihood methods
Advances in large margin classifiers 10(3), 61 – 74.
Schlkopf, B., Herbrich, R. and Smola, A.J. (2001)
A generalized representer theorem
In International Conference on Computational Learning Theory 416 – 426.
Welling, M. (2004)
Support vector regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 44 / 46
45.
Q & A
FengtaoWu (Pitt) Machine Learning Foundations November 15, 2016 45 / 46