SlideShare a Scribd company logo
1 of 46
Download to read offline
Linear Machine Learning Models with L2-Regularization
and Kernel Tricks
Fengtao Wu
University of Pittsburgh
few14@pitt.edu
November 15, 2016
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 1 / 46
Outline
1 Feature Transformation
Overfitting
2 L2-Regularization for Linear Models
Linear Models
L2-Regularization
3 Quadratic Programming
Standard Form
QP Solver
Example: SVM
4 Kernel Trick and L2-Regularization
Representer Theorem
Kernelized Ridge Regression
Kernelized L2-Regularized Logistic Regression
Support Vector Regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 2 / 46
Feature Transformation
Original Dataset
After the data cleaning process, we have the original dataset:
(xi , yi ) i = 1, 2, ..., N
The feature vector xi has m dimensions:
xi = (xi1, xi2, ..., xim)T
i = 1, 2, ..., N
Feature Transformation
The original feature vector xi is transformed to new feature vector zi
which has l dimensions by the transformation φ : xi → zi
φ(xi ) = zi = (zi1, zi2, ..., zil )T
i = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 3 / 46
Feature Transformation
Example: Quadratic Transformation
The original feature vector xi is
xi = (xi1, xi2)T
i = 1, 2, ..., N
The new feature vector φ(xi ) is
φ(xi ) = (xi1, xi2, x2
i1, xi1xi2, x2
i2)T
i = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 4 / 46
Overfitting
The feature transformation may cause the overfitting problem:
φ(xi ) = (xi , x2
i , x3
i , ..., x10
i )T
i = 1, 2, ..., N
Figure: Polynomial transformation cause overfitting problem, from [Wikipedia].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 5 / 46
Linear Models
Linear Regression
Redefine zi ≡ (1, zi )T , the linear regression model is
h(z) = ωT
z
The error function of the model is
Ein(h) =
1
N
N
i=1
(h(zi ) − yi )2
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 6 / 46
Linear Models
Logistic Regression
Redefine zi ≡ (1, zi )T , the logistic regression model is
h(z) = 1/(1 + exp(−ωT
z))
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
The error function of the model is
max
h
N
i=1
h(yi zi ) ⇐⇒ min
ω
N
i=1
ln(1 + exp(−yi ωT
zi ))
Ein(ω) =
1
N
N
i=1
ln(1 + exp(−yi ωT
zi ))
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 7 / 46
Linear Models
PLA and SVM
The PLA and SVM model have the same response function
hPLA/SVM(z) = sign(ωT
z + b)
The error function of the model is
Ein(h) =
N
i=1
χh(zi )=yi
(zi )
Ein(ω) =
N
i=1
χsign(ωT zi +b)=yi
(zi )
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 8 / 46
Regularization
Regularization: add constraints when seeking ω∗ to avoid overfitting
Constraint 1
min
ω∈IR11
Ein(ω)
s.t. ω3 = · · · = ω10 = 0
Constraint 2
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
χωi =0(ωi ) ≤ 3
Regularization by constraint
min
ω∈IR11
Ein(ω)
s.t.
10
i=0
ω2
i ≤ C
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 9 / 46
Geometric Interpretation of L2-Regularization
L2-Regularization
The optimal w∗ satisfies:
− Ein(ω∗) ω∗
i.e.
Ein(ω∗) +
2λ
N
ω∗ = 0
i.e.
min
ω
Ein(ω) +
λ
N
ωT
ω
Geometric Interpretation
Figure: Interpretation of
L2-regularization, from [Lin].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 10 / 46
L2-Regularized Linear Regression
Ridge Regression
The error function of linear regression model is
Ein(ω) =
1
N
(Zω − y)T
(Zω − y)
According to Ein(ω∗) + 2λ
N ω∗ = 0
The analytic equation to the optimal weight vector ω∗ is
ω∗ = (ZT
Z + λI)−1
ZT
y
Compared to the optimal weight vector ω∗ for linear regression model is
ω∗ = (ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 11 / 46
L2-Regularized Logistic Regression
L2-Regularized Logistic Regression
The error function of logistic regression model is
Ein(ω) =
1
N
N
i=1
ln(1 + exp(−yi ωT
zi ))
According to min Ein(ω) + λ
N ωT ω
The optimal weight vector ω∗ is the solution to the unconstrained problem:
min
ω
1
N
N
i=1
ln(1 + exp(−yi ωT
zi )) +
λ
N
ωT
ω
The problem is solved by the gradient descent approach.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 12 / 46
L2-Regularizer
Weight-Decay Regularizer
L-2 regularizer λ
N ωT ω is also called weight-decay regularizer.
Larger λ ⇐⇒ Shorter ω ⇐⇒ Smaller C
Is switching the objective function and the constraints just a coincidence?
L2-Regularization
min
ω∈IRl+1
Ein(ω)
s.t.
l
i=0
ω2
i ≤ C
Hard-Margin SVM
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 13 / 46
Quadratic Programming [Burke]
QP Standard Form
The standard form of quadratic programming Q is
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
where A ∈ IRm×n
and b ∈ IRm
, and the matrix Q is symmetric. In the QP
standard form, the number of unknown variables is n, and the number of
constraints is m.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 14 / 46
Quadratic Programming
Lagrangian Function
The Lagrangian function is
L(x, λ) =
1
2
xT
Qx + cT
x − λT
(Ax − b)
where λ ≥ 0 and λ is called Lagrangian Multipliers.
Karush-Kuhn-Tucker Conditions for Q
A pair (x, λ) ∈ IRn
× IRm
is said to be a Karush-Kuhn-Tucker pair for the
quadratic program Q if and only if the following conditions are satisfied:
Ax ≥ b (primal feasibility)
λ ≥ 0, c + Qx − AT
λ ≥ 0 (dual feasibility)
λT
(Ax − b) = 0, xT
(c + Qx − AT
λ) = 0 (complementarity)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 15 / 46
Theorem
Theorem (First-Order Necessary Conditions for Optimality in QP)
If x∗ solves Q, then there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a
KKT pair for Q
Theorem (Necessary and Sufficient Conditions for Optimality in
Convex QP)
If Q is symmetric and positive semi-definite, the x∗ solves Q if and only if
there exists a vector λ∗ ∈ IRm
such that (x∗, λ∗) is a KKT pair for Q
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 16 / 46
QP Solver [Hoppe]
Direct Solution
Symmetric Indefinite Factorization
Range-Space Approach
Null-Space Approach
Iterative Solution
Krylov Methods
Transforming Range-Space Iterations
Transforming Null-Space Iterations
Active Set Strategies for Convex QP Problems
Primal Active Set Strategies
Primal-dual Active Set Strategies
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 17 / 46
Hard-Margin SVM Primal
Hard-Margin SVM Primal: solve l + 1 variables under N constraints
Hard-Margin SVM Primal
min
b,ω
1
2
ωT
ω
s.t. yn(ωT
zn + b) ≥ 1
n = 1, 2, ..., N
QP Standard Form
min
x
1
2
xT
Qx + cT
x
s.t. Ax ≥ b
QP Solver
x =
b
ω
Q =
0 01×l
0l×1 Il×l
c = 0 an =
yn
ynzn
A =



a1
T
...
aN
T


 bn = 1
b =



b1
...
bN


 x ← QP(Q, c, A, b)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 18 / 46
Hard-Margin SVM Primal Lagrangian Function
Hard-Margin SVM Primal Lagrangian Function
The Lagrangian function is
L(b, ω, α) =
1
2
ωT
ω +
N
n=1
αn[1 − yn(ωT
zn + b)]
where α ≥ 0 and α is called Lagrangian Multipliers.
Hard-Margin SVM Primal Equivalent
Hard-Margin SVM Primal ⇐⇒ min
b,ω
(max
α
L(b, ω, α))
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 19 / 46
Duality
Weak Duality
min
b,ω
(max
α
L(b, ω, α)) ≥ max
α
(min
b,ω
L(b, ω, α))
Strong Duality
min
b,ω
(max
α
L(b, ω, α)) = max
α
(min
b,ω
L(b, ω, α))
Conclusion for Strong Duality
Strong duality holds for quadratic programming problem if
The objective function in the primal problem is convex
The constraints in the primal problem is feasible
The constraints in the primal problem is linear
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 20 / 46
Hard-Margin SVM Dual
KKT Conditions for Hard-Margin SVM
Primal Feasible: yn(ωT zn + b) ≥ 1
Dual Feasible: α ≥ 0
Primal-Inner Optimal: αn(1 − yn(ωT zn + b)) = 0
Dual-Inner Optimal: N
n=1 ynαn = 0, ω = N
n=1 αnynzn
Hard-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 21 / 46
Hard-Margin SVM Dual
Hard-Margin SVM Dual: solve N variables under N + 1 constraints
Hard-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm −
N
n=1
αn
s.t.
N
n=1
ynαn = 0
α ≥ 0
QP Solver
α ← QP(Q, c, A, b)
ω =
N
n=1
αnynzn
b = yn − ωT
zn
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 22 / 46
Soft-Margin SVM
Soft-Margin SVM Primal: solve l + 1 + N variables under 2N constraints
Soft-Margin SVM Dual: solve N variables under 2N + 1 constraints
Soft-Margin SVM Primal
min
b,ω
1
2
ωT
ω + C
N
n=1
ξn
s.t. yn(ωT
zn + b) ≥ 1 − ξn
ξn ≥ 0
n = 1, 2, ..., N
Soft-Margin SVM Dual
min
α
1
2
N
n=1
N
m=1
αnαmynymzn
T
zm
−
N
n=1
αn
s.t.
N
n=1
ynαn = 0
0 ≤ α ≤ C
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 23 / 46
Soft-Margin SVM
Unconstrainted Form of Soft-Margin SVM Primal
Soft-Margin SVM Primal ⇐⇒ min
b,ω
1
2
ωT
ω+C
N
n=1
max(1−yn(ωT
zn+b), 0)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 24 / 46
SVM and L2-Regularization
Minimize Constraint
Regularization by constraint Ein(h) ωT ω ≤ C
Hard-Margin SVM ωT ω Ein(h) = 0
L2 Regularization Ein(h) + λ
N ωT ω
Soft-Margin SVM 1
2 ωT ω + CNEin(h)
Table: SVM as Regularized Model
View SVM as Regularized Model
large margin ⇐⇒ L2 regularization of short ω
larger C ⇐⇒ smaller λ (less regularization)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 25 / 46
Probabilistic SVM [Platt, 1999]
Platt’s Model of Probabilistic SVM for Soft-Binary Classification
The probabilistic SVM model is
Θ(x) = 1/(1 + exp(−x))
h(x) = Θ(A(ωSVM
T
Φ(x) + bSVM) + B)
P(y|z) =
h(z) if y = +1
1 − h(z) if y = −1
A functions as scaling: often A > 0 if ωSVM is reasonably good
B functions as shifting: often B ≈ 0 if bSVM is reasonably good
Two-level learning: logistic regression on SVM-transformed data
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 26 / 46
Kernel Trick
SVM Dual
qn,m = ynymzn
T
zm = ynymΦ(xn)T
Φ(xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynΦ(xn)T
Φ(x) + b)
Kernel Trick
KΦ : X × X → IR
KΦ(xn, xm) = Φ(xn)T
Φ(xm)
qn,m = ynymzn
T
zm = ynymKΦ(xn, xm)
hSVM(x) = sign(ωT
Φ(x) + b) = sign(
N
n=1
αnynKΦ(xn, x) + b)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 27 / 46
Kernel Trick
General Polynomial Kernel
KQ(x1, x2) = (ξ + γx1
T
x2)Q
Corresponding to Qth order polynomial transformation
Gaussian Kernel
K(x1, x2) = exp(−γ x1 − x2
2
)
What is the corresponding transformation Φ?
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 28 / 46
Gaussian Kernel
Gaussian Kernel
K(x1, x2) = exp(−γ x1 − x2
2
)
e.g.
K(x1, x2) = exp(−(x1 − x2)2
)
= exp(−x2
1 ) exp(−x2
2 ) exp(2x1x2)
= exp(−x2
1 ) exp(−x2
2 )(
∞
i=0
(2x1x2)i
i!
)
=
∞
i=0
[
2i
i!
exp(−x2
1 )xi
1][
2i
i!
exp(−x2
2 )xi
2]
Φ(x) = exp(−x2
)(1,
2
1!
x,
22
2!
x2
, · · · )
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 29 / 46
Mercer’s Condition [Wikipedia]
Mercer’s Condition
The function KΦ : X × X → IR, KΦ(xi , xj ) = Φ(xi )T Φ(xj ) is
symmetric
Define
Z =



Φ(x1)T
...
Φ(xN)T


 K = ZZT
The matrix K is positive semi-definite.
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 30 / 46
Representer Theorem
[Schlkopf, Herbrich and Smola, 2001]
Theorem (Representer Theorem)
For any L2-regularized linear model,
min
ω
λ
N
ωT
ω +
1
N
N
n=1
Ein(yn, ωT
zn)
The optimal ω∗ satisfies:
ω∗ =
N
n=1
αnzn = ZT
α
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 31 / 46
Representer Theorem
Proof
ω∗ = ω + ω⊥
where
ω ∈ span(zn) n = 1, 2, · · · , N
ω⊥ ⊥ span(zn) n = 1, 2, · · · , N
if ω⊥ = 0
Ein(yn, ω∗T
zn) = Ein(yn, ω T
zn)
however
ω∗T
ω∗ = ω T
ω + ω⊥
T
ω⊥ > ω T
ω (Contradiction!)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 32 / 46
Representer Theorem
Conclusion
The optimal ω∗ in L2-regularized linear model is a linear combination of
zn n = 1, 2, · · · , N i.e. The optimal ω∗ is represented by data.
SVM Dual
ωSVM =
N
n=1
αnynzn
PLA
ωPLA =
N
n=1
βnynzn if ω0 = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 33 / 46
Representer Theorem
Kernelized Linear Model
For any linear model,
ω∗T
z =
N
n=1
αnzn
T
z =
N
n=1
αnKΦ(xn, x)
Any L2-regularized linear model can be kernelized.
Kernelized Linear Model
Kernelized Ridge Regression
Kernelized Logistic Regression
Kernelized Support Vector Regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 34 / 46
Kernelized Ridge Regression
Kernelized Ridge Regression
min
ω
λ
N
ωT
ω +
1
N
N
n=1
(yn − ωT
zn)2
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm) +
1
N
N
n=1
(yn −
N
m=1
αmK(xm, xn))2
⇒ min
α
λ
N
αT
Kα +
1
N
(y − Kα)T
(y − Kα) ≡ min
α
Eaug (α)
⇒
∂Eaug (α)
∂α
=
2
N
KT
((λI + K)α − y) = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 35 / 46
Kernelized Ridge Regression
Kernelized Ridge Regression
The optimal α is
α = (λI + K)−1
y
The optimal ω∗ is
ω∗ = ZT
(λI + K)−1
y
Compared to the optimal ω∗ of ridge regression
ω∗ = (λI + ZT
Z)−1
ZT
y
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 36 / 46
Kernelized L2-Regularized Logistic Regression
Kernelized L2-Regularized Logistic Regression
min
ω
λ
N
ωT
ω +
1
N
N
n=1
ln(1 + exp(−ynωT
zn))
⇒ min
ω
N
n=1
N
m=1
αnαmK(xn, xm)
+
1
N
N
n=1
ln(1 + exp(−yn
N
m=1
αmK(xm, xn)))
Linear model of ω with embedded-in-kernel transformation and
L2-regularizer
Linear model of α with kernel as transformation and kernelized
regularizer
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 37 / 46
Tube Regression
Tube Regression
The tube regression model is
h(z) = ωT
z + b
The error function of the model is
Ein(h) =
N
i=1
max(0, |h(zi )−yi |− )
Tube Regression
Figure: Interpretation of tube
regularization, from [ResearchGate].
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 38 / 46
L2-Regularized Tube Regression
L2-Regularized Tube Regression
The L2-regularized tube regression model is
min
b,ω
N
i=1
max(0, |h(zi ) − yi | − ) +
λ
N
ωT
ω
Remember?
Unconstrainted Form of Soft-Margin SVM Primal
Soft-Margin SVM Primal ⇐⇒ min
b,ω
1
2
ωT
ω+C
N
n=1
max(1−yn(ωT
zn+b), 0)
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 39 / 46
Support Vector Regression [Welling, 2004]
SVR Primal: solve l + 1 + 2N variables under 4N constraints
Support Vector Regression
Primal
min
b,ω,ξ
1
2
ωT
ω + C
N
n=1
ξn
s.t. |ωT
zn + b − yn| ≤ + ξn
ξn ≥ 0
n = 1, 2, ..., N
Support Vector Regression Primal
Refinement
min
b,ω,ˆξ,ˇξ
1
2
ωT
ω + C
N
n=1
( ˆξn + ˇξn)
s.t. −b − ωT
zn + ˆξn ≥ − − yn
b + ωT
zn + ˇξn ≥ − + yn
ˆξn ≥ 0
ˇξn ≥ 0
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 40 / 46
Support Vector Regression
SVR Dual: solve 2N variables under 4N + 1 constraints
Kernelized Support Vector Regression Dual
min
b,ω,ξ
1
2
N
n=1
N
m=1
( ˆαn − ˇαn)( ˆαm − ˇαm)kn,m
+
N
n=1
[( − yn) ˆαn + ( + yn) ˇαn]
s.t.
N
n=1
( ˆαn − ˇαn) = 0
0 ≤ ˆαn ≤ C
0 ≤ ˇαn ≤ C
n = 1, 2, ..., N
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 41 / 46
Support Vector Regression
KKT Conditions of Support Vector Regression
Primal-Inner Optimal ˆαn( + ˆξn − yn + ωT
zn + b) = 0
ˇαn( + ˇξn + yn − ωT
zn − b) = 0
Dual-Inner Optimal ω =
N
n=1
( ˆαn − ˇαn)zn
N
n=1
( ˆαn − ˇαn) = 0
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 42 / 46
References
ResearchGate
Tube Regression
Wikipedia
Mercer’s Theorem
Wikipedia
Overfitting
James V. Burke
Nonlinear Programming
Lin, H. T
Machine Learning Foundations
Ronald W. Hoppe
Optimization I
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 43 / 46
References
Platt, J. (1999)
Probabilistic outputs for support vector machines and comparisons to regularized
likelihood methods
Advances in large margin classifiers 10(3), 61 – 74.
Schlkopf, B., Herbrich, R. and Smola, A.J. (2001)
A generalized representer theorem
In International Conference on Computational Learning Theory 416 – 426.
Welling, M. (2004)
Support vector regression
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 44 / 46
Q & A
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 45 / 46
Thank You!
Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 46 / 46

More Related Content

What's hot

Newton-Raphson Method
Newton-Raphson MethodNewton-Raphson Method
Newton-Raphson MethodJigisha Dabhi
 
0/1 knapsack
0/1 knapsack0/1 knapsack
0/1 knapsackAmin Omi
 
Introduction TO Finite Automata
Introduction TO Finite AutomataIntroduction TO Finite Automata
Introduction TO Finite AutomataRatnakar Mikkili
 
C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)parth98796
 
Persamaan Differensial Biasa 2014
Persamaan Differensial Biasa 2014 Persamaan Differensial Biasa 2014
Persamaan Differensial Biasa 2014 Rani Sulvianuri
 
Multi-Objective Evolutionary Algorithms
Multi-Objective Evolutionary AlgorithmsMulti-Objective Evolutionary Algorithms
Multi-Objective Evolutionary AlgorithmsSong Gao
 
2 Dimensional Wave Equation Analytical and Numerical Solution
2 Dimensional Wave Equation Analytical and Numerical Solution2 Dimensional Wave Equation Analytical and Numerical Solution
2 Dimensional Wave Equation Analytical and Numerical SolutionAmr Mousa
 
181_Sample-Chapter.pdf
181_Sample-Chapter.pdf181_Sample-Chapter.pdf
181_Sample-Chapter.pdfThanoonQasem
 
B.tech ii unit-2 material beta gamma function
B.tech ii unit-2 material beta gamma functionB.tech ii unit-2 material beta gamma function
B.tech ii unit-2 material beta gamma functionRai University
 
Differential equations of first order
Differential equations of first orderDifferential equations of first order
Differential equations of first ordervishalgohel12195
 
Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Ra'Fat Al-Msie'deen
 
Singular Value Decompostion (SVD): Worked example 2
Singular Value Decompostion (SVD): Worked example 2Singular Value Decompostion (SVD): Worked example 2
Singular Value Decompostion (SVD): Worked example 2Isaac Yowetu
 
Newton’s Divided Difference Formula
Newton’s Divided Difference FormulaNewton’s Divided Difference Formula
Newton’s Divided Difference FormulaJas Singh Bhasin
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Set Theory - Unit -II (Mathematical Foundation Of Computer Science).pptx
Set Theory - Unit -II (Mathematical Foundation  Of  Computer Science).pptxSet Theory - Unit -II (Mathematical Foundation  Of  Computer Science).pptx
Set Theory - Unit -II (Mathematical Foundation Of Computer Science).pptxKalirajMariappan
 
Chapter 19 Variational Inference
Chapter 19 Variational InferenceChapter 19 Variational Inference
Chapter 19 Variational InferenceKyeongUkJang
 

What's hot (20)

Newton-Raphson Method
Newton-Raphson MethodNewton-Raphson Method
Newton-Raphson Method
 
0/1 knapsack
0/1 knapsack0/1 knapsack
0/1 knapsack
 
Introduction TO Finite Automata
Introduction TO Finite AutomataIntroduction TO Finite Automata
Introduction TO Finite Automata
 
Fourier series
Fourier seriesFourier series
Fourier series
 
C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)C.v.n.m (m.e. 130990119004-06)
C.v.n.m (m.e. 130990119004-06)
 
Persamaan Differensial Biasa 2014
Persamaan Differensial Biasa 2014 Persamaan Differensial Biasa 2014
Persamaan Differensial Biasa 2014
 
SINGLE-SOURCE SHORTEST PATHS
SINGLE-SOURCE SHORTEST PATHS SINGLE-SOURCE SHORTEST PATHS
SINGLE-SOURCE SHORTEST PATHS
 
Multi-Objective Evolutionary Algorithms
Multi-Objective Evolutionary AlgorithmsMulti-Objective Evolutionary Algorithms
Multi-Objective Evolutionary Algorithms
 
2 Dimensional Wave Equation Analytical and Numerical Solution
2 Dimensional Wave Equation Analytical and Numerical Solution2 Dimensional Wave Equation Analytical and Numerical Solution
2 Dimensional Wave Equation Analytical and Numerical Solution
 
181_Sample-Chapter.pdf
181_Sample-Chapter.pdf181_Sample-Chapter.pdf
181_Sample-Chapter.pdf
 
B.tech ii unit-2 material beta gamma function
B.tech ii unit-2 material beta gamma functionB.tech ii unit-2 material beta gamma function
B.tech ii unit-2 material beta gamma function
 
Differential equations of first order
Differential equations of first orderDifferential equations of first order
Differential equations of first order
 
lattice
 lattice lattice
lattice
 
Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"Theory of Computation "Chapter 1, introduction"
Theory of Computation "Chapter 1, introduction"
 
Singular Value Decompostion (SVD): Worked example 2
Singular Value Decompostion (SVD): Worked example 2Singular Value Decompostion (SVD): Worked example 2
Singular Value Decompostion (SVD): Worked example 2
 
Newton’s Divided Difference Formula
Newton’s Divided Difference FormulaNewton’s Divided Difference Formula
Newton’s Divided Difference Formula
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Set Theory - Unit -II (Mathematical Foundation Of Computer Science).pptx
Set Theory - Unit -II (Mathematical Foundation  Of  Computer Science).pptxSet Theory - Unit -II (Mathematical Foundation  Of  Computer Science).pptx
Set Theory - Unit -II (Mathematical Foundation Of Computer Science).pptx
 
Optimization tutorial
Optimization tutorialOptimization tutorial
Optimization tutorial
 
Chapter 19 Variational Inference
Chapter 19 Variational InferenceChapter 19 Variational Inference
Chapter 19 Variational Inference
 

Similar to Linear Machine Learning Models with L2 Regularization and Kernel Tricks

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
5.2 primitive recursive functions
5.2 primitive recursive functions5.2 primitive recursive functions
5.2 primitive recursive functionsSampath Kumar S
 
Lecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualLecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualStéphane Canu
 
Lecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualLecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualStéphane Canu
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learningSteve Nouri
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169Ryan White
 
Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Sean Meyn
 
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodTasuku Soma
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisVjekoslavKovac1
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Rediet Moges
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machinenozomuhamada
 
2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information TheoryJoe Suzuki
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning Sean Meyn
 

Similar to Linear Machine Learning Models with L2 Regularization and Kernel Tricks (20)

smtlecture.6
smtlecture.6smtlecture.6
smtlecture.6
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
5.2 primitive recursive functions
5.2 primitive recursive functions5.2 primitive recursive functions
5.2 primitive recursive functions
 
Lecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dualLecture 2: linear SVM in the dual
Lecture 2: linear SVM in the dual
 
Lecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the DualLecture 2: linear SVM in the Dual
Lecture 2: linear SVM in the Dual
 
Cheatsheet supervised-learning
Cheatsheet supervised-learningCheatsheet supervised-learning
Cheatsheet supervised-learning
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018
 
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
 
Stack of Tasks Course
Stack of Tasks CourseStack of Tasks Course
Stack of Tasks Course
 
Nonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares MethodNonconvex Compressed Sensing with the Sum-of-Squares Method
Nonconvex Compressed Sensing with the Sum-of-Squares Method
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine2012 mdsp pr13 support vector machine
2012 mdsp pr13 support vector machine
 
2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory2013 IEEE International Symposium on Information Theory
2013 IEEE International Symposium on Information Theory
 
Introducing Zap Q-Learning
Introducing Zap Q-Learning   Introducing Zap Q-Learning
Introducing Zap Q-Learning
 
SVM for Regression
SVM for RegressionSVM for Regression
SVM for Regression
 

Recently uploaded

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

Linear Machine Learning Models with L2 Regularization and Kernel Tricks

  • 1. Linear Machine Learning Models with L2-Regularization and Kernel Tricks Fengtao Wu University of Pittsburgh few14@pitt.edu November 15, 2016 Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 1 / 46
  • 2. Outline 1 Feature Transformation Overfitting 2 L2-Regularization for Linear Models Linear Models L2-Regularization 3 Quadratic Programming Standard Form QP Solver Example: SVM 4 Kernel Trick and L2-Regularization Representer Theorem Kernelized Ridge Regression Kernelized L2-Regularized Logistic Regression Support Vector Regression Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 2 / 46
  • 3. Feature Transformation Original Dataset After the data cleaning process, we have the original dataset: (xi , yi ) i = 1, 2, ..., N The feature vector xi has m dimensions: xi = (xi1, xi2, ..., xim)T i = 1, 2, ..., N Feature Transformation The original feature vector xi is transformed to new feature vector zi which has l dimensions by the transformation φ : xi → zi φ(xi ) = zi = (zi1, zi2, ..., zil )T i = 1, 2, ..., N Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 3 / 46
  • 4. Feature Transformation Example: Quadratic Transformation The original feature vector xi is xi = (xi1, xi2)T i = 1, 2, ..., N The new feature vector φ(xi ) is φ(xi ) = (xi1, xi2, x2 i1, xi1xi2, x2 i2)T i = 1, 2, ..., N Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 4 / 46
  • 5. Overfitting The feature transformation may cause the overfitting problem: φ(xi ) = (xi , x2 i , x3 i , ..., x10 i )T i = 1, 2, ..., N Figure: Polynomial transformation cause overfitting problem, from [Wikipedia]. Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 5 / 46
  • 6. Linear Models Linear Regression Redefine zi ≡ (1, zi )T , the linear regression model is h(z) = ωT z The error function of the model is Ein(h) = 1 N N i=1 (h(zi ) − yi )2 Ein(ω) = 1 N (Zω − y)T (Zω − y) The analytic equation to the optimal weight vector ω∗ is ω∗ = (ZT Z)−1 ZT y Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 6 / 46
  • 7. Linear Models Logistic Regression Redefine zi ≡ (1, zi )T , the logistic regression model is h(z) = 1/(1 + exp(−ωT z)) P(y|z) = h(z) if y = +1 1 − h(z) if y = −1 The error function of the model is max h N i=1 h(yi zi ) ⇐⇒ min ω N i=1 ln(1 + exp(−yi ωT zi )) Ein(ω) = 1 N N i=1 ln(1 + exp(−yi ωT zi )) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 7 / 46
  • 8. Linear Models PLA and SVM The PLA and SVM model have the same response function hPLA/SVM(z) = sign(ωT z + b) The error function of the model is Ein(h) = N i=1 χh(zi )=yi (zi ) Ein(ω) = N i=1 χsign(ωT zi +b)=yi (zi ) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 8 / 46
  • 9. Regularization Regularization: add constraints when seeking ω∗ to avoid overfitting Constraint 1 min ω∈IR11 Ein(ω) s.t. ω3 = · · · = ω10 = 0 Constraint 2 min ω∈IR11 Ein(ω) s.t. 10 i=0 χωi =0(ωi ) ≤ 3 Regularization by constraint min ω∈IR11 Ein(ω) s.t. 10 i=0 ω2 i ≤ C Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 9 / 46
  • 10. Geometric Interpretation of L2-Regularization L2-Regularization The optimal w∗ satisfies: − Ein(ω∗) ω∗ i.e. Ein(ω∗) + 2λ N ω∗ = 0 i.e. min ω Ein(ω) + λ N ωT ω Geometric Interpretation Figure: Interpretation of L2-regularization, from [Lin]. Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 10 / 46
  • 11. L2-Regularized Linear Regression Ridge Regression The error function of linear regression model is Ein(ω) = 1 N (Zω − y)T (Zω − y) According to Ein(ω∗) + 2λ N ω∗ = 0 The analytic equation to the optimal weight vector ω∗ is ω∗ = (ZT Z + λI)−1 ZT y Compared to the optimal weight vector ω∗ for linear regression model is ω∗ = (ZT Z)−1 ZT y Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 11 / 46
  • 12. L2-Regularized Logistic Regression L2-Regularized Logistic Regression The error function of logistic regression model is Ein(ω) = 1 N N i=1 ln(1 + exp(−yi ωT zi )) According to min Ein(ω) + λ N ωT ω The optimal weight vector ω∗ is the solution to the unconstrained problem: min ω 1 N N i=1 ln(1 + exp(−yi ωT zi )) + λ N ωT ω The problem is solved by the gradient descent approach. Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 12 / 46
  • 13. L2-Regularizer Weight-Decay Regularizer L-2 regularizer λ N ωT ω is also called weight-decay regularizer. Larger λ ⇐⇒ Shorter ω ⇐⇒ Smaller C Is switching the objective function and the constraints just a coincidence? L2-Regularization min ω∈IRl+1 Ein(ω) s.t. l i=0 ω2 i ≤ C Hard-Margin SVM min b,ω 1 2 ωT ω s.t. yn(ωT zn + b) ≥ 1 n = 1, 2, ..., N Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 13 / 46
  • 14. Quadratic Programming [Burke] QP Standard Form The standard form of quadratic programming Q is min x 1 2 xT Qx + cT x s.t. Ax ≥ b where A ∈ IRm×n and b ∈ IRm , and the matrix Q is symmetric. In the QP standard form, the number of unknown variables is n, and the number of constraints is m. Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 14 / 46
  • 15. Quadratic Programming Lagrangian Function The Lagrangian function is L(x, λ) = 1 2 xT Qx + cT x − λT (Ax − b) where λ ≥ 0 and λ is called Lagrangian Multipliers. Karush-Kuhn-Tucker Conditions for Q A pair (x, λ) ∈ IRn × IRm is said to be a Karush-Kuhn-Tucker pair for the quadratic program Q if and only if the following conditions are satisfied: Ax ≥ b (primal feasibility) λ ≥ 0, c + Qx − AT λ ≥ 0 (dual feasibility) λT (Ax − b) = 0, xT (c + Qx − AT λ) = 0 (complementarity) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 15 / 46
  • 16. Theorem Theorem (First-Order Necessary Conditions for Optimality in QP) If x∗ solves Q, then there exists a vector λ∗ ∈ IRm such that (x∗, λ∗) is a KKT pair for Q Theorem (Necessary and Sufficient Conditions for Optimality in Convex QP) If Q is symmetric and positive semi-definite, the x∗ solves Q if and only if there exists a vector λ∗ ∈ IRm such that (x∗, λ∗) is a KKT pair for Q Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 16 / 46
  • 17. QP Solver [Hoppe] Direct Solution Symmetric Indefinite Factorization Range-Space Approach Null-Space Approach Iterative Solution Krylov Methods Transforming Range-Space Iterations Transforming Null-Space Iterations Active Set Strategies for Convex QP Problems Primal Active Set Strategies Primal-dual Active Set Strategies Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 17 / 46
  • 18. Hard-Margin SVM Primal Hard-Margin SVM Primal: solve l + 1 variables under N constraints Hard-Margin SVM Primal min b,ω 1 2 ωT ω s.t. yn(ωT zn + b) ≥ 1 n = 1, 2, ..., N QP Standard Form min x 1 2 xT Qx + cT x s.t. Ax ≥ b QP Solver x = b ω Q = 0 01×l 0l×1 Il×l c = 0 an = yn ynzn A =    a1 T ... aN T    bn = 1 b =    b1 ... bN    x ← QP(Q, c, A, b) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 18 / 46
  • 19. Hard-Margin SVM Primal Lagrangian Function Hard-Margin SVM Primal Lagrangian Function The Lagrangian function is L(b, ω, α) = 1 2 ωT ω + N n=1 αn[1 − yn(ωT zn + b)] where α ≥ 0 and α is called Lagrangian Multipliers. Hard-Margin SVM Primal Equivalent Hard-Margin SVM Primal ⇐⇒ min b,ω (max α L(b, ω, α)) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 19 / 46
  • 20. Duality Weak Duality min b,ω (max α L(b, ω, α)) ≥ max α (min b,ω L(b, ω, α)) Strong Duality min b,ω (max α L(b, ω, α)) = max α (min b,ω L(b, ω, α)) Conclusion for Strong Duality Strong duality holds for quadratic programming problem if The objective function in the primal problem is convex The constraints in the primal problem is feasible The constraints in the primal problem is linear Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 20 / 46
  • 21. Hard-Margin SVM Dual KKT Conditions for Hard-Margin SVM Primal Feasible: yn(ωT zn + b) ≥ 1 Dual Feasible: α ≥ 0 Primal-Inner Optimal: αn(1 − yn(ωT zn + b)) = 0 Dual-Inner Optimal: N n=1 ynαn = 0, ω = N n=1 αnynzn Hard-Margin SVM Dual min α 1 2 N n=1 N m=1 αnαmynymzn T zm − N n=1 αn s.t. N n=1 ynαn = 0 α ≥ 0 Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 21 / 46
  • 22. Hard-Margin SVM Dual Hard-Margin SVM Dual: solve N variables under N + 1 constraints Hard-Margin SVM Dual min α 1 2 N n=1 N m=1 αnαmynymzn T zm − N n=1 αn s.t. N n=1 ynαn = 0 α ≥ 0 QP Solver α ← QP(Q, c, A, b) ω = N n=1 αnynzn b = yn − ωT zn Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 22 / 46
  • 23. Soft-Margin SVM Soft-Margin SVM Primal: solve l + 1 + N variables under 2N constraints Soft-Margin SVM Dual: solve N variables under 2N + 1 constraints Soft-Margin SVM Primal min b,ω 1 2 ωT ω + C N n=1 ξn s.t. yn(ωT zn + b) ≥ 1 − ξn ξn ≥ 0 n = 1, 2, ..., N Soft-Margin SVM Dual min α 1 2 N n=1 N m=1 αnαmynymzn T zm − N n=1 αn s.t. N n=1 ynαn = 0 0 ≤ α ≤ C Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 23 / 46
  • 24. Soft-Margin SVM Unconstrainted Form of Soft-Margin SVM Primal Soft-Margin SVM Primal ⇐⇒ min b,ω 1 2 ωT ω+C N n=1 max(1−yn(ωT zn+b), 0) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 24 / 46
  • 25. SVM and L2-Regularization Minimize Constraint Regularization by constraint Ein(h) ωT ω ≤ C Hard-Margin SVM ωT ω Ein(h) = 0 L2 Regularization Ein(h) + λ N ωT ω Soft-Margin SVM 1 2 ωT ω + CNEin(h) Table: SVM as Regularized Model View SVM as Regularized Model large margin ⇐⇒ L2 regularization of short ω larger C ⇐⇒ smaller λ (less regularization) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 25 / 46
  • 26. Probabilistic SVM [Platt, 1999] Platt’s Model of Probabilistic SVM for Soft-Binary Classification The probabilistic SVM model is Θ(x) = 1/(1 + exp(−x)) h(x) = Θ(A(ωSVM T Φ(x) + bSVM) + B) P(y|z) = h(z) if y = +1 1 − h(z) if y = −1 A functions as scaling: often A > 0 if ωSVM is reasonably good B functions as shifting: often B ≈ 0 if bSVM is reasonably good Two-level learning: logistic regression on SVM-transformed data Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 26 / 46
  • 27. Kernel Trick SVM Dual qn,m = ynymzn T zm = ynymΦ(xn)T Φ(xm) hSVM(x) = sign(ωT Φ(x) + b) = sign( N n=1 αnynΦ(xn)T Φ(x) + b) Kernel Trick KΦ : X × X → IR KΦ(xn, xm) = Φ(xn)T Φ(xm) qn,m = ynymzn T zm = ynymKΦ(xn, xm) hSVM(x) = sign(ωT Φ(x) + b) = sign( N n=1 αnynKΦ(xn, x) + b) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 27 / 46
  • 28. Kernel Trick General Polynomial Kernel KQ(x1, x2) = (ξ + γx1 T x2)Q Corresponding to Qth order polynomial transformation Gaussian Kernel K(x1, x2) = exp(−γ x1 − x2 2 ) What is the corresponding transformation Φ? Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 28 / 46
  • 29. Gaussian Kernel Gaussian Kernel K(x1, x2) = exp(−γ x1 − x2 2 ) e.g. K(x1, x2) = exp(−(x1 − x2)2 ) = exp(−x2 1 ) exp(−x2 2 ) exp(2x1x2) = exp(−x2 1 ) exp(−x2 2 )( ∞ i=0 (2x1x2)i i! ) = ∞ i=0 [ 2i i! exp(−x2 1 )xi 1][ 2i i! exp(−x2 2 )xi 2] Φ(x) = exp(−x2 )(1, 2 1! x, 22 2! x2 , · · · ) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 29 / 46
  • 30. Mercer’s Condition [Wikipedia] Mercer’s Condition The function KΦ : X × X → IR, KΦ(xi , xj ) = Φ(xi )T Φ(xj ) is symmetric Define Z =    Φ(x1)T ... Φ(xN)T    K = ZZT The matrix K is positive semi-definite. Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 30 / 46
  • 31. Representer Theorem [Schlkopf, Herbrich and Smola, 2001] Theorem (Representer Theorem) For any L2-regularized linear model, min ω λ N ωT ω + 1 N N n=1 Ein(yn, ωT zn) The optimal ω∗ satisfies: ω∗ = N n=1 αnzn = ZT α Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 31 / 46
  • 32. Representer Theorem Proof ω∗ = ω + ω⊥ where ω ∈ span(zn) n = 1, 2, · · · , N ω⊥ ⊥ span(zn) n = 1, 2, · · · , N if ω⊥ = 0 Ein(yn, ω∗T zn) = Ein(yn, ω T zn) however ω∗T ω∗ = ω T ω + ω⊥ T ω⊥ > ω T ω (Contradiction!) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 32 / 46
  • 33. Representer Theorem Conclusion The optimal ω∗ in L2-regularized linear model is a linear combination of zn n = 1, 2, · · · , N i.e. The optimal ω∗ is represented by data. SVM Dual ωSVM = N n=1 αnynzn PLA ωPLA = N n=1 βnynzn if ω0 = 0 Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 33 / 46
  • 34. Representer Theorem Kernelized Linear Model For any linear model, ω∗T z = N n=1 αnzn T z = N n=1 αnKΦ(xn, x) Any L2-regularized linear model can be kernelized. Kernelized Linear Model Kernelized Ridge Regression Kernelized Logistic Regression Kernelized Support Vector Regression Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 34 / 46
  • 35. Kernelized Ridge Regression Kernelized Ridge Regression min ω λ N ωT ω + 1 N N n=1 (yn − ωT zn)2 ⇒ min ω N n=1 N m=1 αnαmK(xn, xm) + 1 N N n=1 (yn − N m=1 αmK(xm, xn))2 ⇒ min α λ N αT Kα + 1 N (y − Kα)T (y − Kα) ≡ min α Eaug (α) ⇒ ∂Eaug (α) ∂α = 2 N KT ((λI + K)α − y) = 0 Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 35 / 46
  • 36. Kernelized Ridge Regression Kernelized Ridge Regression The optimal α is α = (λI + K)−1 y The optimal ω∗ is ω∗ = ZT (λI + K)−1 y Compared to the optimal ω∗ of ridge regression ω∗ = (λI + ZT Z)−1 ZT y Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 36 / 46
  • 37. Kernelized L2-Regularized Logistic Regression Kernelized L2-Regularized Logistic Regression min ω λ N ωT ω + 1 N N n=1 ln(1 + exp(−ynωT zn)) ⇒ min ω N n=1 N m=1 αnαmK(xn, xm) + 1 N N n=1 ln(1 + exp(−yn N m=1 αmK(xm, xn))) Linear model of ω with embedded-in-kernel transformation and L2-regularizer Linear model of α with kernel as transformation and kernelized regularizer Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 37 / 46
  • 38. Tube Regression Tube Regression The tube regression model is h(z) = ωT z + b The error function of the model is Ein(h) = N i=1 max(0, |h(zi )−yi |− ) Tube Regression Figure: Interpretation of tube regularization, from [ResearchGate]. Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 38 / 46
  • 39. L2-Regularized Tube Regression L2-Regularized Tube Regression The L2-regularized tube regression model is min b,ω N i=1 max(0, |h(zi ) − yi | − ) + λ N ωT ω Remember? Unconstrainted Form of Soft-Margin SVM Primal Soft-Margin SVM Primal ⇐⇒ min b,ω 1 2 ωT ω+C N n=1 max(1−yn(ωT zn+b), 0) Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 39 / 46
  • 40. Support Vector Regression [Welling, 2004] SVR Primal: solve l + 1 + 2N variables under 4N constraints Support Vector Regression Primal min b,ω,ξ 1 2 ωT ω + C N n=1 ξn s.t. |ωT zn + b − yn| ≤ + ξn ξn ≥ 0 n = 1, 2, ..., N Support Vector Regression Primal Refinement min b,ω,ˆξ,ˇξ 1 2 ωT ω + C N n=1 ( ˆξn + ˇξn) s.t. −b − ωT zn + ˆξn ≥ − − yn b + ωT zn + ˇξn ≥ − + yn ˆξn ≥ 0 ˇξn ≥ 0 n = 1, 2, ..., N Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 40 / 46
  • 41. Support Vector Regression SVR Dual: solve 2N variables under 4N + 1 constraints Kernelized Support Vector Regression Dual min b,ω,ξ 1 2 N n=1 N m=1 ( ˆαn − ˇαn)( ˆαm − ˇαm)kn,m + N n=1 [( − yn) ˆαn + ( + yn) ˇαn] s.t. N n=1 ( ˆαn − ˇαn) = 0 0 ≤ ˆαn ≤ C 0 ≤ ˇαn ≤ C n = 1, 2, ..., N Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 41 / 46
  • 42. Support Vector Regression KKT Conditions of Support Vector Regression Primal-Inner Optimal ˆαn( + ˆξn − yn + ωT zn + b) = 0 ˇαn( + ˇξn + yn − ωT zn − b) = 0 Dual-Inner Optimal ω = N n=1 ( ˆαn − ˇαn)zn N n=1 ( ˆαn − ˇαn) = 0 Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 42 / 46
  • 43. References ResearchGate Tube Regression Wikipedia Mercer’s Theorem Wikipedia Overfitting James V. Burke Nonlinear Programming Lin, H. T Machine Learning Foundations Ronald W. Hoppe Optimization I Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 43 / 46
  • 44. References Platt, J. (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods Advances in large margin classifiers 10(3), 61 – 74. Schlkopf, B., Herbrich, R. and Smola, A.J. (2001) A generalized representer theorem In International Conference on Computational Learning Theory 416 – 426. Welling, M. (2004) Support vector regression Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 44 / 46
  • 45. Q & A Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 45 / 46
  • 46. Thank You! Fengtao Wu (Pitt) Machine Learning Foundations November 15, 2016 46 / 46