ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane

ESL 4.4.3-4.5
Logistic Regression (contd.)
& Separating Hyperplane
June 8, 2015
Talk by Shinichi TAMURA
Mathematical Informatics Lab @ NAIST

Today's topics
¨  Logistic regression (contd.)"
¨  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"
¨  Separating Hyperplane"
¨  Rosenblatt's Perceptron"
¨  Optimal Hyperplane"

On the analogy with Least Squares Fitting
[Review] Fitting LR Model
Parameters are ﬁtted by ML estimation, using
Newton-Raphson algorithm:"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.

[Review] Fitting LR Model
Parameters are ﬁtted by ML estimation, using
Newton-Raphson algorithm:"
"
"
"
It looks like least squares ﬁtting:"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
β ← rg min
β
(y − Xβ) (y − Xβ)
β =(X X)−1
X y

Self-consistency
β depends on W and z, while W and z depend on β."
"
"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.

Self-consistency
"
"
"
"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.
z =
ˆβ +
y − ˆp
ˆp(1 − ˆp)
 = ˆp(1 − ˆp).

Self-consistency
"
"
"
"
→ “self-consistent” equation, needs iterative method
to solve"
"
βnew
← rg min
β
(z − Xβ) W(z − Xβ)
βnew
=(X WX)−1
X Wz.

Meaning of Weighted RSS (1)
RSS is used to check the goodness of ﬁt in least
squares ﬁtting."
"
"
N
=1
(y − ˆp)2

RSS is used to check the goodness of ﬁt in least
squares ﬁtting."
"
"
"
How about weighted RSS in logistic regression?"
N
=1
(y − ˆp)2
N
=1
(y − ˆp)2
ˆp(1 − ˆp)

Weighted RSS is interpreted as..."
Peason's χ-squared statistics"
χ2
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(1 − ˆp + ˆp)(y − ˆp)2
ˆp(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.

or... as quadratic approximation of Deviance"
D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.

D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
Maximum
likelihood

of
the
model

D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
Maximum
likelihood

of
the
model

Likelihood
of
the
full
model

which
achieve
perfect
ﬁtting

D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00

D = −2
N
=1
[y log ˆp + (1 − y) log(1 − ˆp)] −
N
=1
[y log y + (1 − y) log(1 − y)]
= 2
N
=1
y log
y
ˆp
+ (1 − y) log
1 − y
1 − ˆp
≈ 2
N
=1
(y − ˆp) +
(y − ˆp)2
2 ˆp
+ {(1 − y) − (1 − ˆp)} +
{(1 − y) − (1 − ˆp)}2
2(1 − ˆp)
=
N
=1
(y − ˆp)2
ˆp
+
(y − ˆp)2
1 − ˆp
=
N
=1
(y − ˆp)2
ˆp(1 − ˆp)
.
00

 log


= ( − )
+
( − )2
2
−
( − )3
62
+ · · ·

Asymp. distribution of
The distribution of converges to "N β, (X WX)−1ˆβ
ˆβ

Asymp. distribution of
The distribution of converges to "
(See hand-out for the details)"
N β, (X WX)−1
y
i.i.d.
∼ Bern(Pr(; β)).
∴ E[y] = p, vr[y] = W.
∴ E ˆβ = E (X WX)−1
X Wz
= (X WX)−1
X WE Xβ + W−1
(y − p)
= (X WX)−1
X WXβ
= β,
vr ˆβ = (X WX)−1
X Wvr Xβ + W−1
(y − p) W X(X WX)−
= (X WX)−1
X W(W−1
WW−
)W X(X WX)−
= (X WX)−1
.
ˆβ
ˆβ

Test of models for LR
Once a model is obtained, Wald test or Rao's score test
can be used to decide which term to drop/add. It need
no recalculation of IRLS."
Figure from: "Statistics 111: Introduction to Theoretical Statistics" lecture note
by Kevin Andrew Rader, on Harvard College GSAS
http://isites.harvard.edu/icb/icb.do?keyword=k101665&pageid=icb.page651024

Test
by
the
gradient

of
log-‐likelihood

Test
by
the
diﬀerence

of
paremeter

L1-regularlized LR (1)
Just like lasso, L1-regularlizer is effective for LR."

Here the objective function will be:"
"
"
"
"
mx
β0,β
N
=1
log Pr(; β0, β) − λ β 1
= mx
β0,β
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| .

Here the objective function will be:"
"
"
"
"
The resulting algorithm can be called “iterative
reweighted lasso” algorithm."
mx
β0,β
N
=1
log Pr(; β0, β) − λ β 1
= mx
β0,β
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| .

By putting the gradient to 0, we get same score
equation as lasso algorithm:"
∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| = 0
∴
N
=1
 yj −
eβ0+β 
1 + eβ0+β 
− λ · sign(βj) = 0
∴ xj
(y − p) = λ · sign(βj) (where βj = 0)
00

∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| = 0
∴
N
=1
 yj −
eβ0+β 
1 + eβ0+β 
− λ · sign(βj) = 0
∴ xj

∂
∂βj
N
=1
y(β0 + β ) − log 1 + eβ0+β  − λ
p
j=1
|βj| = 0
∴
N
=1
 yj −
eβ0+β 
1 + eβ0+β 
− λ · sign(βj) = 0
∴ xj
xj
(y − Xβ) = λ · sign(βj)
Score
equation
of
lasso
is

Since the objective function is concave, the solution
can be obtained using optimization techniques."
"

Since the objective function is concave, the solution
can be obtained using optimization techniques."
"
However, the profiles of coefficients are not piece-
wise linear, and it is difficult to get the path."
Predictor-Corrector method for convex optimization or
coordinate descent algorithm will work in some situations."

Summary
LR is analogous to least squares ﬁtting"
"
and..."
•  LR requires iterative algorithm because of the self-consistency"
•  Weighted RSS can be seen as χ-squared or deviance"
•  The dist. of converges to "
•  Rao's score test or Wald test is useful for model selection"
•  L1-regularlized is analogous to lasso except for non-linearity"
βnew
= (X WX)−1
X Wz ↔ β = (X X)−1
X y
N β, (X WX)−1ˆβ

Today's topics
þ  On the analogy with Least Squares Fitting"
¨  Logistic regression vs. LDA"

Logistic regression vs. LDA
What is the different
LDA and logistic regression are very similar
methods."
Let us study the characteristics of these methods
through the difference of formal aspects."

Form of the log-odds
"
"
"

LDA"
"
"
"
log
Pr(G = k|X = )
Pr(G = K|X = )
= log
πk
πK
−
1
2
(μk + μK ) −1
(μk − μK )
+  −1
(μk − μK )
=αk0 + αk
,

LDA"
"
"
"
"
Logistic regression"
log
Pr(G = k|X = )
Pr(G = K|X = )
= log
πk
πK
−
1
2
(μk + μK ) −1
(μk − μK )
+  −1
(μk − μK )
=αk0 + αk
,
log
Pr(G = k|X = )
Pr(G = K|X = )
=βk0 + βk
.

LDA"
"
"
"
"
log
Pr(G = k|X = )
Pr(G = K|X = )
= log
πk
πK
−
1
2
(μk + μK ) −1
(μk − μK )
+  −1
(μk − μK )
=αk0 + αk
,
log
Pr(G = k|X = )
Pr(G = K|X = )
=βk0 + βk
.
Same
form

Criteria of estimations
"
"

LDA"
"
"
"
"
mx
N
=1
log Pr(G = g, X = )
= mx
N
=1
log Pr(G = g|X = ) log Pr(X = )

LDA"
"
"
"
"
mx
N
=1
log Pr(G = g, X = )
= mx
N
=1
log Pr(G = g|X = ) log Pr(X = )
mx
N
=1
log Pr(G = g|X = )

LDA"
"
"
"
"
mx
N
=1
log Pr(G = g, X = )
= mx
N
=1
log Pr(G = g|X = ) log Pr(X = )
mx
N
=1
log Pr(G = g|X = )
Marginal
likelihood

Form of the Pr(X)
"
"

Form of the Pr(X)
LDA"
"
"
Pr(X) =
K
k=1
πkϕ(X; μk, ).

Form of the Pr(X)
LDA"
"
"
"
Pr(X) =
K
k=1
πkϕ(X; μk, ).
Arbitrary
Pr(X)

Form of the Pr(X)
LDA"
"
"
"
Pr(X) =
K
k=1
πkϕ(X; μk, ).
Arbitrary
Pr(X)

Involves
parameters

Effects of the difference (1)
How these formal difference affect on the
character of the algorithm?"

The assumption of Gaussian and homoscedastic
can be strong constraint, which lead low variance."

In addition, LDA has the advantage that it can
make use of unlabelled observations; i.e. semi-
supervised is available."

In addition, LDA has the advantage that it can
make use of unlabelled observations; i.e. semi-
supervised is available."
"
On the other hand, LDA could be affected by
outliers."

With linear separable data,"

•  The coefﬁcients of LDA is deﬁned well; but training error
may occur."

may occur."
•  The coefﬁcients of LR can be inﬁnite; but true separating
hyperplane can be found"

may occur."
•  The coefﬁcients of LR can be inﬁnite; but true separating
hyperplane can be found"
Do
not
think
too
much
on
training
error;

what
is
important
is

generalization
error

The assumptions for LDA rarely hold in practical."

Nevertheless, it is known empirically that these
models give quite similar results, even when LDA is
used inappropriately, say with qualitative variables."
"

Nevertheless, it is known empirically that these
models give quite similar results, even when LDA is
used inappropriately, say with qualitative variables."
"
After all, however, if Gaussian assumption looks
to hold, use LDA. Otherwise, use logistic
regression."

Today's topics
þ  Logistic regression vs. LDA"

Today's topics
þ  Logistic regression (contd.)"

Separating Hyperplane: Overview
Another way of Classification
Both LDA and LR do classiﬁcation through the
probabilities using regression models."
"

Another way of Classification
Both LDA and LR do classiﬁcation through the
probabilities using regression models."
"
Classiﬁcation can be done by more explicit way:
modelling the decision boundary directly."

Properties of vector algebra
Let L be the afﬁne set deﬁned by"
"
"
"
"
and the signed distance from x to L is "
β0 + β  = 0
d± (, L) =
1
β
(β  + β0)
β0 + β  > 0 ⇔  is above L
β0 + β  = 0 ⇔  is on L
β0 + β  < 0 ⇔  is below L

Today's topics
p  Separating Hyperplane"
p  Rosenblatt's Perceptron"
p  Optimal Hyperplane"

Rosenblatt's Perceptron
Learning Criteria
The basic criteria of Rosenblatt's Perceptron learning
algorithm is to reduce (M is misclassiﬁed data)"
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
00

Learning Criteria
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)

Learning Criteria
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
If
misclassiﬁed
yi=1
as
-‐1,

the
latter
part
is
negative

Learning Criteria
D(β, β0) =
∈M

β + β0
∝
∈M
d± (, L)
= −
∈M
y(
β + β0)
If
misclassiﬁed
yi=-‐1
as
1,

the
latter
part
is
positive

Learning Algorithm (1)
Instead of reducing D by batch learning,
“stochastic” gradient descent algorithm is adopted. "
The coefﬁcients are updated for each misclassiﬁed
observations like online learning."

Observations
classiﬁed
correctly

do
not
aﬀects
the
parameter,
so

it
is
robust
to
outliers.

"
Thus, coefﬁcients will be updated based not on D
but on single"D(β, β0) = −y(
β + β0)

Proceedings of the algorithm is as follows:"

1.  Take 1 observation xi and classify it"

2.  If the classiﬁcation was wrong, update coefﬁcients"
∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y00

∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y

∂D(β, β0)
∂β
= − y,
∂D(β, β0)
∂β0
= − y.
∴
β
β0
←
β
β0
+ρ
y
y
Learning
rate

Can
be
set
to
1
without

loss
of
generality

Updating parameter may lead misclassiﬁcations of
other correctly-classiﬁed observations."
Therefore, although each update reduces each Di ,
it can increase total D."

Convergence Theorem
If data is linear separable learning of perceptron
terminates in ﬁnite steps. "
Otherwise, learning never terminates."
"

Convergence Theorem

Convergence Theorem
"
However, in practical, it is difﬁcult to know if"
•  the data is not linear separable and never converge"
•  or the data is linear separable but time-consuming"
"
"

Convergence Theorem
"
However, in practical, it is difﬁcult to know if"
•  the data is not linear separable and never converge"
•  or the data is linear separable but time-consuming"
"
In addition, the solution is not unique depending on
the initial value or data order."
"

Today's topics
þ  Rosenblatt's Perceptron"

Optimal Hyperplane
Derivation of KKT cond. (1)
This section could be hard for some audience."
"
To make story bit clearer, let us study general on
optimization problem. The theme is:"

Optimal Hyperplane
This section could be hard for some audience."
"
To make story bit clearer, let us study general on
optimization problem. The theme is:"
Duality and KKT condition for optimization problem"

Optimal Hyperplane
Suppose we have an optimization problem:"
"
"
"
and let the feasible region be"
minimize ƒ()
subject to g() ≤ 0
C = {|g() ≤ 0}

Optimal Hyperplane
On the region of optimization, relaxation is the
technique often used to make problem easier."
"

Optimal Hyperplane
On the region of optimization, relaxation is the
technique often used to make problem easier."
"
Lagrange relaxation, as done below, is one of that:"
" minimize L(, y) = ƒ() +

yg()
subject to y ≥ 0.

Optimal Hyperplane
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
min
∈C
ƒ() = min

sp
y≥0
L(, y) ≥ mx
y≥0
inf

L(, y)

Optimal Hyperplane
Concerning to the L(x,y), following inequality holds:"
"
"
and it requires yi or gi(x) to be equal to zero for all i
(this condition is called “complementary slackness” )."
"
According to the inequality, maximizing infx L(x,y) tells
us the lower boundary for the original problem."
min
∈C
ƒ() = min

sp
y≥0
L(, y) ≥ mx
y≥0
inf

L(, y)

Optimal Hyperplane
Therefore, we have the following maximizing problem:"
"
"
"
"
mximize L(, y)
subject to
∂
∂
L(, y) = 0
y ≥ 0

Optimal Hyperplane
"
"
"
"
subject to
∂
∂
L(, y) = 0
y ≥ 0 Condition
to

achieve
inf
L(x,y)

Optimal Hyperplane
"
"
"
"
This is called “Wolfe dual problem”, and strong duality
theory says the solutions for the primal and dual
problem are equivalent."
subject to
∂
∂
L(, y) = 0
y ≥ 0

Optimal Hyperplane
Thus, optimal solution must satisfy the conditions so
far. They are called the “KKT condition” altogether."



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0

Optimal Hyperplane



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint

Optimal Hyperplane



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint

Stationary
condition

Optimal Hyperplane



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint

Stationary
condition

Dual
constraint

Optimal Hyperplane



g() ≤ 0
∂
∂
L(, y) = 0
y ≥ 0
yg() = 0
Primal
constraint

Stationary
condition

Dual
constraint

Complementary
slackness

Optimal Hyperplane
KKT for Opt. Hyperplane (1)
We learned about the KKT conditions."
"
Then, get back to the original problem: ﬁnding
optimal hyperplane."

Optimal Hyperplane
The original ﬁtting criteria of the optimal hyperplane
is what is generalized of perceptron:"
" mximize
β,β0
M
subject to β = 1
y(
β + β0) ≥ M ( = 1, . . . , N)

Optimal Hyperplane
The original ﬁtting criteria of the optimal hyperplane
is what is generalized of perceptron:"
" mximize
β,β0
M
subject to β = 1
y(
β + β0) ≥ M ( = 1, . . . , N)
Criteria
of
maximizing
margin
is

theoretically
supported
using

distributions
with
no
assumption

Optimal Hyperplane
This is kind of mini-max problem which is difﬁcult to
solve, so convert it into more easier problem:"
"
"
"

Optimal Hyperplane
"
"
"
(See hand-out for the detailed transformation)"
minimize
β,β0
1
2
β 2
subject to y(
β + β0) ≥ 1 ( = 1, . . . , N)

Optimal Hyperplane
"
"
"
(See hand-out for the detailed transformation)"
"
This is quadratic programming problem."
minimize
β,β0
1
2
β 2
subject to y(
β + β0) ≥ 1 ( = 1, . . . , N)

Optimal Hyperplane
To make use of KKT condition, let's make object
function into Lagrange function:"
Lp =
1
2
β 2
−
N
=1
α y(
β + β0) − 1

Optimal Hyperplane
Thus, the KKT condition is:"



y(
β + β0) ≥ 1 ( = 1, . . . , N),
β =
N
=1
αy,
0 =
N
=1
αy,
α ≥ 0 ( = 1, . . . , N),
α y(
β + β0) − 1 = 0 ( = 1, . . . , N),

Optimal Hyperplane
Thus, the KKT condition is:"
"
"
"
"
"
"
"
"
Solution is obtained by solving this."



y(
β + β0) ≥ 1 ( = 1, . . . , N),
β =
N
=1
αy,
0 =
N
=1
αy,
α ≥ 0 ( = 1, . . . , N),
α y(
β + β0) − 1 = 0 ( = 1, . . . , N),

Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
α > 0 ⇔ y(
β + β0) = 1 ⇔ 
α = 0 ⇔ y(
β + β0) > 1 ⇔ 
is
on
edge
of
slab

is
oﬀ
edge
of
slab

Optimal Hyperplane
Support points (1)
The KKT condition tell us"
"
"
"
Those points on the edge of the slab is called
“support points” (or “support vectors” )."
α > 0 ⇔ y(
β + β0) = 1 ⇔ 
α = 0 ⇔ y(
β + β0) > 1 ⇔ 
is
on
edge
of
slab

is
oﬀ
edge
of
slab

Optimal Hyperplane
Support points (2)
β can be written as the linear combination of the
support points:"
"
"
"
where S is the indices of the support points."
β =
N
=1
αy
=
∈S
αy,

Optimal Hyperplane
Support points (3)
β0 can be obtained after β is obtained. For i S"
y(
β + β0) = 1
∴ β0 = 1/y − β 
= y −
j∈S
αjyjj

∴ β0 =
1
|S| ∈S
y −
j∈S
αjyjj

00

Optimal Hyperplane
Support points (3)
β0 can be obtained after β is obtained. For i S"
y(
β + β0) = 1
∴ β0 = 1/y − β 
= y −
j∈S
αjyjj

∴ β0 =
1
|S| ∈S
y −
j∈S
αjyjj

Took
average
to
avoid

computation
error

Optimal Hyperplane
Support points (4)
All coefﬁcients are deﬁned only through support points:"
"
"
"
"
thus, this is robust to outliers."
"
β =
∈S
αy,
β0 =
1
|S| ∈S
y −
j∈S
αjyjj


Optimal Hyperplane
Support points (4)
All coefficients are defined only through support points:"
"
"
"
"
thus, this is robust to outliers."
"
However, do not forget that which will be support points
is defined using all data points."
β =
∈S
αy,
β0 =
1
|S| ∈S
y −
j∈S
αjyjj


Today's topics
þ  Optimal Hyperplane"

Today's topics
þ  Separating Hyperplane"
þ  Optimal Hyperplane"

Summary
LDA
Logistic
Regression
Perceptron
Optimal
Hyperplane
With linear separable data
Training error
may occur
True separator
found, but coef.
may be inﬁnite
True separator
found, but not
unique
Best separator
found
With non-linear separable data
Work well Work well Algorithm never
stop
Not feasible
With outliers
Not robust Robust Robust Robust

ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane

More Related Content

Viewers also liked

Similar to ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane

Recently uploaded

ESL 4.4.3-4.5: Logistic Reression (contd.) and Separating Hyperplane