Linear models for classification

Linear Models for Classiﬁcation
Sung-Yub Kim
Dept of IE, Seoul National University
February 18, 2017

1 Introduction
2 Discriminant Functions
3 Probabilistic Generative Models
4 Probabilistic Disriminative Models
5 The Laplace Approximation
6 Bayesian Lostic Regression

Introduction
Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.
Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine
Learning, MIT press, 2012.
Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent
Systems, MIT Press, 2016.

Introduction
Goal: Take an input vector x and assign it to one of K discrete classes Ck ,
where k=1,· · · ,K.
The input space is divided into decision reigions and its boundary called
decision boundaries or decision surfaces.
Linearly Separable → Separating Hyperplane Theorem
1-of-K Coding Scheme
t = (0, 0, 1, 0, 0) (1)
Each tk means the probability that the class is Ck
Generalized Linear Model
y(x) = f (w x + w0) (2)
Nonlinear function which is needed to make probabilistic output f (·) is
called activation function, and its inverse is called link function. Since
their decision boundary is y(x) = c for some constant c, they are called
Generalized Linear Models(GLM).

Discriminant Functions
Linear Discriminant Functions
Linear Discriminant Function
y(x) = wT
x + w0 = ˜wT
˜x (3)
w is called weight, w0 is called bias.(−w0 is called threshold)
Decision Criteria
Ck =
C1, if y(x) ≥ 0
C2, if y(x) < 0
(4)
Target Coding Scheme
There are one-versus-the-rest or one-versus-one coding scheme, but they
have some ambiguity. Therefore, we use a single K-classes discrimination
comprising K linear functions of the form
yk (x) = wk x + wk0 (5)
and then
Ck = Ci , if yi (x) > yj (x) ∀j = i (6)
then decision region will be
(wi − wj ) x + (wk0 − wj0) = 0 (7)

Least Square
Model
y(x) = ˜W ˜x (8)
where k-th column of ˜W is ˜wk = (wk0, wT
k )T
and ˜x = (1, xT
)T
SSE
ED ( ˜W) =
1
2
˜X ˜W − T 2
F =
1
2
tr{(˜X ˜W − T) (˜X ˜W − T)} (9)
where k-th row of ˜X is ˜xn
Closed-form Solution
˜W = (˜X ˜X)−1 ˜X T = ˜X†
T (10)
Therefore, discriminant function is
y(x) = T (˜X†
) ˜x (11)

Least Square
Limitations
1 Output value cannot have probabilistic interpretation.
2 LS solutions lack robustness to outliers.
3 SSE function penalizes predictions that are ’too correct’.
Origin of Limitations
Maximum Likelihood under the assumptions of a Gaussian Conditional
distributions, whereas binary target vectors have a distribution that is far
from Gaussian.

Fisher’s Linear Discriminant Analysis
Motivation: Dimensionality Reduction
Simple Model: Choose w ∈ {w : w = 1} such that maximize
m2 − m1 = wT
(m2 − m1) (12)
where m2 = 1
N2 n∈C2
xn, m1 = 1
N1 n∈C1
xn
Revised Model: Choose a model which give large separation between
projected class means while also give a small variance within each classes.
Therefore, we need to minimize
J(w) =
(m2 − m1)2
s2
1 + s2
2
=
wT
SB w
wT SW w
(13)
where s2
k = n∈Ck
(yn − mk )2
means within-class variance of the
transformed class Ck and
SB = (m2 − m1)(m2 − m1)T
is the between-class covariance matrix and
SW = n∈C1
(xn − m1)(xn − m1)T
+ n∈C2
(xn − m2)(xn − m2)T
is total
within-class covariance matrix

Fisher’s Linear Discriminant Analysis
By simple calculation,
w ∝ S−1
W (m2 − m1) (14)
Relation to LS
By adjusting target values to N/N1 and −N/N2, we can prove Fisher’s
LDA is equivalent to LS.
Multiple Classes case
One example of Multi-class classiﬁcation by LDA is to mimize
J(W ) = tr{(WSW W T
)−1
(WSB W T
)} (15)
where SW = K
k=1 Sk , Sk = n∈Ck
(xn − mk )(xn − mk )T
and
mk = n∈Ck
xn and
ST = N
n=1(xn − m)(xn − m)T
and this total covraince matrix can be
decomposed by
ST = SW + SB (16)
where SB = K
k=1 Nk (mk − m)(mk − m)T

The Perceptron algorithm
Motivation: How about take non-linear transformation to make classiﬁer?
Model
y(x) = f (wT
φ(x)) (17)
where
f (a) =
+1, a ≥ 0
−1, a < 0
(18)
Error Fucntion
EP (w) = −
n∈M
wT
φntn (19)
where M is a set of missclassiﬁed patterns.
SGD
Applying Stochastic Gradient Descent, we get
w(τ+1)
= w(τ) − η EP (w) = w(τ)
+ ηφntn (20)

The Perceptron algorithm
Perceptron Convergence Theorem
If the training data set is linearly separable, then the perceptron learning
algorithm is guaranteed to find an exact solution in a finite number of
steps.
Limitations
1 PCT doesn’t tell anything about convergence rate and it really converges
slowly sometimes.
2 Also Perceptron is based on linear combinations of fixed basis functions. We
will solve this problem in chapter 5 and 6.

Probabilistic Generative Models
Introduction
Generative Approach
Model the class-conditional densities p(x|Ck ) and class priors p(Ck ).
Posterior Probability and Activation function
Typically, poseterior probabiliity can be deﬁned like
p(C1|x) =
p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=
exp(a)
exp(a) + 1
= σ(a) (21)
where a = ln p(x|C1)p(C1)
p(x|C2)p(C2)
and σ is sigmoid function. Sometimes, we call a
log odds.
More generally, we can represent posterior probability like
p(Ck |x) =
p(x|Ck )p(Ck )
j p(x|Cj )p(Cj )
=
exp(ak )
j exp(aj )
(22)
where aj = ln(p(x|Cj )p(Cj )) and normalized-exponential function is called
softmax function. In fact, this function acts like soft-argmax function.

Continuous inputs
Model
If we assume that class-conditional probability is
p(x|Ck ) =
1
(2π)D/2
1
|Σ|1/2
exp −
1
2
(x − µk )T
Σ−1
(x − µk ) (23)
for k = 1,2. Then we get
p(C1|x) = σ(wT
x + w0) (24)
where w = Σ−1
(µ1 − µ2) and w0 = −1
2
µT
1 Σ−1
µ1 + 1
2
µT
2 Σ−1
µ2 + ln p(C1)
p(C2)
Similarly, if we have multiple class, then model is
ak (x) = wT
k x + wk 0 (25)
where wk = Σ−1
µk and wk0 = −1
2
µT
k Σ−1
µk + ln p(Ck )

MLE
Gaussian clas-conditional density
Likelihood function is
p(t, X|π, µ1, µ2, Σ) =
N
n=1
[πN(xn|µ1, Σ)]tn
[(1 − π)N(xn|µ2, Σ)]1−tn
(26)
where π is class prior probability, µk is class-mean, tn is 1 if class of n-th
data is 1, otherwise 0 and we assume that classes share covariance.
We can solve this problem exactly
π =
N1
N1 + N2
(27)
µ1 =
1
N1
N
n=1
tnxn, µ2 =
1
N2
N
n=1
(1 − tn)xn (28)
S =
N1
N
S1 +
N2
N
S2 (29)
where
S1 = 1
N1 n∈C1
(xn − µ1)(xn − µ1)T
andS2 = 1
N2 n∈C2
(xn − µ2)(xn − µ2)T

Discrete features
Naive Bayes
In general case, we need to consider all possiblities to treat discrete
features. But if we assume that the situation is naive bayes, which means
feature values are treated as independent, conditioned on class Ck , we can
calculate likelihood function very easy.
p(x|Ck ) =
D
i=1
µxi
ki (1 − µki )1−xi
(30)

Exponential family
Likelihood of Exponential family
p(x|η) = h(x)g(η) exp{ηT
u(x)} (31)
If we assume that u(x) = x and the density is scale invariance we can
represent density as
p(x|ηk , s) =
1
s
h(
1
s
x)g(ηk ) exp{
1
s
ηT
k x} (32)
Closed-form solution of Exp family
By above, we get in binary classiﬁcation
a(x) =
1
s
(η1 − η2)T
x + ln
g(η1)
g(η2)
+ ln
p(C1)
p(C2)
(33)
and in multi-class classiﬁciation
ak (x) =
1
s
ηT
k x + ln g(ηk ) + ln p(Ck ) (34)

Probabilistic Disriminative Models
Introduction
Discriminative Approach
Use the functional form of the GLM explicitly and to determine its
parameters directly by using MLE.
Advantages
1 Fewer adaptive parameters
2 Do not use class-conditional density assumption
Fixed Basis Function
In discriminative approach, we model the posterior proababilities
accurately and then applying standatd decision theory. Since ﬁxed basis
function has some limitations, we can generalize this to adaptive basis
function to the data.

Logistic Regression
Model
p(C1|φ) = y(φ) = σ(wT
φ) (35)
If we use this model, we just need to ﬁnd M adaptive parameters. It is
relatively simpler than Gaussian model which is need to be ﬁnd its
covariance paramters.
Likelihood
We can write likelihood as
p(t|w) =
N
n=1
ytn
n {1 − yn}1−tn
(36)
and we can use this likelihood function to make cross entropy error
function like
E(w) = Et [− ln y] −
N
n=1
{tn ln yn + (1 − tn) ln(1 − yn)} (37)

Logistic Regression
Gradient of CE
Take the gradient of error function, we get
wE(w) =
N
n=1
(yn − tn)φn (38)
and this gradient can be interpreted as
(error) × (prediction of model) (39)
Stochastic Gradient Descent
We can use above to give a sequential algorithm, in which each of the
weight vectors is updated using
wEn(w) = (yn − tn)φn (40)

IRLS
Iterative Reweighted Least Squares
For logistic regression, there is no longer a closed-form solution, due to the
nonlinearity of the logistic sigmoid function. But fortunately, logistic
sigmoid function is convex and we can ﬁnd global optimizer by iterative
method.
Newton-Raphson Method
Newton-Raphson method is a iterative method deﬁned as
w(τ+1)
= w(τ)
− H−1
wE(w) (41)
The gradient and hessian of our error function is
wE(w) =
N
n=1
(yn − tn)φn = ΦT
(y − t) (42)
2
wE(w) =
N
n=1
yn(1 − yn)φnφT
n = ΦT
RΦ (43)
where R = diag(y (1 − y))
Becasue weighting matrix R depends on w, we must compute this matrix every
iteration.

Multiclass Logistic Regression
Likelihood
Similar in binary case, we can get likelihood of our model
p(T|w1, . . . , wK ) =
N
n=1
K
k=1
y
tnk
nk (44)
Take negaitve logarithm to get cross entropy error function
E(w1, . . . , wK ) = ET[− ln p(y)] −
N
n=1
K
k=1
tnk ln ynk (45)
By similar argument in binary case, we can get
wj E(w1, . . . , wK ) =
N
n=1
(ynj − tnj )φn (46)
wk wj E(w1, . . . , wK ) =
N
n=1
ynk (Ikj − ynj )φnφT
n (47)

Probit Regression
Inverse Probit Function
Inverse Probit function is deﬁned as
Φ(a) =
a
−∞
N(θ|0, 1)dθ (48)
and the GLM based on an inverse probit activation function is known as
probit regression.
Limitations
Logistic sigmoid decays asymptotically like exp(−x) for x → ∞, whereas
inverse probit activation function decay like exp(−x2
), therefore the probit
model is more sensitive to outliers.

Canonical Link Functios
Canonical Link Function
Assuming a conditional distribution for the target variable from the
exponential family, along with a corresponding choice for the activation
function known as the canonical link function.
Likelihood function First we assume scale invariant exponential
class-conditional distribution
p(t|η, s) =
1
s
h(
t
s
)g(η) exp{
ηt
s
} (49)
By deﬁnition of 1st statistics, we get
y = E[t|η] = −s
d
dη
ln g(η) (50)
WLOG we denote this relation as η = ψ(y). In the deﬁnition of GLM, we
call f (·) activation function. We call inverse of this function, f −1
(·), link
function. Meanwhile log-likelihood function is
ln p(t|η, s) =
N
n=1
{ln g(ηn) +
ηntn
s
} + const. (51)

Canonical Link Functios
Gradient of Log-Likelihood
By previous page, we can get
w ln p(t|η, s) =
N
n=1
{
d
dηn
ln g(ηn) +
tn
s
}
dηn
yn
dyn
dan
an
=
N
n=1
1
s
{tn − yn}ψ (yn)f (an)φn
=
1
s
N
n=1
{yn − tn}φn
(52)

The Laplace Approximation
Process of Lapalace Approximation
Motivation
Find Gaussian Approximation to a probability density deﬁned over a set of
continuous variables.
How?
1 Find the mode z0 and evaluate the Hessian matrix A.
2 Using above information, we get
f (z) f (z0) exp{−
1
2
(z − z0)T
A(z − z0)} (53)
3 Normalize the distribution, we get
q(z) =
|A|1/2
(2π)M/2
exp{−
1
2
(z − z0)T
A(z − z0)} = N(z|z0, A−1
) (54)
Limitation
Since it is based on Gaussian, it can fail sometimes.

The Laplace Approximation
Model Comparison and BIC
Model Evidence
We have approximated the normalization constant Z to
Z f (z0)
(2π)M/2
|A|1/2
(55)
Threfore, we get model evidence
p(D) = p(D|θ)p(θ)dθ (56)
Since f (θ) = p(D|θ)p(θ) and Z = p(D) we get
ln p(D) ln p(D|θMAP ) + ln p(θMAP ) +
M
2
ln(2π) −
1
2
ln |A| (57)
If we assume Gaussian prior is broad, and the Hessian has full rank then
we get
ln p(D|θMAP ) ln p(D|θMAP ) −
1
2
M ln N (58)
We can make more accurate estimate of the model evidence in chapter 5

Bayesian Lostic Regression
Laplace Approximation
How?
1 First, set the prior as
p(w) = N(w|m0, S0) (59)
2 Calculate the posterior as
ln p(w|t) = −
1
2
(w − m0)T
S−1
0 (w − m0)
+
N
n=1
{tn ln yn + (1 − tn) ln(1 − yn)} + const.
(60)
3 Approximate the posterior using Laplace Approximation
q(w) = N(w|wMAP , SN ) (61)
where
S−1
N = S−1
0 +
N
n=1
yn(1 − yn)φnφT
n (62)

Predictive Distribution
Predictive Distribution by Laplace Approximation
By using Laplace Approximation, we get
p(C1|φ, t) = p(C1|φ, w)p(w|t)dw σ(wT
φ)q(w)dw (63)
Denoting a = wT
φ we get
σ(wT
φ) = δ(a − wT
φ)σ(a)da (64)
Therefore we get
p(C1|φ, t) σ(a) δ(a − wT
φ)q(w)dwda = σ(a)p(a)da (65)
Since p(a) means marginalize of q(w), we know that p(a) is also Gaussian.
And mean and variance of Gaussian is
µa = E[a] = ap(a)da = wT
φq(w)dw = wT
MAP φ (66)
σ2
a = {a2
E[a]2
}p(a)da = {(wT
φ)2
− (mT
N φ)2
}dwq(w) = φT
SN φ
(67)

Predictive Distribution
Approximate Convolution
This predictive distribution has a form of the convolution of a Gaussian
with a logistic sigmoid, and cannot be evaluated analytically. Therefore we
use similar function ,inverse probit function, to calculate this analytically
Φ(λa)N(a|µ, σ2
) = Φ(
µ
(λ−2 + σ2)1/2
) (68)
Therefore, we get
σ(a)N(a|µ, σ2
) σ(κ(σ2
)µ) (69)
where κ(σ2
) = 1√
(1+πσ2/8)
Therefore, we get the approximation of
predictive distribution in the following form
p(C1|φ, t) = σ(κ(σ2
a )µa) (70)

Linear models for classification

More Related Content

What's hot

Viewers also liked

Similar to Linear models for classification

Recently uploaded

Linear models for classification