Linear Models for Classification
Linear Models for Classification
Sung-Yub Kim
Dept of IE, Seoul National University
February 18, 2017
Linear Models for Classification
1 Introduction
2 Discriminant Functions
3 Probabilistic Generative Models
4 Probabilistic Disriminative Models
5 The Laplace Approximation
6 Bayesian Lostic Regression
Linear Models for Classification
Introduction
Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.
Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine
Learning, MIT press, 2012.
Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent
Systems, MIT Press, 2016.
Linear Models for Classification
Introduction
Goal: Take an input vector x and assign it to one of K discrete classes Ck ,
where k=1,· · · ,K.
The input space is divided into decision reigions and its boundary called
decision boundaries or decision surfaces.
Linearly Separable → Separating Hyperplane Theorem
1-of-K Coding Scheme
t = (0, 0, 1, 0, 0) (1)
Each tk means the probability that the class is Ck
Generalized Linear Model
y(x) = f (w x + w0) (2)
Nonlinear function which is needed to make probabilistic output f (·) is
called activation function, and its inverse is called link function. Since
their decision boundary is y(x) = c for some constant c, they are called
Generalized Linear Models(GLM).
Linear Models for Classification
Discriminant Functions
Linear Discriminant Functions
Linear Discriminant Function
y(x) = wT
x + w0 = ˜wT
˜x (3)
w is called weight, w0 is called bias.(−w0 is called threshold)
Decision Criteria
Ck =
C1, if y(x) ≥ 0
C2, if y(x) < 0
(4)
Target Coding Scheme
There are one-versus-the-rest or one-versus-one coding scheme, but they
have some ambiguity. Therefore, we use a single K-classes discrimination
comprising K linear functions of the form
yk (x) = wk x + wk0 (5)
and then
Ck = Ci , if yi (x) > yj (x) ∀j = i (6)
then decision region will be
(wi − wj ) x + (wk0 − wj0) = 0 (7)
Linear Models for Classification
Discriminant Functions
Least Square
Model
y(x) = ˜W ˜x (8)
where k-th column of ˜W is ˜wk = (wk0, wT
k )T
and ˜x = (1, xT
)T
SSE
ED ( ˜W) =
1
2
˜X ˜W − T 2
F =
1
2
tr{(˜X ˜W − T) (˜X ˜W − T)} (9)
where k-th row of ˜X is ˜xn
Closed-form Solution
˜W = (˜X ˜X)−1 ˜X T = ˜X†
T (10)
Therefore, discriminant function is
y(x) = T (˜X†
) ˜x (11)
Linear Models for Classification
Discriminant Functions
Least Square
Limitations
1 Output value cannot have probabilistic interpretation.
2 LS solutions lack robustness to outliers.
3 SSE function penalizes predictions that are ’too correct’.
Origin of Limitations
Maximum Likelihood under the assumptions of a Gaussian Conditional
distributions, whereas binary target vectors have a distribution that is far
from Gaussian.
Linear Models for Classification
Discriminant Functions
Fisher’s Linear Discriminant Analysis
Motivation: Dimensionality Reduction
Simple Model: Choose w ∈ {w : w = 1} such that maximize
m2 − m1 = wT
(m2 − m1) (12)
where m2 = 1
N2 n∈C2
xn, m1 = 1
N1 n∈C1
xn
Revised Model: Choose a model which give large separation between
projected class means while also give a small variance within each classes.
Therefore, we need to minimize
J(w) =
(m2 − m1)2
s2
1 + s2
2
=
wT
SB w
wT SW w
(13)
where s2
k = n∈Ck
(yn − mk )2
means within-class variance of the
transformed class Ck and
SB = (m2 − m1)(m2 − m1)T
is the between-class covariance matrix and
SW = n∈C1
(xn − m1)(xn − m1)T
+ n∈C2
(xn − m2)(xn − m2)T
is total
within-class covariance matrix
Linear Models for Classification
Discriminant Functions
Fisher’s Linear Discriminant Analysis
Closed-form Solution
By simple calculation,
w ∝ S−1
W (m2 − m1) (14)
Relation to LS
By adjusting target values to N/N1 and −N/N2, we can prove Fisher’s
LDA is equivalent to LS.
Multiple Classes case
One example of Multi-class classification by LDA is to mimize
J(W ) = tr{(WSW W T
)−1
(WSB W T
)} (15)
where SW = K
k=1 Sk , Sk = n∈Ck
(xn − mk )(xn − mk )T
and
mk = n∈Ck
xn and
ST = N
n=1(xn − m)(xn − m)T
and this total covraince matrix can be
decomposed by
ST = SW + SB (16)
where SB = K
k=1 Nk (mk − m)(mk − m)T
Linear Models for Classification
Discriminant Functions
The Perceptron algorithm
Motivation: How about take non-linear transformation to make classifier?
Model
y(x) = f (wT
φ(x)) (17)
where
f (a) =
+1, a ≥ 0
−1, a < 0
(18)
Error Fucntion
EP (w) = −
n∈M
wT
φntn (19)
where M is a set of missclassified patterns.
SGD
Applying Stochastic Gradient Descent, we get
w(τ+1)
= w(τ) − η EP (w) = w(τ)
+ ηφntn (20)
Linear Models for Classification
Discriminant Functions
The Perceptron algorithm
Perceptron Convergence Theorem
If the training data set is linearly separable, then the perceptron learning
algorithm is guaranteed to find an exact solution in a finite number of
steps.
Limitations
1 PCT doesn’t tell anything about convergence rate and it really converges
slowly sometimes.
2 Also Perceptron is based on linear combinations of fixed basis functions. We
will solve this problem in chapter 5 and 6.
Linear Models for Classification
Probabilistic Generative Models
Introduction
Generative Approach
Model the class-conditional densities p(x|Ck ) and class priors p(Ck ).
Posterior Probability and Activation function
Typically, poseterior probabiliity can be defined like
p(C1|x) =
p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=
exp(a)
exp(a) + 1
= σ(a) (21)
where a = ln p(x|C1)p(C1)
p(x|C2)p(C2)
and σ is sigmoid function. Sometimes, we call a
log odds.
More generally, we can represent posterior probability like
p(Ck |x) =
p(x|Ck )p(Ck )
j p(x|Cj )p(Cj )
=
exp(ak )
j exp(aj )
(22)
where aj = ln(p(x|Cj )p(Cj )) and normalized-exponential function is called
softmax function. In fact, this function acts like soft-argmax function.
Linear Models for Classification
Probabilistic Generative Models
Continuous inputs
Model
If we assume that class-conditional probability is
p(x|Ck ) =
1
(2π)D/2
1
|Σ|1/2
exp −
1
2
(x − µk )T
Σ−1
(x − µk ) (23)
for k = 1,2. Then we get
p(C1|x) = σ(wT
x + w0) (24)
where w = Σ−1
(µ1 − µ2) and w0 = −1
2
µT
1 Σ−1
µ1 + 1
2
µT
2 Σ−1
µ2 + ln p(C1)
p(C2)
Similarly, if we have multiple class, then model is
ak (x) = wT
k x + wk 0 (25)
where wk = Σ−1
µk and wk0 = −1
2
µT
k Σ−1
µk + ln p(Ck )
Linear Models for Classification
Probabilistic Generative Models
MLE
Gaussian clas-conditional density
Likelihood function is
p(t, X|π, µ1, µ2, Σ) =
N
n=1
[πN(xn|µ1, Σ)]tn
[(1 − π)N(xn|µ2, Σ)]1−tn
(26)
where π is class prior probability, µk is class-mean, tn is 1 if class of n-th
data is 1, otherwise 0 and we assume that classes share covariance.
Closed-form Solution
We can solve this problem exactly
π =
N1
N1 + N2
(27)
µ1 =
1
N1
N
n=1
tnxn, µ2 =
1
N2
N
n=1
(1 − tn)xn (28)
S =
N1
N
S1 +
N2
N
S2 (29)
where
S1 = 1
N1 n∈C1
(xn − µ1)(xn − µ1)T
andS2 = 1
N2 n∈C2
(xn − µ2)(xn − µ2)T
Linear Models for Classification
Probabilistic Generative Models
Discrete features
Naive Bayes
In general case, we need to consider all possiblities to treat discrete
features. But if we assume that the situation is naive bayes, which means
feature values are treated as independent, conditioned on class Ck , we can
calculate likelihood function very easy.
p(x|Ck ) =
D
i=1
µxi
ki (1 − µki )1−xi
(30)
Linear Models for Classification
Probabilistic Generative Models
Exponential family
Likelihood of Exponential family
p(x|η) = h(x)g(η) exp{ηT
u(x)} (31)
If we assume that u(x) = x and the density is scale invariance we can
represent density as
p(x|ηk , s) =
1
s
h(
1
s
x)g(ηk ) exp{
1
s
ηT
k x} (32)
Closed-form solution of Exp family
By above, we get in binary classification
a(x) =
1
s
(η1 − η2)T
x + ln
g(η1)
g(η2)
+ ln
p(C1)
p(C2)
(33)
and in multi-class classificiation
ak (x) =
1
s
ηT
k x + ln g(ηk ) + ln p(Ck ) (34)
Linear Models for Classification
Probabilistic Disriminative Models
Introduction
Discriminative Approach
Use the functional form of the GLM explicitly and to determine its
parameters directly by using MLE.
Advantages
1 Fewer adaptive parameters
2 Do not use class-conditional density assumption
Fixed Basis Function
In discriminative approach, we model the posterior proababilities
accurately and then applying standatd decision theory. Since fixed basis
function has some limitations, we can generalize this to adaptive basis
function to the data.
Linear Models for Classification
Probabilistic Disriminative Models
Logistic Regression
Model
p(C1|φ) = y(φ) = σ(wT
φ) (35)
If we use this model, we just need to find M adaptive parameters. It is
relatively simpler than Gaussian model which is need to be find its
covariance paramters.
Likelihood
We can write likelihood as
p(t|w) =
N
n=1
ytn
n {1 − yn}1−tn
(36)
and we can use this likelihood function to make cross entropy error
function like
E(w) = Et [− ln y] −
N
n=1
{tn ln yn + (1 − tn) ln(1 − yn)} (37)
Linear Models for Classification
Probabilistic Disriminative Models
Logistic Regression
Gradient of CE
Take the gradient of error function, we get
wE(w) =
N
n=1
(yn − tn)φn (38)
and this gradient can be interpreted as
(error) × (prediction of model) (39)
Stochastic Gradient Descent
We can use above to give a sequential algorithm, in which each of the
weight vectors is updated using
wEn(w) = (yn − tn)φn (40)
Linear Models for Classification
Probabilistic Disriminative Models
IRLS
Iterative Reweighted Least Squares
For logistic regression, there is no longer a closed-form solution, due to the
nonlinearity of the logistic sigmoid function. But fortunately, logistic
sigmoid function is convex and we can find global optimizer by iterative
method.
Newton-Raphson Method
Newton-Raphson method is a iterative method defined as
w(τ+1)
= w(τ)
− H−1
wE(w) (41)
The gradient and hessian of our error function is
wE(w) =
N
n=1
(yn − tn)φn = ΦT
(y − t) (42)
2
wE(w) =
N
n=1
yn(1 − yn)φnφT
n = ΦT
RΦ (43)
where R = diag(y (1 − y))
Becasue weighting matrix R depends on w, we must compute this matrix every
iteration.
Linear Models for Classification
Probabilistic Disriminative Models
Multiclass Logistic Regression
Likelihood
Similar in binary case, we can get likelihood of our model
p(T|w1, . . . , wK ) =
N
n=1
K
k=1
y
tnk
nk (44)
Take negaitve logarithm to get cross entropy error function
E(w1, . . . , wK ) = ET[− ln p(y)] −
N
n=1
K
k=1
tnk ln ynk (45)
By similar argument in binary case, we can get
wj E(w1, . . . , wK ) =
N
n=1
(ynj − tnj )φn (46)
wk wj E(w1, . . . , wK ) =
N
n=1
ynk (Ikj − ynj )φnφT
n (47)
Linear Models for Classification
Probabilistic Disriminative Models
Probit Regression
Inverse Probit Function
Inverse Probit function is defined as
Φ(a) =
a
−∞
N(θ|0, 1)dθ (48)
and the GLM based on an inverse probit activation function is known as
probit regression.
Limitations
Logistic sigmoid decays asymptotically like exp(−x) for x → ∞, whereas
inverse probit activation function decay like exp(−x2
), therefore the probit
model is more sensitive to outliers.
Linear Models for Classification
Probabilistic Disriminative Models
Canonical Link Functios
Canonical Link Function
Assuming a conditional distribution for the target variable from the
exponential family, along with a corresponding choice for the activation
function known as the canonical link function.
Likelihood function First we assume scale invariant exponential
class-conditional distribution
p(t|η, s) =
1
s
h(
t
s
)g(η) exp{
ηt
s
} (49)
By definition of 1st statistics, we get
y = E[t|η] = −s
d
dη
ln g(η) (50)
WLOG we denote this relation as η = ψ(y). In the definition of GLM, we
call f (·) activation function. We call inverse of this function, f −1
(·), link
function. Meanwhile log-likelihood function is
ln p(t|η, s) =
N
n=1
{ln g(ηn) +
ηntn
s
} + const. (51)
Linear Models for Classification
Probabilistic Disriminative Models
Canonical Link Functios
Gradient of Log-Likelihood
By previous page, we can get
w ln p(t|η, s) =
N
n=1
{
d
dηn
ln g(ηn) +
tn
s
}
dηn
yn
dyn
dan
an
=
N
n=1
1
s
{tn − yn}ψ (yn)f (an)φn
=
1
s
N
n=1
{yn − tn}φn
(52)
Linear Models for Classification
The Laplace Approximation
Process of Lapalace Approximation
Motivation
Find Gaussian Approximation to a probability density defined over a set of
continuous variables.
How?
1 Find the mode z0 and evaluate the Hessian matrix A.
2 Using above information, we get
f (z) f (z0) exp{−
1
2
(z − z0)T
A(z − z0)} (53)
3 Normalize the distribution, we get
q(z) =
|A|1/2
(2π)M/2
exp{−
1
2
(z − z0)T
A(z − z0)} = N(z|z0, A−1
) (54)
Limitation
Since it is based on Gaussian, it can fail sometimes.
Linear Models for Classification
The Laplace Approximation
Model Comparison and BIC
Model Evidence
We have approximated the normalization constant Z to
Z f (z0)
(2π)M/2
|A|1/2
(55)
Threfore, we get model evidence
p(D) = p(D|θ)p(θ)dθ (56)
Since f (θ) = p(D|θ)p(θ) and Z = p(D) we get
ln p(D) ln p(D|θMAP ) + ln p(θMAP ) +
M
2
ln(2π) −
1
2
ln |A| (57)
If we assume Gaussian prior is broad, and the Hessian has full rank then
we get
ln p(D|θMAP ) ln p(D|θMAP ) −
1
2
M ln N (58)
We can make more accurate estimate of the model evidence in chapter 5
Linear Models for Classification
Bayesian Lostic Regression
Laplace Approximation
How?
1 First, set the prior as
p(w) = N(w|m0, S0) (59)
2 Calculate the posterior as
ln p(w|t) = −
1
2
(w − m0)T
S−1
0 (w − m0)
+
N
n=1
{tn ln yn + (1 − tn) ln(1 − yn)} + const.
(60)
3 Approximate the posterior using Laplace Approximation
q(w) = N(w|wMAP , SN ) (61)
where
S−1
N = S−1
0 +
N
n=1
yn(1 − yn)φnφT
n (62)
Linear Models for Classification
Bayesian Lostic Regression
Predictive Distribution
Predictive Distribution by Laplace Approximation
By using Laplace Approximation, we get
p(C1|φ, t) = p(C1|φ, w)p(w|t)dw σ(wT
φ)q(w)dw (63)
Denoting a = wT
φ we get
σ(wT
φ) = δ(a − wT
φ)σ(a)da (64)
Therefore we get
p(C1|φ, t) σ(a) δ(a − wT
φ)q(w)dwda = σ(a)p(a)da (65)
Since p(a) means marginalize of q(w), we know that p(a) is also Gaussian.
And mean and variance of Gaussian is
µa = E[a] = ap(a)da = wT
φq(w)dw = wT
MAP φ (66)
σ2
a = {a2
E[a]2
}p(a)da = {(wT
φ)2
− (mT
N φ)2
}dwq(w) = φT
SN φ
(67)
Linear Models for Classification
Bayesian Lostic Regression
Predictive Distribution
Approximate Convolution
This predictive distribution has a form of the convolution of a Gaussian
with a logistic sigmoid, and cannot be evaluated analytically. Therefore we
use similar function ,inverse probit function, to calculate this analytically
Φ(λa)N(a|µ, σ2
) = Φ(
µ
(λ−2 + σ2)1/2
) (68)
Therefore, we get
σ(a)N(a|µ, σ2
) σ(κ(σ2
)µ) (69)
where κ(σ2
) = 1√
(1+πσ2/8)
Therefore, we get the approximation of
predictive distribution in the following form
p(C1|φ, t) = σ(κ(σ2
a )µa) (70)

Linear models for classification

  • 1.
    Linear Models forClassification Linear Models for Classification Sung-Yub Kim Dept of IE, Seoul National University February 18, 2017
  • 2.
    Linear Models forClassification 1 Introduction 2 Discriminant Functions 3 Probabilistic Generative Models 4 Probabilistic Disriminative Models 5 The Laplace Approximation 6 Bayesian Lostic Regression
  • 3.
    Linear Models forClassification Introduction Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006. Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine Learning, MIT press, 2012. Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent Systems, MIT Press, 2016.
  • 4.
    Linear Models forClassification Introduction Goal: Take an input vector x and assign it to one of K discrete classes Ck , where k=1,· · · ,K. The input space is divided into decision reigions and its boundary called decision boundaries or decision surfaces. Linearly Separable → Separating Hyperplane Theorem 1-of-K Coding Scheme t = (0, 0, 1, 0, 0) (1) Each tk means the probability that the class is Ck Generalized Linear Model y(x) = f (w x + w0) (2) Nonlinear function which is needed to make probabilistic output f (·) is called activation function, and its inverse is called link function. Since their decision boundary is y(x) = c for some constant c, they are called Generalized Linear Models(GLM).
  • 5.
    Linear Models forClassification Discriminant Functions Linear Discriminant Functions Linear Discriminant Function y(x) = wT x + w0 = ˜wT ˜x (3) w is called weight, w0 is called bias.(−w0 is called threshold) Decision Criteria Ck = C1, if y(x) ≥ 0 C2, if y(x) < 0 (4) Target Coding Scheme There are one-versus-the-rest or one-versus-one coding scheme, but they have some ambiguity. Therefore, we use a single K-classes discrimination comprising K linear functions of the form yk (x) = wk x + wk0 (5) and then Ck = Ci , if yi (x) > yj (x) ∀j = i (6) then decision region will be (wi − wj ) x + (wk0 − wj0) = 0 (7)
  • 6.
    Linear Models forClassification Discriminant Functions Least Square Model y(x) = ˜W ˜x (8) where k-th column of ˜W is ˜wk = (wk0, wT k )T and ˜x = (1, xT )T SSE ED ( ˜W) = 1 2 ˜X ˜W − T 2 F = 1 2 tr{(˜X ˜W − T) (˜X ˜W − T)} (9) where k-th row of ˜X is ˜xn Closed-form Solution ˜W = (˜X ˜X)−1 ˜X T = ˜X† T (10) Therefore, discriminant function is y(x) = T (˜X† ) ˜x (11)
  • 7.
    Linear Models forClassification Discriminant Functions Least Square Limitations 1 Output value cannot have probabilistic interpretation. 2 LS solutions lack robustness to outliers. 3 SSE function penalizes predictions that are ’too correct’. Origin of Limitations Maximum Likelihood under the assumptions of a Gaussian Conditional distributions, whereas binary target vectors have a distribution that is far from Gaussian.
  • 8.
    Linear Models forClassification Discriminant Functions Fisher’s Linear Discriminant Analysis Motivation: Dimensionality Reduction Simple Model: Choose w ∈ {w : w = 1} such that maximize m2 − m1 = wT (m2 − m1) (12) where m2 = 1 N2 n∈C2 xn, m1 = 1 N1 n∈C1 xn Revised Model: Choose a model which give large separation between projected class means while also give a small variance within each classes. Therefore, we need to minimize J(w) = (m2 − m1)2 s2 1 + s2 2 = wT SB w wT SW w (13) where s2 k = n∈Ck (yn − mk )2 means within-class variance of the transformed class Ck and SB = (m2 − m1)(m2 − m1)T is the between-class covariance matrix and SW = n∈C1 (xn − m1)(xn − m1)T + n∈C2 (xn − m2)(xn − m2)T is total within-class covariance matrix
  • 9.
    Linear Models forClassification Discriminant Functions Fisher’s Linear Discriminant Analysis Closed-form Solution By simple calculation, w ∝ S−1 W (m2 − m1) (14) Relation to LS By adjusting target values to N/N1 and −N/N2, we can prove Fisher’s LDA is equivalent to LS. Multiple Classes case One example of Multi-class classification by LDA is to mimize J(W ) = tr{(WSW W T )−1 (WSB W T )} (15) where SW = K k=1 Sk , Sk = n∈Ck (xn − mk )(xn − mk )T and mk = n∈Ck xn and ST = N n=1(xn − m)(xn − m)T and this total covraince matrix can be decomposed by ST = SW + SB (16) where SB = K k=1 Nk (mk − m)(mk − m)T
  • 10.
    Linear Models forClassification Discriminant Functions The Perceptron algorithm Motivation: How about take non-linear transformation to make classifier? Model y(x) = f (wT φ(x)) (17) where f (a) = +1, a ≥ 0 −1, a < 0 (18) Error Fucntion EP (w) = − n∈M wT φntn (19) where M is a set of missclassified patterns. SGD Applying Stochastic Gradient Descent, we get w(τ+1) = w(τ) − η EP (w) = w(τ) + ηφntn (20)
  • 11.
    Linear Models forClassification Discriminant Functions The Perceptron algorithm Perceptron Convergence Theorem If the training data set is linearly separable, then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps. Limitations 1 PCT doesn’t tell anything about convergence rate and it really converges slowly sometimes. 2 Also Perceptron is based on linear combinations of fixed basis functions. We will solve this problem in chapter 5 and 6.
  • 12.
    Linear Models forClassification Probabilistic Generative Models Introduction Generative Approach Model the class-conditional densities p(x|Ck ) and class priors p(Ck ). Posterior Probability and Activation function Typically, poseterior probabiliity can be defined like p(C1|x) = p(x|C1)p(C1) p(x|C1)p(C1) + p(x|C2)p(C2) = exp(a) exp(a) + 1 = σ(a) (21) where a = ln p(x|C1)p(C1) p(x|C2)p(C2) and σ is sigmoid function. Sometimes, we call a log odds. More generally, we can represent posterior probability like p(Ck |x) = p(x|Ck )p(Ck ) j p(x|Cj )p(Cj ) = exp(ak ) j exp(aj ) (22) where aj = ln(p(x|Cj )p(Cj )) and normalized-exponential function is called softmax function. In fact, this function acts like soft-argmax function.
  • 13.
    Linear Models forClassification Probabilistic Generative Models Continuous inputs Model If we assume that class-conditional probability is p(x|Ck ) = 1 (2π)D/2 1 |Σ|1/2 exp − 1 2 (x − µk )T Σ−1 (x − µk ) (23) for k = 1,2. Then we get p(C1|x) = σ(wT x + w0) (24) where w = Σ−1 (µ1 − µ2) and w0 = −1 2 µT 1 Σ−1 µ1 + 1 2 µT 2 Σ−1 µ2 + ln p(C1) p(C2) Similarly, if we have multiple class, then model is ak (x) = wT k x + wk 0 (25) where wk = Σ−1 µk and wk0 = −1 2 µT k Σ−1 µk + ln p(Ck )
  • 14.
    Linear Models forClassification Probabilistic Generative Models MLE Gaussian clas-conditional density Likelihood function is p(t, X|π, µ1, µ2, Σ) = N n=1 [πN(xn|µ1, Σ)]tn [(1 − π)N(xn|µ2, Σ)]1−tn (26) where π is class prior probability, µk is class-mean, tn is 1 if class of n-th data is 1, otherwise 0 and we assume that classes share covariance. Closed-form Solution We can solve this problem exactly π = N1 N1 + N2 (27) µ1 = 1 N1 N n=1 tnxn, µ2 = 1 N2 N n=1 (1 − tn)xn (28) S = N1 N S1 + N2 N S2 (29) where S1 = 1 N1 n∈C1 (xn − µ1)(xn − µ1)T andS2 = 1 N2 n∈C2 (xn − µ2)(xn − µ2)T
  • 15.
    Linear Models forClassification Probabilistic Generative Models Discrete features Naive Bayes In general case, we need to consider all possiblities to treat discrete features. But if we assume that the situation is naive bayes, which means feature values are treated as independent, conditioned on class Ck , we can calculate likelihood function very easy. p(x|Ck ) = D i=1 µxi ki (1 − µki )1−xi (30)
  • 16.
    Linear Models forClassification Probabilistic Generative Models Exponential family Likelihood of Exponential family p(x|η) = h(x)g(η) exp{ηT u(x)} (31) If we assume that u(x) = x and the density is scale invariance we can represent density as p(x|ηk , s) = 1 s h( 1 s x)g(ηk ) exp{ 1 s ηT k x} (32) Closed-form solution of Exp family By above, we get in binary classification a(x) = 1 s (η1 − η2)T x + ln g(η1) g(η2) + ln p(C1) p(C2) (33) and in multi-class classificiation ak (x) = 1 s ηT k x + ln g(ηk ) + ln p(Ck ) (34)
  • 17.
    Linear Models forClassification Probabilistic Disriminative Models Introduction Discriminative Approach Use the functional form of the GLM explicitly and to determine its parameters directly by using MLE. Advantages 1 Fewer adaptive parameters 2 Do not use class-conditional density assumption Fixed Basis Function In discriminative approach, we model the posterior proababilities accurately and then applying standatd decision theory. Since fixed basis function has some limitations, we can generalize this to adaptive basis function to the data.
  • 18.
    Linear Models forClassification Probabilistic Disriminative Models Logistic Regression Model p(C1|φ) = y(φ) = σ(wT φ) (35) If we use this model, we just need to find M adaptive parameters. It is relatively simpler than Gaussian model which is need to be find its covariance paramters. Likelihood We can write likelihood as p(t|w) = N n=1 ytn n {1 − yn}1−tn (36) and we can use this likelihood function to make cross entropy error function like E(w) = Et [− ln y] − N n=1 {tn ln yn + (1 − tn) ln(1 − yn)} (37)
  • 19.
    Linear Models forClassification Probabilistic Disriminative Models Logistic Regression Gradient of CE Take the gradient of error function, we get wE(w) = N n=1 (yn − tn)φn (38) and this gradient can be interpreted as (error) × (prediction of model) (39) Stochastic Gradient Descent We can use above to give a sequential algorithm, in which each of the weight vectors is updated using wEn(w) = (yn − tn)φn (40)
  • 20.
    Linear Models forClassification Probabilistic Disriminative Models IRLS Iterative Reweighted Least Squares For logistic regression, there is no longer a closed-form solution, due to the nonlinearity of the logistic sigmoid function. But fortunately, logistic sigmoid function is convex and we can find global optimizer by iterative method. Newton-Raphson Method Newton-Raphson method is a iterative method defined as w(τ+1) = w(τ) − H−1 wE(w) (41) The gradient and hessian of our error function is wE(w) = N n=1 (yn − tn)φn = ΦT (y − t) (42) 2 wE(w) = N n=1 yn(1 − yn)φnφT n = ΦT RΦ (43) where R = diag(y (1 − y)) Becasue weighting matrix R depends on w, we must compute this matrix every iteration.
  • 21.
    Linear Models forClassification Probabilistic Disriminative Models Multiclass Logistic Regression Likelihood Similar in binary case, we can get likelihood of our model p(T|w1, . . . , wK ) = N n=1 K k=1 y tnk nk (44) Take negaitve logarithm to get cross entropy error function E(w1, . . . , wK ) = ET[− ln p(y)] − N n=1 K k=1 tnk ln ynk (45) By similar argument in binary case, we can get wj E(w1, . . . , wK ) = N n=1 (ynj − tnj )φn (46) wk wj E(w1, . . . , wK ) = N n=1 ynk (Ikj − ynj )φnφT n (47)
  • 22.
    Linear Models forClassification Probabilistic Disriminative Models Probit Regression Inverse Probit Function Inverse Probit function is defined as Φ(a) = a −∞ N(θ|0, 1)dθ (48) and the GLM based on an inverse probit activation function is known as probit regression. Limitations Logistic sigmoid decays asymptotically like exp(−x) for x → ∞, whereas inverse probit activation function decay like exp(−x2 ), therefore the probit model is more sensitive to outliers.
  • 23.
    Linear Models forClassification Probabilistic Disriminative Models Canonical Link Functios Canonical Link Function Assuming a conditional distribution for the target variable from the exponential family, along with a corresponding choice for the activation function known as the canonical link function. Likelihood function First we assume scale invariant exponential class-conditional distribution p(t|η, s) = 1 s h( t s )g(η) exp{ ηt s } (49) By definition of 1st statistics, we get y = E[t|η] = −s d dη ln g(η) (50) WLOG we denote this relation as η = ψ(y). In the definition of GLM, we call f (·) activation function. We call inverse of this function, f −1 (·), link function. Meanwhile log-likelihood function is ln p(t|η, s) = N n=1 {ln g(ηn) + ηntn s } + const. (51)
  • 24.
    Linear Models forClassification Probabilistic Disriminative Models Canonical Link Functios Gradient of Log-Likelihood By previous page, we can get w ln p(t|η, s) = N n=1 { d dηn ln g(ηn) + tn s } dηn yn dyn dan an = N n=1 1 s {tn − yn}ψ (yn)f (an)φn = 1 s N n=1 {yn − tn}φn (52)
  • 25.
    Linear Models forClassification The Laplace Approximation Process of Lapalace Approximation Motivation Find Gaussian Approximation to a probability density defined over a set of continuous variables. How? 1 Find the mode z0 and evaluate the Hessian matrix A. 2 Using above information, we get f (z) f (z0) exp{− 1 2 (z − z0)T A(z − z0)} (53) 3 Normalize the distribution, we get q(z) = |A|1/2 (2π)M/2 exp{− 1 2 (z − z0)T A(z − z0)} = N(z|z0, A−1 ) (54) Limitation Since it is based on Gaussian, it can fail sometimes.
  • 26.
    Linear Models forClassification The Laplace Approximation Model Comparison and BIC Model Evidence We have approximated the normalization constant Z to Z f (z0) (2π)M/2 |A|1/2 (55) Threfore, we get model evidence p(D) = p(D|θ)p(θ)dθ (56) Since f (θ) = p(D|θ)p(θ) and Z = p(D) we get ln p(D) ln p(D|θMAP ) + ln p(θMAP ) + M 2 ln(2π) − 1 2 ln |A| (57) If we assume Gaussian prior is broad, and the Hessian has full rank then we get ln p(D|θMAP ) ln p(D|θMAP ) − 1 2 M ln N (58) We can make more accurate estimate of the model evidence in chapter 5
  • 27.
    Linear Models forClassification Bayesian Lostic Regression Laplace Approximation How? 1 First, set the prior as p(w) = N(w|m0, S0) (59) 2 Calculate the posterior as ln p(w|t) = − 1 2 (w − m0)T S−1 0 (w − m0) + N n=1 {tn ln yn + (1 − tn) ln(1 − yn)} + const. (60) 3 Approximate the posterior using Laplace Approximation q(w) = N(w|wMAP , SN ) (61) where S−1 N = S−1 0 + N n=1 yn(1 − yn)φnφT n (62)
  • 28.
    Linear Models forClassification Bayesian Lostic Regression Predictive Distribution Predictive Distribution by Laplace Approximation By using Laplace Approximation, we get p(C1|φ, t) = p(C1|φ, w)p(w|t)dw σ(wT φ)q(w)dw (63) Denoting a = wT φ we get σ(wT φ) = δ(a − wT φ)σ(a)da (64) Therefore we get p(C1|φ, t) σ(a) δ(a − wT φ)q(w)dwda = σ(a)p(a)da (65) Since p(a) means marginalize of q(w), we know that p(a) is also Gaussian. And mean and variance of Gaussian is µa = E[a] = ap(a)da = wT φq(w)dw = wT MAP φ (66) σ2 a = {a2 E[a]2 }p(a)da = {(wT φ)2 − (mT N φ)2 }dwq(w) = φT SN φ (67)
  • 29.
    Linear Models forClassification Bayesian Lostic Regression Predictive Distribution Approximate Convolution This predictive distribution has a form of the convolution of a Gaussian with a logistic sigmoid, and cannot be evaluated analytically. Therefore we use similar function ,inverse probit function, to calculate this analytically Φ(λa)N(a|µ, σ2 ) = Φ( µ (λ−2 + σ2)1/2 ) (68) Therefore, we get σ(a)N(a|µ, σ2 ) σ(κ(σ2 )µ) (69) where κ(σ2 ) = 1√ (1+πσ2/8) Therefore, we get the approximation of predictive distribution in the following form p(C1|φ, t) = σ(κ(σ2 a )µa) (70)