Linear Discriminant Analysis and Its Generalization

Linear Discriminant Analysis and Its Generalization
Chapter 4 and 12 of The Elements of Statistical Learning
Presented by Ilsang Ohn
Department of Statistics, Seoul National University
September 3, 2014
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 1 / 33

Contents
1 Linear Discriminant Analysis
2 Flexible Discriminant Analysis
3 Penalized Discriminant Analysis
4 Mixture Discriminant Analysis

Review of
Linear Discriminant Analysis

LDA: Overview
• Linear discriminant analysis (LDA) does classiﬁcation by assuming
that the data within each class are normally distributed:
fk(x) = P(X = x|G = k) = N(µk, Σ).
• We allow each class to have its own mean µk ∈ Rp, but we assume a
common variance matrix Σ ∈ Rp×p. Thus
fk(x) =
1
(2π)p/2|Σ|1/2
exp −
1
2
(x − µk)T
Σ−1
(x − µk) .
• We want to ﬁnd k so that P(G = k|X = x) ∝ fk(x)πk is the largest.

LDA: Overview
• The linear discriminant functions are derived from the relation
log(fk(x)πk) = −
1
2
(x − µk)T
Σ−1
(x − µk) + log(πk) + C
= xT
Σ−1
µk −
1
2
µT
k Σ−1
µk + log(πk) + C ,
and we denote
δk(x) = xT
Σ−1
µk −
1
2
µT
k Σ−1
µk + log(πk).
• The decision rule is G(x) = argmaxkδk(x).
• The Bayes classiﬁer is a linear classiﬁer.

LDA: Overview
• We need to estimate the parameters based on the training data
xi ∈ Rp and yi ∈ {1, · · · , K} by
• ˆπk = Nk/N
• ˆµk = N−1
k yi=k xi, the centroid of class k
• ˆΣ = 1
N−K
K
k=1 yi=k(xi − ˆµk)(xi − ˆµk)T , the pooled sample
variance matrix
• The decision boundary between each pair of classes k and l is given by
{x : δk(x) = δl(x)}
which is equivalent to
(ˆµk − ˆµl)T ˆΣ−1
x =
1
2
(ˆµk + ˆµl)T ˆΣ−1
(ˆµk − ˆµl) − log(ˆπk/ˆπl).

Fisher’s discriminant analysis
• Fisher’s idea is to find a covariate v such that
max
v
vT
Bv/vT
Wv.
where
- B =
K
k=1(¯xk − ¯x)(¯xk − ¯x)T
: between-class covariance matrix
- W =
K
k=1 yi=k(xi − ¯xk)(xi − ¯xk)T
: within-class covariance
matrix, previously denoted by (N − K)ˆΣ
• This ratio is maximized by v1 = e1, which is the eigenvector of
W−1B with the largest eigenvalue. The linear combination vT
1 X is
called first discriminant. Similarly one can find the next direction v2
orthogonal in W to v1.
• Fisher’s canonical discriminant analysis finds L ≤ K − 1 canonical
coordinates (or a rank-L subspace) that best separate the categories.

Fisher’s discriminant analysis
• Consequently, we have v1, . . . , vL, L ≤ K − 1, which is the
eigenvectors with non-zero eigenvalues.
• Fisher’s discriminant rule assigns to the class closest in Mahalanobis
distance, so the rule is given by
G (x) = argmin
k
L
l=1
[vT
l (x − ¯xk)]2
= argmin
k
(x − ¯xk)T ˆΣ−1
(x − ¯xk)
= argmin
k
(−2δk(x) + xT ˆΣ−1
x + 2 log πk)
= argmax
k
(δk(x) − log πk).
• Thus Fisher’s rule is equivalent to the Gaussian classiﬁcation rule with
equal prior probabilities.

LDA by optimal scoring
• The standard way of carrying out a (Fisher’s) canonical discriminant
analysis is by way of a suitable SVD.
• There is a somewhat diﬀerent approach: optimal scoring.
• This method is performing LDA using linear regression on derived
responses.

• Recall G = {1, · · · , K}.
• θ : G → R is a function that assigns scores to the classes such that
the transformed class labels are optimally predicted by linear
regression on X.
• We ﬁnd L ≤ K − 1 sets of independent scorings for the class labels
{θ1, · · · , θL}, and L corresponding linear maps ηl(X) = XT βl chosen
to be optimal for multiple regression in Rl.
• θl and βl are chosen to minimize
ASR =
1
N
L
l=1
N
i=1
(θl(gi) − xT
i βl)2
.

Notation
• Y : N × K indicator matrix
• PX = X(XT X)−1XT : projection matrix onto the column space of
the predictors
• Θ: K × L matrix of L score vectors for the K classes.
• Θ∗ = Y Θ: N × K matrix with Θ∗
ij = θj(gi).

Problem
• Minimize ASR by regressing Θ∗ on X. This says that ﬁnd Θ that
minimizes
ASR(Θ) = tr(Θ∗T
(I − PX)Θ∗
)/N = tr(ΘT
Y T
(I − PX)Y Θ)/N
• ASR(Θ) is minimized by ﬁnding the L largest eigenvectors Θ of
Y T PXY with normalization ΘT DpΘ = IL.
• Hear Dp = Y T Y/N is a diagonal matrix of the sample class
proportions Nj/N.

Way to the solution
1. Initialize: Form Y : N × K.
2. Multivariate regression: Set ˆY = PXY and denote the p × K
coefficient matrix by B: ˆY = XB.
3. Optimal scores: Obtain the eigenvector matrix Θ of Y T ˆY = Y T PXY
with normalization ΘT DP Θ = I.
4. Update: Update the coefficient matrix in step 2 to reflect the optimal
scores: B ← BΘ. The final optimally scaled regression fit is the K − 1
vector function η(x) = BT x.

• The sequence of discriminant vectors νl in LDA are identical to the
sequence βl up to a constant.
• That is, the coeﬃcient matrix B is, up to a diagonal scale matrix, the
same as the discriminant analysis coeﬃcient matrix,
V T
x = DBT
x = Dη(x)
where Dll = 1/[α2
l (1 − α2
l )] and x is a test point. Here αl is lth
largest eigenvalue of Θ.
• Then the Mahalanobis distance is given by
δJ (x, ˆµk) =
K−1
l=1
wl(ˆηl(x) − ¯ηk
l )2
+ D(x)
where ¯ηk
l = N−1
k
nk
i=1 ˆηl(xi) and wl = 1/[α2
l (1 − α2
l )].

Generalization of LDA
• FDA: Allow non-linear decision boundary
• PDA: Expand the predictors into a large basis set, and then penalize
its coeﬃcients to be smooth
• MDA: Model each class by a mixture of two or more Gaussians with
diﬀerent centroids but same covariance, rather than a single Gaussian
distribution as in LDA

Flexible Discriminant Analysis
(Hastie et al., 1994)

FDA: Overview
• Optimal scoring method provides a starting point for generalizing
LDA to a nonparametric version.
• We replace the linear projection operator PX by a nonparametric
regression procedure, which we denote by the linear operator S.
• One simple and eﬀective approach toward this end is to expand X
into a larger set of basis variables h(X) and then simply use
S = Ph(X) in place of PX.

FDA: Overview
• This regression problems are deﬁned via the criterion
ASR({θl, ηl}L
l=1) =
1
N
L
l=1
N
i=1
(θl(gi) − ηl(xi))2
+ λJ(ηl) ,
where J is a regularizer appropriate for some forms of nonparametric
regression (e.g., smoothing splines, additive splines and lower-order
ANOVA models).

FDA by optimal scoring
Way to the solution
1. Initialize: Form Y : N × K.
2. Multivariate nonparametric regression: Fit a multi-response adaptive
nonparametric regression of Y on X, giving fitted values ˆY : Let Sλ be
the linear operator that fits the the final chosen model and let η∗(x) be
the vector of fitted regression functions.
3. Optimal scores: Compute the eigen-decomposition of Θ of
Y T ˆY = Y T SλY , where the eigenvectors Θ are normalized:
ΘT DpΘ = IK.
4. Update: Update the final model from step 2 using the optimal scores:
η(x) ← ΘT η∗(x)

Penalized Discriminant Analysis
(Hastie et al., 1995)

PDA: Overview
• Although FDA is motivated by generalizing optimal scoring, it can
also be viewed directly as a form of regularized discriminant analysis.
• Suppose the regression procedure used in FDA amounts to a linear
regression onto a basis expansion h(X), with a quadratic penalty on
the coeﬃcients:
ASR({θl, ηl}L
l=1) =
1
N
L
l=1
N
i=1
(θl(gi) − hT
(xi)βl)2
+ λβT
l Ωβl
• Ω has a role to give penalty to “rough” ones
• The steps in FDA can be viewed as a generalized form of LDA, which
we call PDA.

PDA: Overview
• Enlarge the set of predictors X via a basis expansion h(X).
• Use (penalized) LDA in the enlarged space, where the penalized
Mahalanobis distance is given by
D(x, µ) = (h(x) − h(µ))T
(ΣW + λΩ)−1
(h(x) − h(µ)),
where ΣW is the within-class covariance matrix of the derived
variables h(xi).
• Decompose the classiﬁcation subspace using a penalized metric:
max uT
Σu subject to uT
(Σ + λΩ)u = 1

PDA by optimal scoring
Way to the solution
1. Initialize: Form Y and H = (hij) = (hj(xi)).
2. Multivariate nonparametric regression: Fit a penalized multi-response
regression of Y on H, giving ﬁtted values ˆY = S(Ω)Y : Let
S(Ω) = H(HT H + Ω)−1HT be the smoother matrix of H regularized
by Ω and let β = (HT H + Ω)−1HT Y θ be the penalized least squares
estimate,
3. Optimal scores: Compute the eigen-decomposition of Θ of
Y T ˆY = Y T S(Ω)Y , where the eigenvectors Θ are normalized:
ΘT DpΘ = IK.
4. Update: Update the β.

Mixture Discriminant Analysis
(Hastie and Tibshirani, 1996)

MDA: Overview
• Linear discriminant analysis can be viewed as a prototype classiﬁer.
Each class is represented by its centroid, and we classify to the closest
using an appropriate metric.
• In many situations a single prototype is not suﬃcient to represent
inhomogeneous classes, and mixture models are more appropriate.

MDA: Overview

MDA: Overview
• A Gaussian mixture model for the kth class has density
P(X|G = k) =
Rk
r=1
πkrφ(X; µkr, Σ)
where the mixing proportions πkr sum to one and Rk is a number of
prototypes for the kth class.
• The class posterior probabilities are given by
P(G = k|X = x) =
Rk
r=1 πkrφ(X; µkr, Σ)Πk
K
l=1
Rl
r=1 πlrφ(X; µlr, Σ)Πl
where Πk represent the class prior probabilities

MDA: Estimation
• We estimate the parameters by maximum likelihood, using the joint
log-likelihood based on P(G, X):
K
k=1gi=k
log
Rk
r=1
πkrφ(X; µkr, Σ)Πk
• We solve above MLEs by EM algorithm

MDA: Estimation
• E-step: Given the current parameters, compute the responsibility of
subclass ckr within class k for each of the class-k observations
(gi = k):
ˆp(ckr|xi, gi) =
πkrφ(xi; µkr, Σ)
Rk
l=1 πkrφ(xi; µkr, Σ)
.
• M-step: Compute the weighted MLEs for the parameters of each of
the component Gaussians within each of the classes, using the
weights from the E-step.

MDA: Estimation
• The M-step is a weighted version of LDA, with R = K
k=1 RK classes
and K
k=1 NkRK observations.
• We can use optimal scoring as before to solve the weighted LDA
problem, which allows us to use a weighted version of FDA or PDA at
this stage.

MDA: Estimation
• The indicator matrix YN×K collapses in this case to a blurred
response matrix ZN×R.
• For example,
c11 c12 c13 c21 c22 c23 c31 c32 c33
















g1 = 2 0 0 0 0.3 0.5 0.2 0 0 0
g2 = 1 0.9 0.1 0.0 0 0 0 0 0 0
g3 = 1 0.1 0.8 0.1 0 0 0 0 0 0
g4 = 3 0 0 0 0 0 0 0.5 0.4 0.1
...
...
gN = 3 0 0 0 0 0 0 0.5 0.4 0.1
where the entries in a class-k row correspond to ˆp(ckr|x, gi).

MDA: Estimation by optimal scoring
Optimal scoring over EM-step of MDA:
1. Initialize: Start with set of Rk subclasses ckr, and associated subclass
probabilities ˆp(ckr|x, gi)
2. The blurred matrix: If gi = k, then fill the kth block of Rk entries in
the ith row with the values ˆp(ckr|x, gi), and the rest with 0s
3. Multivariate nonparametric regression: Fit a multi-response adaptive
nonparametric regression of Z on X, giving fitted values ˆZ. Let η∗(x)
be the vector of fitted regression functions.
4. Optimal scores: Let Θ be the largest K non-trivial eigenvectors of Z ˆZ,
with normalization ΘT DpΘ = IK.
5. Update: Update the final model from step 2 using the optimal scores:
η(x) ← ΘT η∗(x), and update ˆp(ckr|x, gi) and ˆπkr.

Performance

Linear Discriminant Analysis and Its Generalization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Linear Discriminant Analysis and Its Generalization

Similar to Linear Discriminant Analysis and Its Generalization (20)

Recently uploaded

Recently uploaded (20)

Linear Discriminant Analysis and Its Generalization