SlideShare a Scribd company logo
Linear Discriminant Analysis and Its Generalization
Chapter 4 and 12 of The Elements of Statistical Learning
Presented by Ilsang Ohn
Department of Statistics, Seoul National University
September 3, 2014
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 1 / 33
Contents
1 Linear Discriminant Analysis
2 Flexible Discriminant Analysis
3 Penalized Discriminant Analysis
4 Mixture Discriminant Analysis
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 2 / 33
Review of
Linear Discriminant Analysis
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 3 / 33
LDA: Overview
• Linear discriminant analysis (LDA) does classification by assuming
that the data within each class are normally distributed:
fk(x) = P(X = x|G = k) = N(µk, Σ).
• We allow each class to have its own mean µk ∈ Rp, but we assume a
common variance matrix Σ ∈ Rp×p. Thus
fk(x) =
1
(2π)p/2|Σ|1/2
exp −
1
2
(x − µk)T
Σ−1
(x − µk) .
• We want to find k so that P(G = k|X = x) ∝ fk(x)πk is the largest.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 4 / 33
LDA: Overview
• The linear discriminant functions are derived from the relation
log(fk(x)πk) = −
1
2
(x − µk)T
Σ−1
(x − µk) + log(πk) + C
= xT
Σ−1
µk −
1
2
µT
k Σ−1
µk + log(πk) + C ,
and we denote
δk(x) = xT
Σ−1
µk −
1
2
µT
k Σ−1
µk + log(πk).
• The decision rule is G(x) = argmaxkδk(x).
• The Bayes classifier is a linear classifier.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 5 / 33
LDA: Overview
• We need to estimate the parameters based on the training data
xi ∈ Rp and yi ∈ {1, · · · , K} by
• ˆπk = Nk/N
• ˆµk = N−1
k yi=k xi, the centroid of class k
• ˆΣ = 1
N−K
K
k=1 yi=k(xi − ˆµk)(xi − ˆµk)T , the pooled sample
variance matrix
• The decision boundary between each pair of classes k and l is given by
{x : δk(x) = δl(x)}
which is equivalent to
(ˆµk − ˆµl)T ˆΣ−1
x =
1
2
(ˆµk + ˆµl)T ˆΣ−1
(ˆµk − ˆµl) − log(ˆπk/ˆπl).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 6 / 33
Fisher’s discriminant analysis
• Fisher’s idea is to find a covariate v such that
max
v
vT
Bv/vT
Wv.
where
- B =
K
k=1(¯xk − ¯x)(¯xk − ¯x)T
: between-class covariance matrix
- W =
K
k=1 yi=k(xi − ¯xk)(xi − ¯xk)T
: within-class covariance
matrix, previously denoted by (N − K)ˆΣ
• This ratio is maximized by v1 = e1, which is the eigenvector of
W−1B with the largest eigenvalue. The linear combination vT
1 X is
called first discriminant. Similarly one can find the next direction v2
orthogonal in W to v1.
• Fisher’s canonical discriminant analysis finds L ≤ K − 1 canonical
coordinates (or a rank-L subspace) that best separate the categories.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 7 / 33
Fisher’s discriminant analysis
• Consequently, we have v1, . . . , vL, L ≤ K − 1, which is the
eigenvectors with non-zero eigenvalues.
• Fisher’s discriminant rule assigns to the class closest in Mahalanobis
distance, so the rule is given by
G (x) = argmin
k
L
l=1
[vT
l (x − ¯xk)]2
= argmin
k
(x − ¯xk)T ˆΣ−1
(x − ¯xk)
= argmin
k
(−2δk(x) + xT ˆΣ−1
x + 2 log πk)
= argmax
k
(δk(x) − log πk).
• Thus Fisher’s rule is equivalent to the Gaussian classification rule with
equal prior probabilities.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 8 / 33
LDA by optimal scoring
• The standard way of carrying out a (Fisher’s) canonical discriminant
analysis is by way of a suitable SVD.
• There is a somewhat different approach: optimal scoring.
• This method is performing LDA using linear regression on derived
responses.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 9 / 33
LDA by optimal scoring
• Recall G = {1, · · · , K}.
• θ : G → R is a function that assigns scores to the classes such that
the transformed class labels are optimally predicted by linear
regression on X.
• We find L ≤ K − 1 sets of independent scorings for the class labels
{θ1, · · · , θL}, and L corresponding linear maps ηl(X) = XT βl chosen
to be optimal for multiple regression in Rl.
• θl and βl are chosen to minimize
ASR =
1
N
L
l=1
N
i=1
(θl(gi) − xT
i βl)2
.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 10 / 33
LDA by optimal scoring
Notation
• Y : N × K indicator matrix
• PX = X(XT X)−1XT : projection matrix onto the column space of
the predictors
• Θ: K × L matrix of L score vectors for the K classes.
• Θ∗ = Y Θ: N × K matrix with Θ∗
ij = θj(gi).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 11 / 33
LDA by optimal scoring
Problem
• Minimize ASR by regressing Θ∗ on X. This says that find Θ that
minimizes
ASR(Θ) = tr(Θ∗T
(I − PX)Θ∗
)/N = tr(ΘT
Y T
(I − PX)Y Θ)/N
• ASR(Θ) is minimized by finding the L largest eigenvectors Θ of
Y T PXY with normalization ΘT DpΘ = IL.
• Hear Dp = Y T Y/N is a diagonal matrix of the sample class
proportions Nj/N.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 12 / 33
LDA by optimal scoring
Way to the solution
1. Initialize: Form Y : N × K.
2. Multivariate regression: Set ˆY = PXY and denote the p × K
coefficient matrix by B: ˆY = XB.
3. Optimal scores: Obtain the eigenvector matrix Θ of Y T ˆY = Y T PXY
with normalization ΘT DP Θ = I.
4. Update: Update the coefficient matrix in step 2 to reflect the optimal
scores: B ← BΘ. The final optimally scaled regression fit is the K − 1
vector function η(x) = BT x.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 13 / 33
LDA by optimal scoring
• The sequence of discriminant vectors νl in LDA are identical to the
sequence βl up to a constant.
• That is, the coefficient matrix B is, up to a diagonal scale matrix, the
same as the discriminant analysis coefficient matrix,
V T
x = DBT
x = Dη(x)
where Dll = 1/[α2
l (1 − α2
l )] and x is a test point. Here αl is lth
largest eigenvalue of Θ.
• Then the Mahalanobis distance is given by
δJ (x, ˆµk) =
K−1
l=1
wl(ˆηl(x) − ¯ηk
l )2
+ D(x)
where ¯ηk
l = N−1
k
nk
i=1 ˆηl(xi) and wl = 1/[α2
l (1 − α2
l )].
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 14 / 33
Generalization of LDA
• FDA: Allow non-linear decision boundary
• PDA: Expand the predictors into a large basis set, and then penalize
its coefficients to be smooth
• MDA: Model each class by a mixture of two or more Gaussians with
different centroids but same covariance, rather than a single Gaussian
distribution as in LDA
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 15 / 33
Flexible Discriminant Analysis
(Hastie et al., 1994)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 16 / 33
FDA: Overview
• Optimal scoring method provides a starting point for generalizing
LDA to a nonparametric version.
• We replace the linear projection operator PX by a nonparametric
regression procedure, which we denote by the linear operator S.
• One simple and effective approach toward this end is to expand X
into a larger set of basis variables h(X) and then simply use
S = Ph(X) in place of PX.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 17 / 33
FDA: Overview
• This regression problems are defined via the criterion
ASR({θl, ηl}L
l=1) =
1
N
L
l=1
N
i=1
(θl(gi) − ηl(xi))2
+ λJ(ηl) ,
where J is a regularizer appropriate for some forms of nonparametric
regression (e.g., smoothing splines, additive splines and lower-order
ANOVA models).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 18 / 33
FDA by optimal scoring
Way to the solution
1. Initialize: Form Y : N × K.
2. Multivariate nonparametric regression: Fit a multi-response adaptive
nonparametric regression of Y on X, giving fitted values ˆY : Let Sλ be
the linear operator that fits the the final chosen model and let η∗(x) be
the vector of fitted regression functions.
3. Optimal scores: Compute the eigen-decomposition of Θ of
Y T ˆY = Y T SλY , where the eigenvectors Θ are normalized:
ΘT DpΘ = IK.
4. Update: Update the final model from step 2 using the optimal scores:
η(x) ← ΘT η∗(x)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 19 / 33
Penalized Discriminant Analysis
(Hastie et al., 1995)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 20 / 33
PDA: Overview
• Although FDA is motivated by generalizing optimal scoring, it can
also be viewed directly as a form of regularized discriminant analysis.
• Suppose the regression procedure used in FDA amounts to a linear
regression onto a basis expansion h(X), with a quadratic penalty on
the coefficients:
ASR({θl, ηl}L
l=1) =
1
N
L
l=1
N
i=1
(θl(gi) − hT
(xi)βl)2
+ λβT
l Ωβl
• Ω has a role to give penalty to “rough” ones
• The steps in FDA can be viewed as a generalized form of LDA, which
we call PDA.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 21 / 33
PDA: Overview
• Enlarge the set of predictors X via a basis expansion h(X).
• Use (penalized) LDA in the enlarged space, where the penalized
Mahalanobis distance is given by
D(x, µ) = (h(x) − h(µ))T
(ΣW + λΩ)−1
(h(x) − h(µ)),
where ΣW is the within-class covariance matrix of the derived
variables h(xi).
• Decompose the classification subspace using a penalized metric:
max uT
Σu subject to uT
(Σ + λΩ)u = 1
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 22 / 33
PDA by optimal scoring
Way to the solution
1. Initialize: Form Y and H = (hij) = (hj(xi)).
2. Multivariate nonparametric regression: Fit a penalized multi-response
regression of Y on H, giving fitted values ˆY = S(Ω)Y : Let
S(Ω) = H(HT H + Ω)−1HT be the smoother matrix of H regularized
by Ω and let β = (HT H + Ω)−1HT Y θ be the penalized least squares
estimate,
3. Optimal scores: Compute the eigen-decomposition of Θ of
Y T ˆY = Y T S(Ω)Y , where the eigenvectors Θ are normalized:
ΘT DpΘ = IK.
4. Update: Update the β.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 23 / 33
Mixture Discriminant Analysis
(Hastie and Tibshirani, 1996)
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 24 / 33
MDA: Overview
• Linear discriminant analysis can be viewed as a prototype classifier.
Each class is represented by its centroid, and we classify to the closest
using an appropriate metric.
• In many situations a single prototype is not sufficient to represent
inhomogeneous classes, and mixture models are more appropriate.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 25 / 33
MDA: Overview
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 26 / 33
MDA: Overview
• A Gaussian mixture model for the kth class has density
P(X|G = k) =
Rk
r=1
πkrφ(X; µkr, Σ)
where the mixing proportions πkr sum to one and Rk is a number of
prototypes for the kth class.
• The class posterior probabilities are given by
P(G = k|X = x) =
Rk
r=1 πkrφ(X; µkr, Σ)Πk
K
l=1
Rl
r=1 πlrφ(X; µlr, Σ)Πl
where Πk represent the class prior probabilities
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 27 / 33
MDA: Estimation
• We estimate the parameters by maximum likelihood, using the joint
log-likelihood based on P(G, X):
K
k=1gi=k
log
Rk
r=1
πkrφ(X; µkr, Σ)Πk
• We solve above MLEs by EM algorithm
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 28 / 33
MDA: Estimation
• E-step: Given the current parameters, compute the responsibility of
subclass ckr within class k for each of the class-k observations
(gi = k):
ˆp(ckr|xi, gi) =
πkrφ(xi; µkr, Σ)
Rk
l=1 πkrφ(xi; µkr, Σ)
.
• M-step: Compute the weighted MLEs for the parameters of each of
the component Gaussians within each of the classes, using the
weights from the E-step.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 29 / 33
MDA: Estimation
• The M-step is a weighted version of LDA, with R = K
k=1 RK classes
and K
k=1 NkRK observations.
• We can use optimal scoring as before to solve the weighted LDA
problem, which allows us to use a weighted version of FDA or PDA at
this stage.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 30 / 33
MDA: Estimation
• The indicator matrix YN×K collapses in this case to a blurred
response matrix ZN×R.
• For example,
c11 c12 c13 c21 c22 c23 c31 c32 c33
















g1 = 2 0 0 0 0.3 0.5 0.2 0 0 0
g2 = 1 0.9 0.1 0.0 0 0 0 0 0 0
g3 = 1 0.1 0.8 0.1 0 0 0 0 0 0
g4 = 3 0 0 0 0 0 0 0.5 0.4 0.1
...
...
gN = 3 0 0 0 0 0 0 0.5 0.4 0.1
where the entries in a class-k row correspond to ˆp(ckr|x, gi).
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 31 / 33
MDA: Estimation by optimal scoring
Optimal scoring over EM-step of MDA:
1. Initialize: Start with set of Rk subclasses ckr, and associated subclass
probabilities ˆp(ckr|x, gi)
2. The blurred matrix: If gi = k, then fill the kth block of Rk entries in
the ith row with the values ˆp(ckr|x, gi), and the rest with 0s
3. Multivariate nonparametric regression: Fit a multi-response adaptive
nonparametric regression of Z on X, giving fitted values ˆZ. Let η∗(x)
be the vector of fitted regression functions.
4. Optimal scores: Let Θ be the largest K non-trivial eigenvectors of Z ˆZ,
with normalization ΘT DpΘ = IK.
5. Update: Update the final model from step 2 using the optimal scores:
η(x) ← ΘT η∗(x), and update ˆp(ckr|x, gi) and ˆπkr.
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 32 / 33
Performance
Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 33 / 33

More Related Content

What's hot

K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Greedy Algorithms
Greedy AlgorithmsGreedy Algorithms
Greedy Algorithms
Amrinder Arora
 
Greedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack ProblemGreedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack Problem
Madhu Bala
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
ishmecse13
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methodsReza Ramezani
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
Simplilearn
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
홍배 김
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regression
SreerajVA
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
Ujjawal
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
Ananda Swarup
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
Knoldus Inc.
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
rajshreemuthiah
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
Knoldus Inc.
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Bangalore
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
Sung Yub Kim
 

What's hot (20)

K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
K - Nearest neighbor ( KNN )
K - Nearest neighbor  ( KNN )K - Nearest neighbor  ( KNN )
K - Nearest neighbor ( KNN )
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Greedy Algorithms
Greedy AlgorithmsGreedy Algorithms
Greedy Algorithms
 
Greedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack ProblemGreedy Algorithm - Knapsack Problem
Greedy Algorithm - Knapsack Problem
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Feature selection concepts and methods
Feature selection concepts and methodsFeature selection concepts and methods
Feature selection concepts and methods
 
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
KNN Algorithm - How KNN Algorithm Works With Example | Data Science For Begin...
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
 
Lasso and ridge regression
Lasso and ridge regressionLasso and ridge regression
Lasso and ridge regression
 
K nearest neighbor
K nearest neighborK nearest neighbor
K nearest neighbor
 
Ridge regression
Ridge regressionRidge regression
Ridge regression
 
Machine Learning With Logistic Regression
Machine Learning  With Logistic RegressionMachine Learning  With Logistic Regression
Machine Learning With Logistic Regression
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
 
KNN
KNNKNN
KNN
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 

Similar to Linear Discriminant Analysis and Its Generalization

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
The Statistical and Applied Mathematical Sciences Institute
 
Some Thoughts on Sampling
Some Thoughts on SamplingSome Thoughts on Sampling
Some Thoughts on Sampling
Don Sheehy
 
ESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey ResearchESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey Research
Daniel Oberski
 
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Tarek Dib
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...
Loc Nguyen
 
Factor analysis
Factor analysis Factor analysis
Factor analysis Mintu246
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3Mintu246
 
Stochastic Differentiation
Stochastic DifferentiationStochastic Differentiation
Stochastic Differentiation
SSA KPI
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
Förderverein Technische Fakultät
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
Christian Robert
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Frank Nielsen
 
Gibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inferenceGibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inference
JeremyHeng10
 
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Huang Po Chun
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
Natan Katz
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributions
Frank Nielsen
 
lecture4.pdf
lecture4.pdflecture4.pdf
lecture4.pdf
TarikuArega1
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
Frank Nielsen
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
Marco Righini
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
Christian Robert
 

Similar to Linear Discriminant Analysis and Its Generalization (20)

QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Some Thoughts on Sampling
Some Thoughts on SamplingSome Thoughts on Sampling
Some Thoughts on Sampling
 
ESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey ResearchESRA2015 course: Latent Class Analysis for Survey Research
ESRA2015 course: Latent Class Analysis for Survey Research
 
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
Logistic Regression, Linear and Quadratic Discriminant Analyses, and KNN
 
Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...Extreme bound analysis based on correlation coefficient for optimal regressio...
Extreme bound analysis based on correlation coefficient for optimal regressio...
 
Factor analysis
Factor analysis Factor analysis
Factor analysis
 
A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3A comparative analysis of predictve data mining techniques3
A comparative analysis of predictve data mining techniques3
 
Stochastic Differentiation
Stochastic DifferentiationStochastic Differentiation
Stochastic Differentiation
 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
Estimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample SetsEstimating Space-Time Covariance from Finite Sample Sets
Estimating Space-Time Covariance from Finite Sample Sets
 
Testing for mixtures by seeking components
Testing for mixtures by seeking componentsTesting for mixtures by seeking components
Testing for mixtures by seeking components
 
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...
 
Gibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inferenceGibbs flow transport for Bayesian inference
Gibbs flow transport for Bayesian inference
 
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
Divide_and_Contrast__Source_free_Domain_Adaptation_via_Adaptive_Contrastive_L...
 
Foundation of KL Divergence
Foundation of KL DivergenceFoundation of KL Divergence
Foundation of KL Divergence
 
A new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributionsA new implementation of k-MLE for mixture modelling of Wishart distributions
A new implementation of k-MLE for mixture modelling of Wishart distributions
 
lecture4.pdf
lecture4.pdflecture4.pdf
lecture4.pdf
 
Slides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processingSlides: A glance at information-geometric signal processing
Slides: A glance at information-geometric signal processing
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
How many components in a mixture?
How many components in a mixture?How many components in a mixture?
How many components in a mixture?
 

Recently uploaded

DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
kejapriya1
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
muralinath2
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
YOGESH DOGRA
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
Areesha Ahmad
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
sachin783648
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
ronaldlakony0
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
Areesha Ahmad
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
Sérgio Sacani
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 

Recently uploaded (20)

DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
bordetella pertussis.................................ppt
bordetella pertussis.................................pptbordetella pertussis.................................ppt
bordetella pertussis.................................ppt
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
Hemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptxHemoglobin metabolism_pathophysiology.pptx
Hemoglobin metabolism_pathophysiology.pptx
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
erythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptxerythropoiesis-I_mechanism& clinical significance.pptx
erythropoiesis-I_mechanism& clinical significance.pptx
 
Mammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also FunctionsMammalian Pineal Body Structure and Also Functions
Mammalian Pineal Body Structure and Also Functions
 
GBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture MediaGBSN - Microbiology (Lab 4) Culture Media
GBSN - Microbiology (Lab 4) Culture Media
 
Comparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebratesComparative structure of adrenal gland in vertebrates
Comparative structure of adrenal gland in vertebrates
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
S.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary levelS.1 chemistry scheme term 2 for ordinary level
S.1 chemistry scheme term 2 for ordinary level
 
GBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram StainingGBSN- Microbiology (Lab 3) Gram Staining
GBSN- Microbiology (Lab 3) Gram Staining
 
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 

Linear Discriminant Analysis and Its Generalization

  • 1. Linear Discriminant Analysis and Its Generalization Chapter 4 and 12 of The Elements of Statistical Learning Presented by Ilsang Ohn Department of Statistics, Seoul National University September 3, 2014 Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 1 / 33
  • 2. Contents 1 Linear Discriminant Analysis 2 Flexible Discriminant Analysis 3 Penalized Discriminant Analysis 4 Mixture Discriminant Analysis Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 2 / 33
  • 3. Review of Linear Discriminant Analysis Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 3 / 33
  • 4. LDA: Overview • Linear discriminant analysis (LDA) does classification by assuming that the data within each class are normally distributed: fk(x) = P(X = x|G = k) = N(µk, Σ). • We allow each class to have its own mean µk ∈ Rp, but we assume a common variance matrix Σ ∈ Rp×p. Thus fk(x) = 1 (2π)p/2|Σ|1/2 exp − 1 2 (x − µk)T Σ−1 (x − µk) . • We want to find k so that P(G = k|X = x) ∝ fk(x)πk is the largest. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 4 / 33
  • 5. LDA: Overview • The linear discriminant functions are derived from the relation log(fk(x)πk) = − 1 2 (x − µk)T Σ−1 (x − µk) + log(πk) + C = xT Σ−1 µk − 1 2 µT k Σ−1 µk + log(πk) + C , and we denote δk(x) = xT Σ−1 µk − 1 2 µT k Σ−1 µk + log(πk). • The decision rule is G(x) = argmaxkδk(x). • The Bayes classifier is a linear classifier. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 5 / 33
  • 6. LDA: Overview • We need to estimate the parameters based on the training data xi ∈ Rp and yi ∈ {1, · · · , K} by • ˆπk = Nk/N • ˆµk = N−1 k yi=k xi, the centroid of class k • ˆΣ = 1 N−K K k=1 yi=k(xi − ˆµk)(xi − ˆµk)T , the pooled sample variance matrix • The decision boundary between each pair of classes k and l is given by {x : δk(x) = δl(x)} which is equivalent to (ˆµk − ˆµl)T ˆΣ−1 x = 1 2 (ˆµk + ˆµl)T ˆΣ−1 (ˆµk − ˆµl) − log(ˆπk/ˆπl). Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 6 / 33
  • 7. Fisher’s discriminant analysis • Fisher’s idea is to find a covariate v such that max v vT Bv/vT Wv. where - B = K k=1(¯xk − ¯x)(¯xk − ¯x)T : between-class covariance matrix - W = K k=1 yi=k(xi − ¯xk)(xi − ¯xk)T : within-class covariance matrix, previously denoted by (N − K)ˆΣ • This ratio is maximized by v1 = e1, which is the eigenvector of W−1B with the largest eigenvalue. The linear combination vT 1 X is called first discriminant. Similarly one can find the next direction v2 orthogonal in W to v1. • Fisher’s canonical discriminant analysis finds L ≤ K − 1 canonical coordinates (or a rank-L subspace) that best separate the categories. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 7 / 33
  • 8. Fisher’s discriminant analysis • Consequently, we have v1, . . . , vL, L ≤ K − 1, which is the eigenvectors with non-zero eigenvalues. • Fisher’s discriminant rule assigns to the class closest in Mahalanobis distance, so the rule is given by G (x) = argmin k L l=1 [vT l (x − ¯xk)]2 = argmin k (x − ¯xk)T ˆΣ−1 (x − ¯xk) = argmin k (−2δk(x) + xT ˆΣ−1 x + 2 log πk) = argmax k (δk(x) − log πk). • Thus Fisher’s rule is equivalent to the Gaussian classification rule with equal prior probabilities. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 8 / 33
  • 9. LDA by optimal scoring • The standard way of carrying out a (Fisher’s) canonical discriminant analysis is by way of a suitable SVD. • There is a somewhat different approach: optimal scoring. • This method is performing LDA using linear regression on derived responses. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 9 / 33
  • 10. LDA by optimal scoring • Recall G = {1, · · · , K}. • θ : G → R is a function that assigns scores to the classes such that the transformed class labels are optimally predicted by linear regression on X. • We find L ≤ K − 1 sets of independent scorings for the class labels {θ1, · · · , θL}, and L corresponding linear maps ηl(X) = XT βl chosen to be optimal for multiple regression in Rl. • θl and βl are chosen to minimize ASR = 1 N L l=1 N i=1 (θl(gi) − xT i βl)2 . Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 10 / 33
  • 11. LDA by optimal scoring Notation • Y : N × K indicator matrix • PX = X(XT X)−1XT : projection matrix onto the column space of the predictors • Θ: K × L matrix of L score vectors for the K classes. • Θ∗ = Y Θ: N × K matrix with Θ∗ ij = θj(gi). Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 11 / 33
  • 12. LDA by optimal scoring Problem • Minimize ASR by regressing Θ∗ on X. This says that find Θ that minimizes ASR(Θ) = tr(Θ∗T (I − PX)Θ∗ )/N = tr(ΘT Y T (I − PX)Y Θ)/N • ASR(Θ) is minimized by finding the L largest eigenvectors Θ of Y T PXY with normalization ΘT DpΘ = IL. • Hear Dp = Y T Y/N is a diagonal matrix of the sample class proportions Nj/N. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 12 / 33
  • 13. LDA by optimal scoring Way to the solution 1. Initialize: Form Y : N × K. 2. Multivariate regression: Set ˆY = PXY and denote the p × K coefficient matrix by B: ˆY = XB. 3. Optimal scores: Obtain the eigenvector matrix Θ of Y T ˆY = Y T PXY with normalization ΘT DP Θ = I. 4. Update: Update the coefficient matrix in step 2 to reflect the optimal scores: B ← BΘ. The final optimally scaled regression fit is the K − 1 vector function η(x) = BT x. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 13 / 33
  • 14. LDA by optimal scoring • The sequence of discriminant vectors νl in LDA are identical to the sequence βl up to a constant. • That is, the coefficient matrix B is, up to a diagonal scale matrix, the same as the discriminant analysis coefficient matrix, V T x = DBT x = Dη(x) where Dll = 1/[α2 l (1 − α2 l )] and x is a test point. Here αl is lth largest eigenvalue of Θ. • Then the Mahalanobis distance is given by δJ (x, ˆµk) = K−1 l=1 wl(ˆηl(x) − ¯ηk l )2 + D(x) where ¯ηk l = N−1 k nk i=1 ˆηl(xi) and wl = 1/[α2 l (1 − α2 l )]. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 14 / 33
  • 15. Generalization of LDA • FDA: Allow non-linear decision boundary • PDA: Expand the predictors into a large basis set, and then penalize its coefficients to be smooth • MDA: Model each class by a mixture of two or more Gaussians with different centroids but same covariance, rather than a single Gaussian distribution as in LDA Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 15 / 33
  • 16. Flexible Discriminant Analysis (Hastie et al., 1994) Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 16 / 33
  • 17. FDA: Overview • Optimal scoring method provides a starting point for generalizing LDA to a nonparametric version. • We replace the linear projection operator PX by a nonparametric regression procedure, which we denote by the linear operator S. • One simple and effective approach toward this end is to expand X into a larger set of basis variables h(X) and then simply use S = Ph(X) in place of PX. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 17 / 33
  • 18. FDA: Overview • This regression problems are defined via the criterion ASR({θl, ηl}L l=1) = 1 N L l=1 N i=1 (θl(gi) − ηl(xi))2 + λJ(ηl) , where J is a regularizer appropriate for some forms of nonparametric regression (e.g., smoothing splines, additive splines and lower-order ANOVA models). Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 18 / 33
  • 19. FDA by optimal scoring Way to the solution 1. Initialize: Form Y : N × K. 2. Multivariate nonparametric regression: Fit a multi-response adaptive nonparametric regression of Y on X, giving fitted values ˆY : Let Sλ be the linear operator that fits the the final chosen model and let η∗(x) be the vector of fitted regression functions. 3. Optimal scores: Compute the eigen-decomposition of Θ of Y T ˆY = Y T SλY , where the eigenvectors Θ are normalized: ΘT DpΘ = IK. 4. Update: Update the final model from step 2 using the optimal scores: η(x) ← ΘT η∗(x) Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 19 / 33
  • 20. Penalized Discriminant Analysis (Hastie et al., 1995) Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 20 / 33
  • 21. PDA: Overview • Although FDA is motivated by generalizing optimal scoring, it can also be viewed directly as a form of regularized discriminant analysis. • Suppose the regression procedure used in FDA amounts to a linear regression onto a basis expansion h(X), with a quadratic penalty on the coefficients: ASR({θl, ηl}L l=1) = 1 N L l=1 N i=1 (θl(gi) − hT (xi)βl)2 + λβT l Ωβl • Ω has a role to give penalty to “rough” ones • The steps in FDA can be viewed as a generalized form of LDA, which we call PDA. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 21 / 33
  • 22. PDA: Overview • Enlarge the set of predictors X via a basis expansion h(X). • Use (penalized) LDA in the enlarged space, where the penalized Mahalanobis distance is given by D(x, µ) = (h(x) − h(µ))T (ΣW + λΩ)−1 (h(x) − h(µ)), where ΣW is the within-class covariance matrix of the derived variables h(xi). • Decompose the classification subspace using a penalized metric: max uT Σu subject to uT (Σ + λΩ)u = 1 Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 22 / 33
  • 23. PDA by optimal scoring Way to the solution 1. Initialize: Form Y and H = (hij) = (hj(xi)). 2. Multivariate nonparametric regression: Fit a penalized multi-response regression of Y on H, giving fitted values ˆY = S(Ω)Y : Let S(Ω) = H(HT H + Ω)−1HT be the smoother matrix of H regularized by Ω and let β = (HT H + Ω)−1HT Y θ be the penalized least squares estimate, 3. Optimal scores: Compute the eigen-decomposition of Θ of Y T ˆY = Y T S(Ω)Y , where the eigenvectors Θ are normalized: ΘT DpΘ = IK. 4. Update: Update the β. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 23 / 33
  • 24. Mixture Discriminant Analysis (Hastie and Tibshirani, 1996) Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 24 / 33
  • 25. MDA: Overview • Linear discriminant analysis can be viewed as a prototype classifier. Each class is represented by its centroid, and we classify to the closest using an appropriate metric. • In many situations a single prototype is not sufficient to represent inhomogeneous classes, and mixture models are more appropriate. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 25 / 33
  • 26. MDA: Overview Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 26 / 33
  • 27. MDA: Overview • A Gaussian mixture model for the kth class has density P(X|G = k) = Rk r=1 πkrφ(X; µkr, Σ) where the mixing proportions πkr sum to one and Rk is a number of prototypes for the kth class. • The class posterior probabilities are given by P(G = k|X = x) = Rk r=1 πkrφ(X; µkr, Σ)Πk K l=1 Rl r=1 πlrφ(X; µlr, Σ)Πl where Πk represent the class prior probabilities Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 27 / 33
  • 28. MDA: Estimation • We estimate the parameters by maximum likelihood, using the joint log-likelihood based on P(G, X): K k=1gi=k log Rk r=1 πkrφ(X; µkr, Σ)Πk • We solve above MLEs by EM algorithm Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 28 / 33
  • 29. MDA: Estimation • E-step: Given the current parameters, compute the responsibility of subclass ckr within class k for each of the class-k observations (gi = k): ˆp(ckr|xi, gi) = πkrφ(xi; µkr, Σ) Rk l=1 πkrφ(xi; µkr, Σ) . • M-step: Compute the weighted MLEs for the parameters of each of the component Gaussians within each of the classes, using the weights from the E-step. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 29 / 33
  • 30. MDA: Estimation • The M-step is a weighted version of LDA, with R = K k=1 RK classes and K k=1 NkRK observations. • We can use optimal scoring as before to solve the weighted LDA problem, which allows us to use a weighted version of FDA or PDA at this stage. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 30 / 33
  • 31. MDA: Estimation • The indicator matrix YN×K collapses in this case to a blurred response matrix ZN×R. • For example, c11 c12 c13 c21 c22 c23 c31 c32 c33                 g1 = 2 0 0 0 0.3 0.5 0.2 0 0 0 g2 = 1 0.9 0.1 0.0 0 0 0 0 0 0 g3 = 1 0.1 0.8 0.1 0 0 0 0 0 0 g4 = 3 0 0 0 0 0 0 0.5 0.4 0.1 ... ... gN = 3 0 0 0 0 0 0 0.5 0.4 0.1 where the entries in a class-k row correspond to ˆp(ckr|x, gi). Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 31 / 33
  • 32. MDA: Estimation by optimal scoring Optimal scoring over EM-step of MDA: 1. Initialize: Start with set of Rk subclasses ckr, and associated subclass probabilities ˆp(ckr|x, gi) 2. The blurred matrix: If gi = k, then fill the kth block of Rk entries in the ith row with the values ˆp(ckr|x, gi), and the rest with 0s 3. Multivariate nonparametric regression: Fit a multi-response adaptive nonparametric regression of Z on X, giving fitted values ˆZ. Let η∗(x) be the vector of fitted regression functions. 4. Optimal scores: Let Θ be the largest K non-trivial eigenvectors of Z ˆZ, with normalization ΘT DpΘ = IK. 5. Update: Update the final model from step 2 using the optimal scores: η(x) ← ΘT η∗(x), and update ˆp(ckr|x, gi) and ˆπkr. Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 32 / 33
  • 33. Performance Presented by Ilsang Ohn Linear Discriminant Analysis and Its Generalization September 3, 2014 33 / 33