Linear Discriminant Analysis (LDA) Under f-Divergence Measures

Linear Discriminant Analysis under f -divergence Measures
Anmol Dwivedi Sihui Wang Ali Tajer
Department of Electrical, Computer, and Systems Engineering
Rensselaer Polytechnic Institute
ISIT 2021
Linear
Dis-
crim-
i-
nant
Anal-
y-
sis
un-
der
f -
divergence
Mea-
sures

2/21
Linear Discriminant Analysis
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
I Choice of direction for projection to maximize separation of data as in figure1
1
Christopher M Bishop, Pattern Recognition and Machine Learning, Springer
2 / 21

3/21
Binary Classification: Discriminant Analysis
I population sample X ∈ Rn
I objective:
observe X =⇒ classify between PA and QA where A ∈ Rr×n
s.t. error constraints
3 / 21

4/21
Motivation
I Motivation: inference in high dimensions requires forming high-dimensional statistics
I Example: Consider the likelihood ratio dP
dQ
(X) test for classification
I Challenges:
I Computationally complex optimal test for large data dimension n
I Renders the statistical-to-computational performance gap between
information-theoretically
viable tests
(unbounded complexity)
tests with bounded
computation power
(bounded complexity)
4 / 21

5/21
Statistical Distinguishability under f -divergence Measures
I Objective function for LDA
argmax
a
a>
SB a
a>SW a
where SB and SW are between and within class
scatter matrices
I Optimizes a heuristic objective function
I Proposed objective for LDA
argmax
A
Df (QA k PA)
where PA and QA are probability measures
after dimensionality reduction
I Optimizes information measures as objective
Information measures represent the true performance limits in a wide range of inference problems
5 / 21

6/21
Data Model
I Consider n-dimensional zero-mean Gaussian models
P : X ∼ N(0, ΣP) vs. Q : X ∼ N(0, ΣQ)
I Design A ∈ Rr×n
to maximally distinguish r-dimensional models
PA : Y ∼ N(0, A ΣP A>
) vs. QA : Y ∼ N(0, A ΣQ A>
)
I WLOG design Ā to maximally distinguish models
PĀ : Y ∼ N(0, Ā Ā>
) vs. QĀ : Y ∼ N(0, Ā Σ Ā>
)
where
Σ
4
= Σ
−1/2
P ΣQ Σ
−1/2
P
6 / 21

7/21
Problem Statement
I f -divergence between r-dimensional data models PĀ and QĀ
Df (Ā)
4
= EPĀ

f

dQĀ
dPĀ

I Design Ā s.t.
P : max
Ā∈Rr×n
Df (Ā)
under the following four choices of f -divergence measures
I Kullback-Leibler divergence (DKL)
I Squared Hellinger distance (H2)
I Chi-squared divergence (χ2)
I Total Variation distance (dTV)
7 / 21

8/21
Design Space for A
I Motivation: Large design space for Ā a challenge
Theorem
Corresponding to any matrix Ā there exists a semi-orthogonal matrix A such that Df (Ā) = Df (A).
I WLOG problem P is equivalent to the constrained problem Q
Q ,
(
max
A∈Rr×n
Df (A)
s.t. A · A
= Ir
I Interpretation: Semi-orthogonality constraints limit the design space for A
8 / 21

9/21
f -divergences of Interest
I Kullback-Leibler (KL) divergence for f (t) = t log t:
DKL(A)
4
= EQA

log
dQA
dPA

I χ2
-divergence for f (t) = (t − 1)2
:
χ2
(A)
4
=
Z
Y
(dQA − dPA)2
dPA
I Squared Hellinger distance for f (t) = (1 −
√
t)2
:
H2
(A)
4
=
Z
Y
p
dQA −
p
dPA
2
I Total variation distance for f (t) = 1
2
· |t − 1|:
dTV(A)
4
=
1
2
Z
|dQA − dPA|
9 / 21

10/21
Bounds on Eigenspace
I Motivation: Eigenspace of AΣA
characterizes the optimal solution for all f -measures
Theorem (Poincaré Separation Theorem)
If the eigenvalues of Σ ∈ Rn×n
denoted by {λi : i ∈ [n]} s.t. λ1 ≥ · · · ≥ λn, then the eigenvalues of
AΣA
∈ Rr×r
denoted by {γi : i ∈ [r]} s.t. γ1 ≥ · · · ≥ γr satisfy
λn−(r−i) ≤ γi ≤ λi for all i ∈ [r]
I Interpretation: Example: n = 5, r = 2
eig(Σ)
λ5 λ4 λ3 λ2 λ1
eig(AΣA
)
γ1
eig(AΣA
)
γ2
10 / 21

11/21
Kullback-Leibler divergence: Motivation
-4
-2
0
2
4
0 20 40 60 80 100 0 20 40 60 80 100
0
20
40
60
80
100
120
140
Detection
Delay
I Inference Problem: Quickest change-point detection (Minimax setting)
min
τ
sup
κ≥1
Eκ [τ − κ | τ ≥ κ] subject to FAR(τ) ≤ α
I Figure of Merit: Average Detection Delay (ADD) for the asymptotic optimal test statistic
ADD ∼
c
DKL(Q k P)
11 / 21

12/21
Kullback-Leibler divergence: Results
Theorem
Define the permutation π∗
KL : [n] → [n] as a solution to π∗
KL
4
= arg maxπ DKL(λπ(i)). To maximize
DKL(A) =
Pr
i=1
1
2
(γi − log γi − 1)
1. The eigenvalues of AΣA
are given by γi = λπ∗
KL
(i).
2. Row i of matrix A is the eigenvector of Σ associated with the eigenvalue γi = λπ∗
KL
(i).
0 1 2 3 4 5
0
0.5
1
1.5
2
KL
12 / 21

13/21
Kullback-Leibler divergence: Observations
I Observation 1: If λmin
4
= λn ≥ 1
γi = λi for all i ∈ [r]
rows of A =⇒ eigenvectors of Σ associated with r largest {λi : i ∈ [r]}
Example: n = 5, r = 2
λ5 λ4 λ3 λ2 λ1
1 γ1 =
γ2 =
I Observation 2: If λmax
4
= λ1 ≤ 1
γi = λn−r+i for all i ∈ [r]
rows of A =⇒ eigenvectors of Σ associated with r smallest {λi : i ∈ [r]}
Example: n = 5, r = 2
λ5 λ4 λ3 λ2 λ1 1
γ1 =
γ2 =
13 / 21

14/21
Chi-squared divergence: Motivation
I Latent variable θ ⊂ Θ
I Estimator θ̂ = T(X1, . . . , Xs )
I Xi ∼ Pθ
I Inference Problem: Parameter estimation
I Figure of Merit: Variance of an estimator under quadratic loss
Varθ(θ̂) ≥ sup
θ06=θ
(Eθ0 [θ̂] − Eθ[θ̂])2
χ2(Q k P)
where θ0
⊂ Θ, P = Pθ, Q = Pθ0
14 / 21

15/21
Chi-squared divergence: Results
Theorem
χ2 : [n] → [n] as a solution to π∗
χ2
4
= arg maxπ χ2
(λπ(i)). To maximize
χ2
(A) =
Qr
i=1
1
√
γi (2−γi )
− 1
χ2 (i).
χ2 (i).
0 1 2
0
2
4
6
2
15 / 21

16/21
Total Variation Distance: Motivation
-5 -3 -1 1 3 5
0
0.1
0.2
0.3
0.4
0.5
H0
H1 Type-I Error
Type-II Error
I Inference Problem: Hypothesis testing
H0 : X ∼ P vs. H1 : X ∼ Q
I Figure of Merit: Probability of Error
decision rule d : X → {H0, H1} =⇒ inf
d
[ PA(d = H1)
| {z }
Type-I error
+ QA(d = H0)
| {z }
Type-II error
] = 1 − dTV(PA, QA)
16 / 21

17/21
Total Variation Distance: Results
I No closed form expression for dTV for Gaussian models
I Maximize matching bounds on dTV instead
Theorem
dTV
: [n] → [n] as a solution to π∗
dTV
4
= arg maxπ dTV(λπ(i)). To maximize
matching bounds on dTV
1
100
≤ dTV(A)
min{1,
r
Pr
i=1

1
γi
−1
2
)}
≤ 3
2
dTV
(i).
dTV
(i).
17 / 21

18/21
Squared Hellinger Distance: Motivation
I Inference Problem: Hypothesis testing
H0 : X ∼ P vs. H1 : X ∼ Q
I Figure of Merit: Bounds on probability of error (Pe ) for equal priors
Pe ≤
1
2
·

2 − H2
(P, Q)

18 / 21

19/21
Squared Hellinger Distance: Results
Theorem
H2 : [n] → [n] as a solution to π∗
H2
4
= arg maxπ H2
(λπ(i)). To maximize
H2
(A) = 2 − 2
Qr
i=1
4
q
4·γi
(γi +1)2
H2 (i).
H2 (i).
0 1 2 3 4 5
0
0.5
1
1.5
2
H
2
19 / 21

20/21
Numerical Evaluations (Quickest change-point detection)
I Max LDA: Rows of A are eigenvectors associated with the maximum eigenvalues of Σ
I DKL LDA: Rows of A are eigenvectors associated with the eigenvalues of Σ that maximize DKL(A)
I Average detection delay (ADD) vs. reduced dimension r for data dimension n and fixed FAR
1 3 5 7 9
0
3
6
9
12
15
KL
1 3 5 7 9
0
3
6
9
12
KL
1 3 5 7 9
0
3
6
KL
20 / 21

21/21
Conclusions
I Linear dimensionality reduction for statistical inference problems
I Optimal design for linear transformation that optimize f -divergence measures for Gaussian models
I Row space of the linear map associated with the eigenspace of the covariance matrix
I Design of the linear map independent of the inference problem in certain regimes
21 / 21

Linear Discriminant Analysis (LDA) Under f-Divergence Measures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Linear Discriminant Analysis (LDA) Under f-Divergence Measures

Similar to Linear Discriminant Analysis (LDA) Under f-Divergence Measures (20)

More from Anmol Dwivedi

More from Anmol Dwivedi (12)

Recently uploaded

Recently uploaded (20)

Linear Discriminant Analysis (LDA) Under f-Divergence Measures