1. On The Equivalent of Low-Rank Linear Regressions and Linear Discriminant Analysis Based Regressions
Xiao Cai, Chris Ding, Feiping Nie, Heng Huang CSE Department, The University of Texas at Arlington
xiao.cai@mavs.uta.edu, chqding@uta.edu, feipingnie@gmail.com, heng@uta.edu
Problem
Multivariate linear regression attempts to model the
relationship between predictors and responses by
fitting a linear equation to observed data. Such
linear regression models suffer from the following
two deficiencies. On one hand, the linear regression
models usually have low performance for analyzing
the high-dimensional data. To perform accurate re-
gression or classification tasks on such data, we have
to collect an enormous number of samples. Howev-
er, due to the data and label collection difficulty,
we often cannot obtain enough samples and suffer
from the curse-of-dimensionality problem [1]. On
the other hand, the linear regression models don’t
emphasize the correlations among different respons-
es. Standard least squares regression is equivalent
to regressing each response on the predictors sepa-
rately.
Our Key Contributions
(1) We prove that the low-rank linear regression is e-
quivalent to doing linear regression in the LDA sub-
space.
(2) We derive global and concise algorithms for low-
rank regression models.
(3) We show the connection between low-rank re-
gression and regularized LDA. From both theory
and experiments, the low-rank ridge regression has
better performance than the low-rank linear regres-
sion, which has been used in many existing studies.
(4) To solve related feature selection problem, we
propose the sparse low-rank regression method with
exploring both classes/tasks correlations and fea-
ture structures.
Reference
[1] D.Donoho: High-dimensional data analysis:
The curses and blessings of dimensionality. AM-
S Math Challenges Lecture, pages 1-32, (2000)
[2] T. Anderson. Estimating linear retrictions on
regression coefficients for multivariate normal
distributions. AMS, pages 327-351, (1951)
Linear Low-Rank Regression And LDA + LR
Traditional Linear Regression model for classification is
to solve the following problem:
min
W
||Y − XT
W ||2
F , (1)
where X = [x1, x2, ...., xn] ∈ ℜd×n
is the centered
training data matrix and Y ∈ ℜn×k
is the normalized
class indicator matrix, i.e. Yi,j = 1/
√
nj if the i-th data
point belongs to the j-th class and Yi,j = 0 otherwise
and nj is the sample size of the j-th class.
When the class or task number is large, there are of-
ten underlying correlation structures between classes or
tasks. To incorporate the response correlations into the
regression method [2], we propose the following discrimi-
nant Low-Rank Linear Regression formulation (LRLR):
min
A,B
||Y − XT
AB||2
F , (2)
where A ∈ ℜd×s
, B ∈ ℜs×k
, s < min(n, k). Thus
W = AB has low-rank s.
Theorem 1 The low-rank linear regression method of
Eq. (2) is identical to doing standard linear regression
in LDA subspace.
Proof: Denoting J1(A, B) = ||Y −XT
AB||2
F and taking
its derivative w.r.t. B, we have,
∂J1(A, B)
∂B
= −2AT
XY + 2AT
XXT
AB. (3)
Setting Eq. (3) to zero, we obtain,
B = (AT
XXT
A)−1
AT
XY. (4)
Substituting Eq. (4) back into Eq. (2), we have,
min
A
||Y − XT
A(AT
XXT
A)−1
AT
XY ||2
F , (5)
which is equivalent to
max
A
Tr ((AT
(XXT
)A)−1
AT
XY Y T
XT
A). (6)
Note that
St = XXT
, Sb = XY Y T
XT
, (7)
where St and Sb are the total-class scatter matrix and
the between-class scatter matrix defined in the LDA,
respectively. Therefore, the solution of Eq. (6) can be
written as:
A∗
= arg max
A
Tr [(AT
StA)−1
AT
SbA], (8)
which is exactly the problem of LDA.
Two Extensions: LRRR, SLRR
Theorem 2 The proposed Low-Rank Ridge Regres-
sion (LRRR) method min
A,B
||Y −XT
AB||2
F +λ||AB||2
F
is equivalent to doing the regularized regression in
the regularized LDA subspace.
Theorem 3 The optimal solution of the proposed
SLRR method min
A,B
||Y − XT
AB||2
F + λ||AB||2,1 has
the same column space of a special regularized LDA.
Algorithms
The algorithm to LRLR or LRRR or SLRR
Input:
1. The centralized training data X ∈ ℜd×n.
2. The normalized training indicator matrix Y ∈ ℜn×k.
3. The low-rank parameter s.
4. IF LRRR or SLRR, the regularization parameter λ.
Output:
1. The matrices A ∈ ℜd×s and B ∈ ℜs×k.
Process:
IF LRLR,
Calculate A by Eq. (8)
Calculate B by Eq. (4)
ELSE IF LRRR,
Calculate A by
A∗ = arg max
A
{Tr((AT (St + λI)A)−1AT SbA)}
Calculate B by
B = (AT (XXT + λI)A)−1AT XY
ELSE IF SLRR,
Initialization:
1. Set t = 0
2. Initialize D(t) = I ∈ ℜd×d.
Repeat:
1. Calculate A(t+1)
A∗ = arg max
A
{Tr ((AT (St + λD)A)−1AT SbA)}
2. Calculate B(t+1)
B = (AT (XXT + λD)A)−1AT XY
3. Update the diagonal matrix D(t+1) ∈ ℜd×d, where the
i-th diagonal element is 1
2||(A(t+1)B(t+1))i||2
.
4. Update t = t + 1
Until Converge.
END
Experiment Data Summary
Dataset k d n
UMIST 20 10304 575
BIN36 36 320 1404
BIN26 26 320 1014
VOWEL 11 10 990
MNIST 10 784 150
JAFFE 10 1024 213
Experiment Results
The average classification accuracy V.S. the rank
using 5-fold C.V. on six datasets, low rank is
marked as red and full rank is marked as blue. Left
column: linear regression; middle column: ridge
regression; right column: sparse regression
11 12 13 14 15 16 17 18 19
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
0.82
0.84
0.86
The number of rank s
Theclassificationacc
full rank
low rank
(a) UMIST linear regression
11 12 13 14 15 16 17 18 19
0.91
0.92
0.93
0.94
0.95
0.96
0.97
The number of rank s
Theclassificationacc
full rank
low rank
(b) UMIST ridge regression
11 12 13 14 15 16 17 18 19
0.935
0.94
0.945
0.95
0.955
0.96
0.965
0.97
0.975
The number of rank s
Theclassificationacc
full rank
low rank
(c) UMIST sparse linear regression
5 6 7 8 9 10
0.29
0.291
0.292
0.293
0.294
0.295
0.296
0.297
0.298
0.299
The number of rank s
Theclassificationacc
full rank
low rank
(d) VOWEL linear regression
5 6 7 8 9 10
0.292
0.294
0.296
0.298
0.3
0.302
0.304
0.306
0.308
The number of rank s
Theclassificationacc
full rank
low rank
(e) VOWEL ridge regression
5 6 7 8 9 10
0.29
0.292
0.294
0.296
0.298
0.3
0.302
0.304
0.306
The number of rank s
Theclassificationacc
full rank
low rank
(f) VOWEL sparse linear regression
5 5.5 6 6.5 7 7.5 8 8.5 9
0.36
0.37
0.38
0.39
0.4
0.41
0.42
0.43
0.44
0.45
The number of rank s
Theclassificationacc
full rank
low rank
(g) MNIST linear regression
5 5.5 6 6.5 7 7.5 8 8.5 9
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
The number of rank s
Theclassificationacc
full rank
low rank
(h) MNIST ridge regression
5 5.5 6 6.5 7 7.5 8 8.5 9
0.65
0.7
0.75
0.8
0.85
The number of rank s
Theclassificationacc
full rank
low rank
(i) MNIST sparse linear regression
5 5.5 6 6.5 7 7.5 8 8.5 9
0.6
0.65
0.7
0.75
0.8
0.85
The number of rank s
Theclassificationacc
full rank
low rank
(j) JAFFE linear regression
5 5.5 6 6.5 7 7.5 8 8.5 9
0.85
0.9
0.95
1
The number of rank s
Theclassificationacc
Least square loss with ridge regression
full rank
low rank
(k) JAFFE ridge regression
5 5.5 6 6.5 7 7.5 8 8.5 9
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
The number of rank s
Theclassificationacc
full rank
low rank
(l) JAFFE sparse linear regression
17 19 21 23 25 27 29 31 33 35
0.34
0.35
0.36
0.37
0.38
0.39
0.4
The number of rank s
Theclassificationacc
full rank
low rank
(m) BINALPHA36 linear regression
17 19 21 23 25 27 29 31 33 35
0.59
0.595
0.6
0.605
0.61
0.615
The number of rank s
Theclassificationacc
full rank
low rank
(n) BINALPHA36 ridge regression
17 19 21 23 25 27 29 31 33 35
0.58
0.585
0.59
0.595
0.6
0.605
0.61
The number of rank s
Theclassificationacc
full rank
low rank
(o) BINALPHA36 sparse linear regres-
sion
13 15 17 19 21 23 25
0.36
0.37
0.38
0.39
0.4
0.41
0.42
0.43
0.44
0.45
The number of rank s
Theclassificationacc
full rank
low rank
(p) BINALPHA26 linear regression
13 15 17 19 21 23 25
0.645
0.65
0.655
0.66
0.665
0.67
0.675
0.68
0.685
The number of rank s
Theclassificationacc
full rank
low rank
(q) BINALPHA26 ridge regression
13 15 17 19 21 23 25
0.635
0.64
0.645
0.65
0.655
0.66
0.665
The number of rank s
Theclassificationacc
full rank
low rank
(r) BINALPHA26 sparse linear regres-
sion
Demonstration of the low-rank structure and sparse
structure found by our proposed SLRR method.
0 5 10 15 20
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x 10
−3
index of singular value
singularvalue
index of class
absofweightcoefficients
5 10 15 20
100
200
300
400
500
600
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
x 10
−4
(a) UMIST low-rank structure and sparse structure
1 2 3 4 5 6 7 8 9 10
0
0.01
0.02
0.03
0.04
0.05
0.06
index of singular value
singularvalue
index of class
absofweightcoefficients
2 4 6 8 10
1
2
3
4
5
6
7
8
9
10
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
(b) VOWEL low-rank structure and sparse structure
1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
1.2
x 10
−3
index of singular value
singularvalue
index of class
absofweightcoefficients
2 4 6 8 10
100
200
300
400
500
600
700
0.5
1
1.5
2
2.5
x 10
−4
(c) MNIST low-rank structure and sparse structure
1 2 3 4 5 6 7 8 9 10
0
0.2
0.4
0.6
0.8
1
1.2
x 10
−3
index of singular value
singularvalue
index of class
absofweightcoefficients
2 4 6 8 10
100
200
300
400
500
600
700
800
900
1000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x 10
−4
(d) JAFFE low-rank structure and sparse structure
0 5 10 15 20 25 30 35 40
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
index of singular value
singularvalue
index of class
absofweightcoefficients
5 10 15 20 25 30 35
50
100
150
200
250
300
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
x 10
−3
(e) BINALPHA36 low-rank structure and sparse structure
0 5 10 15 20 25 30
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
index of singular value
singularvalue
index of class
absofweightcoefficients
5 10 15 20 25
50
100
150
200
250
300
0.5
1
1.5
2
2.5
3
3.5
4
x 10
−3
(f) BINALPHA26 low-rank structure and sparse structure