Neural Collaborative Subspace Clustering

International Conference of Machine Learning (ICML), Long Beach, CA, June 10-15, 2019
Our Framework: Collaborative Learning
Data Batch

w D
di,i =
∑
j=0
wi,j
……
L
L = D − W
L
N

min
C
∥C∥p s . t . X = XC, diag(C) = 0 X ∈ ℝD×N
X = [x1, x2, ⋯, xN]
xi D
C ∈ ℝN×N 4
S1
S2
S3
yi
0 5 10 15 20 25 30
−0.05
0
0.05
0.1
q = ∞
0 5 10 15 20 25 30
−0.1
−0.05
0
0.05
0.1
0.15
q = 2
0 5 10 15 20 25 30
−0.2
0
0.2
0.4
q = 1
Fig. 3. Three subspaces in R3
with 10 data points in each subspace, ordered such that the fist and the last 10 points belong to S1 and
S3, respectively. The solution of the `q-minimization program in (3) for yi lying in S1 for q = 1, 2, 1 is shown. Note that as the value of q
decreases, the sparsity of the solution increases. For q = 1, the solution corresponds to choosing two other points lying in S1.
Different choices of q have different effects in the obtained
solution. Typically, by decreasing the value of q from infinity
toward zero, the sparsity of the solution increases, as shown in
Figure 3. The extreme case of q = 0 corresponds to the general
NP-hard problem [51] of finding the sparsest representation of
the given point, as the `0-norm counts the number of nonzero
elements of the solution. Since we are interested in efficiently
whose nonzero elements correspond to points from the same
subspace of the given data point. This provides an immediate
choice of the similarity matrix as W = |C| + |C|>
. In other
words, each node i connects itself to a node j by an edge whose
weight is equal to |cij|+|cji|. The reason for the symmetrization
is that, in general, a data point yi 2 S` can write itself as a
linear combination of some points including yj 2 S`. However,
4
S1
S2
S3
yi
0 5 10 15 20 25 30
−0.05
0
0.05
0.1
q = ∞
0 5 10 15 20 25 30
−0.1
−0.05
0
0.05
0.1
0.15
q = 2
0 5 10 15 20 25 30
−0.2
0
0.2
0.4
q = 1
. In other
4
S1
S2
S3
yi
0 5 10 15 20 25 30
−0.05
0
0.05
0.1
q = ∞
0 5 10 15 20 25 30
−0.1
−0.05
0
0.05
0.1
0.15
q = 2
0 5 10 15 20 25 30
−0.2
0
0.2
0.4
q = 1
. In other
p = ∞ p = 1p = 2
{S1, S2, S3}
4
S1
S2
S3
yi
0 5 10 15 20 25 30
−0.05
0
0.05
0.1
q = ∞
0 5 10 15 20 25 30
−0.1
−0.05
0
0.05
0.1
0.15
q = 2
0 5 10 15 20 25 30
−0.2
0
0.2
0.4
q = 1
. In other
xiS1
S2
S3
W = |C| + |C|⊤ ( wi,j = |ci,j | + |cj,i | ) W ∈ ℝN×N
C = C⊤

Figure 2: Deep Subspace Clustering Networks: As an example, we show a deep subspace clustering
network with three convolutional encoder layers, one self-expressive layer, and three deconvolutional
decoder layer. During training, we first pre-train the deep auto-encoder without the self-expressive
layer; we then fine-tune our entire network using this pre-trained model for initialization.
layer essentially lets us directly learn the affinity matrix via the network. Moreover, minimizing
kCkp simply translates to adding a regularizer to the weights of the self-expressive layer. In this
work, we consider two kinds of regularizations on C: (i) the `1 norm, resulting in a network denoted
by DSC-Net-L1; (ii) the `2 norm, resulting in a network denoted by DSC-Net-L2.
For notational consistency, let us denote the parameters of the self-expressive layer (which are just the
elements of C) as ⇥s. As can be seen from Figure 2, we then take the input to the decoder part of our
network to be the transformed latent representation Z⇥e
⇥s. This lets us re-write our loss function as
˜L(˜⇥) =
1
2
kX ˆX˜⇥k2
F + 1k⇥skp +
2
2
kZ⇥e
Z⇥e
⇥sk2
F s.t. (diag(⇥s) = 0) , (4)
where the network parameters ˜⇥ now consist of encoder parameters ⇥ , self-expressive layer
z
z zc
C
z zc |C| + |C|⊤
z

v ∈ ℝk
Ac
Ac(i, j) = viv⊤
j
Neural Collaborative S
Figure 2. By normalizing the feature vectors after softmax func-
tion and computing their inner product, an affinity matrix can be
generated to encode the clustering information.
binary classifier by
Ac(i, j) = ⌫i⌫T
j , (1)
where ⌫i 2 Rk
is a k dimensional prediction vector after `2
normalization. Ideally, when ⌫i is one-hot, Ac is a binary
matrix encoding the confidence of data points belonging to
the same cluster. So if we supervise the classifier using Ac,
v ∈ ℝk
Ac ∈ ℝB×B
k
B

C* = arg min
C
∥C∥p +
λ
2
∥Z − ZC∥2
F s . t . diag(C) = 0
As(i, j) =
{
(|c*i,j | + |c*j,i |)/2cmax if i ≠ j
1 if i = j Z
λ
As ∈ ℝB×B
cmax C*

Ac As
As
Ac
As Ac

Ac As
min
As,Ac
Ω(As, Ac, l, u) = Lpos(As, Ac, u) + αLneg(Ac, As, l)
Lpos(As, Ac, u) = H(Ms∥Ac) s . t . Ms = 𝔼(As > u)
Collaborative Learning
• Subspace affinity is more confident of identifying samples from the
same class.
!"#$!%&'
Positive
Confidence
same class.
!"#$!%&'
Positive
Confidence
Lneg(Ac, As, l) = H(Mc∥(1 − As)) s . t . Mc = 𝔼(Ac < l)
same class.
!"#$!%&'
Positive
Confidence
same class.
!"#$!%&'
Positive
Confidence
α
u, l
𝔼
H
H(p∥q) =
∑
j
pj log(qj)

𝕃(X; Θ) = Lsub(X; Θ, As) + λclΩ(X, u, l; Θ, As, Ac)
Lsub(X; Θ, As) = ∥C∥2
F +
λ
2
∥Z − ZC∥2
F +
1
2
∥X − ̂X∥2
F s . t . diag(C) = 0
Ω(X, u, l; Θ, As, Ac) = Lpos(As, Ac, u) + αLneg(Ac, As, l)
𝕃
Θ
X
̂X
Z
C
λcl

Figure 2: Deep Subspace Clustering Networks: As an example, we show a deep subspace clustering
network with three convolutional encoder layers, one self-expressive layer, and three deconvolutional
decoder layer. During training, we first pre-train the deep auto-encoder without the self-expressive
layer; we then fine-tune our entire network using this pre-trained model for initialization.
layer essentially lets us directly learn the affinity matrix via the network. Moreover, minimizing
kCkp simply translates to adding a regularizer to the weights of the self-expressive layer. In this
work, we consider two kinds of regularizations on C: (i) the `1 norm, resulting in a network denoted
by DSC-Net-L1; (ii) the `2 norm, resulting in a network denoted by DSC-Net-L2.
For notational consistency, let us denote the parameters of the self-expressive layer (which are just the
elements of C) as ⇥s. As can be seen from Figure 2, we then take the input to the decoder part of our
network to be the transformed latent representation Z⇥e
⇥s. This lets us re-write our loss function as
˜L(˜⇥) =
1
2
kX ˆX˜⇥k2
F + 1k⇥skp +
2
2
kZ⇥e
Z⇥e
⇥sk2
F s.t. (diag(⇥s) = 0) , (4)
where the network parameters ˜⇥ now consist of encoder parameters ⇥e, self-expressive layer
parameters ⇥s, and decoder parameters ⇥d, and where the reconstructed data ˆX is now a function of
{⇥e, ⇥s, ⇥d} rather than just {⇥e, ⇥d} in (3).
3.3 Network Architecture
z
Lsub(X; Θ, As) = ∥C∥2
F +
λ
2
∥Z − ZC∥2
F +
1
2
∥X − ̂X∥2
F s . t . diag(C) = 0
Ω(X, u, l; Θ, As, Ac) = Lpos(As, Ac, u) + αLneg(Ac, As, l)
𝕃(X; Θ) = Lsub(X; Θ, As) + λclΩ(X, u, l; Θ, As, Ac)

si = arg max
h
vi,h h = 1,2,⋯, k
si
vi
k
Figure 1. The Neural Collaborative Subspace Clustering framework. The afﬁnity matrix generated by self-expressive layer, As, and

ural Collaborative Subspace Clustering
images of size
ally for MNIST
entation among
ly our method
dle this type of
volutional auto-
ween the auto-
module. The
5 and channels
e classiﬁcation
nal layers after
e convolutional
vector. For the
to 0.1 and 0.7
nd increase l to
5 mins to ﬁnish
GPU.
ACC(%) NMI(%) ARI(%)
CAE-KM 51.00 44.87 33.52
SAE-KM 81.29 73.78 67.00
KM 53.00 50.00 37.00
DEC 84.30 80.00 75.00
DCN 83.31 80.86 74.87
SSC-CAE 43.03 56.81 28.58
LRR-CAE 55.18 66.54 40.57
KSSC-CAE 58.48 67.74 49.38
DSC-Net 65.92 73.00 57.09
k-SCN 87.14 78.15 75.81
Ours 94.09 86.12 87.52
Table 1. Clustering results of different methods on MNIST. For
all quantitative metrics, the larger the better. The best results are
shown in bold.
without batch normalization in the encoder, and with a sym-
metric structure in the decoder. As the complexity of dataset

The data samples of the Fashion-Mnist Dataset
SAE-KM 54.35 58.53 41.86
CAE-KM 39.84 39.80 25.93
KM 47.58 51.24 34.86
DEC 59.00 60.10 44.60
DCN 58.67 59.4 43.04
DAC 61.50 63.20 50.20
ClusterGAN 63.00 64.00 -
InfoGAN 61.00 59.00 44.20
SSC-CAE 35.87 18.10 13.46
LRR-CAE 34.48 25.41 10.33
KSSC-CAE 38.17 19.73 14.74
DSC-Net 60.62 61.71 48.20
k-SCN 63.78 62.04 48.04
Ours 72.14 68.60 59.17
Table 2. Clustering results of different methods on Fashion-MNIST.
For all quantity metrics, the larger the better. The best results are
shown in bold.
ACC (%) NMI (%) ARI (%)
Figure 3. The data samples of the Fashion-Mnist Dataset
SAE-KM 54.35 58.53 41.86
CAE-KM 39.84 39.80 25.93
KM 47.58 51.24 34.86
DEC 59.00 60.10 44.60
DCN 58.67 59.4 43.04
DAC 61.50 63.20 50.20
InfoGAN 61.00 59.00 44.20
SSC-CAE 35.87 18.10 13.46
LRR-CAE 34.48 25.41 10.33
KSSC-CAE 38.17 19.73 14.74
DSC-Net 60.62 61.71 48.20
k-SCN 63.78 62.04 48.04
Ours 72.14 68.60 59.17
shown in bold.
ACC (%) NMI (%) ARI (%)
DEC 22.89 12.10 3.62
DCN 21.30 8.40 3.14
DAC 23.10 9.80 6.15
InfoGAN 19.76 8.15 3.79
Figure 3. The data samples of the Fashion-Mnist Dataset
Figure 4. The visualization of our latent space through dimension
reduction by PCA.
SAE-KM 54.35 58.53 41.86
CAE-KM 39.84 39.80 25.93
KM 47.58 51.24 34.86
DEC 59.00 60.10 44.60
DCN 58.67 59.4 43.04
DAC 61.50 63.20 50.20
InfoGAN 61.00 59.00 44.20
SSC-CAE 35.87 18.10 13.46
LRR-CAE 34.48 25.41 10.33
KSSC-CAE 38.17 19.73 14.74
DSC-Net 60.62 61.71 48.20
k-SCN 63.78 62.04 48.04
Ours 72.14 68.60 59.17
shown in bold.
ACC (%) NMI (%) ARI (%)
DEC 22.89 12.10 3.62
DCN 21.30 8.40 3.14
DAC 23.10 9.80 6.15
InfoGAN 19.76 8.15 3.79
SSC-CAE 12.66 0.73 0.19
LRR-CAE 22.35 17.36 4.04
KSSC-CAE 26.84 15.17 7.48
DSC-Net 26.87 14.56 8.75
k-SCN 22.91 16.57 7.27

data samples of the Fashion-Mnist Dataset
ualization of our latent space through dimension
.
shown Fig. 5.
or this dataset start from one layer convo-
with 10 channels, and follow with three
esidual blocks without batch normalization,
30 and 10 channels respectively.
the performance of all algorithms on this
KSSC-CAE 38.17 19.73 14.74
DSC-Net 60.62 61.71 48.20
k-SCN 63.78 62.04 48.04
Ours 72.14 68.60 59.17
Table 2. Clustering results of different methods on Fashion-MNIST
shown in bold.
ACC (%) NMI (%) ARI (%)
DEC 22.89 12.10 3.62
DCN 21.30 8.40 3.14
DAC 23.10 9.80 6.15
InfoGAN 19.76 8.15 3.79
SSC-CAE 12.66 0.73 0.19
LRR-CAE 22.35 17.36 4.04
KSSC-CAE 26.84 15.17 7.48
DSC-Net 26.87 14.56 8.75
k-SCN 22.91 16.57 7.27
Ours 27.5 13.78 7.69
Table 3. The clustering results of different algorithms on subset of
Stanford Online Products. The best results are in bold.
-
.
-
-
DAC 23.10 9.80 6.15
InfoGAN 19.76 8.15 3.79
SSC-CAE 12.66 0.73 0.19
LRR-CAE 22.35 17.36 4.04
KSSC-CAE 26.84 15.17 7.48
DSC-Net 26.87 14.56 8.75
k-SCN 22.91 16.57 7.27
Ours 27.5 13.78 7.69
Table 3. The clustering results of different algorithms on subset of
Stanford Online Products. The best results are in bold.
Figure 5. The data samples of the Stanford Online Products
Dataset
In summary, compared to other deep learning methods, our
framework is not sensitive to the architecture of neural net-
works, as long as the dimensionality meets the requirement
of subspace self-expressiveness. Furthermore, the two mod-
ules in our network progressively improve the performance
in a collaborative way, which is both effective and efﬁcient.

Neural Collaborative Subspace Clustering

More Related Content

What's hot

Similar to Neural Collaborative Subspace Clustering

Recently uploaded

Neural Collaborative Subspace Clustering