ENBIS 2018 presentation on Deep k-Means

Deep k-Means: Jointly Clustering with k-Means and Learning
Representations
Thibaut THONET
thibaut.thonet@univ-grenoble-alpes.fr
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG
Joint work with Maziar MORADI FARD and Eric GAUSSIER
5 September 2018 @ ENBIS, Nancy
Thibaut Thonet Deep k-Means

Clustering
Clustering is the process of organizing unlabeled objects into groups (clusters)
whose members are similar in some way
Clustering approaches may be classiﬁed as:
Hard clustering: each object belongs at most to one cluster
Soft clustering: each object can belong to more than one cluster
Thibaut Thonet Deep k-Means 2 / 16

k-Means clustering
k-Means is a centroid-based approach for hard clustering [MacQueen, 1967].
Given a set of objects X, k-Means clustering aims to group the objects into k clusters
of similar samples by minimizing the following loss function:
min
R
x∈X
||x − c(x; R)||2
2
where R are the cluster centers and c(x; R) = arg min
r∈R
||x − r||2
is the nearest cluster
center to x
r1
r2
K
Assign objects to clusters Update cluster centers

k-Means clustering
k-Means is a centroid-based approach for hard clustering [MacQueen, 1967].
Given a set of objects X, k-Means clustering aims to group the objects into k clusters
of similar samples by minimizing the following loss function:
min
R
x∈X
||x − c(x; R)||2
2
where R are the cluster centers and c(x; R) = arg min
r∈R
||x − r||2
is the nearest cluster
center to x
r1
r2
K
Assign objects to clusters Update cluster centers
...But the input space is often high-dimensional, sparse and/or with redundant
dimensions
=⇒ It may not be suitable for clustering

k-Means in an embedded space: Auto-Encoder + k-Means
1. Train an
auto-encoder on the
dataset to learn object
embeddings (e.g., for
text, low-dimensional
dense representations)
2. Perform k-Means in
the embedding space
…
x Auto(x)
r1
r2
K
(x)hθ
(x)hθ
(x) = ||x − Auto(x)|min
θ
∑
x
Lrec ∑
x
|2
2
(x) = || (x) − c( (x); R)|min
R
∑
x
Lclust ∑
x
hθ hθ |2
2
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
Untitled Diagram.xml

k-Means in an embedded space: Auto-Encoder + k-Means
1. Train an
auto-encoder on the
dataset to learn object
embeddings (e.g., for
text, low-dimensional
dense representations)
2. Perform k-Means in
the embedding space
…
x Auto(x)
r1
r2
K
(x)hθ
(x)hθ
(x) = ||x − Auto(x)|min
θ
∑
x
Lrec ∑
x
|2
2
(x) = || (x) − c( (x); R)|min
R
∑
x
Lclust ∑
x
hθ hθ |2
2
r∈R
hθ |2
Untitled Diagram.xml
...But embeddings are not speciﬁcally learned for clustering purposes
=⇒ They may still not be suitable for clustering

k-Means in an embedded space: Deep Clustering Network
The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster
representatives R and auto-encoder parameters θ using SGD and (ii) assigns data
points to the cluster with the nearest representative in the embedding space
= || (x) − c( (x); R)|hθ hθ |2
2
(x) = ||x − Auto(x)|Lrec |2
2
…
(x)Lclust
x Auto(x)
r∈R
hθ |2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
r2
K
(x)hθ
diagram_dcn.xml

k-Means in an embedded space: Deep Clustering Network
The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster
representatives R and auto-encoder parameters θ using SGD and (ii) assigns data
points to the cluster with the nearest representative in the embedding space
= || (x) − c( (x); R)|hθ hθ |2
2
2
…
(x)Lclust
x Auto(x)
r∈R
hθ |2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
r2
K
(x)hθ
diagram_dcn.xml
...But impossibility to solely rely on SGD due to discrete assignments (argmin)
=⇒ Non-joint and less scalable training

Deep k-means: overview
= closeness( (x), )∑
k
hθ rk
2
…
(x)Lclust
x Auto(x)
× || (x) − |hθ rk |2
2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
(x)hθ
r2
K
am_f.xml

Deep k-means: a differentiable surrogate to DCN
We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+,
2018]:
P
(α)
DKM: min
R,θ
L(α)
=
x∈X
Lrec(x) + λ L
(α)
clust(x)
with L
(α)
clust(x) =
r∈R
closeness(hθ(x), r; α) × ||hθ(x) − r||2
such that:
closeness(hθ(x), r; α) is differentiable wrt both θ and r
lim
α→∞
closeness(hθ(x), r; α) =



1 if r = arg min
r ∈R
||hθ(x) − r ||2
0 otherwise

Deep k-means: a differentiable surrogate to DCN
We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+,
2018]:
P
(α)
DKM: min
R,θ
L(α)
=
x∈X
Lrec(x) + λ L
(α)
clust(x)
with L
(α)
clust(x) =
r∈R
closeness(hθ(x), r; α) × ||hθ(x) − r||2
such that:
closeness(hθ(x), r; α) is differentiable wrt both θ and r
lim
α→∞



1 if r = arg min
r ∈R
||hθ(x) − r ||2
0 otherwise
Intuitively, closeness(hθ(x), r; α) can be seen as a relaxation to DCN’s hard
clustering assignments such that lim
α→∞
P
(α)
DKM = PDCN holds

Deep k-means: choice of closeness and α
We chose closeness to be deﬁned based on a parameterized softmax:
exp(−α ||hθ(x) − r||2
)
r ∈R
exp(−α ||hθ(x) − r ||2
)
where α can be either set as a constant or progressively increased (deterministic
annealing)

Deep k-means: choice of closeness and α
We chose closeness to be deﬁned based on a parameterized softmax:
exp(−α ||hθ(x) − r||2
)
r ∈R
exp(−α ||hθ(x) − r ||2
)
where α can be either set as a constant or progressively increased (deterministic
annealing)
α plays two roles: (a) approximation of hard clustering and (b) inverse temperature
in a deterministic annealing scheme
DKMa
: random initialization of θ and R + annealing: sequence (αn)n with
α1 = 0.1
DKMp
: pretraining of θ and k-means-based initialization of R + no annealing:
constant α = 1000
where the sequence (αn)n is deﬁned as αn+1 = 21/ log(n)2
× αn

Deep k-means: SGD-based training algorithm
Algorithm 1 Deep k-means
Input: data X, number of clusters K, trade-off hyperparameter
λ, scheme for α, number of epochs T, number of minibatches N,
learning rate η
Output: autoencoder parameters θ, cluster representatives R
Initialize θ and rk, 1 ≤ k ≤ K (randomly or through pretraining)
for each α do # α levels (if α not constant)
for t = 1 to T do # epochs per α
for n = 1 to N do # minibatches
Draw a minibatch ˜X ⊂ X
Update (θ, R) ← (θ, R) − η 1
| ˜X| (θ, R)
˜L(α)
end for
end for
end for

Experimental setup
AE architecture: encoder with d-500-500-2000-K neurons and mirrored decoder
Baselines
k-Means
AE + k-Means
Deep Clustering Network [Yang+, 2017]
Improved Deep Embedded Clustering [Guo+, 2017]
Datasets
Text
20 Newsgroups: 20 classes, 18,846 samples
RCV1: 4 classes, 10,000 samples
Image
MNIST: 10 classes, 70,000 samples
USPS: 10 classes, 9,298 samples
Clustering metrics
Clustering accuracy (ACC)
Normalized Mutual Information (NMI)

Clustering performance
Mean ± std for ACC and NMI computed over 10 (seeded) runs. Bold (resp. underlined)
values correspond to results with no signiﬁcant difference (p > 0.05) to the best
approach with (resp. without) pretraining for each dataset/metric pair
Model
MNIST USPS 20NEWS RCV1
ACC NMI ACC NMI ACC NMI ACC NMI
KM 53.5±0.3 49.8±0.5 67.3±0.1 61.4±0.1 23.2±1.5 21.6±1.8 50.8±2.9 31.3±5.4
AE-KM 80.8±1.8 75.2±1.1 72.9±0.8 71.7±1.2 49.0±2.9 44.5±1.5 56.7±3.6 31.5±4.3
Deep clustering approaches without pretraining
DCNnp
34.8±3.0 18.1±1.0 36.4±3.5 16.9±1.3 17.9±1.0 9.8±0.5 41.3±4.0 6.9±1.8
IDECnp
61.8±3.0 62.4±1.6 53.9±5.1 50.0±3.8 22.3±1.5 22.3±1.5 56.7±5.3 31.4±2.8
DKMa
82.3±3.2 78.0±1.9 75.5±6.8 73.0±2.3 44.8±2.4 42.8±1.1 53.8±5.5 28.0±5.8
Deep clustering approaches with pretraining
DCNp
81.1±1.9 75.7±1.1 73.0±0.8 71.9±1.2 49.2±2.9 44.7±1.5 56.7±3.6 31.6±4.3
IDECp
85.7±2.4 86.4±1.0 75.2±0.5 74.9±0.6 40.5±1.3 38.2±1.0 59.5±5.7 34.7±5.0
DKMp
84.0±2.2 79.6±0.9 75.7±1.3 77.6±1.1 51.2±2.8 46.7±1.2 58.3±3.8 33.1±4.9

‘k-Means-friendliness’ of learned representations
Mean ± std for ACC and NMI computed over 10 (seeded) runs. Bold values
correspond to results with no signiﬁcant difference (p > 0.05) to the best
Model
MNIST USPS 20NEWS RCV1
ACC NMI ACC NMI ACC NMI ACC NMI
AE-KM 80.8±1.8 75.2±1.1 72.9±0.8 71.7±1.2 49.0±2.9 44.5±1.5 56.7±3.6 31.5±4.3
DCNp
+ KM 84.9±3.1 79.4±1.5 73.9±0.7 74.1±1.1 50.5±3.1 46.5±1.6 57.3±3.6 32.3±4.4
DKMa
+ KM 84.8±1.3 78.7±0.8 76.9±4.9 74.3±1.5 49.0±2.5 44.0±1.0 53.4±5.9 27.4±5.3
DKMp
+ KM 85.1±3.0 79.9±1.5 75.7±1.3 77.6±1.1 52.1±2.7 47.1±1.3 58.3±3.8 33.0±4.9
40 30 20 10 0 10 20 30
40
30
20
10
0
10
20
30
40
AE
40 30 20 10 0 10 20 30 40
30
20
10
0
10
20
30
40
DCN
30 20 10 0 10 20 30 40
40
30
20
10
0
10
20
30
40
DKMa
30 20 10 0 10 20 30 40
30
20
10
0
10
20
30
DKMp

Conclusion
Proposition of Deep k-Means, a new approach to jointly perform k-means
clustering and representation learning
Take-home messages:
Pretraining is clearly beneﬁcial to deep clustering
The differentiable formulation of DKM enables fully joint SGD and thus
efﬁcient use of GPU
k-Means-based approaches can perform on par with state-of-the-art deep
clustering approaches

Ongoing work: Constrained Deep k-Means
We wish to guide the clustering results in order to capture information that is relevant
to the user (e.g., expert knowledge on the classes). We consider here that this
information takes the form of lexical constraints with a set of keywords for document
clustering
engine
car
diet
foodnovel
book

Ongoing work: Constrained Deep k-Means
We wish to guide the clustering results in order to capture information that is relevant
to the user (e.g., expert knowledge on the classes). We consider here that this
information takes the form of lexical constraints with a set of keywords for document
clustering
engine
car
diet
foodnovel
book
Two approaches considered:
Constrain the document embeddings to put more emphasis on the keywords
Constrain the cluster representatives to be related to subsets of the keywords

Thank you!
Paper pre-print available at: https://arxiv.org/pdf/1806.10069.pdf

References
Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved Deep Embedded Clustering
with Local Structure Preservation. In Proceedings of the 26th International Joint
Conference on Artiﬁcial Intelligence (pp. 1753–1759).
MacQueen, J. (1967). Some Methods for Classiﬁcation and Analysis of
Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability (pp. 281–297).
Moradi Fard, M., Thonet, T., & Gaussier, E. (2018). Deep k-Means: Jointly
Clustering with k-Means and Learning Representations. arXiv:1806.10069.
Yang, B., Fu, X., Sidiropoulos, N. D., & Hong, M. (2017). Towards
K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In
ICML ’17 (pp. 3861–3870).

Appendix: clustering metrics
Given the groundtruth classes S = {S1, . . . , SK }, the obtained clusters
C = {C1, . . . , CK }, and the dataset X:
ACC(C, S) = max
φ
1
|X|
|X|
i=1
I{si = φ(ci)}
NMI(C, S) =
2 I(C, S)
H(C) + H(S)
with I(C, S) =
j,k
|Cj ∩ Sk|
|X|
log
|X| |Cj ∩ Sk|
|Cj| |Sk|
and H(C) = −
j
|Cj|
|X|
log
|Cj|
|X|

ENBIS 2018 presentation on Deep k-Means

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ENBIS 2018 presentation on Deep k-Means

Similar to ENBIS 2018 presentation on Deep k-Means (20)

Recently uploaded

Recently uploaded (20)

ENBIS 2018 presentation on Deep k-Means