Slides for the presentation at ENBIS 2018 of "Deep k-Means: Jointly Clustering with k-Means and Learning Representations" by Thibaut Thonet. Joint work with Maziar Moradi Fard and Eric Gaussier.
1. Deep k-Means: Jointly Clustering with k-Means and Learning
Representations
Thibaut THONET
thibaut.thonet@univ-grenoble-alpes.fr
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG
Joint work with Maziar MORADI FARD and Eric GAUSSIER
5 September 2018 @ ENBIS, Nancy
Thibaut Thonet Deep k-Means
2. Clustering
Clustering is the process of organizing unlabeled objects into groups (clusters)
whose members are similar in some way
Clustering approaches may be classified as:
Hard clustering: each object belongs at most to one cluster
Soft clustering: each object can belong to more than one cluster
Thibaut Thonet Deep k-Means 2 / 16
3. k-Means clustering
k-Means is a centroid-based approach for hard clustering [MacQueen, 1967].
Given a set of objects X, k-Means clustering aims to group the objects into k clusters
of similar samples by minimizing the following loss function:
min
R
x∈X
||x − c(x; R)||2
2
where R are the cluster centers and c(x; R) = arg min
r∈R
||x − r||2
is the nearest cluster
center to x
r1
r2
K
Assign objects to clusters Update cluster centers
Thibaut Thonet Deep k-Means 3 / 16
4. k-Means clustering
k-Means is a centroid-based approach for hard clustering [MacQueen, 1967].
Given a set of objects X, k-Means clustering aims to group the objects into k clusters
of similar samples by minimizing the following loss function:
min
R
x∈X
||x − c(x; R)||2
2
where R are the cluster centers and c(x; R) = arg min
r∈R
||x − r||2
is the nearest cluster
center to x
r1
r2
K
Assign objects to clusters Update cluster centers
...But the input space is often high-dimensional, sparse and/or with redundant
dimensions
=⇒ It may not be suitable for clustering
Thibaut Thonet Deep k-Means 3 / 16
5. k-Means in an embedded space: Auto-Encoder + k-Means
1. Train an
auto-encoder on the
dataset to learn object
embeddings (e.g., for
text, low-dimensional
dense representations)
2. Perform k-Means in
the embedding space
…
x Auto(x)
r1
r2
K
(x)hθ
(x)hθ
(x) = ||x − Auto(x)|min
θ
∑
x
Lrec ∑
x
|2
2
(x) = || (x) − c( (x); R)|min
R
∑
x
Lclust ∑
x
hθ hθ |2
2
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
Untitled Diagram.xml
Thibaut Thonet Deep k-Means 4 / 16
6. k-Means in an embedded space: Auto-Encoder + k-Means
1. Train an
auto-encoder on the
dataset to learn object
embeddings (e.g., for
text, low-dimensional
dense representations)
2. Perform k-Means in
the embedding space
…
x Auto(x)
r1
r2
K
(x)hθ
(x)hθ
(x) = ||x − Auto(x)|min
θ
∑
x
Lrec ∑
x
|2
2
(x) = || (x) − c( (x); R)|min
R
∑
x
Lclust ∑
x
hθ hθ |2
2
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
Untitled Diagram.xml
...But embeddings are not specifically learned for clustering purposes
=⇒ They may still not be suitable for clustering
Thibaut Thonet Deep k-Means 4 / 16
7. k-Means in an embedded space: Deep Clustering Network
The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster
representatives R and auto-encoder parameters θ using SGD and (ii) assigns data
points to the cluster with the nearest representative in the embedding space
= || (x) − c( (x); R)|hθ hθ |2
2
(x) = ||x − Auto(x)|Lrec |2
2
…
(x)Lclust
x Auto(x)
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
r2
K
(x)hθ
diagram_dcn.xml
Thibaut Thonet Deep k-Means 5 / 16
8. k-Means in an embedded space: Deep Clustering Network
The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster
representatives R and auto-encoder parameters θ using SGD and (ii) assigns data
points to the cluster with the nearest representative in the embedding space
= || (x) − c( (x); R)|hθ hθ |2
2
(x) = ||x − Auto(x)|Lrec |2
2
…
(x)Lclust
x Auto(x)
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
r2
K
(x)hθ
diagram_dcn.xml
...But impossibility to solely rely on SGD due to discrete assignments (argmin)
=⇒ Non-joint and less scalable training
Thibaut Thonet Deep k-Means 5 / 16
9. Deep k-means: overview
= closeness( (x), )∑
k
hθ rk
(x) = ||x − Auto(x)|Lrec |2
2
…
(x)Lclust
x Auto(x)
× || (x) − |hθ rk |2
2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
(x)hθ
r2
K
am_f.xml
Thibaut Thonet Deep k-Means 6 / 16
10. Deep k-means: a differentiable surrogate to DCN
We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+,
2018]:
P
(α)
DKM: min
R,θ
L(α)
=
x∈X
Lrec(x) + λ L
(α)
clust(x)
with L
(α)
clust(x) =
r∈R
closeness(hθ(x), r; α) × ||hθ(x) − r||2
such that:
closeness(hθ(x), r; α) is differentiable wrt both θ and r
lim
α→∞
closeness(hθ(x), r; α) =
1 if r = arg min
r ∈R
||hθ(x) − r ||2
0 otherwise
Thibaut Thonet Deep k-Means 7 / 16
11. Deep k-means: a differentiable surrogate to DCN
We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+,
2018]:
P
(α)
DKM: min
R,θ
L(α)
=
x∈X
Lrec(x) + λ L
(α)
clust(x)
with L
(α)
clust(x) =
r∈R
closeness(hθ(x), r; α) × ||hθ(x) − r||2
such that:
closeness(hθ(x), r; α) is differentiable wrt both θ and r
lim
α→∞
closeness(hθ(x), r; α) =
1 if r = arg min
r ∈R
||hθ(x) − r ||2
0 otherwise
Intuitively, closeness(hθ(x), r; α) can be seen as a relaxation to DCN’s hard
clustering assignments such that lim
α→∞
P
(α)
DKM = PDCN holds
Thibaut Thonet Deep k-Means 7 / 16
12. Deep k-means: choice of closeness and α
We chose closeness to be defined based on a parameterized softmax:
closeness(hθ(x), r; α) =
exp(−α ||hθ(x) − r||2
)
r ∈R
exp(−α ||hθ(x) − r ||2
)
where α can be either set as a constant or progressively increased (deterministic
annealing)
Thibaut Thonet Deep k-Means 8 / 16
13. Deep k-means: choice of closeness and α
We chose closeness to be defined based on a parameterized softmax:
closeness(hθ(x), r; α) =
exp(−α ||hθ(x) − r||2
)
r ∈R
exp(−α ||hθ(x) − r ||2
)
where α can be either set as a constant or progressively increased (deterministic
annealing)
α plays two roles: (a) approximation of hard clustering and (b) inverse temperature
in a deterministic annealing scheme
DKMa
: random initialization of θ and R + annealing: sequence (αn)n with
α1 = 0.1
DKMp
: pretraining of θ and k-means-based initialization of R + no annealing:
constant α = 1000
where the sequence (αn)n is defined as αn+1 = 21/ log(n)2
× αn
Thibaut Thonet Deep k-Means 8 / 16
14. Deep k-means: SGD-based training algorithm
Algorithm 1 Deep k-means
Input: data X, number of clusters K, trade-off hyperparameter
λ, scheme for α, number of epochs T, number of minibatches N,
learning rate η
Output: autoencoder parameters θ, cluster representatives R
Initialize θ and rk, 1 ≤ k ≤ K (randomly or through pretraining)
for each α do # α levels (if α not constant)
for t = 1 to T do # epochs per α
for n = 1 to N do # minibatches
Draw a minibatch ˜X ⊂ X
Update (θ, R) ← (θ, R) − η 1
| ˜X| (θ, R)
˜L(α)
end for
end for
end for
Thibaut Thonet Deep k-Means 9 / 16
15. Experimental setup
AE architecture: encoder with d-500-500-2000-K neurons and mirrored decoder
Baselines
k-Means
AE + k-Means
Deep Clustering Network [Yang+, 2017]
Improved Deep Embedded Clustering [Guo+, 2017]
Datasets
Text
20 Newsgroups: 20 classes, 18,846 samples
RCV1: 4 classes, 10,000 samples
Image
MNIST: 10 classes, 70,000 samples
USPS: 10 classes, 9,298 samples
Clustering metrics
Clustering accuracy (ACC)
Normalized Mutual Information (NMI)
Thibaut Thonet Deep k-Means 10 / 16
16. Clustering performance
Mean ± std for ACC and NMI computed over 10 (seeded) runs. Bold (resp. underlined)
values correspond to results with no significant difference (p > 0.05) to the best
approach with (resp. without) pretraining for each dataset/metric pair
Model
MNIST USPS 20NEWS RCV1
ACC NMI ACC NMI ACC NMI ACC NMI
KM 53.5±0.3 49.8±0.5 67.3±0.1 61.4±0.1 23.2±1.5 21.6±1.8 50.8±2.9 31.3±5.4
AE-KM 80.8±1.8 75.2±1.1 72.9±0.8 71.7±1.2 49.0±2.9 44.5±1.5 56.7±3.6 31.5±4.3
Deep clustering approaches without pretraining
DCNnp
34.8±3.0 18.1±1.0 36.4±3.5 16.9±1.3 17.9±1.0 9.8±0.5 41.3±4.0 6.9±1.8
IDECnp
61.8±3.0 62.4±1.6 53.9±5.1 50.0±3.8 22.3±1.5 22.3±1.5 56.7±5.3 31.4±2.8
DKMa
82.3±3.2 78.0±1.9 75.5±6.8 73.0±2.3 44.8±2.4 42.8±1.1 53.8±5.5 28.0±5.8
Deep clustering approaches with pretraining
DCNp
81.1±1.9 75.7±1.1 73.0±0.8 71.9±1.2 49.2±2.9 44.7±1.5 56.7±3.6 31.6±4.3
IDECp
85.7±2.4 86.4±1.0 75.2±0.5 74.9±0.6 40.5±1.3 38.2±1.0 59.5±5.7 34.7±5.0
DKMp
84.0±2.2 79.6±0.9 75.7±1.3 77.6±1.1 51.2±2.8 46.7±1.2 58.3±3.8 33.1±4.9
Thibaut Thonet Deep k-Means 11 / 16
18. Conclusion
Proposition of Deep k-Means, a new approach to jointly perform k-means
clustering and representation learning
Take-home messages:
Pretraining is clearly beneficial to deep clustering
The differentiable formulation of DKM enables fully joint SGD and thus
efficient use of GPU
k-Means-based approaches can perform on par with state-of-the-art deep
clustering approaches
Thibaut Thonet Deep k-Means 13 / 16
19. Ongoing work: Constrained Deep k-Means
We wish to guide the clustering results in order to capture information that is relevant
to the user (e.g., expert knowledge on the classes). We consider here that this
information takes the form of lexical constraints with a set of keywords for document
clustering
engine
car
diet
foodnovel
book
Thibaut Thonet Deep k-Means 14 / 16
20. Ongoing work: Constrained Deep k-Means
We wish to guide the clustering results in order to capture information that is relevant
to the user (e.g., expert knowledge on the classes). We consider here that this
information takes the form of lexical constraints with a set of keywords for document
clustering
engine
car
diet
foodnovel
book
Two approaches considered:
Constrain the document embeddings to put more emphasis on the keywords
Constrain the cluster representatives to be related to subsets of the keywords
Thibaut Thonet Deep k-Means 14 / 16
21. Thank you!
Paper pre-print available at: https://arxiv.org/pdf/1806.10069.pdf
Thibaut Thonet Deep k-Means 15 / 16
22. References
Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved Deep Embedded Clustering
with Local Structure Preservation. In Proceedings of the 26th International Joint
Conference on Artificial Intelligence (pp. 1753–1759).
MacQueen, J. (1967). Some Methods for Classification and Analysis of
Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability (pp. 281–297).
Moradi Fard, M., Thonet, T., & Gaussier, E. (2018). Deep k-Means: Jointly
Clustering with k-Means and Learning Representations. arXiv:1806.10069.
Yang, B., Fu, X., Sidiropoulos, N. D., & Hong, M. (2017). Towards
K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In
ICML ’17 (pp. 3861–3870).
Thibaut Thonet Deep k-Means 16 / 16
23. Appendix: clustering metrics
Given the groundtruth classes S = {S1, . . . , SK }, the obtained clusters
C = {C1, . . . , CK }, and the dataset X:
ACC(C, S) = max
φ
1
|X|
|X|
i=1
I{si = φ(ci)}
NMI(C, S) =
2 I(C, S)
H(C) + H(S)
with I(C, S) =
j,k
|Cj ∩ Sk|
|X|
log
|X| |Cj ∩ Sk|
|Cj| |Sk|
and H(C) = −
j
|Cj|
|X|
log
|Cj|
|X|
Thibaut Thonet Deep k-Means 16 / 16