SlideShare a Scribd company logo
Deep k-Means: Jointly Clustering with k-Means and Learning
Representations
Thibaut THONET
thibaut.thonet@univ-grenoble-alpes.fr
Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG
Joint work with Maziar MORADI FARD and Eric GAUSSIER
5 September 2018 @ ENBIS, Nancy
Thibaut Thonet Deep k-Means
Clustering
Clustering is the process of organizing unlabeled objects into groups (clusters)
whose members are similar in some way
Clustering approaches may be classified as:
Hard clustering: each object belongs at most to one cluster
Soft clustering: each object can belong to more than one cluster
Thibaut Thonet Deep k-Means 2 / 16
k-Means clustering
k-Means is a centroid-based approach for hard clustering [MacQueen, 1967].
Given a set of objects X, k-Means clustering aims to group the objects into k clusters
of similar samples by minimizing the following loss function:
min
R
x∈X
||x − c(x; R)||2
2
where R are the cluster centers and c(x; R) = arg min
r∈R
||x − r||2
is the nearest cluster
center to x
r1
r2
K
Assign objects to clusters Update cluster centers
Thibaut Thonet Deep k-Means 3 / 16
k-Means clustering
k-Means is a centroid-based approach for hard clustering [MacQueen, 1967].
Given a set of objects X, k-Means clustering aims to group the objects into k clusters
of similar samples by minimizing the following loss function:
min
R
x∈X
||x − c(x; R)||2
2
where R are the cluster centers and c(x; R) = arg min
r∈R
||x − r||2
is the nearest cluster
center to x
r1
r2
K
Assign objects to clusters Update cluster centers
...But the input space is often high-dimensional, sparse and/or with redundant
dimensions
=⇒ It may not be suitable for clustering
Thibaut Thonet Deep k-Means 3 / 16
k-Means in an embedded space: Auto-Encoder + k-Means
1. Train an
auto-encoder on the
dataset to learn object
embeddings (e.g., for
text, low-dimensional
dense representations)
2. Perform k-Means in
the embedding space
…
x Auto(x)
r1
r2
K
(x)hθ
(x)hθ
(x) = ||x − Auto(x)|min
θ
∑
x
Lrec ∑
x
|2
2
(x) = || (x) − c( (x); R)|min
R
∑
x
Lclust ∑
x
hθ hθ |2
2
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
Untitled Diagram.xml
Thibaut Thonet Deep k-Means 4 / 16
k-Means in an embedded space: Auto-Encoder + k-Means
1. Train an
auto-encoder on the
dataset to learn object
embeddings (e.g., for
text, low-dimensional
dense representations)
2. Perform k-Means in
the embedding space
…
x Auto(x)
r1
r2
K
(x)hθ
(x)hθ
(x) = ||x − Auto(x)|min
θ
∑
x
Lrec ∑
x
|2
2
(x) = || (x) − c( (x); R)|min
R
∑
x
Lclust ∑
x
hθ hθ |2
2
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
Untitled Diagram.xml
...But embeddings are not specifically learned for clustering purposes
=⇒ They may still not be suitable for clustering
Thibaut Thonet Deep k-Means 4 / 16
k-Means in an embedded space: Deep Clustering Network
The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster
representatives R and auto-encoder parameters θ using SGD and (ii) assigns data
points to the cluster with the nearest representative in the embedding space
= || (x) − c( (x); R)|hθ hθ |2
2
(x) = ||x − Auto(x)|Lrec |2
2
…
(x)Lclust
x Auto(x)
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
r2
K
(x)hθ
diagram_dcn.xml
Thibaut Thonet Deep k-Means 5 / 16
k-Means in an embedded space: Deep Clustering Network
The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster
representatives R and auto-encoder parameters θ using SGD and (ii) assigns data
points to the cluster with the nearest representative in the embedding space
= || (x) − c( (x); R)|hθ hθ |2
2
(x) = ||x − Auto(x)|Lrec |2
2
…
(x)Lclust
x Auto(x)
with c( (x); R) = || (x) − r|hθ argmin
r∈R
hθ |2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
r2
K
(x)hθ
diagram_dcn.xml
...But impossibility to solely rely on SGD due to discrete assignments (argmin)
=⇒ Non-joint and less scalable training
Thibaut Thonet Deep k-Means 5 / 16
Deep k-means: overview
= closeness( (x), )∑
k
hθ rk
(x) = ||x − Auto(x)|Lrec |2
2
…
(x)Lclust
x Auto(x)
× || (x) − |hθ rk |2
2
L = (x) + λ (x)min
R,θ
∑
x
Lrec Lclust
r1
(x)hθ
r2
K
am_f.xml
Thibaut Thonet Deep k-Means 6 / 16
Deep k-means: a differentiable surrogate to DCN
We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+,
2018]:
P
(α)
DKM: min
R,θ
L(α)
=
x∈X
Lrec(x) + λ L
(α)
clust(x)
with L
(α)
clust(x) =
r∈R
closeness(hθ(x), r; α) × ||hθ(x) − r||2
such that:
closeness(hθ(x), r; α) is differentiable wrt both θ and r
lim
α→∞
closeness(hθ(x), r; α) =



1 if r = arg min
r ∈R
||hθ(x) − r ||2
0 otherwise
Thibaut Thonet Deep k-Means 7 / 16
Deep k-means: a differentiable surrogate to DCN
We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+,
2018]:
P
(α)
DKM: min
R,θ
L(α)
=
x∈X
Lrec(x) + λ L
(α)
clust(x)
with L
(α)
clust(x) =
r∈R
closeness(hθ(x), r; α) × ||hθ(x) − r||2
such that:
closeness(hθ(x), r; α) is differentiable wrt both θ and r
lim
α→∞
closeness(hθ(x), r; α) =



1 if r = arg min
r ∈R
||hθ(x) − r ||2
0 otherwise
Intuitively, closeness(hθ(x), r; α) can be seen as a relaxation to DCN’s hard
clustering assignments such that lim
α→∞
P
(α)
DKM = PDCN holds
Thibaut Thonet Deep k-Means 7 / 16
Deep k-means: choice of closeness and α
We chose closeness to be defined based on a parameterized softmax:
closeness(hθ(x), r; α) =
exp(−α ||hθ(x) − r||2
)
r ∈R
exp(−α ||hθ(x) − r ||2
)
where α can be either set as a constant or progressively increased (deterministic
annealing)
Thibaut Thonet Deep k-Means 8 / 16
Deep k-means: choice of closeness and α
We chose closeness to be defined based on a parameterized softmax:
closeness(hθ(x), r; α) =
exp(−α ||hθ(x) − r||2
)
r ∈R
exp(−α ||hθ(x) − r ||2
)
where α can be either set as a constant or progressively increased (deterministic
annealing)
α plays two roles: (a) approximation of hard clustering and (b) inverse temperature
in a deterministic annealing scheme
DKMa
: random initialization of θ and R + annealing: sequence (αn)n with
α1 = 0.1
DKMp
: pretraining of θ and k-means-based initialization of R + no annealing:
constant α = 1000
where the sequence (αn)n is defined as αn+1 = 21/ log(n)2
× αn
Thibaut Thonet Deep k-Means 8 / 16
Deep k-means: SGD-based training algorithm
Algorithm 1 Deep k-means
Input: data X, number of clusters K, trade-off hyperparameter
λ, scheme for α, number of epochs T, number of minibatches N,
learning rate η
Output: autoencoder parameters θ, cluster representatives R
Initialize θ and rk, 1 ≤ k ≤ K (randomly or through pretraining)
for each α do # α levels (if α not constant)
for t = 1 to T do # epochs per α
for n = 1 to N do # minibatches
Draw a minibatch ˜X ⊂ X
Update (θ, R) ← (θ, R) − η 1
| ˜X| (θ, R)
˜L(α)
end for
end for
end for
Thibaut Thonet Deep k-Means 9 / 16
Experimental setup
AE architecture: encoder with d-500-500-2000-K neurons and mirrored decoder
Baselines
k-Means
AE + k-Means
Deep Clustering Network [Yang+, 2017]
Improved Deep Embedded Clustering [Guo+, 2017]
Datasets
Text
20 Newsgroups: 20 classes, 18,846 samples
RCV1: 4 classes, 10,000 samples
Image
MNIST: 10 classes, 70,000 samples
USPS: 10 classes, 9,298 samples
Clustering metrics
Clustering accuracy (ACC)
Normalized Mutual Information (NMI)
Thibaut Thonet Deep k-Means 10 / 16
Clustering performance
Mean ± std for ACC and NMI computed over 10 (seeded) runs. Bold (resp. underlined)
values correspond to results with no significant difference (p > 0.05) to the best
approach with (resp. without) pretraining for each dataset/metric pair
Model
MNIST USPS 20NEWS RCV1
ACC NMI ACC NMI ACC NMI ACC NMI
KM 53.5±0.3 49.8±0.5 67.3±0.1 61.4±0.1 23.2±1.5 21.6±1.8 50.8±2.9 31.3±5.4
AE-KM 80.8±1.8 75.2±1.1 72.9±0.8 71.7±1.2 49.0±2.9 44.5±1.5 56.7±3.6 31.5±4.3
Deep clustering approaches without pretraining
DCNnp
34.8±3.0 18.1±1.0 36.4±3.5 16.9±1.3 17.9±1.0 9.8±0.5 41.3±4.0 6.9±1.8
IDECnp
61.8±3.0 62.4±1.6 53.9±5.1 50.0±3.8 22.3±1.5 22.3±1.5 56.7±5.3 31.4±2.8
DKMa
82.3±3.2 78.0±1.9 75.5±6.8 73.0±2.3 44.8±2.4 42.8±1.1 53.8±5.5 28.0±5.8
Deep clustering approaches with pretraining
DCNp
81.1±1.9 75.7±1.1 73.0±0.8 71.9±1.2 49.2±2.9 44.7±1.5 56.7±3.6 31.6±4.3
IDECp
85.7±2.4 86.4±1.0 75.2±0.5 74.9±0.6 40.5±1.3 38.2±1.0 59.5±5.7 34.7±5.0
DKMp
84.0±2.2 79.6±0.9 75.7±1.3 77.6±1.1 51.2±2.8 46.7±1.2 58.3±3.8 33.1±4.9
Thibaut Thonet Deep k-Means 11 / 16
‘k-Means-friendliness’ of learned representations
Mean ± std for ACC and NMI computed over 10 (seeded) runs. Bold values
correspond to results with no significant difference (p > 0.05) to the best
Model
MNIST USPS 20NEWS RCV1
ACC NMI ACC NMI ACC NMI ACC NMI
AE-KM 80.8±1.8 75.2±1.1 72.9±0.8 71.7±1.2 49.0±2.9 44.5±1.5 56.7±3.6 31.5±4.3
DCNp
+ KM 84.9±3.1 79.4±1.5 73.9±0.7 74.1±1.1 50.5±3.1 46.5±1.6 57.3±3.6 32.3±4.4
DKMa
+ KM 84.8±1.3 78.7±0.8 76.9±4.9 74.3±1.5 49.0±2.5 44.0±1.0 53.4±5.9 27.4±5.3
DKMp
+ KM 85.1±3.0 79.9±1.5 75.7±1.3 77.6±1.1 52.1±2.7 47.1±1.3 58.3±3.8 33.0±4.9
40 30 20 10 0 10 20 30
40
30
20
10
0
10
20
30
40
AE
40 30 20 10 0 10 20 30 40
30
20
10
0
10
20
30
40
DCN
30 20 10 0 10 20 30 40
40
30
20
10
0
10
20
30
40
DKMa
30 20 10 0 10 20 30 40
30
20
10
0
10
20
30
DKMp
Thibaut Thonet Deep k-Means 12 / 16
Conclusion
Proposition of Deep k-Means, a new approach to jointly perform k-means
clustering and representation learning
Take-home messages:
Pretraining is clearly beneficial to deep clustering
The differentiable formulation of DKM enables fully joint SGD and thus
efficient use of GPU
k-Means-based approaches can perform on par with state-of-the-art deep
clustering approaches
Thibaut Thonet Deep k-Means 13 / 16
Ongoing work: Constrained Deep k-Means
We wish to guide the clustering results in order to capture information that is relevant
to the user (e.g., expert knowledge on the classes). We consider here that this
information takes the form of lexical constraints with a set of keywords for document
clustering
engine
car
diet
foodnovel
book
Thibaut Thonet Deep k-Means 14 / 16
Ongoing work: Constrained Deep k-Means
We wish to guide the clustering results in order to capture information that is relevant
to the user (e.g., expert knowledge on the classes). We consider here that this
information takes the form of lexical constraints with a set of keywords for document
clustering
engine
car
diet
foodnovel
book
Two approaches considered:
Constrain the document embeddings to put more emphasis on the keywords
Constrain the cluster representatives to be related to subsets of the keywords
Thibaut Thonet Deep k-Means 14 / 16
Thank you!
Paper pre-print available at: https://arxiv.org/pdf/1806.10069.pdf
Thibaut Thonet Deep k-Means 15 / 16
References
Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved Deep Embedded Clustering
with Local Structure Preservation. In Proceedings of the 26th International Joint
Conference on Artificial Intelligence (pp. 1753–1759).
MacQueen, J. (1967). Some Methods for Classification and Analysis of
Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on
Mathematical Statistics and Probability (pp. 281–297).
Moradi Fard, M., Thonet, T., & Gaussier, E. (2018). Deep k-Means: Jointly
Clustering with k-Means and Learning Representations. arXiv:1806.10069.
Yang, B., Fu, X., Sidiropoulos, N. D., & Hong, M. (2017). Towards
K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In
ICML ’17 (pp. 3861–3870).
Thibaut Thonet Deep k-Means 16 / 16
Appendix: clustering metrics
Given the groundtruth classes S = {S1, . . . , SK }, the obtained clusters
C = {C1, . . . , CK }, and the dataset X:
ACC(C, S) = max
φ
1
|X|
|X|
i=1
I{si = φ(ci)}
NMI(C, S) =
2 I(C, S)
H(C) + H(S)
with I(C, S) =
j,k
|Cj ∩ Sk|
|X|
log
|X| |Cj ∩ Sk|
|Cj| |Sk|
and H(C) = −
j
|Cj|
|X|
log
|Cj|
|X|
Thibaut Thonet Deep k-Means 16 / 16

More Related Content

What's hot

Satisfiability
SatisfiabilitySatisfiability
Satisfiability
Jim Kukula
 
Daa notes 3
Daa notes 3Daa notes 3
Daa notes 3
smruti sarangi
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
Np hard
Np hardNp hard
Np hard
jesal_joshi
 
Time complexity.ppt
Time complexity.pptTime complexity.ppt
Time complexity.ppt
YekoyeTigabuYeko
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Bangalore
 
Push down automata
Push down automataPush down automata
Push down automata
Ratnakar Mikkili
 
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Hiroyuki KASAI
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems
 
Webinar : P, NP, NP-Hard , NP - Complete problems
Webinar : P, NP, NP-Hard , NP - Complete problems Webinar : P, NP, NP-Hard , NP - Complete problems
Webinar : P, NP, NP-Hard , NP - Complete problems
Ziyauddin Shaik
 
Graph coloring
Graph coloringGraph coloring
Graph coloring
Rashika Ahuja
 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
djempol
 
Introduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic NotationIntroduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic Notation
Amrinder Arora
 
Red black trees presentation
Red black trees presentationRed black trees presentation
Red black trees presentation
Dexter Paul Gumahad
 
convex hull
convex hullconvex hull
convex hull
ravikirankalal
 
GRAPH COLORING AND ITS APPLICATIONS
GRAPH COLORING AND ITS APPLICATIONSGRAPH COLORING AND ITS APPLICATIONS
GRAPH COLORING AND ITS APPLICATIONS
Manojit Chakraborty
 
Clustering
ClusteringClustering
Clustering
Rashmi Bhat
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
Dat Nguyen
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
Krish_ver2
 
Graph coloring
Graph coloringGraph coloring
Graph coloring
Delowar Hossain
 

What's hot (20)

Satisfiability
SatisfiabilitySatisfiability
Satisfiability
 
Daa notes 3
Daa notes 3Daa notes 3
Daa notes 3
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Np hard
Np hardNp hard
Np hard
 
Time complexity.ppt
Time complexity.pptTime complexity.ppt
Time complexity.ppt
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
 
Push down automata
Push down automataPush down automata
Push down automata
 
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
Riemannian stochastic variance reduced gradient on Grassmann manifold (ICCOPT...
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
 
Webinar : P, NP, NP-Hard , NP - Complete problems
Webinar : P, NP, NP-Hard , NP - Complete problems Webinar : P, NP, NP-Hard , NP - Complete problems
Webinar : P, NP, NP-Hard , NP - Complete problems
 
Graph coloring
Graph coloringGraph coloring
Graph coloring
 
Kmeans initialization
Kmeans initializationKmeans initialization
Kmeans initialization
 
Introduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic NotationIntroduction to Algorithms and Asymptotic Notation
Introduction to Algorithms and Asymptotic Notation
 
Red black trees presentation
Red black trees presentationRed black trees presentation
Red black trees presentation
 
convex hull
convex hullconvex hull
convex hull
 
GRAPH COLORING AND ITS APPLICATIONS
GRAPH COLORING AND ITS APPLICATIONSGRAPH COLORING AND ITS APPLICATIONS
GRAPH COLORING AND ITS APPLICATIONS
 
Clustering
ClusteringClustering
Clustering
 
Mask-RCNN for Instance Segmentation
Mask-RCNN for Instance SegmentationMask-RCNN for Instance Segmentation
Mask-RCNN for Instance Segmentation
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
Graph coloring
Graph coloringGraph coloring
Graph coloring
 

Similar to ENBIS 2018 presentation on Deep k-Means

Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
Frank Nielsen
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
Khaled Al-Shamaa
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
Frank Nielsen
 
Lect4
Lect4Lect4
Lect4
sumit621
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
Elvis DOHMATOB
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
Frank Nielsen
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
Rediet Moges
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
Priyanka Aash
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
tuxette
 
Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
Venkat Sai Sharath Mudhigonda
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
arogozhnikov
 
Subquad multi ff
Subquad multi ffSubquad multi ff
Subquad multi ff
Fabian Velazquez
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practice
guest3550292
 
Q-Metrics in Theory and Practice
Q-Metrics in Theory and PracticeQ-Metrics in Theory and Practice
Q-Metrics in Theory and Practice
Magdi Mohamed
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
Alexander Litvinenko
 
Unit 3
Unit 3Unit 3
Unit 3
Unit 3Unit 3
Unit 3
guna287176
 
Semi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster EnsembleSemi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster Ensemble
Alexander Litvinenko
 

Similar to ENBIS 2018 presentation on Deep k-Means (20)

Information-theoretic clustering with applications
Information-theoretic clustering  with applicationsInformation-theoretic clustering  with applications
Information-theoretic clustering with applications
 
R Language Introduction
R Language IntroductionR Language Introduction
R Language Introduction
 
On learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihoodOn learning statistical mixtures maximizing the complete likelihood
On learning statistical mixtures maximizing the complete likelihood
 
Lect4
Lect4Lect4
Lect4
 
MVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priorsMVPA with SpaceNet: sparse structured priors
MVPA with SpaceNet: sparse structured priors
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
Codes and Isogenies
Codes and IsogeniesCodes and Isogenies
Codes and Isogenies
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
 
Graph Neural Network in practice
Graph Neural Network in practiceGraph Neural Network in practice
Graph Neural Network in practice
 
Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Subquad multi ff
Subquad multi ffSubquad multi ff
Subquad multi ff
 
Q-Metrics in Theory And Practice
Q-Metrics in Theory And PracticeQ-Metrics in Theory And Practice
Q-Metrics in Theory And Practice
 
Q-Metrics in Theory and Practice
Q-Metrics in Theory and PracticeQ-Metrics in Theory and Practice
Q-Metrics in Theory and Practice
 
Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...Hierarchical matrices for approximating large covariance matries and computin...
Hierarchical matrices for approximating large covariance matries and computin...
 
Unit 3
Unit 3Unit 3
Unit 3
 
Unit 3
Unit 3Unit 3
Unit 3
 
Semi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster EnsembleSemi-Supervised Regression using Cluster Ensemble
Semi-Supervised Regression using Cluster Ensemble
 

Recently uploaded

Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
Anagha Prasad
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
European Sustainable Phosphorus Platform
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
MAGOTI ERNEST
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
Sérgio Sacani
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
TinyAnderson
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 

Recently uploaded (20)

Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
molar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptxmolar-distalization in orthodontics-seminar.pptx
molar-distalization in orthodontics-seminar.pptx
 
Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Thornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdfThornton ESPP slides UK WW Network 4_6_24.pdf
Thornton ESPP slides UK WW Network 4_6_24.pdf
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptxThe use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
The use of Nauplii and metanauplii artemia in aquaculture (brine shrimp).pptx
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
The debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically youngThe debris of the ‘last major merger’ is dynamically young
The debris of the ‘last major merger’ is dynamically young
 
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdfTopic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
Topic: SICKLE CELL DISEASE IN CHILDREN-3.pdf
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 

ENBIS 2018 presentation on Deep k-Means

  • 1. Deep k-Means: Jointly Clustering with k-Means and Learning Representations Thibaut THONET thibaut.thonet@univ-grenoble-alpes.fr Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG Joint work with Maziar MORADI FARD and Eric GAUSSIER 5 September 2018 @ ENBIS, Nancy Thibaut Thonet Deep k-Means
  • 2. Clustering Clustering is the process of organizing unlabeled objects into groups (clusters) whose members are similar in some way Clustering approaches may be classified as: Hard clustering: each object belongs at most to one cluster Soft clustering: each object can belong to more than one cluster Thibaut Thonet Deep k-Means 2 / 16
  • 3. k-Means clustering k-Means is a centroid-based approach for hard clustering [MacQueen, 1967]. Given a set of objects X, k-Means clustering aims to group the objects into k clusters of similar samples by minimizing the following loss function: min R x∈X ||x − c(x; R)||2 2 where R are the cluster centers and c(x; R) = arg min r∈R ||x − r||2 is the nearest cluster center to x r1 r2 K Assign objects to clusters Update cluster centers Thibaut Thonet Deep k-Means 3 / 16
  • 4. k-Means clustering k-Means is a centroid-based approach for hard clustering [MacQueen, 1967]. Given a set of objects X, k-Means clustering aims to group the objects into k clusters of similar samples by minimizing the following loss function: min R x∈X ||x − c(x; R)||2 2 where R are the cluster centers and c(x; R) = arg min r∈R ||x − r||2 is the nearest cluster center to x r1 r2 K Assign objects to clusters Update cluster centers ...But the input space is often high-dimensional, sparse and/or with redundant dimensions =⇒ It may not be suitable for clustering Thibaut Thonet Deep k-Means 3 / 16
  • 5. k-Means in an embedded space: Auto-Encoder + k-Means 1. Train an auto-encoder on the dataset to learn object embeddings (e.g., for text, low-dimensional dense representations) 2. Perform k-Means in the embedding space … x Auto(x) r1 r2 K (x)hθ (x)hθ (x) = ||x − Auto(x)|min θ ∑ x Lrec ∑ x |2 2 (x) = || (x) − c( (x); R)|min R ∑ x Lclust ∑ x hθ hθ |2 2 with c( (x); R) = || (x) − r|hθ argmin r∈R hθ |2 Untitled Diagram.xml Thibaut Thonet Deep k-Means 4 / 16
  • 6. k-Means in an embedded space: Auto-Encoder + k-Means 1. Train an auto-encoder on the dataset to learn object embeddings (e.g., for text, low-dimensional dense representations) 2. Perform k-Means in the embedding space … x Auto(x) r1 r2 K (x)hθ (x)hθ (x) = ||x − Auto(x)|min θ ∑ x Lrec ∑ x |2 2 (x) = || (x) − c( (x); R)|min R ∑ x Lclust ∑ x hθ hθ |2 2 with c( (x); R) = || (x) − r|hθ argmin r∈R hθ |2 Untitled Diagram.xml ...But embeddings are not specifically learned for clustering purposes =⇒ They may still not be suitable for clustering Thibaut Thonet Deep k-Means 4 / 16
  • 7. k-Means in an embedded space: Deep Clustering Network The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster representatives R and auto-encoder parameters θ using SGD and (ii) assigns data points to the cluster with the nearest representative in the embedding space = || (x) − c( (x); R)|hθ hθ |2 2 (x) = ||x − Auto(x)|Lrec |2 2 … (x)Lclust x Auto(x) with c( (x); R) = || (x) − r|hθ argmin r∈R hθ |2 L = (x) + λ (x)min R,θ ∑ x Lrec Lclust r1 r2 K (x)hθ diagram_dcn.xml Thibaut Thonet Deep k-Means 5 / 16
  • 8. k-Means in an embedded space: Deep Clustering Network The Deep Clustering Network (DCN) [Yang+, 2017] alternatively (i) learns cluster representatives R and auto-encoder parameters θ using SGD and (ii) assigns data points to the cluster with the nearest representative in the embedding space = || (x) − c( (x); R)|hθ hθ |2 2 (x) = ||x − Auto(x)|Lrec |2 2 … (x)Lclust x Auto(x) with c( (x); R) = || (x) − r|hθ argmin r∈R hθ |2 L = (x) + λ (x)min R,θ ∑ x Lrec Lclust r1 r2 K (x)hθ diagram_dcn.xml ...But impossibility to solely rely on SGD due to discrete assignments (argmin) =⇒ Non-joint and less scalable training Thibaut Thonet Deep k-Means 5 / 16
  • 9. Deep k-means: overview = closeness( (x), )∑ k hθ rk (x) = ||x − Auto(x)|Lrec |2 2 … (x)Lclust x Auto(x) × || (x) − |hθ rk |2 2 L = (x) + λ (x)min R,θ ∑ x Lrec Lclust r1 (x)hθ r2 K am_f.xml Thibaut Thonet Deep k-Means 6 / 16
  • 10. Deep k-means: a differentiable surrogate to DCN We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+, 2018]: P (α) DKM: min R,θ L(α) = x∈X Lrec(x) + λ L (α) clust(x) with L (α) clust(x) = r∈R closeness(hθ(x), r; α) × ||hθ(x) − r||2 such that: closeness(hθ(x), r; α) is differentiable wrt both θ and r lim α→∞ closeness(hθ(x), r; α) =    1 if r = arg min r ∈R ||hθ(x) − r ||2 0 otherwise Thibaut Thonet Deep k-Means 7 / 16
  • 11. Deep k-means: a differentiable surrogate to DCN We propose to solve a fully differentiable problem surrogate to DCN’s [Moradi Fard+, 2018]: P (α) DKM: min R,θ L(α) = x∈X Lrec(x) + λ L (α) clust(x) with L (α) clust(x) = r∈R closeness(hθ(x), r; α) × ||hθ(x) − r||2 such that: closeness(hθ(x), r; α) is differentiable wrt both θ and r lim α→∞ closeness(hθ(x), r; α) =    1 if r = arg min r ∈R ||hθ(x) − r ||2 0 otherwise Intuitively, closeness(hθ(x), r; α) can be seen as a relaxation to DCN’s hard clustering assignments such that lim α→∞ P (α) DKM = PDCN holds Thibaut Thonet Deep k-Means 7 / 16
  • 12. Deep k-means: choice of closeness and α We chose closeness to be defined based on a parameterized softmax: closeness(hθ(x), r; α) = exp(−α ||hθ(x) − r||2 ) r ∈R exp(−α ||hθ(x) − r ||2 ) where α can be either set as a constant or progressively increased (deterministic annealing) Thibaut Thonet Deep k-Means 8 / 16
  • 13. Deep k-means: choice of closeness and α We chose closeness to be defined based on a parameterized softmax: closeness(hθ(x), r; α) = exp(−α ||hθ(x) − r||2 ) r ∈R exp(−α ||hθ(x) − r ||2 ) where α can be either set as a constant or progressively increased (deterministic annealing) α plays two roles: (a) approximation of hard clustering and (b) inverse temperature in a deterministic annealing scheme DKMa : random initialization of θ and R + annealing: sequence (αn)n with α1 = 0.1 DKMp : pretraining of θ and k-means-based initialization of R + no annealing: constant α = 1000 where the sequence (αn)n is defined as αn+1 = 21/ log(n)2 × αn Thibaut Thonet Deep k-Means 8 / 16
  • 14. Deep k-means: SGD-based training algorithm Algorithm 1 Deep k-means Input: data X, number of clusters K, trade-off hyperparameter λ, scheme for α, number of epochs T, number of minibatches N, learning rate η Output: autoencoder parameters θ, cluster representatives R Initialize θ and rk, 1 ≤ k ≤ K (randomly or through pretraining) for each α do # α levels (if α not constant) for t = 1 to T do # epochs per α for n = 1 to N do # minibatches Draw a minibatch ˜X ⊂ X Update (θ, R) ← (θ, R) − η 1 | ˜X| (θ, R) ˜L(α) end for end for end for Thibaut Thonet Deep k-Means 9 / 16
  • 15. Experimental setup AE architecture: encoder with d-500-500-2000-K neurons and mirrored decoder Baselines k-Means AE + k-Means Deep Clustering Network [Yang+, 2017] Improved Deep Embedded Clustering [Guo+, 2017] Datasets Text 20 Newsgroups: 20 classes, 18,846 samples RCV1: 4 classes, 10,000 samples Image MNIST: 10 classes, 70,000 samples USPS: 10 classes, 9,298 samples Clustering metrics Clustering accuracy (ACC) Normalized Mutual Information (NMI) Thibaut Thonet Deep k-Means 10 / 16
  • 16. Clustering performance Mean ± std for ACC and NMI computed over 10 (seeded) runs. Bold (resp. underlined) values correspond to results with no significant difference (p > 0.05) to the best approach with (resp. without) pretraining for each dataset/metric pair Model MNIST USPS 20NEWS RCV1 ACC NMI ACC NMI ACC NMI ACC NMI KM 53.5±0.3 49.8±0.5 67.3±0.1 61.4±0.1 23.2±1.5 21.6±1.8 50.8±2.9 31.3±5.4 AE-KM 80.8±1.8 75.2±1.1 72.9±0.8 71.7±1.2 49.0±2.9 44.5±1.5 56.7±3.6 31.5±4.3 Deep clustering approaches without pretraining DCNnp 34.8±3.0 18.1±1.0 36.4±3.5 16.9±1.3 17.9±1.0 9.8±0.5 41.3±4.0 6.9±1.8 IDECnp 61.8±3.0 62.4±1.6 53.9±5.1 50.0±3.8 22.3±1.5 22.3±1.5 56.7±5.3 31.4±2.8 DKMa 82.3±3.2 78.0±1.9 75.5±6.8 73.0±2.3 44.8±2.4 42.8±1.1 53.8±5.5 28.0±5.8 Deep clustering approaches with pretraining DCNp 81.1±1.9 75.7±1.1 73.0±0.8 71.9±1.2 49.2±2.9 44.7±1.5 56.7±3.6 31.6±4.3 IDECp 85.7±2.4 86.4±1.0 75.2±0.5 74.9±0.6 40.5±1.3 38.2±1.0 59.5±5.7 34.7±5.0 DKMp 84.0±2.2 79.6±0.9 75.7±1.3 77.6±1.1 51.2±2.8 46.7±1.2 58.3±3.8 33.1±4.9 Thibaut Thonet Deep k-Means 11 / 16
  • 17. ‘k-Means-friendliness’ of learned representations Mean ± std for ACC and NMI computed over 10 (seeded) runs. Bold values correspond to results with no significant difference (p > 0.05) to the best Model MNIST USPS 20NEWS RCV1 ACC NMI ACC NMI ACC NMI ACC NMI AE-KM 80.8±1.8 75.2±1.1 72.9±0.8 71.7±1.2 49.0±2.9 44.5±1.5 56.7±3.6 31.5±4.3 DCNp + KM 84.9±3.1 79.4±1.5 73.9±0.7 74.1±1.1 50.5±3.1 46.5±1.6 57.3±3.6 32.3±4.4 DKMa + KM 84.8±1.3 78.7±0.8 76.9±4.9 74.3±1.5 49.0±2.5 44.0±1.0 53.4±5.9 27.4±5.3 DKMp + KM 85.1±3.0 79.9±1.5 75.7±1.3 77.6±1.1 52.1±2.7 47.1±1.3 58.3±3.8 33.0±4.9 40 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 40 AE 40 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 40 DCN 30 20 10 0 10 20 30 40 40 30 20 10 0 10 20 30 40 DKMa 30 20 10 0 10 20 30 40 30 20 10 0 10 20 30 DKMp Thibaut Thonet Deep k-Means 12 / 16
  • 18. Conclusion Proposition of Deep k-Means, a new approach to jointly perform k-means clustering and representation learning Take-home messages: Pretraining is clearly beneficial to deep clustering The differentiable formulation of DKM enables fully joint SGD and thus efficient use of GPU k-Means-based approaches can perform on par with state-of-the-art deep clustering approaches Thibaut Thonet Deep k-Means 13 / 16
  • 19. Ongoing work: Constrained Deep k-Means We wish to guide the clustering results in order to capture information that is relevant to the user (e.g., expert knowledge on the classes). We consider here that this information takes the form of lexical constraints with a set of keywords for document clustering engine car diet foodnovel book Thibaut Thonet Deep k-Means 14 / 16
  • 20. Ongoing work: Constrained Deep k-Means We wish to guide the clustering results in order to capture information that is relevant to the user (e.g., expert knowledge on the classes). We consider here that this information takes the form of lexical constraints with a set of keywords for document clustering engine car diet foodnovel book Two approaches considered: Constrain the document embeddings to put more emphasis on the keywords Constrain the cluster representatives to be related to subsets of the keywords Thibaut Thonet Deep k-Means 14 / 16
  • 21. Thank you! Paper pre-print available at: https://arxiv.org/pdf/1806.10069.pdf Thibaut Thonet Deep k-Means 15 / 16
  • 22. References Guo, X., Gao, L., Liu, X., & Yin, J. (2017). Improved Deep Embedded Clustering with Local Structure Preservation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1753–1759). MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability (pp. 281–297). Moradi Fard, M., Thonet, T., & Gaussier, E. (2018). Deep k-Means: Jointly Clustering with k-Means and Learning Representations. arXiv:1806.10069. Yang, B., Fu, X., Sidiropoulos, N. D., & Hong, M. (2017). Towards K-means-friendly Spaces: Simultaneous Deep Learning and Clustering. In ICML ’17 (pp. 3861–3870). Thibaut Thonet Deep k-Means 16 / 16
  • 23. Appendix: clustering metrics Given the groundtruth classes S = {S1, . . . , SK }, the obtained clusters C = {C1, . . . , CK }, and the dataset X: ACC(C, S) = max φ 1 |X| |X| i=1 I{si = φ(ci)} NMI(C, S) = 2 I(C, S) H(C) + H(S) with I(C, S) = j,k |Cj ∩ Sk| |X| log |X| |Cj ∩ Sk| |Cj| |Sk| and H(C) = − j |Cj| |X| log |Cj| |X| Thibaut Thonet Deep k-Means 16 / 16