Determining the k in k-means
with MapReduce
Thibault Debatty, Pietro Michiardi,
Wim Mees & Olivier Thonnard
Algorithms for MapReduce and Beyond 2014
Determining the k in k-means with MapReduce 2
Clustering & k-means
● Clustering
● K-means
[Stuart P. Lloyd. Least squares quantization in pcm. IEEE
Transactions on Information Theory, 28:129–137, 1982.]
– 1982 (a great year!)
– But still largely used
– Drawbacks (amongst others):
● Local minimum
● K is a parameter!
Determining the k in k-means with MapReduce 3
Clustering & k-means
● Determine k:
– VERY difficult
[Anil K Jain. Data Clustering : 50 Years Beyond K-Means.
Pattern Recognition Letters, 2009]
– Using cluster evaluation metrics:
Dunn's index, Elbow, Silhouette, “jump method” (based on
information theory), “Gap statistic”,...
O(k²)
Determining the k in k-means with MapReduce 4
G-means
● G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
● K-means : points in each cluster are
spherically distributed around the center
Source:scikit-learn
Determining the k in k-means with MapReduce 5
G-means
● G-means
[Greg Hamerly and Charles Elkan. Learning the k in k-
means. In Neural Information Processing Systems. MIT
Press, 2003]
● K-means : points in each cluster are
spherically distributed around the center
normality test &
recursion
Determining the k in k-means with MapReduce 6
G-means
Dataset
Determining the k in k-means with MapReduce 7
G-means
1. Pick 2 centers
Determining the k in k-means with MapReduce 8
G-means
2. k-means
Determining the k in k-means with MapReduce 9
G-means
3. Project
Determining the k in k-means with MapReduce 10
G-means
3. Project
Determining the k in k-means with MapReduce 11
G-means
Normal?
No
=> recursion
4. Normality test
Determining the k in k-means with MapReduce 12
G-means
5. Recursion
Determining the k in k-means with MapReduce 13
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 14
MapReduce G-means
● Challenges:
1. Reduce I/O operations
2. Reduce number of jobs
3. Maximize parallelism
4. Limit memory usage
Determining the k in k-means with MapReduce 15
MapReduce G-means
2. Reduce number of jobs
PickInitialCenters
while Not ClusteringCompleted do
KMeans
KMeansAndFindNewCenters
TestClusters
end while
Determining the k in k-means with MapReduce 16
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
3. Maximize
parallelism
4. Limit memory
usage
Determining the k in k-means with MapReduce 17
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
Bottleneck
3. Maximize
parallelism
4. Limit memory
usage (risk of crash)
Determining the k in k-means with MapReduce 18
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
In memory combiner
Determining the k in k-means with MapReduce 19
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Determining the k in k-means with MapReduce 20
MapReduce G-means
TestClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Emit(cluster, projection)
end procedure
Reduce(cluster, projections)
Build a vector
ADtest(vector)
if normal then
Mark cluster
end if
end procedure
TestFewClusters
Map(key, point)
Find cluster
Find vector
Project point on vector
Add projection to list
end procedure
Close()
For each list do
Build a vector
A2 = ADtest(vector)
Emit(cluster, A2)
End for each
end procedure
#clusters > #reducers
&
Estimated required memory < Java heap
Experimentally:
64 Bytes / point
Determining the k in k-means with MapReduce 21
Comparison
MR multi-k-means MR G-means
Speed
Quality
all possible values of k
in a single job
Determining the k in k-means with MapReduce 22
Comparison
MR multi-k-means MR G-means
Speed O(nk²) computations O(nk) computations
But:
● more iterations
● more dataset reads
● log2
(k)
Quality New centers added if and
where needed
But:
tends to overestimate k!
Determining the k in k-means with MapReduce 23
Experimental results : Speed
● Hadoop
● Synthetic dataset
● 10M points in R10
● Euclidean distance
● 8 machines
Determining the k in k-means with MapReduce 24
Experimental results : Quality
● Hadoop
● Synthetic dataset
● 10M points in R10
● Euclidean distance
● 8 machines
k 100 200 400
kfound
150 279 639
Within Cluster Sum of Square
(less is better)
MR G-means 3.34 3.33 3.23
multi-k-means 3.71 3.6 3.39
(with same k)
x ~1.5
Determining the k in k-means with MapReduce 25
Conclusions & future work...
● MapReduce algorithm to determine k
● Running time proportional to k
● Future:
– Overestimation of k
– Test on real data
– Test scalability
– Reduce I/O (using Spark)
– Consider skewed data
– Consider impact of machine failure
Determining the k in k-means with MapReduce 26
Thank you!

Determining the k in k-means with MapReduce

  • 1.
    Determining the kin k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard Algorithms for MapReduce and Beyond 2014
  • 2.
    Determining the kin k-means with MapReduce 2 Clustering & k-means ● Clustering ● K-means [Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.] – 1982 (a great year!) – But still largely used – Drawbacks (amongst others): ● Local minimum ● K is a parameter!
  • 3.
    Determining the kin k-means with MapReduce 3 Clustering & k-means ● Determine k: – VERY difficult [Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009] – Using cluster evaluation metrics: Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,... O(k²)
  • 4.
    Determining the kin k-means with MapReduce 4 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center Source:scikit-learn
  • 5.
    Determining the kin k-means with MapReduce 5 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center normality test & recursion
  • 6.
    Determining the kin k-means with MapReduce 6 G-means Dataset
  • 7.
    Determining the kin k-means with MapReduce 7 G-means 1. Pick 2 centers
  • 8.
    Determining the kin k-means with MapReduce 8 G-means 2. k-means
  • 9.
    Determining the kin k-means with MapReduce 9 G-means 3. Project
  • 10.
    Determining the kin k-means with MapReduce 10 G-means 3. Project
  • 11.
    Determining the kin k-means with MapReduce 11 G-means Normal? No => recursion 4. Normality test
  • 12.
    Determining the kin k-means with MapReduce 12 G-means 5. Recursion
  • 13.
    Determining the kin k-means with MapReduce 13 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 14.
    Determining the kin k-means with MapReduce 14 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  • 15.
    Determining the kin k-means with MapReduce 15 MapReduce G-means 2. Reduce number of jobs PickInitialCenters while Not ClusteringCompleted do KMeans KMeansAndFindNewCenters TestClusters end while
  • 16.
    Determining the kin k-means with MapReduce 16 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure 3. Maximize parallelism 4. Limit memory usage
  • 17.
    Determining the kin k-means with MapReduce 17 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure Bottleneck 3. Maximize parallelism 4. Limit memory usage (risk of crash)
  • 18.
    Determining the kin k-means with MapReduce 18 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure In memory combiner
  • 19.
    Determining the kin k-means with MapReduce 19 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap
  • 20.
    Determining the kin k-means with MapReduce 20 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap Experimentally: 64 Bytes / point
  • 21.
    Determining the kin k-means with MapReduce 21 Comparison MR multi-k-means MR G-means Speed Quality all possible values of k in a single job
  • 22.
    Determining the kin k-means with MapReduce 22 Comparison MR multi-k-means MR G-means Speed O(nk²) computations O(nk) computations But: ● more iterations ● more dataset reads ● log2 (k) Quality New centers added if and where needed But: tends to overestimate k!
  • 23.
    Determining the kin k-means with MapReduce 23 Experimental results : Speed ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines
  • 24.
    Determining the kin k-means with MapReduce 24 Experimental results : Quality ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines k 100 200 400 kfound 150 279 639 Within Cluster Sum of Square (less is better) MR G-means 3.34 3.33 3.23 multi-k-means 3.71 3.6 3.39 (with same k) x ~1.5
  • 25.
    Determining the kin k-means with MapReduce 25 Conclusions & future work... ● MapReduce algorithm to determine k ● Running time proportional to k ● Future: – Overestimation of k – Test on real data – Test scalability – Reduce I/O (using Spark) – Consider skewed data – Consider impact of machine failure
  • 26.
    Determining the kin k-means with MapReduce 26 Thank you!