Determining the k in k-means with MapReduce

1,239 views

Published on

Presentation given at Algorithms for MapReduce and Beyond, 2014-03-28

Published in: Science
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,239
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
58
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Determining the k in k-means with MapReduce

  1. 1. Determining the k in k-means with MapReduce Thibault Debatty, Pietro Michiardi, Wim Mees & Olivier Thonnard Algorithms for MapReduce and Beyond 2014
  2. 2. Determining the k in k-means with MapReduce 2 Clustering & k-means ● Clustering ● K-means [Stuart P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28:129–137, 1982.] – 1982 (a great year!) – But still largely used – Drawbacks (amongst others): ● Local minimum ● K is a parameter!
  3. 3. Determining the k in k-means with MapReduce 3 Clustering & k-means ● Determine k: – VERY difficult [Anil K Jain. Data Clustering : 50 Years Beyond K-Means. Pattern Recognition Letters, 2009] – Using cluster evaluation metrics: Dunn's index, Elbow, Silhouette, “jump method” (based on information theory), “Gap statistic”,... O(k²)
  4. 4. Determining the k in k-means with MapReduce 4 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center Source:scikit-learn
  5. 5. Determining the k in k-means with MapReduce 5 G-means ● G-means [Greg Hamerly and Charles Elkan. Learning the k in k- means. In Neural Information Processing Systems. MIT Press, 2003] ● K-means : points in each cluster are spherically distributed around the center normality test & recursion
  6. 6. Determining the k in k-means with MapReduce 6 G-means Dataset
  7. 7. Determining the k in k-means with MapReduce 7 G-means 1. Pick 2 centers
  8. 8. Determining the k in k-means with MapReduce 8 G-means 2. k-means
  9. 9. Determining the k in k-means with MapReduce 9 G-means 3. Project
  10. 10. Determining the k in k-means with MapReduce 10 G-means 3. Project
  11. 11. Determining the k in k-means with MapReduce 11 G-means Normal? No => recursion 4. Normality test
  12. 12. Determining the k in k-means with MapReduce 12 G-means 5. Recursion
  13. 13. Determining the k in k-means with MapReduce 13 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  14. 14. Determining the k in k-means with MapReduce 14 MapReduce G-means ● Challenges: 1. Reduce I/O operations 2. Reduce number of jobs 3. Maximize parallelism 4. Limit memory usage
  15. 15. Determining the k in k-means with MapReduce 15 MapReduce G-means 2. Reduce number of jobs PickInitialCenters while Not ClusteringCompleted do KMeans KMeansAndFindNewCenters TestClusters end while
  16. 16. Determining the k in k-means with MapReduce 16 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure 3. Maximize parallelism 4. Limit memory usage
  17. 17. Determining the k in k-means with MapReduce 17 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure Bottleneck 3. Maximize parallelism 4. Limit memory usage (risk of crash)
  18. 18. Determining the k in k-means with MapReduce 18 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure In memory combiner
  19. 19. Determining the k in k-means with MapReduce 19 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap
  20. 20. Determining the k in k-means with MapReduce 20 MapReduce G-means TestClusters Map(key, point) Find cluster Find vector Project point on vector Emit(cluster, projection) end procedure Reduce(cluster, projections) Build a vector ADtest(vector) if normal then Mark cluster end if end procedure TestFewClusters Map(key, point) Find cluster Find vector Project point on vector Add projection to list end procedure Close() For each list do Build a vector A2 = ADtest(vector) Emit(cluster, A2) End for each end procedure #clusters > #reducers & Estimated required memory < Java heap Experimentally: 64 Bytes / point
  21. 21. Determining the k in k-means with MapReduce 21 Comparison MR multi-k-means MR G-means Speed Quality all possible values of k in a single job
  22. 22. Determining the k in k-means with MapReduce 22 Comparison MR multi-k-means MR G-means Speed O(nk²) computations O(nk) computations But: ● more iterations ● more dataset reads ● log2 (k) Quality New centers added if and where needed But: tends to overestimate k!
  23. 23. Determining the k in k-means with MapReduce 23 Experimental results : Speed ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines
  24. 24. Determining the k in k-means with MapReduce 24 Experimental results : Quality ● Hadoop ● Synthetic dataset ● 10M points in R10 ● Euclidean distance ● 8 machines k 100 200 400 kfound 150 279 639 Within Cluster Sum of Square (less is better) MR G-means 3.34 3.33 3.23 multi-k-means 3.71 3.6 3.39 (with same k) x ~1.5
  25. 25. Determining the k in k-means with MapReduce 25 Conclusions & future work... ● MapReduce algorithm to determine k ● Running time proportional to k ● Future: – Overestimation of k – Test on real data – Test scalability – Reduce I/O (using Spark) – Consider skewed data – Consider impact of machine failure
  26. 26. Determining the k in k-means with MapReduce 26 Thank you!

×