Clustering large-scale dataDan Filimon, Politehnica University BucharestTed Dunning, MapR TechnologiesBuzzwords talk at ht...
whoami• Soon-to-be graduate of Politehnica Univerity ofBucharest• Apache Mahout committer• Contact me at• dfilimon@apache.o...
The problemGroup n d-dimensional datapoints into k disjointsets to minimizeckXc10@Xxij2Xidist(xij, ci)1AXi is the ithclust...
k-meanspoints to clusterSaturday, June 1, 13
k-meansinitial centroidsSaturday, June 1, 13
k-meansfirst iteration assignmentSaturday, June 1, 13
k-meansadjusted centroidsSaturday, June 1, 13
k-meanssecond iteration assignmentSaturday, June 1, 13
k-meansadjusted centroidsSaturday, June 1, 13
Details• Quality• How to initialize the centroids• When to stop iterating• How to deal with outliers• Speed• Complexity of...
k-means as a MapReduce• Can’t split k-means loop directly• Must express a single k-means step as a MapReduceSaturday, June...
Now for somethingcompletely different• Attempt to build a “sketch” of the data in one pass• O(k log n) intermediate cluste...
streaming k-meansfirst pointSaturday, June 1, 13
streaming k-meansbecomes centroidSaturday, June 1, 13
streaming k-meanssecond pointfar awaySaturday, June 1, 13
streaming k-meansbecomes centroidfar awaySaturday, June 1, 13
streaming k-meansthird pointcloseSaturday, June 1, 13
streaming k-meanscentroid is updatedSaturday, June 1, 13
streaming k-meansclosebut becomes anew centroidSaturday, June 1, 13
streaming k-meansSaturday, June 1, 13
The big pictureMapReduce• Can cluster all the points with just 1 MapReduce:• m mappers run streaming k-means• 1 reducer ru...
Quality• Compared the quality on various small-medium UCI datasets• iris, seeds, movement, control, power• k-means vs stre...
iris randplot-20.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5051062000038Confusion matrix run 2Adjusted Rand Index 1Ball...
seeds compareplotbskm km45674.3116448 4.22126213333333Clustering techniques compared allClustering type / overall mean clu...
movement compareplot●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 601.01.52.02.51.403...
Overall (avg. 5 runs)Dataset Clustering Avg. Dunn Avg. DB Avg. Cost Avg.ARIiriskm 9.161 0.265 124.1460.905irisbskm 6.454 0...
Speed (Threaded)Saturday, June 1, 13
Speed (MapReduce)017.53552.57020 clusters 1 iterationKMeans StreamingKMeansiteration times comparable~2.4x fasterSaturday,...
Speed (MapReduce)0751502253001000 clusters 1 iterationKMeans StreamingKMeansbenefits fromapproximatenearest-neighborsearch~...
My takeContribute to open source!Saturday, June 1, 13
My takeRespect language conventions!Saturday, June 1, 13
My takeLearn math!Statistics and Linear Algebra especiallySaturday, June 1, 13
My takeWrite tests!They let you know when you break thingsSaturday, June 1, 13
My takeProfile your code!Especially when it’s oddly slow :)Saturday, June 1, 13
My takeUse source control effectively!Merge often, have multiple branchesSaturday, June 1, 13
My takeMicrobenchmarks are tricky in JavaUse frameworks like CaliperSaturday, June 1, 13
My takeAsk lots of questions!Don’t be afraid of saying something stupidSaturday, June 1, 13
Thank you!Questions?dfilimon@apache.orgdangeorge.filimon@gmail.comSaturday, June 1, 13
Upcoming SlideShare
Loading in …5
×

Clustering large-scale data TU Berlin talk

168
-1

Published on

Talk I gave at TU Berlin for Sebastian Schelter's data mining class.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
168
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Clustering large-scale data TU Berlin talk

  1. 1. Clustering large-scale dataDan Filimon, Politehnica University BucharestTed Dunning, MapR TechnologiesBuzzwords talk at http://goo.gl/9FxbNSaturday, June 1, 13
  2. 2. whoami• Soon-to-be graduate of Politehnica Univerity ofBucharest• Apache Mahout committer• Contact me at• dfilimon@apache.org• dangeorge.filimon@gmail.comSaturday, June 1, 13
  3. 3. The problemGroup n d-dimensional datapoints into k disjointsets to minimizeckXc10@Xxij2Xidist(xij, ci)1AXi is the ithclusterci is the centroid of the ithclusterxij is the jthpoint from the ithclusterdist(x, y) = ||x y||2Saturday, June 1, 13
  4. 4. k-meanspoints to clusterSaturday, June 1, 13
  5. 5. k-meansinitial centroidsSaturday, June 1, 13
  6. 6. k-meansfirst iteration assignmentSaturday, June 1, 13
  7. 7. k-meansadjusted centroidsSaturday, June 1, 13
  8. 8. k-meanssecond iteration assignmentSaturday, June 1, 13
  9. 9. k-meansadjusted centroidsSaturday, June 1, 13
  10. 10. Details• Quality• How to initialize the centroids• When to stop iterating• How to deal with outliers• Speed• Complexity of cluster assignmentSaturday, June 1, 13
  11. 11. k-means as a MapReduce• Can’t split k-means loop directly• Must express a single k-means step as a MapReduceSaturday, June 1, 13
  12. 12. Now for somethingcompletely different• Attempt to build a “sketch” of the data in one pass• O(k log n) intermediate clusters• Can fit into memory• Ball k-means on the sketch for k final clustersSaturday, June 1, 13
  13. 13. streaming k-meansfirst pointSaturday, June 1, 13
  14. 14. streaming k-meansbecomes centroidSaturday, June 1, 13
  15. 15. streaming k-meanssecond pointfar awaySaturday, June 1, 13
  16. 16. streaming k-meansbecomes centroidfar awaySaturday, June 1, 13
  17. 17. streaming k-meansthird pointcloseSaturday, June 1, 13
  18. 18. streaming k-meanscentroid is updatedSaturday, June 1, 13
  19. 19. streaming k-meansclosebut becomes anew centroidSaturday, June 1, 13
  20. 20. streaming k-meansSaturday, June 1, 13
  21. 21. The big pictureMapReduce• Can cluster all the points with just 1 MapReduce:• m mappers run streaming k-means• 1 reducer runs ball k-means to get k clustersSaturday, June 1, 13
  22. 22. Quality• Compared the quality on various small-medium UCI datasets• iris, seeds, movement, control, power• k-means vs streaming k-means + ball k-means iscomparable quality-wiseSaturday, June 1, 13
  23. 23. iris randplot-20.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5051062000038Confusion matrix run 2Adjusted Rand Index 1Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
  24. 24. seeds compareplotbskm km45674.3116448 4.22126213333333Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
  25. 25. movement compareplot●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 601.01.52.02.51.403378●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1.50878●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
  26. 26. Overall (avg. 5 runs)Dataset Clustering Avg. Dunn Avg. DB Avg. Cost Avg.ARIiriskm 9.161 0.265 124.1460.905irisbskm 6.454 0.336 117.8590.905seedskm 7.432 0.453 909.8750.980seedsbskm 6.886 0.505 916.5110.980movementkm 0.457 1.843 336.4560.650movementbskm 0.436 2.003 347.0780.650controlkm 0.553 1.700 10143130.630controlbskm 0.753 1.434 10049170.630powerkm 0.107 1.380 733390830.605powerbskm 1.953 1.080 544227580.605Saturday, June 1, 13
  27. 27. Speed (Threaded)Saturday, June 1, 13
  28. 28. Speed (MapReduce)017.53552.57020 clusters 1 iterationKMeans StreamingKMeansiteration times comparable~2.4x fasterSaturday, June 1, 13
  29. 29. Speed (MapReduce)0751502253001000 clusters 1 iterationKMeans StreamingKMeansbenefits fromapproximatenearest-neighborsearch~8x fasterSaturday, June 1, 13
  30. 30. My takeContribute to open source!Saturday, June 1, 13
  31. 31. My takeRespect language conventions!Saturday, June 1, 13
  32. 32. My takeLearn math!Statistics and Linear Algebra especiallySaturday, June 1, 13
  33. 33. My takeWrite tests!They let you know when you break thingsSaturday, June 1, 13
  34. 34. My takeProfile your code!Especially when it’s oddly slow :)Saturday, June 1, 13
  35. 35. My takeUse source control effectively!Merge often, have multiple branchesSaturday, June 1, 13
  36. 36. My takeMicrobenchmarks are tricky in JavaUse frameworks like CaliperSaturday, June 1, 13
  37. 37. My takeAsk lots of questions!Don’t be afraid of saying something stupidSaturday, June 1, 13
  38. 38. Thank you!Questions?dfilimon@apache.orgdangeorge.filimon@gmail.comSaturday, June 1, 13

×