1.
Clustering data at scaleDan Filimon, Politehnica University BucharestTed Dunning, MapR TechnologiesSaturday, June 1, 13
2.
whoami• Soon-to-be graduate of Politehnica Univerity ofBucharest• Apache Mahout committer• This work is my senior project• Contact me at• dﬁlimon@apache.org• dangeorge.ﬁlimon@gmail.comSaturday, June 1, 13
3.
Agenda• Data• Clustering• k-means• Improvements• Large scale• k-means as a map-reduce• streaming k-means• MapReduce & Storm• ResultsFull version at http://goo.gl/n3n8SSaturday, June 1, 13
4.
Data• Real-valued vectors (or anything that can beencoded as such)• Think of rows in a database table• Can be: documents, web pages, images, videos,users, DNASaturday, June 1, 13
5.
The problemGroup n d-dimensional datapoints into k disjointsets to minimizeckXc10@Xxij2Xidist(xij, ci)1AXi is the ithclusterci is the centroid of the ithclusterxij is the jthpoint from the ithclusterdist(x, y) = ||x y||2Saturday, June 1, 13
12.
Details• Quality• How to initialize the centroids• When to stop iterating• How to deal with outliers• Speed• Complexity of cluster assignmentSaturday, June 1, 13
13.
Quality?• Clustering is in the eye of the beholder• Total clustering cost and:• compact• well-separated• Dunn Index, Davies-Bouldin Index, etc.Saturday, June 1, 13
14.
Centroid initialization• Important for quickconvergence and quality• Randomly select k points ascentroids• Clustering fails if twocentroids are in the samecluster• k-means++ addresses this2 seeds here1 seed hereSaturday, June 1, 13
15.
Outliers• Real data is messy• Outliers can affectk-means centroids●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●−5 0 5−6−4−20246xy●●Effect of outliersbetter centroidSaturday, June 1, 13
16.
Closest cluster: reducing k• Avoid computing the distance from a point to everycluster• Random Projections• projection search• locality-sensitive hashing searchSaturday, June 1, 13
17.
Random Projections1.5-2 -1.5 -1 -0.5 0.5 13-3-2-112X AxisYAxis• Unit length vectors with normally distributedcomponentsSaturday, June 1, 13
18.
Closest cluster: reducing d• Principal Component Analysis• compute SVD• Random Projections• multiply data by projection matrixSaturday, June 1, 13
19.
k-means as a MapReduce• Can’t split k-means loop directly• Must express a single k-means step as a MapReduceSaturday, June 1, 13
20.
Now for somethingcompletely different• Attempt to build a “sketch” of the data in one pass• O(k log n) intermediate clusters• Can ﬁt into memory• Ball k-means on the sketch for k ﬁnal clustersSaturday, June 1, 13
21.
streaming k-meansﬁrst pointSaturday, June 1, 13
22.
streaming k-meansbecomes centroidSaturday, June 1, 13
23.
streaming k-meanssecond pointfar awaySaturday, June 1, 13
24.
streaming k-meansbecomes centroidfar awaySaturday, June 1, 13
25.
streaming k-meansthird pointcloseSaturday, June 1, 13
26.
streaming k-meanscentroid is updatedSaturday, June 1, 13
27.
streaming k-meansclosebut becomes anew centroidSaturday, June 1, 13
29.
streaming k-means• for each point p, with weight w• ﬁnd the closest centroid to p, call it c and let d be thedistance between p and c• if an event with probability proportional tod * w / distanceCutoff occurs• create a new cluster with p as its centroid• else, merge p into c• if there are too many clusters, increase distanceCutoffand cluster recursivelySaturday, June 1, 13
30.
The big pictureMapReduce• Can cluster all the points with just 1 MapReduce:• m mappers run streaming k-means• 1 reducer runs ball k-means to get k clustersSaturday, June 1, 13
31.
The big pictureStorm• Storm streaming topology• streaming k-means bolt• Release the sketch when notiﬁed (e.g. tick tuples)• Trident• streaming k-means partition aggregator• ball k-means aggregatorSaturday, June 1, 13
33.
Quality• Compared the quality on various small-medium UCI datasets• iris, seeds, movement, control, power• Computed the following quality measures:• Dunn Index (higher is better)• Davies-Bouldin Index (lower is better)• Adjusted Rand Index (higher is better)• Total cost (lower is better)Saturday, June 1, 13
34.
iris randplot-20.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5051062000038Confusion matrix run 2Adjusted Rand Index 1Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
35.
iris randplot-40.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5501000383059Confusion matrix run 4Adjusted Rand Index 0.546899Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
36.
seeds compareplotbskm km45674.3116448 4.22126213333333Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
37.
movement compareplot●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 601.01.52.02.51.403378●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1.50878●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
38.
control randplot-31 2 3 4 5 612345601320000039000000463934019710000001804590310000Confusion matrix run 3Adjusted Rand Index 0.724136Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
39.
power allplot●●●●bskm km102050100200500142.101769766667 149.3591672Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
40.
power compareplot●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ●●●0 5 10 15 20 25 300200400600800149.3592● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●142.1018●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
45.
Code• Now available in Apache Mahout trunk• Prototype for Stormhttp://github.com/dﬁlimon/streaming-stormClustering algorithmsBallKMeansStreamingKMeansFast nearest-neighbor search ProjectionSearchQuality metrics ClusteringUtilsMapReduce classesStreamingKMeansMapperStreamingKMeansReducerStorm classes StreamingKMeansBoltSaturday, June 1, 13
46.
Thank you!Questions?dﬁlimon@apache.orgdangeorge.ﬁlimon@gmail.comSaturday, June 1, 13
Be the first to comment