Clustering data at scale(full)Dan Filimon, Politehnica University BucharestTed Dunning, MapR TechnologiesSaturday, June 1, 13
whoami• Soon-to-be graduate of Politehnica Univerity of Bucharest• Apache Mahout committer• This work is my senior project• Contact me at• dﬁlimon@apache.org• dangeorge.ﬁlimon@gmail.comSaturday, June 1, 13
Agenda• Data• Clustering• k-means• challenges• (partial) solutions• ball k-means• Large-scale• map-reduce• k-means as a map-reduce• streaming k-means• the big picture• ResultsSaturday, June 1, 13
Data• Real-valued vectors (or anything that can be encoded as such)• Think of rows in a database table• Can be: documents, web pages, images, videos, users, DNASaturday, June 1, 13
ClusteringSaturday, June 1, 13
Rationale• Want to ﬁnd clumps of higher density in the data• Widely used for:• Similarity detection• Accelerating nearest-neighbor search• Data approximation• As a step in other algorithms (k-nearest-neighbor, radial basis functions, etc.)Saturday, June 1, 13
The problemGroup n d-dimensional datapoints into k disjoint sets to minimizeckXc10@Xxij2Xidist(xij, ci)1AXi is the ithclusterci is the centroid of the ithclusterxij is the jthpoint from the ithclusterdist(x, y) = ||x y||2Saturday, June 1, 13
But wait!• NP-hard to solve in the general case• Can only ﬁnd locally optimal solutions• Additional assumptions (sigma-separability) prove why we get good resultsin practiceSaturday, June 1, 13
Centroid-based clustering:k-means• CS folklore since the 50s, also known as Lloyd’s method• Arbitrarily bad examples exist• Very effective in practiceSaturday, June 1, 13
ExampleSaturday, June 1, 13
Exampleinitial centroidsSaturday, June 1, 13
Exampleﬁrst iteration assignmentSaturday, June 1, 13
Exampleadjusted centroidsSaturday, June 1, 13
Examplesecond iterationassignmentSaturday, June 1, 13
Exampleadjusted centroidsSaturday, June 1, 13
k-means pseudocode• initialize k points as centroids• while not done• assign each point to the cluster whose centroid it’s closest to• set the mean of the points in each cluster as the new centroidSaturday, June 1, 13
ChallengesSaturday, June 1, 13
Details• Quality• Need to clarify (from the pseudocode):• How to initialize the centroids• When to stop iterating• How to deal with outliers• Speed• Complexity of cluster assignmentSaturday, June 1, 13
Centroid initialization• Most important factor for quickconvergence and quality clusters• Randomly select k points as centroids• Clustering fails if two centroids are in thesame cluster• Likely needs multiple restarts to get a goodsolution2 seeds here1 seed hereSaturday, June 1, 13
Stop condition• After a ﬁxed number of iterations• Once the clustering converges (one of):• No points are assigned to a different cluster than in the previousiteration• The “quality” of the clustering plateausSaturday, June 1, 13
Quality?• Clustering is in the eye of the beholder• Total clustering cost (this is what we optimize for)• Would also clusters that are:• compact (small intra-cluster distances)• well-separated (large inter-cluster distances)• Dunn Index, Davies-Bouldin Index, other metricsSaturday, June 1, 13
Outliers• Real data is messy: algorithm thatcollected the data has a bug, theperson who curated the datamade a mistake• Outliers can affect k-meanscentroids●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●● ●●● ●●●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●−5 0 5−6−4−20246xy●●Effect of outliersbetter centroidSaturday, June 1, 13
Speed• If, n = # of points to cluster, k = # of clusters, d = dimensionality of thedata• The complexity of the cluster assignment step for a basic implementationis O(n * k * d)• The complexity of the cluster adjustment step is O(k * d)• Can’t get rid of n, but can:• search approximately and reduce the effective size of k• reduce the size of the vectors d (PCA, random projections)Saturday, June 1, 13
(Partial) SolutionsSaturday, June 1, 13
Centroid initialization revisited• Smart initialization: k-means++• Select the ﬁrst 2 centroids with probability proportional to the distancebetween them• While there are fewer than k centroids• Select the point with probability proportional to it’s shortest distanceto an existing centroid• Good quality guarantees, practically never fails• Takes longer, more complicated data structureSaturday, June 1, 13
Outliers revisited• Consider just the points that are “close” to the cluster when computingthe new mean• “Close” typically means a fraction of the distance to the other closestcentroid• This is known as ball k-meansSaturday, June 1, 13
Closest cluster: reducing k• Avoid computing the distance from a point to every cluster• Random Projections• projection search• locality-sensitive hashing searchSaturday, June 1, 13
Projections• Projecting every vector on to a “projection basis” (a randomly sampled,unit length vector) induces a total ordering• Points that are close together in the original space will project closetogether• The converse is not trueSaturday, June 1, 13
Example1.5-2 -1.5 -1 -0.5 0.5 13-3-2-112X AxisYAxisSaturday, June 1, 13
Random Projections• Widely used for nearest neighbor searches• Unit length vectors with normally distributed componentsSaturday, June 1, 13
How to search?• Project all the vectors to search among on to the basis• Keep the results sorted (binary search tree, sorted arrays)• Search for the projected query (log k)• Compute actual distances for elements in a ball around the resulting index• Use more than one projection vector for increased accuracySaturday, June 1, 13
How many projections?10004 10 1000.500.10.20.30.4DimensionProbabilityofﬁndingnearestneighborProbability of ﬁnding nearest neighbor usingsingle projection search against 10,000 pointsSaturday, June 1, 13
Locality Sensitive Hashing• Based on multiple random projections• Use the signs of the projections to set bits in a hash (32 or 64 bits) to1 or 0• The Hamming Distance between two hashes is correlated with the cosinedistance between their corresponding vectorsSaturday, June 1, 13
How to search?• Compute the hashes for each vector to search• Keep the results sorted (binary search tree, sorted arrays)• Search for the projected query (log k)• Compute Hamming Distance for elements in a ball around the resultingindex• If it’s low enough, compute the actual distanceSaturday, June 1, 13
Closest cluster: reducing d• Principal Component Analysis• High quality and known guarantees• Select on the ﬁrst d’ singular vectors of the SVD of the data• Random Projections• Create a d x d’ projection matrix• Multiply the data matrix (n x d) to the right by the projection matrixSaturday, June 1, 13
Large scale clusteringSaturday, June 1, 13
Big Data• Have to use large-scale techniques because there is too much data• doesn’t ﬁt in RAM• doesn’t ﬁt on disk• distributed across many machines in a cluster• Paradigm of choice: MapReduce, implemented in HadoopSaturday, June 1, 13
MapReduce Overview• Awesome illustrations at CS Illustrated Berkeleyhttp://csillustrated.berkeley.edu/illustrations.php• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-1-handout.pdf• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-2-example-handout.pdfSaturday, June 1, 13
k-means as a MapReduce• Can’t split k-means loop directly:• An iteration depends on the previous one’s results• Must express a single k-means step as a MapReduce• The job driver runs the outer loop submitting a job for every iteration,where the output of the last one is the input of the current one• Centroids at each step are read from and written to ﬁles on diskSaturday, June 1, 13
k-means as a MapReduce• initialize k points as centroids• the driver samples the centroids from disk using reservoir sampling• while not done (multiple map-reduce steps)• assign each point to the cluster whose centroid it’s closest to• each mapper gets a chunk of the datapoints and all the clusters emitting(id, cluster) pairs• set the mean of the points in each cluster as the new centroid• each reducer aggregates its cluster indices and the chunks they’re made up ofSaturday, June 1, 13
Issues• Initial centroid selection requires iterating through all the datapoints (accessingremote chunks, loading into memory, etc.)• Cannot feasibly implement k-means++ initialization• Each map step requires an iteration through the entire data set• Needs multiple MapReduces (and intermediate data is written to disk)• Could implement faster nearest-neighbor searching and ball k-means(but Mahout doesn’t)Saturday, June 1, 13
Now for something completely differentStreaming k-means• Attempt to build a “sketch” of the data (can build a provably good sketch inone pass)• Won’t get k clusters, will get O(k log n) clusters• Can collapse the results down to k with ball k-means• The sketch can be small enough to ﬁt into memory so the clustering can useall the previous tricks• Fast and Accurate k-means for Large Datasets, M. Schindler,A.Wonghttp://books.nips.cc/papers/ﬁles/nips24/NIPS2011_1271.pdfSaturday, June 1, 13
Streaming k-means pseudocode• for each point p, with weight w• ﬁnd the closest centroid to p, call it c and let d be the distance between p and c• if an event with probability proportional to d * w / (distanceCutoff) occurs• create a new cluster with p as its centroid• else, merge p into c• if there are too many clusters, increase distanceCutoff and cluster recursively(“collapse” the centroids)Saturday, June 1, 13
Exampleﬁrst pointSaturday, June 1, 13
Examplebecomes centroidSaturday, June 1, 13
Examplesecond pointfar awaySaturday, June 1, 13
Examplebecomes centroidfar awaySaturday, June 1, 13
Examplethird pointcloseSaturday, June 1, 13
Examplecentroid is updatedSaturday, June 1, 13
Exampleclosebut becomes anew centroidSaturday, June 1, 13
ExampleSaturday, June 1, 13
Discussion• The resulting O(k log n) clusters are a good sketch of the data• The number of clusters in the sketch will not always be the same• For some applications (notably approximation), these clusters are ﬁne• Most important parameter is the initial distanceCutoff (which is data dependent)• Too large: loose clusters, poor quality• Too small (less bad): frequent collapses (and it will be increased anyway)• Rule of thumb: on the order of the minimum distance between two pointsSaturday, June 1, 13
The big picture• Can cluster all the points with just 1 MapReduce:• m mappers run streaming k-means to get O(k log (n/m)) points• 1 reducer runs ball k-means to get k clusters• The reducer can perform another streaming k-means step before ball k-means to collapse the clusters further so they ﬁt into memory• Reducer k-means step supports k-means++, multiple restarts with a train/test split, ball k-means and fast nearest neighbor searchingSaturday, June 1, 13
Results & ImplementationSaturday, June 1, 13
Quality (vs KMeans)• Compared the quality on various small-medium UCI data sets• iris, seeds, movement, control, power• Computed the following quality measures:• Dunn Index (higher is better)• Davies-Bouldin Index (lower is better)• Adjusted Rand Index (higher is better)• Total cost (lower is better)Saturday, June 1, 13
iris randplot-20.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5051062000038Confusion matrix run 2Adjusted Rand Index 1Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
iris randplot-40.5 1.0 1.5 2.0 2.5 3.0 3.50.51.01.52.02.53.03.5501000383059Confusion matrix run 4Adjusted Rand Index 0.546899Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
seeds compareplotbskm km45674.3116448 4.22126213333333Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
movement compareplot●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●0 20 40 601.01.52.02.51.403378●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1.50878●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
control randplot-31 2 3 4 5 612345601320000039000000463934019710000001804590310000Confusion matrix run 3Adjusted Rand Index 0.724136Ball/Streaming k−means centroidsMahoutk−meanscentroidsSaturday, June 1, 13
power allplot●●●●bskm km102050100200500142.101769766667 149.3591672Clustering techniques compared allClustering type / overall mean cluster distanceMeanclusterdistanceSaturday, June 1, 13
power compareplot●●●●●●●●●●●●●●●●● ●●●●●● ●●● ● ●●●0 5 10 15 20 25 300200400600800149.3592● ●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●142.1018●●km distancesbskm distanceskm meanbskm meankm vs bskmCluster index/RunAverageclusterdistanceSaturday, June 1, 13
Speed (MapReduce)• Synthetic data• 500-dimensional, ~120 million points, 6.7 GB• 20 corners of a hypercube as means of a multivariate normal distribution(90%) + uniform noise (10%)• 5-node MapR cluster on Amazon EC2• KMeans with 20 clusters, 10 iterations• StreamingKMeans with 20 clusters, 100 map clustersSaturday, June 1, 13
Speed (MapReduce)017.53552.57020 clusters 1 iterationKMeans StreamingKMeansiteration times comparable~2.4x fasterSaturday, June 1, 13
Speed (MapReduce)0751502253001000 clusters 1 iterationKMeans StreamingKMeansbeneﬁts from approximatenearest-neighbor search~8x fasterSaturday, June 1, 13
Code• Now in trunk, part of Apache Mahout 0.8• Easy to integrate with the rest of the Hadoop ecosystem (i.e. cluster vectors fromApache Lucene/Solr)Clustering algorithms BallKMeans, StreamingKMeansMapReduce classesStreamingKMeansDriver, StreamingKMeansMapper,StreamingKMeansReducer,StreamingKMeansThreadFast nearest-neighbor searchProjectionSearch, FastProjectionSearch,LocalitySensitiveHashSearchQuality metrics ClusteringUtilsSaturday, June 1, 13
Thank you!Questions?dﬁlimon@apache.orgdangeorge.ﬁlimon@gmail.comSaturday, June 1, 13
Be the first to comment