Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Clustering large-scale data Buzzwords 2013 full
1. Clustering data at scale
(full)
Dan Filimon, Politehnica University Bucharest
Ted Dunning, MapR Technologies
Saturday, June 1, 13
2. whoami
• Soon-to-be graduate of Politehnica Univerity of Bucharest
• Apache Mahout committer
• This work is my senior project
• Contact me at
• dfilimon@apache.org
• dangeorge.filimon@gmail.com
Saturday, June 1, 13
3. Agenda
• Data
• Clustering
• k-means
• challenges
• (partial) solutions
• ball k-means
• Large-scale
• map-reduce
• k-means as a map-reduce
• streaming k-means
• the big picture
• Results
Saturday, June 1, 13
4. Data
• Real-valued vectors (or anything that can be encoded as such)
• Think of rows in a database table
• Can be: documents, web pages, images, videos, users, DNA
Saturday, June 1, 13
6. Rationale
• Want to find clumps of higher density in the data
• Widely used for:
• Similarity detection
• Accelerating nearest-neighbor search
• Data approximation
• As a step in other algorithms (k-nearest-neighbor, radial basis functions, etc.)
Saturday, June 1, 13
7. The problem
Group n d-dimensional datapoints into k disjoint sets to minimize
ckX
c1
0
@
X
xij2Xi
dist(xij, ci)
1
A
Xi is the ith
cluster
ci is the centroid of the ith
cluster
xij is the jth
point from the ith
cluster
dist(x, y) = ||x y||2
Saturday, June 1, 13
8. But wait!
• NP-hard to solve in the general case
• Can only find locally optimal solutions
• Additional assumptions (sigma-separability) prove why we get good results
in practice
Saturday, June 1, 13
9. Centroid-based clustering:
k-means
• CS folklore since the 50s, also known as Lloyd’s method
• Arbitrarily bad examples exist
• Very effective in practice
Saturday, June 1, 13
16. k-means pseudocode
• initialize k points as centroids
• while not done
• assign each point to the cluster whose centroid it’s closest to
• set the mean of the points in each cluster as the new centroid
Saturday, June 1, 13
18. Details
• Quality
• Need to clarify (from the pseudocode):
• How to initialize the centroids
• When to stop iterating
• How to deal with outliers
• Speed
• Complexity of cluster assignment
Saturday, June 1, 13
19. Centroid initialization
• Most important factor for quick
convergence and quality clusters
• Randomly select k points as centroids
• Clustering fails if two centroids are in the
same cluster
• Likely needs multiple restarts to get a good
solution
2 seeds here
1 seed here
Saturday, June 1, 13
20. Stop condition
• After a fixed number of iterations
• Once the clustering converges (one of):
• No points are assigned to a different cluster than in the previous
iteration
• The “quality” of the clustering plateaus
Saturday, June 1, 13
21. Quality?
• Clustering is in the eye of the beholder
• Total clustering cost (this is what we optimize for)
• Would also clusters that are:
• compact (small intra-cluster distances)
• well-separated (large inter-cluster distances)
• Dunn Index, Davies-Bouldin Index, other metrics
Saturday, June 1, 13
22. Outliers
• Real data is messy: algorithm that
collected the data has a bug, the
person who curated the data
made a mistake
• Outliers can affect k-means
centroids
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
−5 0 5
−6−4−20246
x
y
●
●
Effect of outliers
better centroid
Saturday, June 1, 13
23. Speed
• If, n = # of points to cluster, k = # of clusters, d = dimensionality of the
data
• The complexity of the cluster assignment step for a basic implementation
is O(n * k * d)
• The complexity of the cluster adjustment step is O(k * d)
• Can’t get rid of n, but can:
• search approximately and reduce the effective size of k
• reduce the size of the vectors d (PCA, random projections)
Saturday, June 1, 13
25. Centroid initialization revisited
• Smart initialization: k-means++
• Select the first 2 centroids with probability proportional to the distance
between them
• While there are fewer than k centroids
• Select the point with probability proportional to it’s shortest distance
to an existing centroid
• Good quality guarantees, practically never fails
• Takes longer, more complicated data structure
Saturday, June 1, 13
26. Outliers revisited
• Consider just the points that are “close” to the cluster when computing
the new mean
• “Close” typically means a fraction of the distance to the other closest
centroid
• This is known as ball k-means
Saturday, June 1, 13
27. Closest cluster: reducing k
• Avoid computing the distance from a point to every cluster
• Random Projections
• projection search
• locality-sensitive hashing search
Saturday, June 1, 13
28. Projections
• Projecting every vector on to a “projection basis” (a randomly sampled,
unit length vector) induces a total ordering
• Points that are close together in the original space will project close
together
• The converse is not true
Saturday, June 1, 13
30. Random Projections
• Widely used for nearest neighbor searches
• Unit length vectors with normally distributed components
Saturday, June 1, 13
31. How to search?
• Project all the vectors to search among on to the basis
• Keep the results sorted (binary search tree, sorted arrays)
• Search for the projected query (log k)
• Compute actual distances for elements in a ball around the resulting index
• Use more than one projection vector for increased accuracy
Saturday, June 1, 13
32. How many projections?
10004 10 100
0.5
0
0.1
0.2
0.3
0.4
Dimension
Probabilityoffindingnearestneighbor
Probability of finding nearest neighbor using
single projection search against 10,000 points
Saturday, June 1, 13
33. Locality Sensitive Hashing
• Based on multiple random projections
• Use the signs of the projections to set bits in a hash (32 or 64 bits) to
1 or 0
• The Hamming Distance between two hashes is correlated with the cosine
distance between their corresponding vectors
Saturday, June 1, 13
35. How to search?
• Compute the hashes for each vector to search
• Keep the results sorted (binary search tree, sorted arrays)
• Search for the projected query (log k)
• Compute Hamming Distance for elements in a ball around the resulting
index
• If it’s low enough, compute the actual distance
Saturday, June 1, 13
36. Closest cluster: reducing d
• Principal Component Analysis
• High quality and known guarantees
• Select on the first d’ singular vectors of the SVD of the data
• Random Projections
• Create a d x d’ projection matrix
• Multiply the data matrix (n x d) to the right by the projection matrix
Saturday, June 1, 13
38. Big Data
• Have to use large-scale techniques because there is too much data
• doesn’t fit in RAM
• doesn’t fit on disk
• distributed across many machines in a cluster
• Paradigm of choice: MapReduce, implemented in Hadoop
Saturday, June 1, 13
39. MapReduce Overview
• Awesome illustrations at CS Illustrated Berkeley
http://csillustrated.berkeley.edu/illustrations.php
• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-1-handout.pdf
• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-2-example-
handout.pdf
Saturday, June 1, 13
40. k-means as a MapReduce
• Can’t split k-means loop directly:
• An iteration depends on the previous one’s results
• Must express a single k-means step as a MapReduce
• The job driver runs the outer loop submitting a job for every iteration,
where the output of the last one is the input of the current one
• Centroids at each step are read from and written to files on disk
Saturday, June 1, 13
41. k-means as a MapReduce
• initialize k points as centroids
• the driver samples the centroids from disk using reservoir sampling
• while not done (multiple map-reduce steps)
• assign each point to the cluster whose centroid it’s closest to
• each mapper gets a chunk of the datapoints and all the clusters emitting
(id, cluster) pairs
• set the mean of the points in each cluster as the new centroid
• each reducer aggregates its cluster indices and the chunks they’re made up of
Saturday, June 1, 13
42. Issues
• Initial centroid selection requires iterating through all the datapoints (accessing
remote chunks, loading into memory, etc.)
• Cannot feasibly implement k-means++ initialization
• Each map step requires an iteration through the entire data set
• Needs multiple MapReduces (and intermediate data is written to disk)
• Could implement faster nearest-neighbor searching and ball k-means
(but Mahout doesn’t)
Saturday, June 1, 13
43. Now for something completely different
Streaming k-means
• Attempt to build a “sketch” of the data (can build a provably good sketch in
one pass)
• Won’t get k clusters, will get O(k log n) clusters
• Can collapse the results down to k with ball k-means
• The sketch can be small enough to fit into memory so the clustering can use
all the previous tricks
• Fast and Accurate k-means for Large Datasets, M. Schindler,A.Wong
http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
Saturday, June 1, 13
44. Streaming k-means pseudocode
• for each point p, with weight w
• find the closest centroid to p, call it c and let d be the distance between p and c
• if an event with probability proportional to d * w / (distanceCutoff) occurs
• create a new cluster with p as its centroid
• else, merge p into c
• if there are too many clusters, increase distanceCutoff and cluster recursively
(“collapse” the centroids)
Saturday, June 1, 13
53. Discussion
• The resulting O(k log n) clusters are a good sketch of the data
• The number of clusters in the sketch will not always be the same
• For some applications (notably approximation), these clusters are fine
• Most important parameter is the initial distanceCutoff (which is data dependent)
• Too large: loose clusters, poor quality
• Too small (less bad): frequent collapses (and it will be increased anyway)
• Rule of thumb: on the order of the minimum distance between two points
Saturday, June 1, 13
54. The big picture
• Can cluster all the points with just 1 MapReduce:
• m mappers run streaming k-means to get O(k log (n/m)) points
• 1 reducer runs ball k-means to get k clusters
• The reducer can perform another streaming k-means step before ball k-
means to collapse the clusters further so they fit into memory
• Reducer k-means step supports k-means++, multiple restarts with a train/
test split, ball k-means and fast nearest neighbor searching
Saturday, June 1, 13
56. Quality (vs KMeans)
• Compared the quality on various small-medium UCI data sets
• iris, seeds, movement, control, power
• Computed the following quality measures:
• Dunn Index (higher is better)
• Davies-Bouldin Index (lower is better)
• Adjusted Rand Index (higher is better)
• Total cost (lower is better)
Saturday, June 1, 13
59. seeds compareplot
bskm km
4567
4.3116448 4.22126213333333
Clustering techniques compared all
Clustering type / overall mean cluster distance
Meanclusterdistance
Saturday, June 1, 13
66. Speed (MapReduce)
• Synthetic data
• 500-dimensional, ~120 million points, 6.7 GB
• 20 corners of a hypercube as means of a multivariate normal distribution
(90%) + uniform noise (10%)
• 5-node MapR cluster on Amazon EC2
• KMeans with 20 clusters, 10 iterations
• StreamingKMeans with 20 clusters, 100 map clusters
Saturday, June 1, 13
69. Code
• Now in trunk, part of Apache Mahout 0.8
• Easy to integrate with the rest of the Hadoop ecosystem (i.e. cluster vectors from
Apache Lucene/Solr)
Clustering algorithms BallKMeans, StreamingKMeans
MapReduce classes
StreamingKMeansDriver, StreamingKMeansMapper,
StreamingKMeansReducer,
StreamingKMeansThread
Fast nearest-neighbor search
ProjectionSearch, FastProjectionSearch,
LocalitySensitiveHashSearch
Quality metrics ClusteringUtils
Saturday, June 1, 13