Clustering large-scale data Buzzwords 2013 full

Clustering data at scale
(full)
Dan Filimon, Politehnica University Bucharest
Ted Dunning, MapR Technologies
Saturday, June 1, 13

whoami
• Soon-to-be graduate of Politehnica Univerity of Bucharest
• Apache Mahout committer
• This work is my senior project
• Contact me at
• dﬁlimon@apache.org
• dangeorge.ﬁlimon@gmail.com

Agenda
• Data
• Clustering
• k-means
• challenges
• (partial) solutions
• ball k-means
• Large-scale
• map-reduce
• k-means as a map-reduce
• streaming k-means
• the big picture
• Results

Data
• Real-valued vectors (or anything that can be encoded as such)
• Think of rows in a database table
• Can be: documents, web pages, images, videos, users, DNA

Clustering

Rationale
• Want to ﬁnd clumps of higher density in the data
• Widely used for:
• Similarity detection
• Accelerating nearest-neighbor search
• Data approximation
• As a step in other algorithms (k-nearest-neighbor, radial basis functions, etc.)

The problem
Group n d-dimensional datapoints into k disjoint sets to minimize
ckX
c1
0
@
X
xij2Xi
dist(xij, ci)
1
A
Xi is the ith
cluster
ci is the centroid of the ith
cluster
xij is the jth
point from the ith
cluster
dist(x, y) = ||x y||2

But wait!
• NP-hard to solve in the general case
• Can only ﬁnd locally optimal solutions
• Additional assumptions (sigma-separability) prove why we get good results
in practice

Centroid-based clustering:
k-means
• CS folklore since the 50s, also known as Lloyd’s method
• Arbitrarily bad examples exist
• Very effective in practice

Example
initial centroids

Example
ﬁrst iteration assignment

Example
adjusted centroids

Example
second iteration
assignment

k-means pseudocode
• initialize k points as centroids
• while not done
• assign each point to the cluster whose centroid it’s closest to
• set the mean of the points in each cluster as the new centroid

Challenges

Details
• Quality
• Need to clarify (from the pseudocode):
• How to initialize the centroids
• When to stop iterating
• How to deal with outliers
• Speed
• Complexity of cluster assignment

Centroid initialization
• Most important factor for quick
convergence and quality clusters
• Randomly select k points as centroids
• Clustering fails if two centroids are in the
same cluster
• Likely needs multiple restarts to get a good
solution
2 seeds here
1 seed here

Stop condition
• After a ﬁxed number of iterations
• Once the clustering converges (one of):
• No points are assigned to a different cluster than in the previous
iteration
• The “quality” of the clustering plateaus

Quality?
• Clustering is in the eye of the beholder
• Total clustering cost (this is what we optimize for)
• Would also clusters that are:
• compact (small intra-cluster distances)
• well-separated (large inter-cluster distances)
• Dunn Index, Davies-Bouldin Index, other metrics

Outliers
• Real data is messy: algorithm that
collected the data has a bug, the
person who curated the data
made a mistake
• Outliers can affect k-means
centroids
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
−5 0 5
−6−4−20246
x
y
●
●
Effect of outliers
better centroid

Speed
• If, n = # of points to cluster, k = # of clusters, d = dimensionality of the
data
• The complexity of the cluster assignment step for a basic implementation
is O(n * k * d)
• The complexity of the cluster adjustment step is O(k * d)
• Can’t get rid of n, but can:
• search approximately and reduce the effective size of k
• reduce the size of the vectors d (PCA, random projections)

(Partial) Solutions

Centroid initialization revisited
• Smart initialization: k-means++
• Select the ﬁrst 2 centroids with probability proportional to the distance
between them
• While there are fewer than k centroids
• Select the point with probability proportional to it’s shortest distance
to an existing centroid
• Good quality guarantees, practically never fails
• Takes longer, more complicated data structure

Outliers revisited
• Consider just the points that are “close” to the cluster when computing
the new mean
• “Close” typically means a fraction of the distance to the other closest
centroid
• This is known as ball k-means

Closest cluster: reducing k
• Avoid computing the distance from a point to every cluster
• Random Projections
• projection search
• locality-sensitive hashing search

Projections
• Projecting every vector on to a “projection basis” (a randomly sampled,
unit length vector) induces a total ordering
• Points that are close together in the original space will project close
together
• The converse is not true

Example
1.5-2 -1.5 -1 -0.5 0.5 1
3
-3
-2
-1
1
2
X Axis
YAxis

Random Projections
• Widely used for nearest neighbor searches
• Unit length vectors with normally distributed components

How to search?
• Project all the vectors to search among on to the basis
• Keep the results sorted (binary search tree, sorted arrays)
• Search for the projected query (log k)
• Compute actual distances for elements in a ball around the resulting index
• Use more than one projection vector for increased accuracy

How many projections?
10004 10 100
0.5
0
0.1
0.2
0.3
0.4
Dimension
Probabilityofﬁndingnearestneighbor
Probability of ﬁnding nearest neighbor using
single projection search against 10,000 points

Locality Sensitive Hashing
• Based on multiple random projections
• Use the signs of the projections to set bits in a hash (32 or 64 bits) to
1 or 0
• The Hamming Distance between two hashes is correlated with the cosine
distance between their corresponding vectors

Correlation plot
0 8 16 24 32 40 48 56 64
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis

How to search?
• Compute the hashes for each vector to search
• Keep the results sorted (binary search tree, sorted arrays)
• Search for the projected query (log k)
• Compute Hamming Distance for elements in a ball around the resulting
index
• If it’s low enough, compute the actual distance

Closest cluster: reducing d
• Principal Component Analysis
• High quality and known guarantees
• Select on the ﬁrst d’ singular vectors of the SVD of the data
• Random Projections
• Create a d x d’ projection matrix
• Multiply the data matrix (n x d) to the right by the projection matrix

Large scale clustering

Big Data
• Have to use large-scale techniques because there is too much data
• doesn’t ﬁt in RAM
• doesn’t ﬁt on disk
• distributed across many machines in a cluster
• Paradigm of choice: MapReduce, implemented in Hadoop

MapReduce Overview
• Awesome illustrations at CS Illustrated Berkeley
http://csillustrated.berkeley.edu/illustrations.php
• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-1-handout.pdf
• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-2-example-
handout.pdf

k-means as a MapReduce
• Can’t split k-means loop directly:
• An iteration depends on the previous one’s results
• Must express a single k-means step as a MapReduce
• The job driver runs the outer loop submitting a job for every iteration,
where the output of the last one is the input of the current one
• Centroids at each step are read from and written to ﬁles on disk

k-means as a MapReduce
• initialize k points as centroids
• the driver samples the centroids from disk using reservoir sampling
• while not done (multiple map-reduce steps)
• assign each point to the cluster whose centroid it’s closest to
• each mapper gets a chunk of the datapoints and all the clusters emitting
(id, cluster) pairs
• set the mean of the points in each cluster as the new centroid
• each reducer aggregates its cluster indices and the chunks they’re made up of

Issues
• Initial centroid selection requires iterating through all the datapoints (accessing
remote chunks, loading into memory, etc.)
• Cannot feasibly implement k-means++ initialization
• Each map step requires an iteration through the entire data set
• Needs multiple MapReduces (and intermediate data is written to disk)
• Could implement faster nearest-neighbor searching and ball k-means
(but Mahout doesn’t)

Now for something completely different
Streaming k-means
• Attempt to build a “sketch” of the data (can build a provably good sketch in
one pass)
• Won’t get k clusters, will get O(k log n) clusters
• Can collapse the results down to k with ball k-means
• The sketch can be small enough to ﬁt into memory so the clustering can use
all the previous tricks
• Fast and Accurate k-means for Large Datasets, M. Schindler,A.Wong
http://books.nips.cc/papers/ﬁles/nips24/NIPS2011_1271.pdf

Streaming k-means pseudocode
• for each point p, with weight w
• ﬁnd the closest centroid to p, call it c and let d be the distance between p and c
• if an event with probability proportional to d * w / (distanceCutoff) occurs
• create a new cluster with p as its centroid
• else, merge p into c
• if there are too many clusters, increase distanceCutoff and cluster recursively
(“collapse” the centroids)

Example
ﬁrst point

Example
becomes centroid

Example
second point
far away

Example
becomes centroid
far away

Example
third point
close

Example
centroid is updated

Example
close
but becomes a
new centroid

Discussion
• The resulting O(k log n) clusters are a good sketch of the data
• The number of clusters in the sketch will not always be the same
• For some applications (notably approximation), these clusters are ﬁne
• Most important parameter is the initial distanceCutoff (which is data dependent)
• Too large: loose clusters, poor quality
• Too small (less bad): frequent collapses (and it will be increased anyway)
• Rule of thumb: on the order of the minimum distance between two points

The big picture
• Can cluster all the points with just 1 MapReduce:
• m mappers run streaming k-means to get O(k log (n/m)) points
• 1 reducer runs ball k-means to get k clusters
• The reducer can perform another streaming k-means step before ball k-
means to collapse the clusters further so they ﬁt into memory
• Reducer k-means step supports k-means++, multiple restarts with a train/
test split, ball k-means and fast nearest neighbor searching

Results & Implementation

Quality (vs KMeans)
• Compared the quality on various small-medium UCI data sets
• iris, seeds, movement, control, power
• Computed the following quality measures:
• Dunn Index (higher is better)
• Davies-Bouldin Index (lower is better)
• Adjusted Rand Index (higher is better)
• Total cost (lower is better)

iris randplot-2
0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.51.01.52.02.53.03.5
0
51
0
62
0
0
0
0
38
Confusion matrix run 2
Adjusted Rand Index 1
Ball/Streaming k−means centroids
Mahoutk−meanscentroids

iris randplot-4
0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.51.01.52.02.53.03.5
50
1
0
0
0
38
3
0
59
Adjusted Rand Index 0.546899

seeds compareplot
bskm km
4567
4.3116448 4.22126213333333
Clustering techniques compared all
Clustering type / overall mean cluster distance
Meanclusterdistance

movement compareplot
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 20 40 60
1.01.52.02.51.403378
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1.50878
●
●
km distances
bskm distances
km mean
bskm mean
km vs bskm
Cluster index/Run
Averageclusterdistance

control randplot-3
1 2 3 4 5 6
123456
0
132
0
0
0
0
0
39
0
0
0
0
0
0
46
39
34
0
197
1
0
0
0
0
0
0
18
0
4
59
0
31
0
0
0
0
Adjusted Rand Index 0.724136

power allplot
●
●
●
●
bskm km
102050100200500
142.101769766667 149.3591672
Clustering techniques compared all
Clustering type / overall mean cluster distance
Meanclusterdistance

power compareplot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
● ● ●
●
●
0 5 10 15 20 25 30
0200400600800149.3592
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
142.1018
●
●
km distances
bskm distances
km mean
bskm mean
km vs bskm
Cluster index/Run
Averageclusterdistance

Overall (avg. 5 runs)
Dataset Clustering Avg. Dunn Avg. DB Avg. Cost Avg.ARI
iris
km 9.161 0.265 124.146
0.905iris
bskm 6.454 0.336 117.859
0.905
seeds
km 7.432 0.453 909.875
0.980seeds
bskm 6.886 0.505 916.511
0.980
movement
km 0.457 1.843 336.456
0.650movement
bskm 0.436 2.003 347.078
0.650
control
km 0.553 1.700 1014313
0.630control
bskm 0.753 1.434 1004917
0.630
power
km 0.107 1.380 73339083
0.605power
bskm 1.953 1.080 54422758
0.605

Speed (Threaded)

Speed (MapReduce)
• Synthetic data
• 500-dimensional, ~120 million points, 6.7 GB
• 20 corners of a hypercube as means of a multivariate normal distribution
(90%) + uniform noise (10%)
• 5-node MapR cluster on Amazon EC2
• KMeans with 20 clusters, 10 iterations
• StreamingKMeans with 20 clusters, 100 map clusters

Speed (MapReduce)
0
17.5
35
52.5
70
20 clusters 1 iteration
KMeans StreamingKMeans
iteration times comparable
~2.4x faster

Speed (MapReduce)
0
75
150
225
300
1000 clusters 1 iteration
KMeans StreamingKMeans
beneﬁts from approximate
nearest-neighbor search
~8x faster

Code
• Now in trunk, part of Apache Mahout 0.8
• Easy to integrate with the rest of the Hadoop ecosystem (i.e. cluster vectors from
Apache Lucene/Solr)
Clustering algorithms BallKMeans, StreamingKMeans
MapReduce classes
StreamingKMeansDriver, StreamingKMeansMapper,
StreamingKMeansReducer,
StreamingKMeansThread
Fast nearest-neighbor search
ProjectionSearch, FastProjectionSearch,
LocalitySensitiveHashSearch
Quality metrics ClusteringUtils

Thank you!
Questions?
dﬁlimon@apache.org
dangeorge.ﬁlimon@gmail.com

Clustering large-scale data Buzzwords 2013 full

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Clustering large-scale data Buzzwords 2013 full

Similar to Clustering large-scale data Buzzwords 2013 full (20)

Recently uploaded

Recently uploaded (20)

Clustering large-scale data Buzzwords 2013 full