Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Clustering large-scale data TU Berlin talk
1. Clustering large-scale data
Dan Filimon, Politehnica University Bucharest
Ted Dunning, MapR Technologies
Buzzwords talk at http://goo.gl/9FxbN
Saturday, June 1, 13
2. whoami
• Soon-to-be graduate of Politehnica Univerity of
Bucharest
• Apache Mahout committer
• Contact me at
• dfilimon@apache.org
• dangeorge.filimon@gmail.com
Saturday, June 1, 13
3. The problem
Group n d-dimensional datapoints into k disjoint
sets to minimize
ckX
c1
0
@
X
xij2Xi
dist(xij, ci)
1
A
Xi is the ith
cluster
ci is the centroid of the ith
cluster
xij is the jth
point from the ith
cluster
dist(x, y) = ||x y||2
Saturday, June 1, 13
10. Details
• Quality
• How to initialize the centroids
• When to stop iterating
• How to deal with outliers
• Speed
• Complexity of cluster assignment
Saturday, June 1, 13
11. k-means as a MapReduce
• Can’t split k-means loop directly
• Must express a single k-means step as a MapReduce
Saturday, June 1, 13
12. Now for something
completely different
• Attempt to build a “sketch” of the data in one pass
• O(k log n) intermediate clusters
• Can fit into memory
• Ball k-means on the sketch for k final clusters
Saturday, June 1, 13
21. The big picture
MapReduce
• Can cluster all the points with just 1 MapReduce:
• m mappers run streaming k-means
• 1 reducer runs ball k-means to get k clusters
Saturday, June 1, 13
22. Quality
• Compared the quality on various small-medium UCI data
sets
• iris, seeds, movement, control, power
• k-means vs streaming k-means + ball k-means is
comparable quality-wise
Saturday, June 1, 13
24. seeds compareplot
bskm km
4567
4.3116448 4.22126213333333
Clustering techniques compared all
Clustering type / overall mean cluster distance
Meanclusterdistance
Saturday, June 1, 13