Distributed streaming k means

Clustering
 Group a set of objects
 Objects in the same group should be similar
 For each group we have an object called centre
 Minimise the distance to the central point

 Unsupervised learning:
 Un-labelled data
 No training data

Lloyd’s K-means algo.
 Centres ← Randomly pick k points
 Iterate:
 Assign each point to the closest centre
 Calculate the new centre points: centroids of each cluster

 Problems:
 It iterates over the whole list of points -> Not suitable for
vast amounts of data.
 Bad initialization.

K-means++
 Centers ← Randomly pick ONE point from X
 Until we have enough centres:
 Choose from X the next centre with probability
𝐷(𝑝,𝑐)2

𝑖∈𝑋

𝐷(𝑥)2

 The probability increases when the distance to the
closest centre is high.

K-means#
 Centers ← Randomly pick 3 log k points from X
 Until we have enough centres:
 Choose from X the next 3 log k centres with
𝐷(𝑝,𝑐)2
probability
2
𝑖∈𝑋

𝐷(𝑥)

 It improves the coverage of the clusters of the
optimal solution.

Divide and conquer
k-means#

CENTERS1

POINTS1
WEIGHTS1
k-means#

k-means++

CENTERS2
POINTS2

CENTERS
WEIGHTS2
k-means#

CENTERS3

POINTS3
WEIGHTS3

Fast streaming k-means
One pass over
the points
selecting those
that are far away
from the already
selected
When there is no
space enough,
we remove those
centres that are
less interesting
Finally, we run
Lloyd’s algorithm
on the centres
using the
weights

Basic Method
 Single-pass k-means (explained before)
 Output: Not-so-good clustering but a good candidate

 Use weighted centers/ facilities from Step-1
 Output: Good clustering with fewer clusters

 Finding Nearest Neighbor: Most time consuming step
 NN based on random Projection- Simple
 Compact Projection: Simple and Efﬁcient Near Neighbor
Search with Practical Memory Requirements [1]
 Empirically, Projection search is a bit better than 64 bit LSH[4]

Scaling
 Map:
 Roughly cluster input data using Streaming k-means
 Output: Weighted Centers (Cluster’s Center and the
number of points it contains)

 Reduce:
 All centers passed to a single reducer
 Apply batch k-means or again one-pass (if there are
too many centers)
 Can use Combiner but not necessary

References
 Compact Projection: Simple and Efﬁcient Near Neighbor
Search with Practical Memory Requirements by Kerui
Min et al.

 Fast and Accurate k-means for large datasets by Shindler
et al.
 Streaming k-Means Approximation by Jaiswal et al.
 Large Scale Single pass k-Means Clustering at Scale by
Ted Dunning
 Apache Mahout

Distributed streaming k means

More Related Content

What's hot

Similar to Distributed streaming k means

More from Jose Luis Lopez Pino

Recently uploaded

Distributed streaming k means