Clustering
 Group a set of objects
 Objects in the same group should be similar
 For each group we have an object called centre
 Minimise the distance to the central point

 Unsupervised learning:
 Un-labelled data
 No training data
Lloyd’s K-means algo.
 Centres ← Randomly pick k points
 Iterate:
 Assign each point to the closest centre
 Calculate the new centre points: centroids of each cluster

 Problems:
 It iterates over the whole list of points -> Not suitable for
vast amounts of data.
 Bad initialization.
K-means++
 Centers ← Randomly pick ONE point from X
 Until we have enough centres:
 Choose from X the next centre with probability
𝐷(𝑝,𝑐)2

𝑖∈𝑋

𝐷(𝑥)2

 The probability increases when the distance to the
closest centre is high.
K-means#
 Centers ← Randomly pick 3 log k points from X
 Until we have enough centres:
 Choose from X the next 3 log k centres with
𝐷(𝑝,𝑐)2
probability
2
𝑖∈𝑋

𝐷(𝑥)

 It improves the coverage of the clusters of the
optimal solution.
Divide and conquer
k-means#

CENTERS1

POINTS1
WEIGHTS1
k-means#

k-means++

CENTERS2
POINTS2

CENTERS
WEIGHTS2
k-means#

CENTERS3

POINTS3
WEIGHTS3
Fast streaming k-means
One pass over
the points
selecting those
that are far away
from the already
selected
When there is no
space enough,
we remove those
centres that are
less interesting
Finally, we run
Lloyd’s algorithm
on the centres
using the
weights
Basic Method
 Single-pass k-means (explained before)
 Output: Not-so-good clustering but a good candidate

 Use weighted centers/ facilities from Step-1
 Output: Good clustering with fewer clusters

 Finding Nearest Neighbor: Most time consuming step
 NN based on random Projection- Simple
 Compact Projection: Simple and Efficient Near Neighbor
Search with Practical Memory Requirements [1]
 Empirically, Projection search is a bit better than 64 bit LSH[4]
Scaling
 Map:
 Roughly cluster input data using Streaming k-means
 Output: Weighted Centers (Cluster’s Center and the
number of points it contains)

 Reduce:
 All centers passed to a single reducer
 Apply batch k-means or again one-pass (if there are
too many centers)
 Can use Combiner but not necessary
Scaling
References
 Compact Projection: Simple and Efficient Near Neighbor
Search with Practical Memory Requirements by Kerui
Min et al.

 Fast and Accurate k-means for large datasets by Shindler
et al.
 Streaming k-Means Approximation by Jaiswal et al.
 Large Scale Single pass k-Means Clustering at Scale by
Ted Dunning
 Apache Mahout
Questions?

Distributed streaming k means

  • 2.
    Clustering  Group aset of objects  Objects in the same group should be similar  For each group we have an object called centre  Minimise the distance to the central point  Unsupervised learning:  Un-labelled data  No training data
  • 3.
    Lloyd’s K-means algo. Centres ← Randomly pick k points  Iterate:  Assign each point to the closest centre  Calculate the new centre points: centroids of each cluster  Problems:  It iterates over the whole list of points -> Not suitable for vast amounts of data.  Bad initialization.
  • 4.
    K-means++  Centers ←Randomly pick ONE point from X  Until we have enough centres:  Choose from X the next centre with probability 𝐷(𝑝,𝑐)2 𝑖∈𝑋 𝐷(𝑥)2  The probability increases when the distance to the closest centre is high.
  • 5.
    K-means#  Centers ←Randomly pick 3 log k points from X  Until we have enough centres:  Choose from X the next 3 log k centres with 𝐷(𝑝,𝑐)2 probability 2 𝑖∈𝑋 𝐷(𝑥)  It improves the coverage of the clusters of the optimal solution.
  • 6.
  • 7.
    Fast streaming k-means Onepass over the points selecting those that are far away from the already selected When there is no space enough, we remove those centres that are less interesting Finally, we run Lloyd’s algorithm on the centres using the weights
  • 9.
    Basic Method  Single-passk-means (explained before)  Output: Not-so-good clustering but a good candidate  Use weighted centers/ facilities from Step-1  Output: Good clustering with fewer clusters  Finding Nearest Neighbor: Most time consuming step  NN based on random Projection- Simple  Compact Projection: Simple and Efficient Near Neighbor Search with Practical Memory Requirements [1]  Empirically, Projection search is a bit better than 64 bit LSH[4]
  • 10.
    Scaling  Map:  Roughlycluster input data using Streaming k-means  Output: Weighted Centers (Cluster’s Center and the number of points it contains)  Reduce:  All centers passed to a single reducer  Apply batch k-means or again one-pass (if there are too many centers)  Can use Combiner but not necessary
  • 11.
  • 12.
    References  Compact Projection:Simple and Efficient Near Neighbor Search with Practical Memory Requirements by Kerui Min et al.  Fast and Accurate k-means for large datasets by Shindler et al.  Streaming k-Means Approximation by Jaiswal et al.  Large Scale Single pass k-Means Clustering at Scale by Ted Dunning  Apache Mahout
  • 13.