Upcoming SlideShare
×

# Distributed streaming k means

1,623 views

Published on

This presentation is part of my work for the course 'Big Data Analytics Projects' at TU Berlin within the IT4BI (Information Technology for Business Intelligence) master programme.

Published in: Technology
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

### Distributed streaming k means

1. 1. Clustering  Group a set of objects  Objects in the same group should be similar  For each group we have an object called centre  Minimise the distance to the central point  Unsupervised learning:  Un-labelled data  No training data
2. 2. Lloyd’s K-means algo.  Centres ← Randomly pick k points  Iterate:  Assign each point to the closest centre  Calculate the new centre points: centroids of each cluster  Problems:  It iterates over the whole list of points -> Not suitable for vast amounts of data.  Bad initialization.
3. 3. K-means++  Centers ← Randomly pick ONE point from X  Until we have enough centres:  Choose from X the next centre with probability 𝐷(𝑝,𝑐)2 𝑖∈𝑋 𝐷(𝑥)2  The probability increases when the distance to the closest centre is high.
4. 4. K-means#  Centers ← Randomly pick 3 log k points from X  Until we have enough centres:  Choose from X the next 3 log k centres with 𝐷(𝑝,𝑐)2 probability 2 𝑖∈𝑋 𝐷(𝑥)  It improves the coverage of the clusters of the optimal solution.
5. 5. Divide and conquer k-means# CENTERS1 POINTS1 WEIGHTS1 k-means# k-means++ CENTERS2 POINTS2 CENTERS WEIGHTS2 k-means# CENTERS3 POINTS3 WEIGHTS3
6. 6. Fast streaming k-means One pass over the points selecting those that are far away from the already selected When there is no space enough, we remove those centres that are less interesting Finally, we run Lloyd’s algorithm on the centres using the weights
7. 7. Basic Method  Single-pass k-means (explained before)  Output: Not-so-good clustering but a good candidate  Use weighted centers/ facilities from Step-1  Output: Good clustering with fewer clusters  Finding Nearest Neighbor: Most time consuming step  NN based on random Projection- Simple  Compact Projection: Simple and Efﬁcient Near Neighbor Search with Practical Memory Requirements [1]  Empirically, Projection search is a bit better than 64 bit LSH[4]
8. 8. Scaling  Map:  Roughly cluster input data using Streaming k-means  Output: Weighted Centers (Cluster’s Center and the number of points it contains)  Reduce:  All centers passed to a single reducer  Apply batch k-means or again one-pass (if there are too many centers)  Can use Combiner but not necessary
9. 9. Scaling
10. 10. References  Compact Projection: Simple and Efﬁcient Near Neighbor Search with Practical Memory Requirements by Kerui Min et al.  Fast and Accurate k-means for large datasets by Shindler et al.  Streaming k-Means Approximation by Jaiswal et al.  Large Scale Single pass k-Means Clustering at Scale by Ted Dunning  Apache Mahout
11. 11. Questions?