Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Teaching k-Means New Tricks
Sergei Vassilvitskii
Google

k-Means Algorithm
The k-Means Algorithm [Lloyd ’57]
– Clusters points intro groups
– Remains a workhorse of machine learning even in the age of deep networks

MR ML Algorithmics Sergei Vassilvitskii
Lloyd’s Method: k-means
Initialize with random clusters
49

Assign each point to nearest center
50

Recompute optimum centers (means)
51

Repeat: Assign points to nearest center
52

Repeat: Recompute centers
53

Repeat...
54

Repeat...Until clustering does not change
55

Total error reduced at every step - guaranteed to converge.
55

Total error reduced at every step - guaranteed to converge.
Minimizes:
56
(X, C) =
X
x2X
d(x, C)2

New Tricks for k-Means
Initialization:
– Is random initialization a good idea?
Large data:
– Clustering many points (in parallel)
– Clustering into many clusters

k-means Initialization
Random?
57

Random?
58

Random? A bad idea
59

Random? A bad idea
Even with many random restarts!
59

Easy Fix
Select centers using a furthest point algorithm (2-approximation to k-
Center clustering).
60

Easy Fix
Center clustering).
61

Easy Fix
Center clustering).
62

Easy Fix
Center clustering).
63

Easy Fix
Center clustering).
64

Sensitive to Outliers
65

Sensitive to Outliers
66

Interpolate between two methods. Give preference to further points.
Let be the distance between and the nearest cluster center.
Sample next center proportionally to .
k-means++
67
D(p) p
D↵
(p)

k-means++
68
D(p) p
Sample next center proportionally to .D↵
(p)
D↵
(p)
P
x D↵(p)
kmeans++:
Select first point uniformly at random
for (int i=1; i < k; ++i){
Select next point p with probability ;
UpdateDistances();
}

k-means++
69
D(p) p
Sample next center proportionally to .D↵
(p)
↵ = 1
↵ = 2
Original Lloyd’s:
Furthest Point:
k-means++:
↵ = 0
D↵
(p)
P
x D↵(p)
kmeans++:
for (int i=1; i < k; ++i){
UpdateDistances();
}

k-means++
70

k-means++
71
Theorem [AV ’07]: k-means++ guarantees a approximation⇥(log k)

Dealing with large data
The new initialization approach:
– Leads to very good clusterings
– But is very sequential!
• Must select one cluster at a time, then update the distribution we are
sampling from
– How to adapt it in the world of parallel computing?

Speeding up initialization
Initialization:
kmeans++:
for (int i=1; i < k; ++i) {
UpdateDistance();
}
Improving the speed:
– Instead of selecting a single point, sample many points at a time
– Oversample: select more than k centers, and then select the best k out of them.
D2
(p)
P
x D2(x)

k-means||
74
kmeans++:
for (int i=1; i < k; ++i){
UpdateDistances();
}
}
D2
(p)
P
p D2(p)

k-means||
75
kmeans++:
Select first point c uniformly at random
for (int i=1; i < ; ++i){
Select point p independently with probability
UpdateDistances();
}
Prune to k points total by clustering the clusters
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))

k-means||
76
kmeans++:
for (int i=1; i < ; ++i){
UpdateDistances();
}
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))
Independent selection
Easy MR

k-means||
77
kmeans++:
for (int i=1; i < ; ++i){
UpdateDistances();
}
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))
Easy MR
Oversampling Parameter

k-means||
78
kmeans++:
for (int i=1; i < ; ++i){
UpdateDistances();
}
}
k · ` ·
D↵
(p)
P
x D↵(p)
log`( (X, c))
Easy MR
Oversampling Parameter
Re-clustering step

k-means||: Analysis
How Many Rounds?
– Theorem: After rounds, guarantee approximation
– In practice: fewer iterations are needed
– Need to re-cluster intermediate centers
Discussion:
– Number of rounds independent of k
– Tradeoff between number of rounds and memory
79
O(1)O(log`(n ))
O(k` log`(n ))

How well does this work?
80
1e+12
1e+13
1 10
log # Rounds
1e+11
1e+12
1e+13
1
1e+11
1e+12
1e+13
1e+14
1e+15
1e+16
1 10
cost
log # Rounds
KDD Dataset, k=65
l/k=1
l/k=2
l/k=4
1e+10
1e+11
1e+12
1e+13
1e+14
1e+15
1e+16
1
cost
Random Initialization
k-means++
k-means||
l=1
l=2
l=4

Performance vs. k-means++
– Even better on small datasets: 4600 points, 50 dimensions (SPAM)
– Accuracy:
– Time (iterations):
81

Large k
How do you run k-means when k is large?
– For every point, need to find the nearest center

Large k
– Naive approach: linear scan

Large k
– Naive approach: linear scan
– Better approach [Elkan]:
• Use triangle inequality to see if the center could have possibly gotten closer
• Still expensive when k is large

Using Nearest Neighbor Data Structures
Expensive step of k-Means:
– For every point, find the nearest center
But we have many algorithms for nearest neighbors!

First idea:
– Index the centers. Then do a query into this data structure for every point
– Need to rebuild the NN Data structure every time

First idea:
– Index the centers. Then do a query into this data structure for every point
– Need to rebuild the NN Data structure every time
Better idea:
– Index the points!
– For every center, query the nearest points

Performance
Two large datasets:
– 1M points in each
– 7-25M features in each (very high dimensionality)
– Clustering into k=1000 clusters.

Performance
Two large datasets:
– 1M points in each
– 7-25M features in each (very high dimensionality)
– Clustering into k=1000 clusters.
Index based k-means:
– Simple implementation: 2-7x faster than traditional k-means
– No degradation in quality (same objective function value)
– More complex implementation:
• An additional 8-50x speed improvement !

K-Means Algorithm
Almost 60 years on, still incredibly popular and useful approach
It has gotten better with age:
– Better initialization approaches that are fast and accurate
– Parallel implementations to handle large datasets
– New implementations that handle points in many dimensions and clustering into
many clusters
– New approaches for online clustering

K-Means Algorithm
Almost 60 years on, still incredibly popular and useful approach
It has gotten better with age:
– Better initialization approaches that are fast and accurate
– Parallel implementations to handle large datasets
– New implementations that handle points in many dimensions and clustering into
many clusters
– New approaches for online clustering
More work remains!
– Non spherical clusters
– Other metric spaces
– Dealing with outliers

Thank You.
Arthur, D., V., S. K-means++, the advantages of better seeding. SODA 2007.
Bahmani, B., Moseley, B., Vattani A., Kumar, R., V.,S. Scalable k-means++.
VLDB 2012.
Broder, A., Garcia, L., Josifovski, V., V.S., Venkatesan, S. Scalable k-means by
ranked retrieval. WSDM 2014.

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16

Similar to Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16