Large-scale Single-pass k-Means
Clustering at Scale
Ted Dunning


©MapR Technologies - Confidential   1
Large-scale Single-pass k-Means
Clustering

©MapR Technologies - Confidential   2
Large-scale k-Means Clustering


©MapR Technologies - Confidential   3
Goals

     Cluster very large data sets
     Facilitate large nearest neighbor search
     Allow very large number of clusters
     Achieve good quality
       –   low average distance to nearest centroid on held-out data
     Based on Mahout Math
     Runs on Hadoop (really MapR) cluster
     FAST – cluster tens of millions in minutes




©MapR Technologies - Confidential            4
Non-goals

     Use map-reduce (but it is there)
     Minimize the number of clusters
     Support metrics other than L2




©MapR Technologies - Confidential        5
Anti-goals

     Multiple passes over original data
     Scale as O(k n)




©MapR Technologies - Confidential     6
Why?




©MapR Technologies - Confidential    7
K-nearest Neighbor with
  Super Fast k-means




©MapR Technologies - Confidential   8
What’s that?

     Find the k nearest training examples
     Use the average value of the target variable from them


     This is easy … but hard
       –   easy because it is so conceptually simple and you don’t have knobs to turn
           or models to build
       –   hard because of the stunning amount of math
       –   also hard because we need top 50,000 results


     Initial prototype was massively too slow
       –   3K queries x 200K examples takes hours
       –   needed 20M x 25M in the same time

©MapR Technologies - Confidential            9
How We Did It

     2 week hackathon with 6 developers from customer bank
     Agile-ish development
     To avoid IP issues
       –   all code is Apache Licensed (no ownership question)
       –   all data is synthetic (no question of private data)
       –   all development done on individual machines, hosting on Github
       –   open is easier than closed (in this case)
     Goal is new open technology to facilitate new closed solutions


     Ambitious goal of ~ 1,000,000 x speedup



©MapR Technologies - Confidential           10
How We Did It

     2 week hackathon with 6 developers from customer bank
     Agile-ish development
     To avoid IP issues
       –   all code is Apache Licensed (no ownership question)
       –   all data is synthetic (no question of private data)
       –   all development done on individual machines, hosting on Github
       –   open is easier than closed (in this case)
     Goal is new open technology to facilitate new closed solutions


     Ambitious goal of ~ 1,000,000 x speedup
       –   well, really only 100-1000x after basic hygiene


©MapR Technologies - Confidential             11
What We Did

     Mechanism for extending Mahout Vectors
       –   DelegatingVector, WeightedVector, Centroid


     Shared memory matrix
       –   FileBasedMatrix uses mmap to share very large dense matrices


     Searcher interface
       –   ProjectionSearch, KmeansSearch, LshSearch, Brute


     Super-fast clustering
       –   Kmeans, StreamingKmeans

©MapR Technologies - Confidential           12
Projection Search


                                         java.lang.TreeSet!




©MapR Technologies - Confidential   13
How Many Projections?




©MapR Technologies - Confidential   14
K-means Search

     Simple Idea
       –   pre-cluster the data
       –   to find the nearest points, search the nearest clusters


     Recursive application
       –   to search a cluster, use a Searcher!




©MapR Technologies - Confidential                 15
©MapR Technologies - Confidential   16
x




©MapR Technologies - Confidential       17
©MapR Technologies - Confidential   18
©MapR Technologies - Confidential   19
x




©MapR Technologies - Confidential       20
But This Requires k-means!

     Need a new k-means algorithm to get speed
       –   Hadoop is very slow at iterative map-reduce
       –   Maybe Pregel clones like Giraph would be better
       –   Or maybe not


     Streaming k-means is
       –   One pass (through the original data)
       –   Very fast (20 us per data point with threads)
       –   Very parallelizable




©MapR Technologies - Confidential             21
Basic Method

     Use a single pass of k-means with very many clusters
       –   output is a bad-ish clustering but a good surrogate
     Use weighted centroids from step 1 to do in-memory clustering
       –   output is a good clustering with fewer clusters




©MapR Technologies - Confidential             22
Algorithmic Details

Foreach data point xn
           compute distance to nearest centroid, ∂
           sample u, if u > ∂/ß add to nearest centroid
           else create new centroid

           if number of centroids > 10 log n
                       recursively cluster centroids
                       set ß = 1.5 ß if number of centroids did not decrease




©MapR Technologies - Confidential                        23
How It Works


     Result is large set of centroids
       –   these provide approximation of original distribution
       –   we can cluster centroids to get a close approximation of clustering original
       –   or we can just use the result directly




©MapR Technologies - Confidential             24
Parallel Speedup?

                                        200


                                                                                     Non- threaded




                                                                  ✓
                                        100
                                                  2
                 Tim e per point (μs)




                                                                                      Threaded version
                                                          3

                                        50
                                                                    4
                                        40                                              6
                                                                             5

                                                                                              8
                                        30
                                                                                                  10        14
                                                                                                       12
                                        20                    Perfect Scaling                                    16




                                        10
                                              1       2       3         4        5                                    20


                                                                  Threads
©MapR Technologies - Confidential                                       25
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!




©MapR Technologies - Confidential       26
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)




©MapR Technologies - Confidential                27
Warning, Recursive Descent

     Inner loop requires finding nearest centroid


     With lots of centroids, this is slow


     But wait, we have classes to accelerate that!


                       (Let’s not use k-means searcher, though)


     Empirically, projection search beats 64 bit LSH by a bit



©MapR Technologies - Confidential                28
Moving to Scale

     Map-reduce implementation nearly trivial


     Map: rough-cluster input data, output ß, weighted centroids


     Reduce:
       –   single reducer gets all centroids
       –   if too many centroids, merge using recursive clustering
       –   optionally do final clustering in-memory


     Combiner possible, but essentially never important



©MapR Technologies - Confidential             29
     Contact:
       –   tdunning@maprtech.com
       –   @ted_dunning


     Slides and such:
       –   http://info.mapr.com/ted-mlconf

       Hash tags: #mlconf #mahout #mapr




©MapR Technologies - Confidential            30

Graphlab dunning-clustering

  • 1.
    Large-scale Single-pass k-Means Clusteringat Scale Ted Dunning ©MapR Technologies - Confidential 1
  • 2.
  • 3.
    Large-scale k-Means Clustering ©MapRTechnologies - Confidential 3
  • 4.
    Goals  Cluster very large data sets  Facilitate large nearest neighbor search  Allow very large number of clusters  Achieve good quality – low average distance to nearest centroid on held-out data  Based on Mahout Math  Runs on Hadoop (really MapR) cluster  FAST – cluster tens of millions in minutes ©MapR Technologies - Confidential 4
  • 5.
    Non-goals  Use map-reduce (but it is there)  Minimize the number of clusters  Support metrics other than L2 ©MapR Technologies - Confidential 5
  • 6.
    Anti-goals  Multiple passes over original data  Scale as O(k n) ©MapR Technologies - Confidential 6
  • 7.
  • 8.
    K-nearest Neighbor with Super Fast k-means ©MapR Technologies - Confidential 8
  • 9.
    What’s that?  Find the k nearest training examples  Use the average value of the target variable from them  This is easy … but hard – easy because it is so conceptually simple and you don’t have knobs to turn or models to build – hard because of the stunning amount of math – also hard because we need top 50,000 results  Initial prototype was massively too slow – 3K queries x 200K examples takes hours – needed 20M x 25M in the same time ©MapR Technologies - Confidential 9
  • 10.
    How We DidIt  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup ©MapR Technologies - Confidential 10
  • 11.
    How We DidIt  2 week hackathon with 6 developers from customer bank  Agile-ish development  To avoid IP issues – all code is Apache Licensed (no ownership question) – all data is synthetic (no question of private data) – all development done on individual machines, hosting on Github – open is easier than closed (in this case)  Goal is new open technology to facilitate new closed solutions  Ambitious goal of ~ 1,000,000 x speedup – well, really only 100-1000x after basic hygiene ©MapR Technologies - Confidential 11
  • 12.
    What We Did  Mechanism for extending Mahout Vectors – DelegatingVector, WeightedVector, Centroid  Shared memory matrix – FileBasedMatrix uses mmap to share very large dense matrices  Searcher interface – ProjectionSearch, KmeansSearch, LshSearch, Brute  Super-fast clustering – Kmeans, StreamingKmeans ©MapR Technologies - Confidential 12
  • 13.
    Projection Search java.lang.TreeSet! ©MapR Technologies - Confidential 13
  • 14.
    How Many Projections? ©MapRTechnologies - Confidential 14
  • 15.
    K-means Search  Simple Idea – pre-cluster the data – to find the nearest points, search the nearest clusters  Recursive application – to search a cluster, use a Searcher! ©MapR Technologies - Confidential 15
  • 16.
    ©MapR Technologies -Confidential 16
  • 17.
    x ©MapR Technologies -Confidential 17
  • 18.
    ©MapR Technologies -Confidential 18
  • 19.
    ©MapR Technologies -Confidential 19
  • 20.
    x ©MapR Technologies -Confidential 20
  • 21.
    But This Requiresk-means!  Need a new k-means algorithm to get speed – Hadoop is very slow at iterative map-reduce – Maybe Pregel clones like Giraph would be better – Or maybe not  Streaming k-means is – One pass (through the original data) – Very fast (20 us per data point with threads) – Very parallelizable ©MapR Technologies - Confidential 21
  • 22.
    Basic Method  Use a single pass of k-means with very many clusters – output is a bad-ish clustering but a good surrogate  Use weighted centroids from step 1 to do in-memory clustering – output is a good clustering with fewer clusters ©MapR Technologies - Confidential 22
  • 23.
    Algorithmic Details Foreach datapoint xn compute distance to nearest centroid, ∂ sample u, if u > ∂/ß add to nearest centroid else create new centroid if number of centroids > 10 log n recursively cluster centroids set ß = 1.5 ß if number of centroids did not decrease ©MapR Technologies - Confidential 23
  • 24.
    How It Works  Result is large set of centroids – these provide approximation of original distribution – we can cluster centroids to get a close approximation of clustering original – or we can just use the result directly ©MapR Technologies - Confidential 24
  • 25.
    Parallel Speedup? 200 Non- threaded ✓ 100 2 Tim e per point (μs) Threaded version 3 50 4 40 6 5 8 30 10 14 12 20 Perfect Scaling 16 10 1 2 3 4 5 20 Threads ©MapR Technologies - Confidential 25
  • 26.
    Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! ©MapR Technologies - Confidential 26
  • 27.
    Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though) ©MapR Technologies - Confidential 27
  • 28.
    Warning, Recursive Descent  Inner loop requires finding nearest centroid  With lots of centroids, this is slow  But wait, we have classes to accelerate that! (Let’s not use k-means searcher, though)  Empirically, projection search beats 64 bit LSH by a bit ©MapR Technologies - Confidential 28
  • 29.
    Moving to Scale  Map-reduce implementation nearly trivial  Map: rough-cluster input data, output ß, weighted centroids  Reduce: – single reducer gets all centroids – if too many centroids, merge using recursive clustering – optionally do final clustering in-memory  Combiner possible, but essentially never important ©MapR Technologies - Confidential 29
  • 30.
    Contact: – tdunning@maprtech.com – @ted_dunning  Slides and such: – http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr ©MapR Technologies - Confidential 30