SlideShare a Scribd company logo
1 of 70
Clustering data at scale
(full)
Dan Filimon, Politehnica University Bucharest
Ted Dunning, MapR Technologies
Saturday, June 1, 13
whoami
• Soon-to-be graduate of Politehnica Univerity of Bucharest
• Apache Mahout committer
• This work is my senior project
• Contact me at
• dfilimon@apache.org
• dangeorge.filimon@gmail.com
Saturday, June 1, 13
Agenda
• Data
• Clustering
• k-means
• challenges
• (partial) solutions
• ball k-means
• Large-scale
• map-reduce
• k-means as a map-reduce
• streaming k-means
• the big picture
• Results
Saturday, June 1, 13
Data
• Real-valued vectors (or anything that can be encoded as such)
• Think of rows in a database table
• Can be: documents, web pages, images, videos, users, DNA
Saturday, June 1, 13
Clustering
Saturday, June 1, 13
Rationale
• Want to find clumps of higher density in the data
• Widely used for:
• Similarity detection
• Accelerating nearest-neighbor search
• Data approximation
• As a step in other algorithms (k-nearest-neighbor, radial basis functions, etc.)
Saturday, June 1, 13
The problem
Group n d-dimensional datapoints into k disjoint sets to minimize
ckX
c1
0
@
X
xij2Xi
dist(xij, ci)
1
A
Xi is the ith
cluster
ci is the centroid of the ith
cluster
xij is the jth
point from the ith
cluster
dist(x, y) = ||x y||2
Saturday, June 1, 13
But wait!
• NP-hard to solve in the general case
• Can only find locally optimal solutions
• Additional assumptions (sigma-separability) prove why we get good results
in practice
Saturday, June 1, 13
Centroid-based clustering:
k-means
• CS folklore since the 50s, also known as Lloyd’s method
• Arbitrarily bad examples exist
• Very effective in practice
Saturday, June 1, 13
Example
Saturday, June 1, 13
Example
initial centroids
Saturday, June 1, 13
Example
first iteration assignment
Saturday, June 1, 13
Example
adjusted centroids
Saturday, June 1, 13
Example
second iteration
assignment
Saturday, June 1, 13
Example
adjusted centroids
Saturday, June 1, 13
k-means pseudocode
• initialize k points as centroids
• while not done
• assign each point to the cluster whose centroid it’s closest to
• set the mean of the points in each cluster as the new centroid
Saturday, June 1, 13
Challenges
Saturday, June 1, 13
Details
• Quality
• Need to clarify (from the pseudocode):
• How to initialize the centroids
• When to stop iterating
• How to deal with outliers
• Speed
• Complexity of cluster assignment
Saturday, June 1, 13
Centroid initialization
• Most important factor for quick
convergence and quality clusters
• Randomly select k points as centroids
• Clustering fails if two centroids are in the
same cluster
• Likely needs multiple restarts to get a good
solution
2 seeds here
1 seed here
Saturday, June 1, 13
Stop condition
• After a fixed number of iterations
• Once the clustering converges (one of):
• No points are assigned to a different cluster than in the previous
iteration
• The “quality” of the clustering plateaus
Saturday, June 1, 13
Quality?
• Clustering is in the eye of the beholder
• Total clustering cost (this is what we optimize for)
• Would also clusters that are:
• compact (small intra-cluster distances)
• well-separated (large inter-cluster distances)
• Dunn Index, Davies-Bouldin Index, other metrics
Saturday, June 1, 13
Outliers
• Real data is messy: algorithm that
collected the data has a bug, the
person who curated the data
made a mistake
• Outliers can affect k-means
centroids
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
● ●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
−5 0 5
−6−4−20246
x
y
●
●
Effect of outliers
better centroid
Saturday, June 1, 13
Speed
• If, n = # of points to cluster, k = # of clusters, d = dimensionality of the
data
• The complexity of the cluster assignment step for a basic implementation
is O(n * k * d)
• The complexity of the cluster adjustment step is O(k * d)
• Can’t get rid of n, but can:
• search approximately and reduce the effective size of k
• reduce the size of the vectors d (PCA, random projections)
Saturday, June 1, 13
(Partial) Solutions
Saturday, June 1, 13
Centroid initialization revisited
• Smart initialization: k-means++
• Select the first 2 centroids with probability proportional to the distance
between them
• While there are fewer than k centroids
• Select the point with probability proportional to it’s shortest distance
to an existing centroid
• Good quality guarantees, practically never fails
• Takes longer, more complicated data structure
Saturday, June 1, 13
Outliers revisited
• Consider just the points that are “close” to the cluster when computing
the new mean
• “Close” typically means a fraction of the distance to the other closest
centroid
• This is known as ball k-means
Saturday, June 1, 13
Closest cluster: reducing k
• Avoid computing the distance from a point to every cluster
• Random Projections
• projection search
• locality-sensitive hashing search
Saturday, June 1, 13
Projections
• Projecting every vector on to a “projection basis” (a randomly sampled,
unit length vector) induces a total ordering
• Points that are close together in the original space will project close
together
• The converse is not true
Saturday, June 1, 13
Example
1.5-2 -1.5 -1 -0.5 0.5 1
3
-3
-2
-1
1
2
X Axis
YAxis
Saturday, June 1, 13
Random Projections
• Widely used for nearest neighbor searches
• Unit length vectors with normally distributed components
Saturday, June 1, 13
How to search?
• Project all the vectors to search among on to the basis
• Keep the results sorted (binary search tree, sorted arrays)
• Search for the projected query (log k)
• Compute actual distances for elements in a ball around the resulting index
• Use more than one projection vector for increased accuracy
Saturday, June 1, 13
How many projections?
10004 10 100
0.5
0
0.1
0.2
0.3
0.4
Dimension
Probabilityoffindingnearestneighbor
Probability of finding nearest neighbor using
single projection search against 10,000 points
Saturday, June 1, 13
Locality Sensitive Hashing
• Based on multiple random projections
• Use the signs of the projections to set bits in a hash (32 or 64 bits) to
1 or 0
• The Hamming Distance between two hashes is correlated with the cosine
distance between their corresponding vectors
Saturday, June 1, 13
Correlation plot
0 8 16 24 32 40 48 56 64
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis
Saturday, June 1, 13
How to search?
• Compute the hashes for each vector to search
• Keep the results sorted (binary search tree, sorted arrays)
• Search for the projected query (log k)
• Compute Hamming Distance for elements in a ball around the resulting
index
• If it’s low enough, compute the actual distance
Saturday, June 1, 13
Closest cluster: reducing d
• Principal Component Analysis
• High quality and known guarantees
• Select on the first d’ singular vectors of the SVD of the data
• Random Projections
• Create a d x d’ projection matrix
• Multiply the data matrix (n x d) to the right by the projection matrix
Saturday, June 1, 13
Large scale clustering
Saturday, June 1, 13
Big Data
• Have to use large-scale techniques because there is too much data
• doesn’t fit in RAM
• doesn’t fit on disk
• distributed across many machines in a cluster
• Paradigm of choice: MapReduce, implemented in Hadoop
Saturday, June 1, 13
MapReduce Overview
• Awesome illustrations at CS Illustrated Berkeley
http://csillustrated.berkeley.edu/illustrations.php
• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-1-handout.pdf
• http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-2-example-
handout.pdf
Saturday, June 1, 13
k-means as a MapReduce
• Can’t split k-means loop directly:
• An iteration depends on the previous one’s results
• Must express a single k-means step as a MapReduce
• The job driver runs the outer loop submitting a job for every iteration,
where the output of the last one is the input of the current one
• Centroids at each step are read from and written to files on disk
Saturday, June 1, 13
k-means as a MapReduce
• initialize k points as centroids
• the driver samples the centroids from disk using reservoir sampling
• while not done (multiple map-reduce steps)
• assign each point to the cluster whose centroid it’s closest to
• each mapper gets a chunk of the datapoints and all the clusters emitting
(id, cluster) pairs
• set the mean of the points in each cluster as the new centroid
• each reducer aggregates its cluster indices and the chunks they’re made up of
Saturday, June 1, 13
Issues
• Initial centroid selection requires iterating through all the datapoints (accessing
remote chunks, loading into memory, etc.)
• Cannot feasibly implement k-means++ initialization
• Each map step requires an iteration through the entire data set
• Needs multiple MapReduces (and intermediate data is written to disk)
• Could implement faster nearest-neighbor searching and ball k-means
(but Mahout doesn’t)
Saturday, June 1, 13
Now for something completely different
Streaming k-means
• Attempt to build a “sketch” of the data (can build a provably good sketch in
one pass)
• Won’t get k clusters, will get O(k log n) clusters
• Can collapse the results down to k with ball k-means
• The sketch can be small enough to fit into memory so the clustering can use
all the previous tricks
• Fast and Accurate k-means for Large Datasets, M. Schindler,A.Wong
http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf
Saturday, June 1, 13
Streaming k-means pseudocode
• for each point p, with weight w
• find the closest centroid to p, call it c and let d be the distance between p and c
• if an event with probability proportional to d * w / (distanceCutoff) occurs
• create a new cluster with p as its centroid
• else, merge p into c
• if there are too many clusters, increase distanceCutoff and cluster recursively
(“collapse” the centroids)
Saturday, June 1, 13
Example
first point
Saturday, June 1, 13
Example
becomes centroid
Saturday, June 1, 13
Example
second point
far away
Saturday, June 1, 13
Example
becomes centroid
far away
Saturday, June 1, 13
Example
third point
close
Saturday, June 1, 13
Example
centroid is updated
Saturday, June 1, 13
Example
close
but becomes a
new centroid
Saturday, June 1, 13
Example
Saturday, June 1, 13
Discussion
• The resulting O(k log n) clusters are a good sketch of the data
• The number of clusters in the sketch will not always be the same
• For some applications (notably approximation), these clusters are fine
• Most important parameter is the initial distanceCutoff (which is data dependent)
• Too large: loose clusters, poor quality
• Too small (less bad): frequent collapses (and it will be increased anyway)
• Rule of thumb: on the order of the minimum distance between two points
Saturday, June 1, 13
The big picture
• Can cluster all the points with just 1 MapReduce:
• m mappers run streaming k-means to get O(k log (n/m)) points
• 1 reducer runs ball k-means to get k clusters
• The reducer can perform another streaming k-means step before ball k-
means to collapse the clusters further so they fit into memory
• Reducer k-means step supports k-means++, multiple restarts with a train/
test split, ball k-means and fast nearest neighbor searching
Saturday, June 1, 13
Results & Implementation
Saturday, June 1, 13
Quality (vs KMeans)
• Compared the quality on various small-medium UCI data sets
• iris, seeds, movement, control, power
• Computed the following quality measures:
• Dunn Index (higher is better)
• Davies-Bouldin Index (lower is better)
• Adjusted Rand Index (higher is better)
• Total cost (lower is better)
Saturday, June 1, 13
iris randplot-2
0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.51.01.52.02.53.03.5
0
51
0
62
0
0
0
0
38
Confusion matrix run 2
Adjusted Rand Index 1
Ball/Streaming k−means centroids
Mahoutk−meanscentroids
Saturday, June 1, 13
iris randplot-4
0.5 1.0 1.5 2.0 2.5 3.0 3.5
0.51.01.52.02.53.03.5
50
1
0
0
0
38
3
0
59
Confusion matrix run 4
Adjusted Rand Index 0.546899
Ball/Streaming k−means centroids
Mahoutk−meanscentroids
Saturday, June 1, 13
seeds compareplot
bskm km
4567
4.3116448 4.22126213333333
Clustering techniques compared all
Clustering type / overall mean cluster distance
Meanclusterdistance
Saturday, June 1, 13
movement compareplot
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 20 40 60
1.01.52.02.51.403378
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
1.50878
●
●
km distances
bskm distances
km mean
bskm mean
km vs bskm
Cluster index/Run
Averageclusterdistance
Saturday, June 1, 13
control randplot-3
1 2 3 4 5 6
123456
0
132
0
0
0
0
0
39
0
0
0
0
0
0
46
39
34
0
197
1
0
0
0
0
0
0
18
0
4
59
0
31
0
0
0
0
Confusion matrix run 3
Adjusted Rand Index 0.724136
Ball/Streaming k−means centroids
Mahoutk−meanscentroids
Saturday, June 1, 13
power allplot
●
●
●
●
bskm km
102050100200500
142.101769766667 149.3591672
Clustering techniques compared all
Clustering type / overall mean cluster distance
Meanclusterdistance
Saturday, June 1, 13
power compareplot
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
● ● ●
●
●
0 5 10 15 20 25 30
0200400600800149.3592
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
142.1018
●
●
km distances
bskm distances
km mean
bskm mean
km vs bskm
Cluster index/Run
Averageclusterdistance
Saturday, June 1, 13
Overall (avg. 5 runs)
Dataset Clustering Avg. Dunn Avg. DB Avg. Cost Avg.ARI
iris
km 9.161 0.265 124.146
0.905iris
bskm 6.454 0.336 117.859
0.905
seeds
km 7.432 0.453 909.875
0.980seeds
bskm 6.886 0.505 916.511
0.980
movement
km 0.457 1.843 336.456
0.650movement
bskm 0.436 2.003 347.078
0.650
control
km 0.553 1.700 1014313
0.630control
bskm 0.753 1.434 1004917
0.630
power
km 0.107 1.380 73339083
0.605power
bskm 1.953 1.080 54422758
0.605
Saturday, June 1, 13
Speed (Threaded)
Saturday, June 1, 13
Speed (MapReduce)
• Synthetic data
• 500-dimensional, ~120 million points, 6.7 GB
• 20 corners of a hypercube as means of a multivariate normal distribution
(90%) + uniform noise (10%)
• 5-node MapR cluster on Amazon EC2
• KMeans with 20 clusters, 10 iterations
• StreamingKMeans with 20 clusters, 100 map clusters
Saturday, June 1, 13
Speed (MapReduce)
0
17.5
35
52.5
70
20 clusters 1 iteration
KMeans StreamingKMeans
iteration times comparable
~2.4x faster
Saturday, June 1, 13
Speed (MapReduce)
0
75
150
225
300
1000 clusters 1 iteration
KMeans StreamingKMeans
benefits from approximate
nearest-neighbor search
~8x faster
Saturday, June 1, 13
Code
• Now in trunk, part of Apache Mahout 0.8
• Easy to integrate with the rest of the Hadoop ecosystem (i.e. cluster vectors from
Apache Lucene/Solr)
Clustering algorithms BallKMeans, StreamingKMeans
MapReduce classes
StreamingKMeansDriver, StreamingKMeansMapper,
StreamingKMeansReducer,
StreamingKMeansThread
Fast nearest-neighbor search
ProjectionSearch, FastProjectionSearch,
LocalitySensitiveHashSearch
Quality metrics ClusteringUtils
Saturday, June 1, 13
Thank you!
Questions?
dfilimon@apache.org
dangeorge.filimon@gmail.com
Saturday, June 1, 13

More Related Content

What's hot

Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017Iwan Sofana
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseMohaiminur Rahman
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 
Ashfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, PepperdataAshfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, PepperdataMLconf
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
Image classification using neural network
Image classification using neural networkImage classification using neural network
Image classification using neural networkBhavyateja Potineni
 
Non-parametric probability distribution for fitting data
Non-parametric probability distribution for fitting dataNon-parametric probability distribution for fitting data
Non-parametric probability distribution for fitting dataNikhil Chandra Sarkar
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Yan Xu
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringIJCSIS Research Publications
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian ProcessHa Phuong
 
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on ArraysArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on ArraysGoon83
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Gurbinder Gill
 
ICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data WarehousingICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data WarehousingTakuma Wakamori
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitBAINIDA
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịHong Ong
 

What's hot (20)

Kmeans plusplus
Kmeans plusplusKmeans plusplus
Kmeans plusplus
 
Training machine learning k means 2017
Training machine learning k means 2017Training machine learning k means 2017
Training machine learning k means 2017
 
L4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics CourseL4 cluster analysis NWU 4.3 Graphics Course
L4 cluster analysis NWU 4.3 Graphics Course
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
50120140505013
5012014050501350120140505013
50120140505013
 
Ashfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, PepperdataAshfaq Munshi, ML7 Fellow, Pepperdata
Ashfaq Munshi, ML7 Fellow, Pepperdata
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Image classification using neural network
Image classification using neural networkImage classification using neural network
Image classification using neural network
 
Non-parametric probability distribution for fitting data
Non-parametric probability distribution for fitting dataNon-parametric probability distribution for fitting data
Non-parametric probability distribution for fitting data
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering Mean shift and Hierarchical clustering
Mean shift and Hierarchical clustering
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Premeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means ClusteringPremeditated Initial Points for K-Means Clustering
Premeditated Initial Points for K-Means Clustering
 
010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process010_20160216_Variational Gaussian Process
010_20160216_Variational Gaussian Process
 
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on ArraysArrayUDF: User-Defined Scientific Data Analysis on Arrays
ArrayUDF: User-Defined Scientific Data Analysis on Arrays
 
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
Efficient Variable Size Template Matching Using Fast Normalized Cross Correla...
 
ICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data WarehousingICDE2014 Session 14 Data Warehousing
ICDE2014 Session 14 Data Warehousing
 
Project PPT
Project PPTProject PPT
Project PPT
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thịDistance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
Distance oracle - Truy vấn nhanh khoảng cách giữa hai điểm bất kỳ trên đồ thị
 

Similar to Clustering large-scale data Buzzwords 2013 full

대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화NAVER Engineering
 
Developing a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISDeveloping a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISCOGS Presentations
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Maninda Edirisooriya
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford MapR Technologies
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)Ali-ziane Myriam
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsVarad Meru
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...Dataconomy Media
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkmVahid Mirjalili
 
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Neelabha Pant
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdfbintis1
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptxJK970901
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit vmalathieswaran29
 

Similar to Clustering large-scale data Buzzwords 2013 full (20)

Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25Clustering - ACM 2013 02-25
Clustering - ACM 2013 02-25
 
Clustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn TutorialClustering: A Scikit Learn Tutorial
Clustering: A Scikit Learn Tutorial
 
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
 
Developing a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGISDeveloping a Tutorial for Grouping Analysis in ArcGIS
Developing a Tutorial for Grouping Analysis in ArcGIS
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford Fast Single-pass K-means Clusterting at Oxford
Fast Single-pass K-means Clusterting at Oxford
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
K-Means, its Variants and its Applications
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
 
Paris Data Geeks
Paris Data GeeksParis Data Geeks
Paris Data Geeks
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
 
DM_clustering.ppt
DM_clustering.pptDM_clustering.ppt
DM_clustering.ppt
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 
Clustering on database systems rkm
Clustering on database systems rkmClustering on database systems rkm
Clustering on database systems rkm
 
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
Hyperoptimized Machine Learning and Deep Learning Methods For Geospatial and ...
 
Birch1
Birch1Birch1
Birch1
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Clustering large-scale data Buzzwords 2013 full

  • 1. Clustering data at scale (full) Dan Filimon, Politehnica University Bucharest Ted Dunning, MapR Technologies Saturday, June 1, 13
  • 2. whoami • Soon-to-be graduate of Politehnica Univerity of Bucharest • Apache Mahout committer • This work is my senior project • Contact me at • dfilimon@apache.org • dangeorge.filimon@gmail.com Saturday, June 1, 13
  • 3. Agenda • Data • Clustering • k-means • challenges • (partial) solutions • ball k-means • Large-scale • map-reduce • k-means as a map-reduce • streaming k-means • the big picture • Results Saturday, June 1, 13
  • 4. Data • Real-valued vectors (or anything that can be encoded as such) • Think of rows in a database table • Can be: documents, web pages, images, videos, users, DNA Saturday, June 1, 13
  • 6. Rationale • Want to find clumps of higher density in the data • Widely used for: • Similarity detection • Accelerating nearest-neighbor search • Data approximation • As a step in other algorithms (k-nearest-neighbor, radial basis functions, etc.) Saturday, June 1, 13
  • 7. The problem Group n d-dimensional datapoints into k disjoint sets to minimize ckX c1 0 @ X xij2Xi dist(xij, ci) 1 A Xi is the ith cluster ci is the centroid of the ith cluster xij is the jth point from the ith cluster dist(x, y) = ||x y||2 Saturday, June 1, 13
  • 8. But wait! • NP-hard to solve in the general case • Can only find locally optimal solutions • Additional assumptions (sigma-separability) prove why we get good results in practice Saturday, June 1, 13
  • 9. Centroid-based clustering: k-means • CS folklore since the 50s, also known as Lloyd’s method • Arbitrarily bad examples exist • Very effective in practice Saturday, June 1, 13
  • 16. k-means pseudocode • initialize k points as centroids • while not done • assign each point to the cluster whose centroid it’s closest to • set the mean of the points in each cluster as the new centroid Saturday, June 1, 13
  • 18. Details • Quality • Need to clarify (from the pseudocode): • How to initialize the centroids • When to stop iterating • How to deal with outliers • Speed • Complexity of cluster assignment Saturday, June 1, 13
  • 19. Centroid initialization • Most important factor for quick convergence and quality clusters • Randomly select k points as centroids • Clustering fails if two centroids are in the same cluster • Likely needs multiple restarts to get a good solution 2 seeds here 1 seed here Saturday, June 1, 13
  • 20. Stop condition • After a fixed number of iterations • Once the clustering converges (one of): • No points are assigned to a different cluster than in the previous iteration • The “quality” of the clustering plateaus Saturday, June 1, 13
  • 21. Quality? • Clustering is in the eye of the beholder • Total clustering cost (this is what we optimize for) • Would also clusters that are: • compact (small intra-cluster distances) • well-separated (large inter-cluster distances) • Dunn Index, Davies-Bouldin Index, other metrics Saturday, June 1, 13
  • 22. Outliers • Real data is messy: algorithm that collected the data has a bug, the person who curated the data made a mistake • Outliers can affect k-means centroids ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● −5 0 5 −6−4−20246 x y ● ● Effect of outliers better centroid Saturday, June 1, 13
  • 23. Speed • If, n = # of points to cluster, k = # of clusters, d = dimensionality of the data • The complexity of the cluster assignment step for a basic implementation is O(n * k * d) • The complexity of the cluster adjustment step is O(k * d) • Can’t get rid of n, but can: • search approximately and reduce the effective size of k • reduce the size of the vectors d (PCA, random projections) Saturday, June 1, 13
  • 25. Centroid initialization revisited • Smart initialization: k-means++ • Select the first 2 centroids with probability proportional to the distance between them • While there are fewer than k centroids • Select the point with probability proportional to it’s shortest distance to an existing centroid • Good quality guarantees, practically never fails • Takes longer, more complicated data structure Saturday, June 1, 13
  • 26. Outliers revisited • Consider just the points that are “close” to the cluster when computing the new mean • “Close” typically means a fraction of the distance to the other closest centroid • This is known as ball k-means Saturday, June 1, 13
  • 27. Closest cluster: reducing k • Avoid computing the distance from a point to every cluster • Random Projections • projection search • locality-sensitive hashing search Saturday, June 1, 13
  • 28. Projections • Projecting every vector on to a “projection basis” (a randomly sampled, unit length vector) induces a total ordering • Points that are close together in the original space will project close together • The converse is not true Saturday, June 1, 13
  • 29. Example 1.5-2 -1.5 -1 -0.5 0.5 1 3 -3 -2 -1 1 2 X Axis YAxis Saturday, June 1, 13
  • 30. Random Projections • Widely used for nearest neighbor searches • Unit length vectors with normally distributed components Saturday, June 1, 13
  • 31. How to search? • Project all the vectors to search among on to the basis • Keep the results sorted (binary search tree, sorted arrays) • Search for the projected query (log k) • Compute actual distances for elements in a ball around the resulting index • Use more than one projection vector for increased accuracy Saturday, June 1, 13
  • 32. How many projections? 10004 10 100 0.5 0 0.1 0.2 0.3 0.4 Dimension Probabilityoffindingnearestneighbor Probability of finding nearest neighbor using single projection search against 10,000 points Saturday, June 1, 13
  • 33. Locality Sensitive Hashing • Based on multiple random projections • Use the signs of the projections to set bits in a hash (32 or 64 bits) to 1 or 0 • The Hamming Distance between two hashes is correlated with the cosine distance between their corresponding vectors Saturday, June 1, 13
  • 34. Correlation plot 0 8 16 24 32 40 48 56 64 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 X Axis YAxis Saturday, June 1, 13
  • 35. How to search? • Compute the hashes for each vector to search • Keep the results sorted (binary search tree, sorted arrays) • Search for the projected query (log k) • Compute Hamming Distance for elements in a ball around the resulting index • If it’s low enough, compute the actual distance Saturday, June 1, 13
  • 36. Closest cluster: reducing d • Principal Component Analysis • High quality and known guarantees • Select on the first d’ singular vectors of the SVD of the data • Random Projections • Create a d x d’ projection matrix • Multiply the data matrix (n x d) to the right by the projection matrix Saturday, June 1, 13
  • 38. Big Data • Have to use large-scale techniques because there is too much data • doesn’t fit in RAM • doesn’t fit on disk • distributed across many machines in a cluster • Paradigm of choice: MapReduce, implemented in Hadoop Saturday, June 1, 13
  • 39. MapReduce Overview • Awesome illustrations at CS Illustrated Berkeley http://csillustrated.berkeley.edu/illustrations.php • http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-1-handout.pdf • http://csillustrated.berkeley.edu/PDFs/handouts/mapreduce-2-example- handout.pdf Saturday, June 1, 13
  • 40. k-means as a MapReduce • Can’t split k-means loop directly: • An iteration depends on the previous one’s results • Must express a single k-means step as a MapReduce • The job driver runs the outer loop submitting a job for every iteration, where the output of the last one is the input of the current one • Centroids at each step are read from and written to files on disk Saturday, June 1, 13
  • 41. k-means as a MapReduce • initialize k points as centroids • the driver samples the centroids from disk using reservoir sampling • while not done (multiple map-reduce steps) • assign each point to the cluster whose centroid it’s closest to • each mapper gets a chunk of the datapoints and all the clusters emitting (id, cluster) pairs • set the mean of the points in each cluster as the new centroid • each reducer aggregates its cluster indices and the chunks they’re made up of Saturday, June 1, 13
  • 42. Issues • Initial centroid selection requires iterating through all the datapoints (accessing remote chunks, loading into memory, etc.) • Cannot feasibly implement k-means++ initialization • Each map step requires an iteration through the entire data set • Needs multiple MapReduces (and intermediate data is written to disk) • Could implement faster nearest-neighbor searching and ball k-means (but Mahout doesn’t) Saturday, June 1, 13
  • 43. Now for something completely different Streaming k-means • Attempt to build a “sketch” of the data (can build a provably good sketch in one pass) • Won’t get k clusters, will get O(k log n) clusters • Can collapse the results down to k with ball k-means • The sketch can be small enough to fit into memory so the clustering can use all the previous tricks • Fast and Accurate k-means for Large Datasets, M. Schindler,A.Wong http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf Saturday, June 1, 13
  • 44. Streaming k-means pseudocode • for each point p, with weight w • find the closest centroid to p, call it c and let d be the distance between p and c • if an event with probability proportional to d * w / (distanceCutoff) occurs • create a new cluster with p as its centroid • else, merge p into c • if there are too many clusters, increase distanceCutoff and cluster recursively (“collapse” the centroids) Saturday, June 1, 13
  • 51. Example close but becomes a new centroid Saturday, June 1, 13
  • 53. Discussion • The resulting O(k log n) clusters are a good sketch of the data • The number of clusters in the sketch will not always be the same • For some applications (notably approximation), these clusters are fine • Most important parameter is the initial distanceCutoff (which is data dependent) • Too large: loose clusters, poor quality • Too small (less bad): frequent collapses (and it will be increased anyway) • Rule of thumb: on the order of the minimum distance between two points Saturday, June 1, 13
  • 54. The big picture • Can cluster all the points with just 1 MapReduce: • m mappers run streaming k-means to get O(k log (n/m)) points • 1 reducer runs ball k-means to get k clusters • The reducer can perform another streaming k-means step before ball k- means to collapse the clusters further so they fit into memory • Reducer k-means step supports k-means++, multiple restarts with a train/ test split, ball k-means and fast nearest neighbor searching Saturday, June 1, 13
  • 56. Quality (vs KMeans) • Compared the quality on various small-medium UCI data sets • iris, seeds, movement, control, power • Computed the following quality measures: • Dunn Index (higher is better) • Davies-Bouldin Index (lower is better) • Adjusted Rand Index (higher is better) • Total cost (lower is better) Saturday, June 1, 13
  • 57. iris randplot-2 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.51.01.52.02.53.03.5 0 51 0 62 0 0 0 0 38 Confusion matrix run 2 Adjusted Rand Index 1 Ball/Streaming k−means centroids Mahoutk−meanscentroids Saturday, June 1, 13
  • 58. iris randplot-4 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.51.01.52.02.53.03.5 50 1 0 0 0 38 3 0 59 Confusion matrix run 4 Adjusted Rand Index 0.546899 Ball/Streaming k−means centroids Mahoutk−meanscentroids Saturday, June 1, 13
  • 59. seeds compareplot bskm km 4567 4.3116448 4.22126213333333 Clustering techniques compared all Clustering type / overall mean cluster distance Meanclusterdistance Saturday, June 1, 13
  • 60. movement compareplot ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 1.01.52.02.51.403378 ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1.50878 ● ● km distances bskm distances km mean bskm mean km vs bskm Cluster index/Run Averageclusterdistance Saturday, June 1, 13
  • 61. control randplot-3 1 2 3 4 5 6 123456 0 132 0 0 0 0 0 39 0 0 0 0 0 0 46 39 34 0 197 1 0 0 0 0 0 0 18 0 4 59 0 31 0 0 0 0 Confusion matrix run 3 Adjusted Rand Index 0.724136 Ball/Streaming k−means centroids Mahoutk−meanscentroids Saturday, June 1, 13
  • 62. power allplot ● ● ● ● bskm km 102050100200500 142.101769766667 149.3591672 Clustering techniques compared all Clustering type / overall mean cluster distance Meanclusterdistance Saturday, June 1, 13
  • 63. power compareplot ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 15 20 25 30 0200400600800149.3592 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 142.1018 ● ● km distances bskm distances km mean bskm mean km vs bskm Cluster index/Run Averageclusterdistance Saturday, June 1, 13
  • 64. Overall (avg. 5 runs) Dataset Clustering Avg. Dunn Avg. DB Avg. Cost Avg.ARI iris km 9.161 0.265 124.146 0.905iris bskm 6.454 0.336 117.859 0.905 seeds km 7.432 0.453 909.875 0.980seeds bskm 6.886 0.505 916.511 0.980 movement km 0.457 1.843 336.456 0.650movement bskm 0.436 2.003 347.078 0.650 control km 0.553 1.700 1014313 0.630control bskm 0.753 1.434 1004917 0.630 power km 0.107 1.380 73339083 0.605power bskm 1.953 1.080 54422758 0.605 Saturday, June 1, 13
  • 66. Speed (MapReduce) • Synthetic data • 500-dimensional, ~120 million points, 6.7 GB • 20 corners of a hypercube as means of a multivariate normal distribution (90%) + uniform noise (10%) • 5-node MapR cluster on Amazon EC2 • KMeans with 20 clusters, 10 iterations • StreamingKMeans with 20 clusters, 100 map clusters Saturday, June 1, 13
  • 67. Speed (MapReduce) 0 17.5 35 52.5 70 20 clusters 1 iteration KMeans StreamingKMeans iteration times comparable ~2.4x faster Saturday, June 1, 13
  • 68. Speed (MapReduce) 0 75 150 225 300 1000 clusters 1 iteration KMeans StreamingKMeans benefits from approximate nearest-neighbor search ~8x faster Saturday, June 1, 13
  • 69. Code • Now in trunk, part of Apache Mahout 0.8 • Easy to integrate with the rest of the Hadoop ecosystem (i.e. cluster vectors from Apache Lucene/Solr) Clustering algorithms BallKMeans, StreamingKMeans MapReduce classes StreamingKMeansDriver, StreamingKMeansMapper, StreamingKMeansReducer, StreamingKMeansThread Fast nearest-neighbor search ProjectionSearch, FastProjectionSearch, LocalitySensitiveHashSearch Quality metrics ClusteringUtils Saturday, June 1, 13