Clustering with Spark 
Sandy Ryza / Data Science / Cloudera
Me 
● Data scientist at Cloudera 
● Recently lead Apache Spark development at 
Cloudera 
● Before that, committing on Apache Hadoop 
● Before that, studying combinatorial 
optimization and distributed systems at 
Brown
Sometimes you find yourself 
with lots of stuff
Large Scale Learning
Network Packets
Detect Network Intrusions
Credit Card Transactions
Detect Fraud
Movie Viewings
Recommend Movies
Unsupervised Learning 
● Learn hidden structure of your data 
● Interpret new data as it relates to this 
structure
Two Main Problems 
● Designing a system for processing huge 
data in parallel 
● Taking advantage of it with algorithms that 
work well in parallel
MapReduce 
Map Map Map Map Map Map Map Map Map Map Map Map 
Reduce Reduce Reduce Reduce 
Key advances by MapReduce: 
•Data Locality: Automatic split computation and launch of mappers appropriately 
•Fault tolerance: Write out of intermediate results and restartable mappers meant 
ability to run on commodity hardware 
•Linear scalability: Combination of locality + programming model that forces developers 
to write generally scalable solutions to problems 
* CONFIDENTIAL - RESTRICTED
MapReduce 
Map Map Map Map Map Map Map Map Map Map Map Map 
Reduce Reduce Reduce Reduce 
Limitations of MapReduce 
•Each job reads data from HDFS 
•No concept of a session 
•Jobs are rigin map-then-reduce 
* CONFIDENTIAL - RESTRICTED
Spark is a general purpose computation framework geared towards massive 
data - more flexible than MapReduce 
Extra properties: 
•Leverages distributed memory 
•Full Directed Graph expressions for data parallel computations 
•Improved developer experience 
Yet retains: 
Linear scalability, Fault-tolerance and Data-Locality 
* CONFIDENTIAL - RESTRICTED
RDDs 
val lines = sc.textFile 
(“bigfile.txt”) 
bigfile.txt lines 
val numbers = lines.map 
((x) => x.toDouble) numbers.sum() 
numbers 
Partition 
Partition 
Partition 
Partition 
Partition 
Partition 
HDFS 
sum 
Driver
RDDs 
val lines = sc.textFile 
(“bigfile.txt”) 
bigfile.txt lines 
val numbers = lines.map 
((x) => x.toInt) numbers.cache() 
numbers 
Partition 
Partition 
Partition 
Partition 
Partition 
Partition 
HDFS 
.sum() 
sum 
Driver
numbers.sum() 
bigfile.txt lines numbers 
Partition 
Partition 
Partition 
sum 
Driver
Spark MLlib 
Discrete Continuous 
Supervised Classification 
● Logistic regression (and 
regularized variants) 
● Linear SVM 
● Naive Bayes 
● Random Decision Forests 
(soon) 
Regression 
● Linear regression (and 
regularized variants) 
Unsupervised Clustering 
● K-means 
Dimensionality reduction, matrix 
factorization 
● Principal component analysis / 
singular value decomposition 
● Alternating least squares
Spark MLlib 
Discrete Continuous 
Supervised Classification 
● Logistic regression (and 
regularized variants) 
● Linear SVM 
● Naive Bayes 
● Random Decision Forests 
(soon) 
Regression 
● Linear regression (and 
regularized variants) 
Unsupervised Clustering 
● K-means 
Dimensionality reduction, matrix 
factorization 
● Principal component analysis / 
singular value decomposition 
● Alternating least squares
Using it 
val data = sc.textFile("kmeans_data.txt") 
val parsedData = data.map( _.split(' ').map(_.toDouble)) 
// Cluster the data into two classes using KMeans 
val numIterations = 20 
val numClusters = 2 
val clusters = KMeans.train(parsedData, numClusters, 
numIterations)
K-Means 
● Choose some initial centers 
● Then alternate between two steps: 
○ Assign each point to a cluster based on 
existing centers 
○ Recompute cluster centers from the 
points in each cluster
K-Means - very parallelizable 
● Alternate between two steps: 
○ Assign each point to a cluster based on 
existing centers 
■ Process each data point independently 
○ Recompute cluster centers from the 
points in each cluster 
■ Average across partitions
// Find the sum and count of points mapping to each center 
val totalContribs = data.mapPartitions { points => 
val k = centers.length 
val dims = centers(0).vector.length 
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) 
val counts = Array.fill(k)(0L) 
points.foreach { point => 
val (bestCenter, cost) = KMeans.findClosest(centers, point) 
costAccum += cost 
sums(bestCenter) += point.vector 
counts(bestCenter) += 1 
} 
val contribs = for (j <- 0 until k) yield { 
(j, (sums(j), counts(j))) 
} 
contribs.iterator 
}.reduceByKey(mergeContribs).collectAsMap()
// Update the cluster centers and costs 
var changed = false 
var j = 0 
while (j < k) { 
val (sum, count) = totalContribs(j) 
if (count != 0) { 
sum /= count.toDouble 
val newCenter = new BreezeVectorWithNorm(sum) 
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { 
changed = true 
} 
centers(j) = newCenter 
} 
j += 1 
} 
if (!changed) { 
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") 
} 
cost = costAccum.value
The Problem 
● K-Means is very sensitive to initial set of 
center points chosen. 
● Best existing algorithm for choosing centers 
is highly sequential.
K-Means++ 
● Start with random point from dataset 
● Pick another one randomly, with probability 
proportional to distance from the closest 
already chosen 
● Repeat until initial centers chosen
K-Means++ 
● Initial cluster has expected bound of O(log k) 
of optimum cost
K-Means++ 
● Requires k passes over the data
K-Means|| 
● Do only a few (~5) passes 
● Sample m points on each pass 
● Oversample 
● Run K-Means++ on sampled points to find 
initial centers
Then on the full data...
Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

  • 1.
    Clustering with Spark Sandy Ryza / Data Science / Cloudera
  • 2.
    Me ● Datascientist at Cloudera ● Recently lead Apache Spark development at Cloudera ● Before that, committing on Apache Hadoop ● Before that, studying combinatorial optimization and distributed systems at Brown
  • 3.
    Sometimes you findyourself with lots of stuff
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Unsupervised Learning ●Learn hidden structure of your data ● Interpret new data as it relates to this structure
  • 14.
    Two Main Problems ● Designing a system for processing huge data in parallel ● Taking advantage of it with algorithms that work well in parallel
  • 15.
    MapReduce Map MapMap Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Key advances by MapReduce: •Data Locality: Automatic split computation and launch of mappers appropriately •Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware •Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems * CONFIDENTIAL - RESTRICTED
  • 16.
    MapReduce Map MapMap Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce Limitations of MapReduce •Each job reads data from HDFS •No concept of a session •Jobs are rigin map-then-reduce * CONFIDENTIAL - RESTRICTED
  • 17.
    Spark is ageneral purpose computation framework geared towards massive data - more flexible than MapReduce Extra properties: •Leverages distributed memory •Full Directed Graph expressions for data parallel computations •Improved developer experience Yet retains: Linear scalability, Fault-tolerance and Data-Locality * CONFIDENTIAL - RESTRICTED
  • 18.
    RDDs val lines= sc.textFile (“bigfile.txt”) bigfile.txt lines val numbers = lines.map ((x) => x.toDouble) numbers.sum() numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver
  • 19.
    RDDs val lines= sc.textFile (“bigfile.txt”) bigfile.txt lines val numbers = lines.map ((x) => x.toInt) numbers.cache() numbers Partition Partition Partition Partition Partition Partition HDFS .sum() sum Driver
  • 20.
    numbers.sum() bigfile.txt linesnumbers Partition Partition Partition sum Driver
  • 21.
    Spark MLlib DiscreteContinuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 22.
    Spark MLlib DiscreteContinuous Supervised Classification ● Logistic regression (and regularized variants) ● Linear SVM ● Naive Bayes ● Random Decision Forests (soon) Regression ● Linear regression (and regularized variants) Unsupervised Clustering ● K-means Dimensionality reduction, matrix factorization ● Principal component analysis / singular value decomposition ● Alternating least squares
  • 25.
    Using it valdata = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
  • 26.
    K-Means ● Choosesome initial centers ● Then alternate between two steps: ○ Assign each point to a cluster based on existing centers ○ Recompute cluster centers from the points in each cluster
  • 32.
    K-Means - veryparallelizable ● Alternate between two steps: ○ Assign each point to a cluster based on existing centers ■ Process each data point independently ○ Recompute cluster centers from the points in each cluster ■ Average across partitions
  • 33.
    // Find thesum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j <- 0 until k) yield { (j, (sums(j), counts(j))) } contribs.iterator }.reduceByKey(mergeContribs).collectAsMap()
  • 34.
    // Update thecluster centers and costs var changed = false var j = 0 while (j < k) { val (sum, count) = totalContribs(j) if (count != 0) { sum /= count.toDouble val newCenter = new BreezeVectorWithNorm(sum) if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
  • 36.
    The Problem ●K-Means is very sensitive to initial set of center points chosen. ● Best existing algorithm for choosing centers is highly sequential.
  • 38.
    K-Means++ ● Startwith random point from dataset ● Pick another one randomly, with probability proportional to distance from the closest already chosen ● Repeat until initial centers chosen
  • 39.
    K-Means++ ● Initialcluster has expected bound of O(log k) of optimum cost
  • 40.
    K-Means++ ● Requiresk passes over the data
  • 41.
    K-Means|| ● Doonly a few (~5) passes ● Sample m points on each pass ● Oversample ● Run K-Means++ on sampled points to find initial centers
  • 49.
    Then on thefull data...