Scalable Machine Learning with 
Apache Spark 
Evan Casey 
@ev_ancasey
Who am I? 
● Engineer at Tapad 
● HackNY 2014 Fellow 
● Things I work on: 
○ Scala 
○ Distributed systems 
○ Hadoop/Spark
Overview 
● Apache Spark 
○ Dataflow model 
○ Spark vs Hadoop MapReduce 
○ Programming with Spark 
● Machine Learning with Spark 
○ MLlib overview 
○ Gradient descent example 
○ Distributed implementation on Apache Spark 
○ Lessons learned
Apache Spark 
● Distributed data-processing 
framework built on top of HDFS 
● Use cases: 
○ Interactive analytics 
○ Graph processing 
○ Stream processing 
○ Scalable ML
Why Spark? 
● Up to 100x faster than 
Hadoop 
● Built on top of Akka 
● Expressive APIs in Scala, 
Java, and Python 
● Active open-source 
community
Spark vs Hadoop MapReduce 
● In-memory data flow model 
optimized for multi-stage 
jobs 
● Novel approach to fault 
tolerance 
● Similar programming style 
to Scalding/Cascading
Programming Model 
● Resilient Distributed Dataset (RDD) 
○ Textfile, parallelize 
● Parallel Operations 
○ Map, GroupBy, Filter, Join, etc 
● Optimizations 
○ Caching, shared variables
Wordcount Example 
val sc = new SparkContext() 
val file = sc.textFile("hdfs://...") 
val counts = file.flatMap(line => line.split(" 
")) 
.map(word => (word, 1)) 
.reduceByKey(_ + _) 
counts.saveAsTextFile("hdfs://...") 
// counts.cache 
// sc.broadcast(counts)
Machine Learning in Spark 
Algorithms: 
- classification: logistic regression, linear SVM, naive bayes, random 
forests 
- regression: generalized linear models, regression tree 
- collaborative filtering: alternating least squares (ALS), non-negative 
matrix factorization (NMF) 
- clustering: k-means 
- decomposition: singular value decompositions (SVD), principal 
component analysis (PCA)
K-Means Clustering 
val data = sc.textFile("hdfs://...") 
val parsedData = data.map(_.split(‘ ‘).map(_. 
toDouble)).cache() 
// Cluster the data into two classes 
val clusters = KMeans.train(parsedData, 2, 
numIterations = 20) 
// Compute the sum of squared errors 
val cost = clusters.computeCost(parsedData)
Gradient Descent Example 
val file = sc.textFile("hdfs://...") 
val points = file.map(parsePoint).cache() 
var w = Vector.zeros(d) 
for (i <- 1 to numIterations) { 
(1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y * 
p.x).reduce(_+_) 
w -= alpha * gradient 
}
About Tapad 
● 350k QPS 
● Ingest multiple TBs daily 
● Kafka, Scalding, Spark, Zookeeper, Aerospike 
● We’re hiring! :)
Thanks! 
@ev_ancasey 
Questions?

Machine Learning with Apache Spark - HackNY Masters

  • 1.
    Scalable Machine Learningwith Apache Spark Evan Casey @ev_ancasey
  • 2.
    Who am I? ● Engineer at Tapad ● HackNY 2014 Fellow ● Things I work on: ○ Scala ○ Distributed systems ○ Hadoop/Spark
  • 3.
    Overview ● ApacheSpark ○ Dataflow model ○ Spark vs Hadoop MapReduce ○ Programming with Spark ● Machine Learning with Spark ○ MLlib overview ○ Gradient descent example ○ Distributed implementation on Apache Spark ○ Lessons learned
  • 4.
    Apache Spark ●Distributed data-processing framework built on top of HDFS ● Use cases: ○ Interactive analytics ○ Graph processing ○ Stream processing ○ Scalable ML
  • 5.
    Why Spark? ●Up to 100x faster than Hadoop ● Built on top of Akka ● Expressive APIs in Scala, Java, and Python ● Active open-source community
  • 6.
    Spark vs HadoopMapReduce ● In-memory data flow model optimized for multi-stage jobs ● Novel approach to fault tolerance ● Similar programming style to Scalding/Cascading
  • 7.
    Programming Model ●Resilient Distributed Dataset (RDD) ○ Textfile, parallelize ● Parallel Operations ○ Map, GroupBy, Filter, Join, etc ● Optimizations ○ Caching, shared variables
  • 8.
    Wordcount Example valsc = new SparkContext() val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") // counts.cache // sc.broadcast(counts)
  • 9.
    Machine Learning inSpark Algorithms: - classification: logistic regression, linear SVM, naive bayes, random forests - regression: generalized linear models, regression tree - collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) - clustering: k-means - decomposition: singular value decompositions (SVD), principal component analysis (PCA)
  • 10.
    K-Means Clustering valdata = sc.textFile("hdfs://...") val parsedData = data.map(_.split(‘ ‘).map(_. toDouble)).cache() // Cluster the data into two classes val clusters = KMeans.train(parsedData, 2, numIterations = 20) // Compute the sum of squared errors val cost = clusters.computeCost(parsedData)
  • 11.
    Gradient Descent Example val file = sc.textFile("hdfs://...") val points = file.map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) { (1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y * p.x).reduce(_+_) w -= alpha * gradient }
  • 12.
    About Tapad ●350k QPS ● Ingest multiple TBs daily ● Kafka, Scalding, Spark, Zookeeper, Aerospike ● We’re hiring! :)
  • 13.