Machine Learning with Apache Spark - HackNY Masters

Scalable Machine Learning with
Apache Spark
Evan Casey
@ev_ancasey

Who am I?
● Engineer at Tapad
● HackNY 2014 Fellow
● Things I work on:
○ Scala
○ Distributed systems
○ Hadoop/Spark

Overview
● Apache Spark
○ Dataflow model
○ Spark vs Hadoop MapReduce
○ Programming with Spark
● Machine Learning with Spark
○ MLlib overview
○ Gradient descent example
○ Distributed implementation on Apache Spark
○ Lessons learned

Apache Spark
● Distributed data-processing
framework built on top of HDFS
● Use cases:
○ Interactive analytics
○ Graph processing
○ Stream processing
○ Scalable ML

Why Spark?
● Up to 100x faster than
Hadoop
● Built on top of Akka
● Expressive APIs in Scala,
Java, and Python
● Active open-source
community

Spark vs Hadoop MapReduce
● In-memory data flow model
optimized for multi-stage
jobs
● Novel approach to fault
tolerance
● Similar programming style
to Scalding/Cascading

Programming Model
● Resilient Distributed Dataset (RDD)
○ Textfile, parallelize
● Parallel Operations
○ Map, GroupBy, Filter, Join, etc
● Optimizations
○ Caching, shared variables

Wordcount Example
val sc = new SparkContext()
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split("
"))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// counts.cache
// sc.broadcast(counts)

Machine Learning in Spark
Algorithms:
- classification: logistic regression, linear SVM, naive bayes, random
forests
- regression: generalized linear models, regression tree
- collaborative filtering: alternating least squares (ALS), non-negative
matrix factorization (NMF)
- clustering: k-means
- decomposition: singular value decompositions (SVD), principal
component analysis (PCA)

K-Means Clustering
val data = sc.textFile("hdfs://...")
val parsedData = data.map(_.split(‘ ‘).map(_.
toDouble)).cache()
// Cluster the data into two classes
val clusters = KMeans.train(parsedData, 2,
numIterations = 20)
// Compute the sum of squared errors
val cost = clusters.computeCost(parsedData)

Gradient Descent Example
val file = sc.textFile("hdfs://...")
val points = file.map(parsePoint).cache()
var w = Vector.zeros(d)
for (i <- 1 to numIterations) {
(1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y *
p.x).reduce(_+_)
w -= alpha * gradient
}

About Tapad
● 350k QPS
● Ingest multiple TBs daily
● Kafka, Scalding, Spark, Zookeeper, Aerospike
● We’re hiring! :)

Thanks!
@ev_ancasey
Questions?

Machine Learning with Apache Spark - HackNY Masters

More Related Content

What's hot

Similar to Machine Learning with Apache Spark - HackNY Masters

Recently uploaded

Machine Learning with Apache Spark - HackNY Masters