2. Who am I?
● Engineer at Tapad
● HackNY 2014 Fellow
● Things I work on:
○ Scala
○ Distributed systems
○ Hadoop/Spark
3. Overview
● Apache Spark
○ Dataflow model
○ Spark vs Hadoop MapReduce
○ Programming with Spark
● Machine Learning with Spark
○ MLlib overview
○ Gradient descent example
○ Distributed implementation on Apache Spark
○ Lessons learned
4. Apache Spark
● Distributed data-processing
framework built on top of HDFS
● Use cases:
○ Interactive analytics
○ Graph processing
○ Stream processing
○ Scalable ML
5. Why Spark?
● Up to 100x faster than
Hadoop
● Built on top of Akka
● Expressive APIs in Scala,
Java, and Python
● Active open-source
community
6. Spark vs Hadoop MapReduce
● In-memory data flow model
optimized for multi-stage
jobs
● Novel approach to fault
tolerance
● Similar programming style
to Scalding/Cascading
8. Wordcount Example
val sc = new SparkContext()
val file = sc.textFile("hdfs://...")
val counts = file.flatMap(line => line.split("
"))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
// counts.cache
// sc.broadcast(counts)
9. Machine Learning in Spark
Algorithms:
- classification: logistic regression, linear SVM, naive bayes, random
forests
- regression: generalized linear models, regression tree
- collaborative filtering: alternating least squares (ALS), non-negative
matrix factorization (NMF)
- clustering: k-means
- decomposition: singular value decompositions (SVD), principal
component analysis (PCA)
10. K-Means Clustering
val data = sc.textFile("hdfs://...")
val parsedData = data.map(_.split(‘ ‘).map(_.
toDouble)).cache()
// Cluster the data into two classes
val clusters = KMeans.train(parsedData, 2,
numIterations = 20)
// Compute the sum of squared errors
val cost = clusters.computeCost(parsedData)
11. Gradient Descent Example
val file = sc.textFile("hdfs://...")
val points = file.map(parsePoint).cache()
var w = Vector.zeros(d)
for (i <- 1 to numIterations) {
(1 / (1 + exp(-p.y * w.dot(p.x)) -1) * p.y *
p.x).reduce(_+_)
w -= alpha * gradient
}