There are lots of reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn’t implemented in MLlib. In this talk, we’ll walk through the process of developing a new machine learning model for Spark. We’ll start with the basics, by considering how we’d design a parallel implementation of a particular unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark: we’ll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. You’ll leave this talk with everything you need to build a new machine learning technique that runs on Spark.
6. #EUds5
Forecast
Introducing our case study: self-organizing maps
Parallel implementations for partitioned collections (in particular, RDDs)
Beyond the RDD: data frames and ML pipelines
Practical considerations and key takeaways
6
14. #EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
15. #EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
16. #EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
17. #EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
the learning rate controls
how much closer to the
example each unit gets
27. #EUds5
How can we fix these?
20
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
28. #EUds5
How can we fix these?
21
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
29. #EUds5
How can we fix these?
22
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
30. #EUds5
How can we fix these?
23
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
31. #EUds5
How can we fix these?
24
L-BGFSSGD
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
32. #EUds5
How can we fix these?
25
SGD L-BGFS
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
There will be examples of each
of these approaches for many
problems in the literature and
in open-source code!
33. #EUds5
Implementing atop RDDs
We’ll start with a batch implementation of our technique:
26
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
34. #EUds5
Implementing atop RDDs
27
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
35. #EUds5
Implementing atop RDDs
28
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
partition
36. #EUds5
Implementing atop RDDs
29
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
This won’t always work!
37. #EUds5
An implementation template
30
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
38. #EUds5
An implementation template
31
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“fold”: update the state
for this partition with a
single new example
39. #EUds5
An implementation template
32
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“reduce”: combine the
states from two partitions
40. #EUds5
An implementation template
33
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
this will cause the model object
to be serialized with the closure
nextModel
41. #EUds5
An implementation template
34
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
broadcast the current working
model for this iterationvar nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} get the value of the
broadcast variable
42. #EUds5
An implementation template
35
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} remove the stale
broadcasted model
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
43. #EUds5
An implementation template
36
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
the wrong implementation
of the right interface
56. #EUds5
RDDs: some good parts
47
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
57. #EUds5
RDDs: some good parts
48
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
58. #EUds5
RDDs: some good parts
49
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
59. #EUds5
RDDs: some good parts
50
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
crashes at runtime
60. #EUds5
RDDs: some good parts
51
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
61. #EUds5
RDDs: some good parts
52
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
62. #EUds5
RDDs versus query planning
53
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.cartesian(numbers2)
.map((x, y) => (x, y, expensive(x, y)))
.filter((x, y, _) => isPrime(x), isPrime(y))
63. #EUds5
RDDs versus query planning
54
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.filter(isPrime(_))
.cartesian(numbers2.filter(isPrime(_)))
.map((x, y) => (x, y, expensive(x, y)))
64. #EUds5
RDDs and the JVM heap
55
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
65. #EUds5
RDDs and the Java heap
56
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
66. #EUds5
RDDs and the Java heap
57
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0 32 bytes of data…
67. #EUds5
RDDs and the Java heap
58
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
…and 64 bytes
of overhead!
32 bytes of data…
68. #EUds5
ML pipelines: a quick example
59
from pyspark.ml.clustering import KMeans
K, SEED = 100, 0xdea110c8
randomDF = make_random_df()
kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features")
model = kmeans.fit(randomDF)
withPredictions = model.transform(randomDF).select("x", "y", "prediction")
74. #EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
75. #EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
outputCol
76. #EUds5
Defining parameters
65
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
77. #EUds5
Defining parameters
66
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
78. #EUds5
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
Defining parameters
67
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
79. #EUds5
Defining parameters
68
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
80. #EUds5
Defining parameters
69
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
81. #EUds5
Defining parameters
70
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
82. #EUds5
Don’t repeat yourself
71
/**
* Common params for KMeans and KMeansModel
*/
private[clustering] trait KMeansParams extends Params
with HasMaxIter with HasFeaturesCol
with HasSeed with HasPredictionCol with HasTol { /* ... */ }
88. #EUds5
Validate and transform at once
73
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
89. #EUds5
Validate and transform at once
74
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
require(schema.fieldNames.contains($(featuresCol)))
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
90. #EUds5
Validate and transform at once
75
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
schema($(featuresCol)) match {
case sf: StructField => require(sf.dataType.equals(VectorType))
}
// ...and that the output columns don’t exist
// ...and then make a new schema
}
91. #EUds5
Validate and transform at once
76
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
require(!schema.fieldNames.contains($(predictionCol)))
require(!schema.fieldNames.contains($(similarityCol)))
// ...and then make a new schema
}
92. #EUds5
Validate and transform at once
77
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
schema.add($(predictionCol), "int")
.add($(similarityCol), "double")
}
93. #EUds5
Training on data frames
78
def fit(examples: DataFrame) = {
import examples.sparkSession.implicits._
import org.apache.spark.ml.linalg.{Vector=>SV}
val dfexamples = examples.select($(exampleCol)).rdd.map {
case Row(sv: SV) => sv
}
/* construct a model object with the result of training */
new SOMModel(train(dfexamples, $(x), $(y)))
}
95. #EUds5
Retire your visibility hacks
80
package org.apache.spark.ml.hacks
object Hacks {
import org.apache.spark.ml.linalg.VectorUDT
val vectorUDT = new VectorUDT
}
96. #EUds5
Retire your visibility hacks
81
package org.apache.spark.ml.linalg
/* imports, etc., are elided ... */
@Since("2.0.0")
@DeveloperApi
object SQLDataTypes {
val VectorType: DataType = new VectorUDT
val MatrixType: DataType = new MatrixUDT
}
97. #EUds5
Caching training data
82
val wasUncached = examples.storageLevel == StorageLevel.NONE
if (wasUncached) { examples.cache() }
/* actually train here */
if (wasUncached) { examples.unpersist() }
98. #EUds5
Improve serial execution times
Are you repeatedly comparing training data to a model that only changes
once per iteration? Consider caching norms.
Are you doing a lot of dot products in a for loop? Consider replacing
these loops with a matrix-vector multiplication.
Seek to limit the number of library invocations you make and thus the
time you spend copying data to and from your linear algebra library.
83
99. #EUds5
Key takeaways
There are several techniques you can use to develop parallel
implementations of machine learning algorithms.
The RDD API may not be your favorite way to interact with Spark as a user,
but it can be extremely valuable if you’re developing libraries for Spark.
As a library developer, you might need to rely on developer APIs and dive
in to Spark’s source code, but things are getting easier with each release!
84