Building machine learning
algorithms on Apache Spark
William Benton (@willb)
Red Hat, Inc.
Session hashtag: #EUds5
#EUds5
Motivation
2
#EUds5
Motivation
3
#EUds5
Motivation
4
#EUds5
Motivation
5
#EUds5
Forecast
Introducing our case study: self-organizing maps
Parallel implementations for partitioned collections (in particular, RDDs)
Beyond the RDD: data frames and ML pipelines
Practical considerations and key takeaways
6
#EUds5
Introducing self-organizing maps
#EUds58
#EUds58
#EUds5
Training self-organizing maps
9
#EUds5
Training self-organizing maps
10
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
the learning rate controls
how much closer to the
example each unit gets
#EUds5
Parallel implementations for
partitioned collections
#EUds5
Historical aside: Amdahl’s Law
14
1
1 - p
lim So =sp —> ∞
#EUds5
What forces serial execution?
15
#EUds5
What forces serial execution?
16
#EUds5
What forces serial execution?
17
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
18
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
How can we fix these?
20
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
21
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
22
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
23
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
24
L-BGFSSGD
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
25
SGD L-BGFS
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
There will be examples of each
of these approaches for many
problems in the literature and
in open-source code!
#EUds5
Implementing atop RDDs
We’ll start with a batch implementation of our technique:
26
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
#EUds5
Implementing atop RDDs
27
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
#EUds5
Implementing atop RDDs
28
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
partition
#EUds5
Implementing atop RDDs
29
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
This won’t always work!
#EUds5
An implementation template
30
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
#EUds5
An implementation template
31
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“fold”: update the state
for this partition with a
single new example
#EUds5
An implementation template
32
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“reduce”: combine the
states from two partitions
#EUds5
An implementation template
33
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
this will cause the model object
to be serialized with the closure
nextModel
#EUds5
An implementation template
34
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
broadcast the current working
model for this iterationvar nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} get the value of the
broadcast variable
#EUds5
An implementation template
35
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} remove the stale
broadcasted model
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
#EUds5
An implementation template
36
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
the wrong implementation
of the right interface
#EUds5
workersdriver (aggregate)
Implementing on RDDs
37
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
39
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
40
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
41
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
42
⊕
⊕
⊕
⊕
⊕
⊕
⊕ ⊕
driver (treeAggregate)
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
43
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
44
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕⊕
#EUds5
Beyond the RDD: Data frames and
ML Pipelines
#EUds5
RDDs: some good parts
47
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
#EUds5
RDDs: some good parts
48
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
49
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
50
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
crashes at runtime
#EUds5
RDDs: some good parts
51
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs: some good parts
52
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs versus query planning
53
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.cartesian(numbers2)
.map((x, y) => (x, y, expensive(x, y)))
.filter((x, y, _) => isPrime(x), isPrime(y))
#EUds5
RDDs versus query planning
54
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.filter(isPrime(_))
.cartesian(numbers2.filter(isPrime(_)))
.map((x, y) => (x, y, expensive(x, y)))
#EUds5
RDDs and the JVM heap
55
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
#EUds5
RDDs and the Java heap
56
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
#EUds5
RDDs and the Java heap
57
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0 32 bytes of data…
#EUds5
RDDs and the Java heap
58
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
…and 64 bytes
of overhead!
32 bytes of data…
#EUds5
ML pipelines: a quick example
59
from pyspark.ml.clustering import KMeans
K, SEED = 100, 0xdea110c8
randomDF = make_random_df()
kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features")
model = kmeans.fit(randomDF)
withPredictions = model.transform(randomDF).select("x", "y", "prediction")
#EUds5
Working with ML pipelines
60
estimator.fit(df)
#EUds5
Working with ML pipelines
61
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
62
model.transform(df)
#EUds5
Working with ML pipelines
63
model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
outputCol
#EUds5
Defining parameters
65
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
66
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
Defining parameters
67
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
68
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
69
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
70
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Don’t repeat yourself
71
/**
* Common params for KMeans and KMeansModel
*/
private[clustering] trait KMeansParams extends Params
with HasMaxIter with HasFeaturesCol
with HasSeed with HasPredictionCol with HasTol { /* ... */ }
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Validate and transform at once
73
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
74
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
require(schema.fieldNames.contains($(featuresCol)))
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
75
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
schema($(featuresCol)) match {
case sf: StructField => require(sf.dataType.equals(VectorType))
}
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
76
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
require(!schema.fieldNames.contains($(predictionCol)))
require(!schema.fieldNames.contains($(similarityCol)))
// ...and then make a new schema
}
#EUds5
Validate and transform at once
77
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
schema.add($(predictionCol), "int")
.add($(similarityCol), "double")
}
#EUds5
Training on data frames
78
def fit(examples: DataFrame) = {
import examples.sparkSession.implicits._
import org.apache.spark.ml.linalg.{Vector=>SV}
val dfexamples = examples.select($(exampleCol)).rdd.map {
case Row(sv: SV) => sv
}
/* construct a model object with the result of training */
new SOMModel(train(dfexamples, $(x), $(y)))
}
#EUds5
Practical considerations

and key takeaways
#EUds5
Retire your visibility hacks
80
package org.apache.spark.ml.hacks
object Hacks {
import org.apache.spark.ml.linalg.VectorUDT
val vectorUDT = new VectorUDT
}
#EUds5
Retire your visibility hacks
81
package org.apache.spark.ml.linalg
/* imports, etc., are elided ... */
@Since("2.0.0")
@DeveloperApi
object SQLDataTypes {
val VectorType: DataType = new VectorUDT
val MatrixType: DataType = new MatrixUDT
}
#EUds5
Caching training data
82
val wasUncached = examples.storageLevel == StorageLevel.NONE
if (wasUncached) { examples.cache() }
/* actually train here */
if (wasUncached) { examples.unpersist() }
#EUds5
Improve serial execution times
Are you repeatedly comparing training data to a model that only changes
once per iteration? Consider caching norms.
Are you doing a lot of dot products in a for loop? Consider replacing
these loops with a matrix-vector multiplication.
Seek to limit the number of library invocations you make and thus the
time you spend copying data to and from your linear algebra library.
83
#EUds5
Key takeaways
There are several techniques you can use to develop parallel
implementations of machine learning algorithms.
The RDD API may not be your favorite way to interact with Spark as a user,
but it can be extremely valuable if you’re developing libraries for Spark.
As a library developer, you might need to rely on developer APIs and dive
in to Spark’s source code, but things are getting easier with each release!
84
#EUds5
@willb • willb@redhat.com
https://chapeau.freevariable.com
https://radanalytics.io
Thanks!

Building Machine Learning Algorithms on Apache Spark with William Benton

  • 1.
    Building machine learning algorithmson Apache Spark William Benton (@willb) Red Hat, Inc. Session hashtag: #EUds5
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
    #EUds5 Forecast Introducing our casestudy: self-organizing maps Parallel implementations for partitioned collections (in particular, RDDs) Beyond the RDD: data frames and ML pipelines Practical considerations and key takeaways 6
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    #EUds5 Training self-organizing maps 12 whilet < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
  • 15.
    #EUds5 Training self-organizing maps 12 whilet < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order
  • 16.
    #EUds5 Training self-organizing maps 12 whilet < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected
  • 17.
    #EUds5 Training self-organizing maps 12 whilet < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected the learning rate controls how much closer to the example each unit gets
  • 18.
  • 19.
    #EUds5 Historical aside: Amdahl’sLaw 14 1 1 - p lim So =sp —> ∞
  • 20.
  • 21.
  • 22.
    #EUds5 What forces serialexecution? 17 state[t+1] = combine(state[t], x)
  • 23.
    #EUds5 What forces serialexecution? 18 state[t+1] = combine(state[t], x)
  • 24.
    #EUds5 What forces serialexecution? 19 f1: (T, T) => T f2: (T, U) => T
  • 25.
    #EUds5 What forces serialexecution? 19 f1: (T, T) => T f2: (T, U) => T
  • 26.
    #EUds5 What forces serialexecution? 19 f1: (T, T) => T f2: (T, U) => T
  • 27.
    #EUds5 How can wefix these? 20 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 28.
    #EUds5 How can wefix these? 21 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 29.
    #EUds5 How can wefix these? 22 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 30.
    #EUds5 How can wefix these? 23 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 31.
    #EUds5 How can wefix these? 24 L-BGFSSGD a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 32.
    #EUds5 How can wefix these? 25 SGD L-BGFS a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) There will be examples of each of these approaches for many problems in the literature and in open-source code!
  • 33.
    #EUds5 Implementing atop RDDs We’llstart with a batch implementation of our technique: 26 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  • 34.
    #EUds5 Implementing atop RDDs 27 fort in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models
  • 35.
    #EUds5 Implementing atop RDDs 28 fort in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models partition
  • 36.
    #EUds5 Implementing atop RDDs 29 fort in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) This won’t always work!
  • 37.
    #EUds5 An implementation template 30 varnextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) }
  • 38.
    #EUds5 An implementation template 31 varnextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “fold”: update the state for this partition with a single new example
  • 39.
    #EUds5 An implementation template 32 varnextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “reduce”: combine the states from two partitions
  • 40.
    #EUds5 An implementation template 33 varnextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } this will cause the model object to be serialized with the closure nextModel
  • 41.
    #EUds5 An implementation template 34 varnextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } broadcast the current working model for this iterationvar nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the broadcast variable
  • 42.
    #EUds5 An implementation template 35 varnextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale broadcasted model var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist
  • 43.
    #EUds5 An implementation template 36 varnextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } the wrong implementation of the right interface
  • 44.
    #EUds5 workersdriver (aggregate) Implementing onRDDs 37 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 45.
    #EUds5 workersdriver (aggregate) Implementing onRDDs 38 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 46.
  • 47.
    #EUds5 workersdriver (treeAggregate) Implementing onRDDs 39 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 48.
    #EUds5 workersdriver (treeAggregate) Implementing onRDDs 40 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 49.
    #EUds5 workersdriver (treeAggregate) Implementing onRDDs 41 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 50.
    #EUds5 workersdriver (treeAggregate) Implementing onRDDs 42 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ driver (treeAggregate) ⊕ ⊕ ⊕ ⊕
  • 51.
    #EUds5 workersdriver (treeAggregate) Implementing onRDDs 43 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 52.
  • 53.
  • 54.
  • 55.
    #EUds5 Beyond the RDD:Data frames and ML Pipelines
  • 56.
    #EUds5 RDDs: some goodparts 47 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
  • 57.
    #EUds5 RDDs: some goodparts 48 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 58.
    #EUds5 RDDs: some goodparts 49 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 59.
    #EUds5 RDDs: some goodparts 50 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile crashes at runtime
  • 60.
    #EUds5 RDDs: some goodparts 51 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 61.
    #EUds5 RDDs: some goodparts 52 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 62.
    #EUds5 RDDs versus queryplanning 53 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))
  • 63.
    #EUds5 RDDs versus queryplanning 54 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))
  • 64.
    #EUds5 RDDs and theJVM heap 55 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
  • 65.
    #EUds5 RDDs and theJava heap 56 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0
  • 66.
    #EUds5 RDDs and theJava heap 57 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…
  • 67.
    #EUds5 RDDs and theJava heap 58 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…
  • 68.
    #EUds5 ML pipelines: aquick example 59 from pyspark.ml.clustering import KMeans K, SEED = 100, 0xdea110c8 randomDF = make_random_df() kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")
  • 69.
    #EUds5 Working with MLpipelines 60 estimator.fit(df)
  • 70.
    #EUds5 Working with MLpipelines 61 estimator.fit(df) model.transform(df)
  • 71.
    #EUds5 Working with MLpipelines 62 model.transform(df)
  • 72.
    #EUds5 Working with MLpipelines 63 model.transform(df)
  • 73.
    #EUds5 Working with MLpipelines 64 estimator.fit(df) model.transform(df)
  • 74.
    #EUds5 Working with MLpipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed
  • 75.
    #EUds5 Working with MLpipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed outputCol
  • 76.
    #EUds5 Defining parameters 65 private[som] traitSOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 77.
    #EUds5 Defining parameters 66 private[som] traitSOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 78.
    #EUds5 private[som] trait SOMParamsextends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... Defining parameters 67 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 79.
    #EUds5 Defining parameters 68 private[som] traitSOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 80.
    #EUds5 Defining parameters 69 private[som] traitSOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 81.
    #EUds5 Defining parameters 70 private[som] traitSOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 82.
    #EUds5 Don’t repeat yourself 71 /** *Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
    #EUds5 Validate and transformat once 73 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 89.
    #EUds5 Validate and transformat once 74 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 90.
    #EUds5 Validate and transformat once 75 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }
  • 91.
    #EUds5 Validate and transformat once 76 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }
  • 92.
    #EUds5 Validate and transformat once 77 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }
  • 93.
    #EUds5 Training on dataframes 78 def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }
  • 94.
  • 95.
    #EUds5 Retire your visibilityhacks 80 package org.apache.spark.ml.hacks object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }
  • 96.
    #EUds5 Retire your visibilityhacks 81 package org.apache.spark.ml.linalg /* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }
  • 97.
    #EUds5 Caching training data 82 valwasUncached = examples.storageLevel == StorageLevel.NONE if (wasUncached) { examples.cache() } /* actually train here */ if (wasUncached) { examples.unpersist() }
  • 98.
    #EUds5 Improve serial executiontimes Are you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms. Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication. Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library. 83
  • 99.
    #EUds5 Key takeaways There areseveral techniques you can use to develop parallel implementations of machine learning algorithms. The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark. As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release! 84
  • 100.