Building Machine Learning Algorithms on Apache Spark with William Benton

Spark Summit
Spark SummitSpark Summit
Building machine learning
algorithms on Apache Spark
William Benton (@willb)
Red Hat, Inc.
Session hashtag: #EUds5
#EUds5
Motivation
2
#EUds5
Motivation
3
#EUds5
Motivation
4
#EUds5
Motivation
5
#EUds5
Forecast
Introducing our case study: self-organizing maps
Parallel implementations for partitioned collections (in particular, RDDs)
Beyond the RDD: data frames and ML pipelines
Practical considerations and key takeaways
6
#EUds5
Introducing self-organizing maps
#EUds58
#EUds58
#EUds5
Training self-organizing maps
9
#EUds5
Training self-organizing maps
10
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
the learning rate controls
how much closer to the
example each unit gets
#EUds5
Parallel implementations for
partitioned collections
#EUds5
Historical aside: Amdahl’s Law
14
1
1 - p
lim So =sp —> ∞
#EUds5
What forces serial execution?
15
#EUds5
What forces serial execution?
16
#EUds5
What forces serial execution?
17
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
18
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
How can we fix these?
20
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
21
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
22
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
23
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
24
L-BGFSSGD
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
25
SGD L-BGFS
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
There will be examples of each
of these approaches for many
problems in the literature and
in open-source code!
#EUds5
Implementing atop RDDs
We’ll start with a batch implementation of our technique:
26
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
#EUds5
Implementing atop RDDs
27
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
#EUds5
Implementing atop RDDs
28
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
partition
#EUds5
Implementing atop RDDs
29
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
This won’t always work!
#EUds5
An implementation template
30
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
#EUds5
An implementation template
31
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“fold”: update the state
for this partition with a
single new example
#EUds5
An implementation template
32
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“reduce”: combine the
states from two partitions
#EUds5
An implementation template
33
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
this will cause the model object
to be serialized with the closure
nextModel
#EUds5
An implementation template
34
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
broadcast the current working
model for this iterationvar nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} get the value of the
broadcast variable
#EUds5
An implementation template
35
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} remove the stale
broadcasted model
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
#EUds5
An implementation template
36
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
the wrong implementation
of the right interface
#EUds5
workersdriver (aggregate)
Implementing on RDDs
37
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
39
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
40
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
41
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
42
⊕
⊕
⊕
⊕
⊕
⊕
⊕ ⊕
driver (treeAggregate)
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
43
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
44
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕⊕
#EUds5
Beyond the RDD: Data frames and
ML Pipelines
#EUds5
RDDs: some good parts
47
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
#EUds5
RDDs: some good parts
48
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
49
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
50
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
crashes at runtime
#EUds5
RDDs: some good parts
51
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs: some good parts
52
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs versus query planning
53
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.cartesian(numbers2)
.map((x, y) => (x, y, expensive(x, y)))
.filter((x, y, _) => isPrime(x), isPrime(y))
#EUds5
RDDs versus query planning
54
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.filter(isPrime(_))
.cartesian(numbers2.filter(isPrime(_)))
.map((x, y) => (x, y, expensive(x, y)))
#EUds5
RDDs and the JVM heap
55
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
#EUds5
RDDs and the Java heap
56
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
#EUds5
RDDs and the Java heap
57
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0 32 bytes of data…
#EUds5
RDDs and the Java heap
58
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
…and 64 bytes
of overhead!
32 bytes of data…
#EUds5
ML pipelines: a quick example
59
from pyspark.ml.clustering import KMeans
K, SEED = 100, 0xdea110c8
randomDF = make_random_df()
kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features")
model = kmeans.fit(randomDF)
withPredictions = model.transform(randomDF).select("x", "y", "prediction")
#EUds5
Working with ML pipelines
60
estimator.fit(df)
#EUds5
Working with ML pipelines
61
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
62
model.transform(df)
#EUds5
Working with ML pipelines
63
model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
outputCol
#EUds5
Defining parameters
65
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
66
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
Defining parameters
67
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
68
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
69
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
70
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Don’t repeat yourself
71
/**
* Common params for KMeans and KMeansModel
*/
private[clustering] trait KMeansParams extends Params
with HasMaxIter with HasFeaturesCol
with HasSeed with HasPredictionCol with HasTol { /* ... */ }
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Validate and transform at once
73
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
74
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
require(schema.fieldNames.contains($(featuresCol)))
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
75
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
schema($(featuresCol)) match {
case sf: StructField => require(sf.dataType.equals(VectorType))
}
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
76
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
require(!schema.fieldNames.contains($(predictionCol)))
require(!schema.fieldNames.contains($(similarityCol)))
// ...and then make a new schema
}
#EUds5
Validate and transform at once
77
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
schema.add($(predictionCol), "int")
.add($(similarityCol), "double")
}
#EUds5
Training on data frames
78
def fit(examples: DataFrame) = {
import examples.sparkSession.implicits._
import org.apache.spark.ml.linalg.{Vector=>SV}
val dfexamples = examples.select($(exampleCol)).rdd.map {
case Row(sv: SV) => sv
}
/* construct a model object with the result of training */
new SOMModel(train(dfexamples, $(x), $(y)))
}
#EUds5
Practical considerations

and key takeaways
#EUds5
Retire your visibility hacks
80
package org.apache.spark.ml.hacks
object Hacks {
import org.apache.spark.ml.linalg.VectorUDT
val vectorUDT = new VectorUDT
}
#EUds5
Retire your visibility hacks
81
package org.apache.spark.ml.linalg
/* imports, etc., are elided ... */
@Since("2.0.0")
@DeveloperApi
object SQLDataTypes {
val VectorType: DataType = new VectorUDT
val MatrixType: DataType = new MatrixUDT
}
#EUds5
Caching training data
82
val wasUncached = examples.storageLevel == StorageLevel.NONE
if (wasUncached) { examples.cache() }
/* actually train here */
if (wasUncached) { examples.unpersist() }
#EUds5
Improve serial execution times
Are you repeatedly comparing training data to a model that only changes
once per iteration? Consider caching norms.
Are you doing a lot of dot products in a for loop? Consider replacing
these loops with a matrix-vector multiplication.
Seek to limit the number of library invocations you make and thus the
time you spend copying data to and from your linear algebra library.
83
#EUds5
Key takeaways
There are several techniques you can use to develop parallel
implementations of machine learning algorithms.
The RDD API may not be your favorite way to interact with Spark as a user,
but it can be extremely valuable if you’re developing libraries for Spark.
As a library developer, you might need to rely on developer APIs and dive
in to Spark’s source code, but things are getting easier with each release!
84
#EUds5
@willb • willb@redhat.com
https://chapeau.freevariable.com
https://radanalytics.io
Thanks!
1 of 100

Recommended

Building Machine Learning Algorithms on Apache Spark: Scaling Out and Up with... by
Building Machine Learning Algorithms on Apache Spark: Scaling Out and Up with...Building Machine Learning Algorithms on Apache Spark: Scaling Out and Up with...
Building Machine Learning Algorithms on Apache Spark: Scaling Out and Up with...Databricks
442 views113 slides
Matlab plotting by
Matlab plottingMatlab plotting
Matlab plottingshahid sultan
207 views78 slides
Matlab plotting by
Matlab plottingMatlab plotting
Matlab plottingAmr Rashed
2.5K views40 slides
Three dimensional geometric transformations by
Three dimensional geometric transformationsThree dimensional geometric transformations
Three dimensional geometric transformationsshanthishyam
34 views28 slides
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013) by
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)
Tall-and-skinny Matrix Computations in MapReduce (ICME MR 2013)Austin Benson
403 views63 slides
Computer graphics lab manual by
Computer graphics lab manualComputer graphics lab manual
Computer graphics lab manualUma mohan
362 views60 slides

More Related Content

What's hot

ECMAScript 6 major changes by
ECMAScript 6 major changesECMAScript 6 major changes
ECMAScript 6 major changeshayato
704 views14 slides
SE Computer, Programming Laboratory(210251) University of Pune by
SE Computer, Programming Laboratory(210251) University of PuneSE Computer, Programming Laboratory(210251) University of Pune
SE Computer, Programming Laboratory(210251) University of PuneBhavesh Shah
15.4K views38 slides
Monadologie by
MonadologieMonadologie
Monadologieleague
3.6K views58 slides
How to extend map? Or why we need collections redesign? - Scalar 2017 by
How to extend map? Or why we need collections redesign? - Scalar 2017How to extend map? Or why we need collections redesign? - Scalar 2017
How to extend map? Or why we need collections redesign? - Scalar 2017Szymon Matejczyk
554 views30 slides
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14 by
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14stewashton
2.1K views34 slides
Datastructures asignment by
Datastructures asignmentDatastructures asignment
Datastructures asignmentsreekanth3dce
899 views16 slides

What's hot(20)

ECMAScript 6 major changes by hayato
ECMAScript 6 major changesECMAScript 6 major changes
ECMAScript 6 major changes
hayato704 views
SE Computer, Programming Laboratory(210251) University of Pune by Bhavesh Shah
SE Computer, Programming Laboratory(210251) University of PuneSE Computer, Programming Laboratory(210251) University of Pune
SE Computer, Programming Laboratory(210251) University of Pune
Bhavesh Shah15.4K views
Monadologie by league
MonadologieMonadologie
Monadologie
league3.6K views
How to extend map? Or why we need collections redesign? - Scalar 2017 by Szymon Matejczyk
How to extend map? Or why we need collections redesign? - Scalar 2017How to extend map? Or why we need collections redesign? - Scalar 2017
How to extend map? Or why we need collections redesign? - Scalar 2017
Szymon Matejczyk554 views
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14 by stewashton
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
stewashton2.1K views
Datastructures asignment by sreekanth3dce
Datastructures asignmentDatastructures asignment
Datastructures asignment
sreekanth3dce899 views
S1 3 derivadas_resueltas by jesquerrev1
S1 3 derivadas_resueltasS1 3 derivadas_resueltas
S1 3 derivadas_resueltas
jesquerrev1187 views
R data mining-Time Series Analysis with R by Dr. Volkan OBAN
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with R
Dr. Volkan OBAN361 views
Gremlin's Graph Traversal Machinery by Marko Rodriguez
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal Machinery
Marko Rodriguez8.3K views
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm by Tasuku Soma
Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmOptimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Tasuku Soma3.9K views
Maximizing Submodular Function over the Integer Lattice by Tasuku Soma
Maximizing Submodular Function over the Integer LatticeMaximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer Lattice
Tasuku Soma700 views

Viewers also liked

Low Touch Machine Learning with Leah McGuire (Salesforce) by
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Spark Summit
1.3K views23 slides
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval... by
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Spark Summit
1.6K views30 slides
Feature Hashing for Scalable Machine Learning with Nick Pentreath by
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathSpark Summit
2.1K views33 slides
Experimental Design for Distributed Machine Learning with Myles Baker by
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
2.8K views27 slides
Art of Feature Engineering for Data Science with Nabeel Sarwar by
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarSpark Summit
2.1K views20 slides
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
1.7K views90 slides

Viewers also liked(20)

Low Touch Machine Learning with Leah McGuire (Salesforce) by Spark Summit
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)
Spark Summit1.3K views
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval... by Spark Summit
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit1.6K views
Feature Hashing for Scalable Machine Learning with Nick Pentreath by Spark Summit
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit2.1K views
Experimental Design for Distributed Machine Learning with Myles Baker by Databricks
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
Databricks2.8K views
Art of Feature Engineering for Data Science with Nabeel Sarwar by Spark Summit
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
Spark Summit2.1K views
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ... by Spark Summit
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit1.7K views
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem... by Spark Summit
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit3.4K views
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with... by Spark Summit
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
Spark Summit2.3K views
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit3.1K views
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa... by Spark Summit
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit1.1K views
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library... by Spark Summit
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit2.8K views
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl... by Spark Summit
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
Spark Summit840 views
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K... by Spark Summit
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Spark Summit2.8K views
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D... by Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
Spark Summit1.5K views
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao by Spark Summit
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind RaoHistogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Spark Summit1.1K views
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ... by Spark Summit
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Spark Summit1.7K views
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning by Spark Summit
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Spark Summit10.2K views
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa... by Spark Summit
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit2.7K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit2.1K views
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang by Databricks
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks2.5K views

Similar to Building Machine Learning Algorithms on Apache Spark with William Benton

C++11 - A Change in Style - v2.0 by
C++11 - A Change in Style - v2.0C++11 - A Change in Style - v2.0
C++11 - A Change in Style - v2.0Yaser Zhian
1K views51 slides
JDBC for CSQL Database by
JDBC for CSQL DatabaseJDBC for CSQL Database
JDBC for CSQL Databasejitendral
1.4K views27 slides
Monads - Dublin Scala meetup by
Monads - Dublin Scala meetupMonads - Dublin Scala meetup
Monads - Dublin Scala meetupMikhail Girkin
764 views43 slides
Compiler Construction | Lecture 9 | Constraint Resolution by
Compiler Construction | Lecture 9 | Constraint ResolutionCompiler Construction | Lecture 9 | Constraint Resolution
Compiler Construction | Lecture 9 | Constraint ResolutionEelco Visser
447 views38 slides
Whats new in_csharp4 by
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4Abed Bukhari
942 views26 slides
User Defined Aggregation in Apache Spark: A Love Story by
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
251 views50 slides

Similar to Building Machine Learning Algorithms on Apache Spark with William Benton(20)

C++11 - A Change in Style - v2.0 by Yaser Zhian
C++11 - A Change in Style - v2.0C++11 - A Change in Style - v2.0
C++11 - A Change in Style - v2.0
Yaser Zhian1K views
JDBC for CSQL Database by jitendral
JDBC for CSQL DatabaseJDBC for CSQL Database
JDBC for CSQL Database
jitendral1.4K views
Monads - Dublin Scala meetup by Mikhail Girkin
Monads - Dublin Scala meetupMonads - Dublin Scala meetup
Monads - Dublin Scala meetup
Mikhail Girkin764 views
Compiler Construction | Lecture 9 | Constraint Resolution by Eelco Visser
Compiler Construction | Lecture 9 | Constraint ResolutionCompiler Construction | Lecture 9 | Constraint Resolution
Compiler Construction | Lecture 9 | Constraint Resolution
Eelco Visser447 views
Whats new in_csharp4 by Abed Bukhari
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4
Abed Bukhari942 views
User Defined Aggregation in Apache Spark: A Love Story by Databricks
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks251 views
User Defined Aggregation in Apache Spark: A Love Story by Databricks
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
Databricks402 views
Rewriting Java In Scala by Skills Matter
Rewriting Java In ScalaRewriting Java In Scala
Rewriting Java In Scala
Skills Matter1.5K views
The TclQuadcode Compiler by Donal Fellows
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode Compiler
Donal Fellows1.1K views
Stata Cheat Sheets (all) by Laura Hughes
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
Laura Hughes6.6K views
Gssl by ncct
GsslGssl
Gssl
ncct259 views
Scala @ TomTom by Eric Bowman
Scala @ TomTomScala @ TomTom
Scala @ TomTom
Eric Bowman1.3K views
Laziness, trampolines, monoids and other functional amenities: this is not yo... by Mario Fusco
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Mario Fusco7.1K views
Lecture#6 functions in c++ by NUST Stuff
Lecture#6 functions in c++Lecture#6 functions in c++
Lecture#6 functions in c++
NUST Stuff518 views
Uncertain Volatility Models by Swati Mital
Uncertain Volatility ModelsUncertain Volatility Models
Uncertain Volatility Models
Swati Mital2.9K views
Lecture 02: Preliminaries of Data structure by Nurjahan Nipa
Lecture 02: Preliminaries of Data structureLecture 02: Preliminaries of Data structure
Lecture 02: Preliminaries of Data structure
Nurjahan Nipa918 views

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
6.8K views32 slides
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
4.3K views34 slides
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
3.1K views21 slides
Apache Spark and Tensorflow as a Service with Jim Dowling by
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
1.1K views44 slides
Next CERN Accelerator Logging Service with Jakub Wozniak by
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
2.3K views36 slides
Powering a Startup with Apache Spark with Kevin Kim by
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
787 views30 slides

More from Spark Summit(20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang by Spark Summit
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit6.8K views
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M... by Spark Summit
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit4.3K views
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu by Spark Summit
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit3.1K views
Apache Spark and Tensorflow as a Service with Jim Dowling by Spark Summit
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit1.1K views
Next CERN Accelerator Logging Service with Jakub Wozniak by Spark Summit
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit2.3K views
Powering a Startup with Apache Spark with Kevin Kim by Spark Summit
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit787 views
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra by Spark Summit
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit730 views
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—... by Spark Summit
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit1.1K views
How Nielsen Utilized Databricks for Large-Scale Research and Development with... by Spark Summit
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit999 views
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov... by Spark Summit
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit2.3K views
Goal Based Data Production with Sim Simeonov by Spark Summit
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit746 views
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le... by Spark Summit
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit1.5K views
Getting Ready to Use Redis with Apache Spark with Dvir Volk by Spark Summit
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit3K views
Deduplication and Author-Disambiguation of Streaming Records via Supervised M... by Spark Summit
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit915 views
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization... by Spark Summit
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit1.7K views
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit... by Spark Summit
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Spark Summit3K views
Variant-Apache Spark for Bioinformatics with Piotr Szul by Spark Summit
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
Spark Summit807 views
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed by Spark Summit
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit2.2K views
Best Practices for Using Alluxio with Apache Spark with Gene Pang by Spark Summit
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Spark Summit2K views
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad by Spark Summit
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Spark Summit2.1K views

Recently uploaded

Introduction to Microsoft Fabric.pdf by
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdfishaniuudeshika
24 views16 slides
Advanced_Recommendation_Systems_Presentation.pptx by
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptxneeharikasingh29
5 views9 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
11 views4 slides
Understanding Hallucinations in LLMs - 2023 09 29.pptx by
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptxGreg Makowski
13 views18 slides
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
5 views30 slides
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docx by
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxRIO GRANDE SUPPLY COMPANY INC, JAYSON.docx
RIO GRANDE SUPPLY COMPANY INC, JAYSON.docxJaysonGarabilesEspej
6 views3 slides

Recently uploaded(20)

Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika24 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023 by StatsCommunications
Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023
Launch of the Knowledge Exchange Platform - Romina Boarini - 21 November 2023
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views
Building Real-Time Travel Alerts by Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann109 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials10 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 views
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views

Building Machine Learning Algorithms on Apache Spark with William Benton

  • 1. Building machine learning algorithms on Apache Spark William Benton (@willb) Red Hat, Inc. Session hashtag: #EUds5
  • 6. #EUds5 Forecast Introducing our case study: self-organizing maps Parallel implementations for partitioned collections (in particular, RDDs) Beyond the RDD: data frames and ML pipelines Practical considerations and key takeaways 6
  • 14. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
  • 15. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order
  • 16. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected
  • 17. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected the learning rate controls how much closer to the example each unit gets
  • 19. #EUds5 Historical aside: Amdahl’s Law 14 1 1 - p lim So =sp —> ∞
  • 20. #EUds5 What forces serial execution? 15
  • 21. #EUds5 What forces serial execution? 16
  • 22. #EUds5 What forces serial execution? 17 state[t+1] = combine(state[t], x)
  • 23. #EUds5 What forces serial execution? 18 state[t+1] = combine(state[t], x)
  • 24. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 25. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 26. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 27. #EUds5 How can we fix these? 20 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 28. #EUds5 How can we fix these? 21 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 29. #EUds5 How can we fix these? 22 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 30. #EUds5 How can we fix these? 23 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 31. #EUds5 How can we fix these? 24 L-BGFSSGD a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 32. #EUds5 How can we fix these? 25 SGD L-BGFS a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) There will be examples of each of these approaches for many problems in the literature and in open-source code!
  • 33. #EUds5 Implementing atop RDDs We’ll start with a batch implementation of our technique: 26 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  • 34. #EUds5 Implementing atop RDDs 27 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models
  • 35. #EUds5 Implementing atop RDDs 28 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models partition
  • 36. #EUds5 Implementing atop RDDs 29 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) This won’t always work!
  • 37. #EUds5 An implementation template 30 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) }
  • 38. #EUds5 An implementation template 31 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “fold”: update the state for this partition with a single new example
  • 39. #EUds5 An implementation template 32 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “reduce”: combine the states from two partitions
  • 40. #EUds5 An implementation template 33 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } this will cause the model object to be serialized with the closure nextModel
  • 41. #EUds5 An implementation template 34 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } broadcast the current working model for this iterationvar nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the broadcast variable
  • 42. #EUds5 An implementation template 35 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale broadcasted model var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist
  • 43. #EUds5 An implementation template 36 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } the wrong implementation of the right interface
  • 44. #EUds5 workersdriver (aggregate) Implementing on RDDs 37 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 45. #EUds5 workersdriver (aggregate) Implementing on RDDs 38 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 47. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 39 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 48. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 40 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 49. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 41 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 50. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 42 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ driver (treeAggregate) ⊕ ⊕ ⊕ ⊕
  • 51. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 43 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 55. #EUds5 Beyond the RDD: Data frames and ML Pipelines
  • 56. #EUds5 RDDs: some good parts 47 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
  • 57. #EUds5 RDDs: some good parts 48 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 58. #EUds5 RDDs: some good parts 49 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 59. #EUds5 RDDs: some good parts 50 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile crashes at runtime
  • 60. #EUds5 RDDs: some good parts 51 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 61. #EUds5 RDDs: some good parts 52 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 62. #EUds5 RDDs versus query planning 53 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))
  • 63. #EUds5 RDDs versus query planning 54 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))
  • 64. #EUds5 RDDs and the JVM heap 55 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
  • 65. #EUds5 RDDs and the Java heap 56 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0
  • 66. #EUds5 RDDs and the Java heap 57 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…
  • 67. #EUds5 RDDs and the Java heap 58 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…
  • 68. #EUds5 ML pipelines: a quick example 59 from pyspark.ml.clustering import KMeans K, SEED = 100, 0xdea110c8 randomDF = make_random_df() kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")
  • 69. #EUds5 Working with ML pipelines 60 estimator.fit(df)
  • 70. #EUds5 Working with ML pipelines 61 estimator.fit(df) model.transform(df)
  • 71. #EUds5 Working with ML pipelines 62 model.transform(df)
  • 72. #EUds5 Working with ML pipelines 63 model.transform(df)
  • 73. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df)
  • 74. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed
  • 75. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed outputCol
  • 76. #EUds5 Defining parameters 65 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 77. #EUds5 Defining parameters 66 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 78. #EUds5 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... Defining parameters 67 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 79. #EUds5 Defining parameters 68 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 80. #EUds5 Defining parameters 69 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 81. #EUds5 Defining parameters 70 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 82. #EUds5 Don’t repeat yourself 71 /** * Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }
  • 88. #EUds5 Validate and transform at once 73 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 89. #EUds5 Validate and transform at once 74 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 90. #EUds5 Validate and transform at once 75 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }
  • 91. #EUds5 Validate and transform at once 76 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }
  • 92. #EUds5 Validate and transform at once 77 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }
  • 93. #EUds5 Training on data frames 78 def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }
  • 95. #EUds5 Retire your visibility hacks 80 package org.apache.spark.ml.hacks object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }
  • 96. #EUds5 Retire your visibility hacks 81 package org.apache.spark.ml.linalg /* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }
  • 97. #EUds5 Caching training data 82 val wasUncached = examples.storageLevel == StorageLevel.NONE if (wasUncached) { examples.cache() } /* actually train here */ if (wasUncached) { examples.unpersist() }
  • 98. #EUds5 Improve serial execution times Are you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms. Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication. Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library. 83
  • 99. #EUds5 Key takeaways There are several techniques you can use to develop parallel implementations of machine learning algorithms. The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark. As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release! 84