SlideShare a Scribd company logo
1 of 100
Download to read offline
Building machine learning
algorithms on Apache Spark
William Benton (@willb)
Red Hat, Inc.
Session hashtag: #EUds5
#EUds5
Motivation
2
#EUds5
Motivation
3
#EUds5
Motivation
4
#EUds5
Motivation
5
#EUds5
Forecast
Introducing our case study: self-organizing maps
Parallel implementations for partitioned collections (in particular, RDDs)
Beyond the RDD: data frames and ML pipelines
Practical considerations and key takeaways
6
#EUds5
Introducing self-organizing maps
#EUds58
#EUds58
#EUds5
Training self-organizing maps
9
#EUds5
Training self-organizing maps
10
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
11
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
#EUds5
Training self-organizing maps
12
while t < maxupdates:
random.shuffle(examples)
for ex in examples:
t = t + 1
if t == maxupdates:
break
bestMatch = closest(somt, ex)
for (unit, wt) in neighborhood(bestMatch, sigma(t)):
somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
process the training
set in random order
the neighborhood size controls
how much of the map around
the BMU is affected
the learning rate controls
how much closer to the
example each unit gets
#EUds5
Parallel implementations for
partitioned collections
#EUds5
Historical aside: Amdahl’s Law
14
1
1 - p
lim So =sp —> ∞
#EUds5
What forces serial execution?
15
#EUds5
What forces serial execution?
16
#EUds5
What forces serial execution?
17
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
18
state[t+1] =
combine(state[t], x)
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
What forces serial execution?
19
f1: (T, T) => T
f2: (T, U) => T
#EUds5
How can we fix these?
20
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
21
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
22
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
23
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
24
L-BGFSSGD
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
#EUds5
How can we fix these?
25
SGD L-BGFS
a ⊕ b = b ⊕ a
(a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
There will be examples of each
of these approaches for many
problems in the literature and
in open-source code!
#EUds5
Implementing atop RDDs
We’ll start with a batch implementation of our technique:
26
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
#EUds5
Implementing atop RDDs
27
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
#EUds5
Implementing atop RDDs
28
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
Each batch produces a model that
can be averaged with other models
partition
#EUds5
Implementing atop RDDs
29
for t in (1 to iterations):
state = newState()
for ex in examples:
bestMatch = closest(somt-1, ex)
hood = neighborhood(bestMatch, sigma(t))
state.matches += ex * hood
state.hoods += hood
somt = newSOM(state.matches / state.hoods)
This won’t always work!
#EUds5
An implementation template
30
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
#EUds5
An implementation template
31
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“fold”: update the state
for this partition with a
single new example
#EUds5
An implementation template
32
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
“reduce”: combine the
states from two partitions
#EUds5
An implementation template
33
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(nextModel.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
}
this will cause the model object
to be serialized with the closure
nextModel
#EUds5
An implementation template
34
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
broadcast the current working
model for this iterationvar nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} get the value of the
broadcast variable
#EUds5
An implementation template
35
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
} remove the stale
broadcasted model
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
#EUds5
An implementation template
36
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterations; i++) {
val current = sc.broadcast(nextModel)
val newState = examples.aggregate(ModelState.empty()) {
{ case (state: ModelState, example: Example) =>
state.update(current.value.lookup(example, i), example) }
{ case (s1: ModelState, s2: ModelState) => s1.combine(s2) }
}
nextModel = modelFromState(newState)
current.unpersist
}
the wrong implementation
of the right interface
#EUds5
workersdriver (aggregate)
Implementing on RDDs
37
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
⊕ ⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕ ⊕
⊕ ⊕ ⊕ ⊕
#EUds5
workersdriver (aggregate)
Implementing on RDDs
38
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
39
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
40
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
41
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
42
⊕
⊕
⊕
⊕
⊕
⊕
⊕ ⊕
driver (treeAggregate)
⊕
⊕
⊕
⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
43
⊕
⊕
⊕
⊕
⊕ ⊕
⊕ ⊕
⊕ ⊕
#EUds5
workersdriver (treeAggregate)
Implementing on RDDs
44
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕
#EUds5
driver (treeAggregate) workers
Implementing on RDDs
45
⊕ ⊕⊕
#EUds5
Beyond the RDD: Data frames and
ML Pipelines
#EUds5
RDDs: some good parts
47
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
#EUds5
RDDs: some good parts
48
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
49
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
#EUds5
RDDs: some good parts
50
val rdd: RDD[String] = /* ... */
rdd.map(_ * 3.0).collect()
val df: DataFrame = /* data frame with one String-valued column */
df.select($"_1" * 3.0).show()
doesn’t compile
crashes at runtime
#EUds5
RDDs: some good parts
51
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs: some good parts
52
rdd.map {
vec => (vec, model.value.closestWithSimilarity(vec))
}
val predict = udf ((vec: SV) =>
model.value.closestWithSimilarity(vec))
df.withColumn($"predictions", predict($"features"))
#EUds5
RDDs versus query planning
53
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.cartesian(numbers2)
.map((x, y) => (x, y, expensive(x, y)))
.filter((x, y, _) => isPrime(x), isPrime(y))
#EUds5
RDDs versus query planning
54
val numbers1 = sc.parallelize(1 to 100000000)
val numbers2 = sc.parallelize(1 to 1000000000)
numbers1.filter(isPrime(_))
.cartesian(numbers2.filter(isPrime(_)))
.map((x, y) => (x, y, expensive(x, y)))
#EUds5
RDDs and the JVM heap
55
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
#EUds5
RDDs and the Java heap
56
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
#EUds5
RDDs and the Java heap
57
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0 32 bytes of data…
#EUds5
RDDs and the Java heap
58
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element pointer element pointer
class
pointer flags size locks 1.0
class
pointer flags size locks 3.0 4.0
2.0
…and 64 bytes
of overhead!
32 bytes of data…
#EUds5
ML pipelines: a quick example
59
from pyspark.ml.clustering import KMeans
K, SEED = 100, 0xdea110c8
randomDF = make_random_df()
kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features")
model = kmeans.fit(randomDF)
withPredictions = model.transform(randomDF).select("x", "y", "prediction")
#EUds5
Working with ML pipelines
60
estimator.fit(df)
#EUds5
Working with ML pipelines
61
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
62
model.transform(df)
#EUds5
Working with ML pipelines
63
model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
#EUds5
Working with ML pipelines
64
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
outputCol
#EUds5
Defining parameters
65
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
66
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
Defining parameters
67
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
68
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
69
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Defining parameters
70
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
private[som] trait SOMParams extends Params
with DefaultParamsWritable {
final val x: IntParam =
new IntParam(this, "x", "width of self-organizing map (>= 1)",
ParamValidators.gtEq(1))
final def getX: Int = $(x)
final def setX(value: Int): this.type = set(x, value)
// ...
#EUds5
Don’t repeat yourself
71
/**
* Common params for KMeans and KMeansModel
*/
private[clustering] trait KMeansParams extends Params
with HasMaxIter with HasFeaturesCol
with HasSeed with HasPredictionCol with HasTol { /* ... */ }
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Estimators and transformers
72
estimator.fit(df) model.transform(df)
#EUds5
Validate and transform at once
73
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
74
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
require(schema.fieldNames.contains($(featuresCol)))
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
75
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist...
// ...and are the proper type
schema($(featuresCol)) match {
case sf: StructField => require(sf.dataType.equals(VectorType))
}
// ...and that the output columns don’t exist
// ...and then make a new schema
}
#EUds5
Validate and transform at once
76
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
require(!schema.fieldNames.contains($(predictionCol)))
require(!schema.fieldNames.contains($(similarityCol)))
// ...and then make a new schema
}
#EUds5
Validate and transform at once
77
def transformSchema(schema: StructType):
StructType = {
// check that the input columns exist…
// ...and are the proper type
// ...and that the output columns don’t exist
// ...and then make a new schema
schema.add($(predictionCol), "int")
.add($(similarityCol), "double")
}
#EUds5
Training on data frames
78
def fit(examples: DataFrame) = {
import examples.sparkSession.implicits._
import org.apache.spark.ml.linalg.{Vector=>SV}
val dfexamples = examples.select($(exampleCol)).rdd.map {
case Row(sv: SV) => sv
}
/* construct a model object with the result of training */
new SOMModel(train(dfexamples, $(x), $(y)))
}
#EUds5
Practical considerations

and key takeaways
#EUds5
Retire your visibility hacks
80
package org.apache.spark.ml.hacks
object Hacks {
import org.apache.spark.ml.linalg.VectorUDT
val vectorUDT = new VectorUDT
}
#EUds5
Retire your visibility hacks
81
package org.apache.spark.ml.linalg
/* imports, etc., are elided ... */
@Since("2.0.0")
@DeveloperApi
object SQLDataTypes {
val VectorType: DataType = new VectorUDT
val MatrixType: DataType = new MatrixUDT
}
#EUds5
Caching training data
82
val wasUncached = examples.storageLevel == StorageLevel.NONE
if (wasUncached) { examples.cache() }
/* actually train here */
if (wasUncached) { examples.unpersist() }
#EUds5
Improve serial execution times
Are you repeatedly comparing training data to a model that only changes
once per iteration? Consider caching norms.
Are you doing a lot of dot products in a for loop? Consider replacing
these loops with a matrix-vector multiplication.
Seek to limit the number of library invocations you make and thus the
time you spend copying data to and from your linear algebra library.
83
#EUds5
Key takeaways
There are several techniques you can use to develop parallel
implementations of machine learning algorithms.
The RDD API may not be your favorite way to interact with Spark as a user,
but it can be extremely valuable if you’re developing libraries for Spark.
As a library developer, you might need to rely on developer APIs and dive
in to Spark’s source code, but things are getting easier with each release!
84
#EUds5
@willb • willb@redhat.com
https://chapeau.freevariable.com
https://radanalytics.io
Thanks!

More Related Content

What's hot

ECMAScript 6 major changes
ECMAScript 6 major changesECMAScript 6 major changes
ECMAScript 6 major changeshayato
 
SE Computer, Programming Laboratory(210251) University of Pune
SE Computer, Programming Laboratory(210251) University of PuneSE Computer, Programming Laboratory(210251) University of Pune
SE Computer, Programming Laboratory(210251) University of PuneBhavesh Shah
 
Monadologie
MonadologieMonadologie
Monadologieleague
 
How to extend map? Or why we need collections redesign? - Scalar 2017
How to extend map? Or why we need collections redesign? - Scalar 2017How to extend map? Or why we need collections redesign? - Scalar 2017
How to extend map? Or why we need collections redesign? - Scalar 2017Szymon Matejczyk
 
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14stewashton
 
Datastructures asignment
Datastructures asignmentDatastructures asignment
Datastructures asignmentsreekanth3dce
 
Taking your side effects aside
Taking your side effects asideTaking your side effects aside
Taking your side effects aside💡 Tomasz Kogut
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.Dr. Volkan OBAN
 
S1 3 derivadas_resueltas
S1 3 derivadas_resueltasS1 3 derivadas_resueltas
S1 3 derivadas_resueltasjesquerrev1
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RDr. Volkan OBAN
 
Gremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryMarko Rodriguez
 
Effector: we need to go deeper
Effector: we need to go deeperEffector: we need to go deeper
Effector: we need to go deeperVictor Didenko
 
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmOptimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmTasuku Soma
 
Maximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeMaximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeTasuku Soma
 

What's hot (20)

ECMAScript 6 major changes
ECMAScript 6 major changesECMAScript 6 major changes
ECMAScript 6 major changes
 
SE Computer, Programming Laboratory(210251) University of Pune
SE Computer, Programming Laboratory(210251) University of PuneSE Computer, Programming Laboratory(210251) University of Pune
SE Computer, Programming Laboratory(210251) University of Pune
 
Monadologie
MonadologieMonadologie
Monadologie
 
How to extend map? Or why we need collections redesign? - Scalar 2017
How to extend map? Or why we need collections redesign? - Scalar 2017How to extend map? Or why we need collections redesign? - Scalar 2017
How to extend map? Or why we need collections redesign? - Scalar 2017
 
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
Row Pattern Matching 12c MATCH_RECOGNIZE OOW14
 
Datastructures asignment
Datastructures asignmentDatastructures asignment
Datastructures asignment
 
Taking your side effects aside
Taking your side effects asideTaking your side effects aside
Taking your side effects aside
 
Table of Useful R commands.
Table of Useful R commands.Table of Useful R commands.
Table of Useful R commands.
 
S1 3 derivadas_resueltas
S1 3 derivadas_resueltasS1 3 derivadas_resueltas
S1 3 derivadas_resueltas
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with R
 
2 d matrices
2 d matrices2 d matrices
2 d matrices
 
numdoc
numdocnumdoc
numdoc
 
QUEUES
QUEUESQUEUES
QUEUES
 
Reactive Collections
Reactive CollectionsReactive Collections
Reactive Collections
 
Gremlin's Graph Traversal Machinery
Gremlin's Graph Traversal MachineryGremlin's Graph Traversal Machinery
Gremlin's Graph Traversal Machinery
 
Effector: we need to go deeper
Effector: we need to go deeperEffector: we need to go deeper
Effector: we need to go deeper
 
C programs
C programsC programs
C programs
 
Matlab file
Matlab file Matlab file
Matlab file
 
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient AlgorithmOptimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
Optimal Budget Allocation: Theoretical Guarantee and Efficient Algorithm
 
Maximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer LatticeMaximizing Submodular Function over the Integer Lattice
Maximizing Submodular Function over the Integer Lattice
 

Viewers also liked

Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Spark Summit
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Spark Summit
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathSpark Summit
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarSpark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...Spark Summit
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...Spark Summit
 
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind RaoHistogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind RaoSpark Summit
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Spark Summit
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangDatabricks
 

Viewers also liked (20)

Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)Low Touch Machine Learning with Leah McGuire (Salesforce)
Low Touch Machine Learning with Leah McGuire (Salesforce)
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Art of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel SarwarArt of Feature Engineering for Data Science with Nabeel Sarwar
Art of Feature Engineering for Data Science with Nabeel Sarwar
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
One-Pass Data Science In Apache Spark With Generative T-Digests with Erik Erl...
 
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with Roger Menezes and D...
 
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind RaoHistogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
Histogram Equalized Heat Maps from Log Data via Apache Spark with Arvind Rao
 
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and ...
 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
 

Similar to Building Machine Learning Algorithms on Apache Spark with William Benton

C++11 - A Change in Style - v2.0
C++11 - A Change in Style - v2.0C++11 - A Change in Style - v2.0
C++11 - A Change in Style - v2.0Yaser Zhian
 
JDBC for CSQL Database
JDBC for CSQL DatabaseJDBC for CSQL Database
JDBC for CSQL Databasejitendral
 
Monads - Dublin Scala meetup
Monads - Dublin Scala meetupMonads - Dublin Scala meetup
Monads - Dublin Scala meetupMikhail Girkin
 
Compiler Construction | Lecture 9 | Constraint Resolution
Compiler Construction | Lecture 9 | Constraint ResolutionCompiler Construction | Lecture 9 | Constraint Resolution
Compiler Construction | Lecture 9 | Constraint ResolutionEelco Visser
 
Whats new in_csharp4
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4Abed Bukhari
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Cheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF CompleteCheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF CompleteTsamaraLuthfia1
 
Rewriting Java In Scala
Rewriting Java In ScalaRewriting Java In Scala
Rewriting Java In ScalaSkills Matter
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode CompilerDonal Fellows
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Laura Hughes
 
Gssl
GsslGssl
Gsslncct
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Mario Fusco
 
Lecture#6 functions in c++
Lecture#6 functions in c++Lecture#6 functions in c++
Lecture#6 functions in c++NUST Stuff
 
Uncertain Volatility Models
Uncertain Volatility ModelsUncertain Volatility Models
Uncertain Volatility ModelsSwati Mital
 
Pointcuts and Analysis
Pointcuts and AnalysisPointcuts and Analysis
Pointcuts and AnalysisWiwat Ruengmee
 

Similar to Building Machine Learning Algorithms on Apache Spark with William Benton (20)

C++11 - A Change in Style - v2.0
C++11 - A Change in Style - v2.0C++11 - A Change in Style - v2.0
C++11 - A Change in Style - v2.0
 
JDBC for CSQL Database
JDBC for CSQL DatabaseJDBC for CSQL Database
JDBC for CSQL Database
 
Monads - Dublin Scala meetup
Monads - Dublin Scala meetupMonads - Dublin Scala meetup
Monads - Dublin Scala meetup
 
Compiler Construction | Lecture 9 | Constraint Resolution
Compiler Construction | Lecture 9 | Constraint ResolutionCompiler Construction | Lecture 9 | Constraint Resolution
Compiler Construction | Lecture 9 | Constraint Resolution
 
Whats new in_csharp4
Whats new in_csharp4Whats new in_csharp4
Whats new in_csharp4
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Cheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF CompleteCheat Sheet for Stata v15.00 PDF Complete
Cheat Sheet for Stata v15.00 PDF Complete
 
Rewriting Java In Scala
Rewriting Java In ScalaRewriting Java In Scala
Rewriting Java In Scala
 
Quality Python Homework Help
Quality Python Homework HelpQuality Python Homework Help
Quality Python Homework Help
 
The TclQuadcode Compiler
The TclQuadcode CompilerThe TclQuadcode Compiler
The TclQuadcode Compiler
 
Stata Cheat Sheets (all)
Stata Cheat Sheets (all)Stata Cheat Sheets (all)
Stata Cheat Sheets (all)
 
Gssl
GsslGssl
Gssl
 
Scala @ TomTom
Scala @ TomTomScala @ TomTom
Scala @ TomTom
 
Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...Laziness, trampolines, monoids and other functional amenities: this is not yo...
Laziness, trampolines, monoids and other functional amenities: this is not yo...
 
04_AJMS_453_22_compressed.pdf
04_AJMS_453_22_compressed.pdf04_AJMS_453_22_compressed.pdf
04_AJMS_453_22_compressed.pdf
 
Lecture#6 functions in c++
Lecture#6 functions in c++Lecture#6 functions in c++
Lecture#6 functions in c++
 
Uncertain Volatility Models
Uncertain Volatility ModelsUncertain Volatility Models
Uncertain Volatility Models
 
Pointcuts and Analysis
Pointcuts and AnalysisPointcuts and Analysis
Pointcuts and Analysis
 
Hidden loops
Hidden loopsHidden loops
Hidden loops
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulSpark Summit
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Spark Summit
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangSpark Summit
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSpark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 
Variant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr SzulVariant-Apache Spark for Bioinformatics with Piotr Szul
Variant-Apache Spark for Bioinformatics with Piotr Szul
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg SchadSmack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
Smack Stack and Beyond—Building Fast Data Pipelines with Jorg Schad
 

Recently uploaded

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Building Machine Learning Algorithms on Apache Spark with William Benton

  • 1. Building machine learning algorithms on Apache Spark William Benton (@willb) Red Hat, Inc. Session hashtag: #EUds5
  • 6. #EUds5 Forecast Introducing our case study: self-organizing maps Parallel implementations for partitioned collections (in particular, RDDs) Beyond the RDD: data frames and ML pipelines Practical considerations and key takeaways 6
  • 14. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
  • 15. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order
  • 16. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected
  • 17. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected the learning rate controls how much closer to the example each unit gets
  • 19. #EUds5 Historical aside: Amdahl’s Law 14 1 1 - p lim So =sp —> ∞
  • 20. #EUds5 What forces serial execution? 15
  • 21. #EUds5 What forces serial execution? 16
  • 22. #EUds5 What forces serial execution? 17 state[t+1] = combine(state[t], x)
  • 23. #EUds5 What forces serial execution? 18 state[t+1] = combine(state[t], x)
  • 24. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 25. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 26. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  • 27. #EUds5 How can we fix these? 20 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 28. #EUds5 How can we fix these? 21 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 29. #EUds5 How can we fix these? 22 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 30. #EUds5 How can we fix these? 23 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 31. #EUds5 How can we fix these? 24 L-BGFSSGD a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  • 32. #EUds5 How can we fix these? 25 SGD L-BGFS a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) There will be examples of each of these approaches for many problems in the literature and in open-source code!
  • 33. #EUds5 Implementing atop RDDs We’ll start with a batch implementation of our technique: 26 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  • 34. #EUds5 Implementing atop RDDs 27 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models
  • 35. #EUds5 Implementing atop RDDs 28 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models partition
  • 36. #EUds5 Implementing atop RDDs 29 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) This won’t always work!
  • 37. #EUds5 An implementation template 30 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) }
  • 38. #EUds5 An implementation template 31 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “fold”: update the state for this partition with a single new example
  • 39. #EUds5 An implementation template 32 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “reduce”: combine the states from two partitions
  • 40. #EUds5 An implementation template 33 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } this will cause the model object to be serialized with the closure nextModel
  • 41. #EUds5 An implementation template 34 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } broadcast the current working model for this iterationvar nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the broadcast variable
  • 42. #EUds5 An implementation template 35 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale broadcasted model var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist
  • 43. #EUds5 An implementation template 36 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } the wrong implementation of the right interface
  • 44. #EUds5 workersdriver (aggregate) Implementing on RDDs 37 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 45. #EUds5 workersdriver (aggregate) Implementing on RDDs 38 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 47. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 39 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 48. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 40 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 49. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 41 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 50. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 42 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ driver (treeAggregate) ⊕ ⊕ ⊕ ⊕
  • 51. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 43 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  • 55. #EUds5 Beyond the RDD: Data frames and ML Pipelines
  • 56. #EUds5 RDDs: some good parts 47 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
  • 57. #EUds5 RDDs: some good parts 48 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 58. #EUds5 RDDs: some good parts 49 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  • 59. #EUds5 RDDs: some good parts 50 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile crashes at runtime
  • 60. #EUds5 RDDs: some good parts 51 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 61. #EUds5 RDDs: some good parts 52 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  • 62. #EUds5 RDDs versus query planning 53 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))
  • 63. #EUds5 RDDs versus query planning 54 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))
  • 64. #EUds5 RDDs and the JVM heap 55 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
  • 65. #EUds5 RDDs and the Java heap 56 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0
  • 66. #EUds5 RDDs and the Java heap 57 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…
  • 67. #EUds5 RDDs and the Java heap 58 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…
  • 68. #EUds5 ML pipelines: a quick example 59 from pyspark.ml.clustering import KMeans K, SEED = 100, 0xdea110c8 randomDF = make_random_df() kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")
  • 69. #EUds5 Working with ML pipelines 60 estimator.fit(df)
  • 70. #EUds5 Working with ML pipelines 61 estimator.fit(df) model.transform(df)
  • 71. #EUds5 Working with ML pipelines 62 model.transform(df)
  • 72. #EUds5 Working with ML pipelines 63 model.transform(df)
  • 73. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df)
  • 74. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed
  • 75. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed outputCol
  • 76. #EUds5 Defining parameters 65 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 77. #EUds5 Defining parameters 66 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 78. #EUds5 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... Defining parameters 67 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 79. #EUds5 Defining parameters 68 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 80. #EUds5 Defining parameters 69 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 81. #EUds5 Defining parameters 70 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  • 82. #EUds5 Don’t repeat yourself 71 /** * Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }
  • 88. #EUds5 Validate and transform at once 73 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 89. #EUds5 Validate and transform at once 74 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  • 90. #EUds5 Validate and transform at once 75 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }
  • 91. #EUds5 Validate and transform at once 76 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }
  • 92. #EUds5 Validate and transform at once 77 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }
  • 93. #EUds5 Training on data frames 78 def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }
  • 95. #EUds5 Retire your visibility hacks 80 package org.apache.spark.ml.hacks object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }
  • 96. #EUds5 Retire your visibility hacks 81 package org.apache.spark.ml.linalg /* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }
  • 97. #EUds5 Caching training data 82 val wasUncached = examples.storageLevel == StorageLevel.NONE if (wasUncached) { examples.cache() } /* actually train here */ if (wasUncached) { examples.unpersist() }
  • 98. #EUds5 Improve serial execution times Are you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms. Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication. Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library. 83
  • 99. #EUds5 Key takeaways There are several techniques you can use to develop parallel implementations of machine learning algorithms. The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark. As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release! 84