Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Machine Learning Algorithms on Apache Spark with William Benton

742 views

Published on

There are lots of reasons why you might want to implement your own machine learning algorithms on Spark: you might want to experiment with a new idea, try and reproduce results from a recent research paper, or simply to use an existing technique that isn’t implemented in MLlib. In this talk, we’ll walk through the process of developing a new machine learning model for Spark. We’ll start with the basics, by considering how we’d design a parallel implementation of a particular unsupervised learning technique. The bulk of the talk will focus on the details you need to know to turn an algorithm design into an efficient parallel implementation on Spark: we’ll start by reviewing a simple RDD-based implementation, show some improvements, point out some pitfalls to avoid, and iteratively extend our implementation to support contemporary Spark features like ML Pipelines and structured query processing. You’ll leave this talk with everything you need to build a new machine learning technique that runs on Spark.

Published in: Data & Analytics
  • Be the first to comment

Building Machine Learning Algorithms on Apache Spark with William Benton

  1. 1. Building machine learning algorithms on Apache Spark William Benton (@willb) Red Hat, Inc. Session hashtag: #EUds5
  2. 2. #EUds5 Motivation 2
  3. 3. #EUds5 Motivation 3
  4. 4. #EUds5 Motivation 4
  5. 5. #EUds5 Motivation 5
  6. 6. #EUds5 Forecast Introducing our case study: self-organizing maps Parallel implementations for partitioned collections (in particular, RDDs) Beyond the RDD: data frames and ML pipelines Practical considerations and key takeaways 6
  7. 7. #EUds5 Introducing self-organizing maps
  8. 8. #EUds58
  9. 9. #EUds58
  10. 10. #EUds5 Training self-organizing maps 9
  11. 11. #EUds5 Training self-organizing maps 10
  12. 12. #EUds5 Training self-organizing maps 11
  13. 13. #EUds5 Training self-organizing maps 11
  14. 14. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt
  15. 15. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order
  16. 16. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected
  17. 17. #EUds5 Training self-organizing maps 12 while t < maxupdates: random.shuffle(examples) for ex in examples: t = t + 1 if t == maxupdates: break bestMatch = closest(somt, ex) for (unit, wt) in neighborhood(bestMatch, sigma(t)): somt+1[unit] = somt[unit] + (ex - somt[unit]) * alpha(t) * wt process the training set in random order the neighborhood size controls how much of the map around the BMU is affected the learning rate controls how much closer to the example each unit gets
  18. 18. #EUds5 Parallel implementations for partitioned collections
  19. 19. #EUds5 Historical aside: Amdahl’s Law 14 1 1 - p lim So =sp —> ∞
  20. 20. #EUds5 What forces serial execution? 15
  21. 21. #EUds5 What forces serial execution? 16
  22. 22. #EUds5 What forces serial execution? 17 state[t+1] = combine(state[t], x)
  23. 23. #EUds5 What forces serial execution? 18 state[t+1] = combine(state[t], x)
  24. 24. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  25. 25. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  26. 26. #EUds5 What forces serial execution? 19 f1: (T, T) => T f2: (T, U) => T
  27. 27. #EUds5 How can we fix these? 20 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  28. 28. #EUds5 How can we fix these? 21 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  29. 29. #EUds5 How can we fix these? 22 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  30. 30. #EUds5 How can we fix these? 23 a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  31. 31. #EUds5 How can we fix these? 24 L-BGFSSGD a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)
  32. 32. #EUds5 How can we fix these? 25 SGD L-BGFS a ⊕ b = b ⊕ a (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c) There will be examples of each of these approaches for many problems in the literature and in open-source code!
  33. 33. #EUds5 Implementing atop RDDs We’ll start with a batch implementation of our technique: 26 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods)
  34. 34. #EUds5 Implementing atop RDDs 27 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models
  35. 35. #EUds5 Implementing atop RDDs 28 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) Each batch produces a model that can be averaged with other models partition
  36. 36. #EUds5 Implementing atop RDDs 29 for t in (1 to iterations): state = newState() for ex in examples: bestMatch = closest(somt-1, ex) hood = neighborhood(bestMatch, sigma(t)) state.matches += ex * hood state.hoods += hood somt = newSOM(state.matches / state.hoods) This won’t always work!
  37. 37. #EUds5 An implementation template 30 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) }
  38. 38. #EUds5 An implementation template 31 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “fold”: update the state for this partition with a single new example
  39. 39. #EUds5 An implementation template 32 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } “reduce”: combine the states from two partitions
  40. 40. #EUds5 An implementation template 33 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(nextModel.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) } this will cause the model object to be serialized with the closure nextModel
  41. 41. #EUds5 An implementation template 34 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } broadcast the current working model for this iterationvar nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } get the value of the broadcast variable
  42. 42. #EUds5 An implementation template 35 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } remove the stale broadcasted model var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist
  43. 43. #EUds5 An implementation template 36 var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = examples.aggregate(ModelState.empty()) { { case (state: ModelState, example: Example) => state.update(current.value.lookup(example, i), example) } { case (s1: ModelState, s2: ModelState) => s1.combine(s2) } } nextModel = modelFromState(newState) current.unpersist } the wrong implementation of the right interface
  44. 44. #EUds5 workersdriver (aggregate) Implementing on RDDs 37 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  45. 45. #EUds5 workersdriver (aggregate) Implementing on RDDs 38 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  46. 46. #EUds5 workersdriver (aggregate) Implementing on RDDs 38
  47. 47. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 39 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  48. 48. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 40 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  49. 49. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 41 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  50. 50. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 42 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ driver (treeAggregate) ⊕ ⊕ ⊕ ⊕
  51. 51. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 43 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ ⊕
  52. 52. #EUds5 workersdriver (treeAggregate) Implementing on RDDs 44 ⊕ ⊕
  53. 53. #EUds5 driver (treeAggregate) workers Implementing on RDDs 45 ⊕ ⊕
  54. 54. #EUds5 driver (treeAggregate) workers Implementing on RDDs 45 ⊕ ⊕⊕
  55. 55. #EUds5 Beyond the RDD: Data frames and ML Pipelines
  56. 56. #EUds5 RDDs: some good parts 47 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show()
  57. 57. #EUds5 RDDs: some good parts 48 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  58. 58. #EUds5 RDDs: some good parts 49 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile
  59. 59. #EUds5 RDDs: some good parts 50 val rdd: RDD[String] = /* ... */ rdd.map(_ * 3.0).collect() val df: DataFrame = /* data frame with one String-valued column */ df.select($"_1" * 3.0).show() doesn’t compile crashes at runtime
  60. 60. #EUds5 RDDs: some good parts 51 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  61. 61. #EUds5 RDDs: some good parts 52 rdd.map { vec => (vec, model.value.closestWithSimilarity(vec)) } val predict = udf ((vec: SV) => model.value.closestWithSimilarity(vec)) df.withColumn($"predictions", predict($"features"))
  62. 62. #EUds5 RDDs versus query planning 53 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.cartesian(numbers2) .map((x, y) => (x, y, expensive(x, y))) .filter((x, y, _) => isPrime(x), isPrime(y))
  63. 63. #EUds5 RDDs versus query planning 54 val numbers1 = sc.parallelize(1 to 100000000) val numbers2 = sc.parallelize(1 to 1000000000) numbers1.filter(isPrime(_)) .cartesian(numbers2.filter(isPrime(_))) .map((x, y) => (x, y, expensive(x, y)))
  64. 64. #EUds5 RDDs and the JVM heap 55 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
  65. 65. #EUds5 RDDs and the Java heap 56 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0
  66. 66. #EUds5 RDDs and the Java heap 57 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…
  67. 67. #EUds5 RDDs and the Java heap 58 val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…
  68. 68. #EUds5 ML pipelines: a quick example 59 from pyspark.ml.clustering import KMeans K, SEED = 100, 0xdea110c8 randomDF = make_random_df() kmeans = KMeans().setK(K).setSeed(SEED).setFeaturesCol("features") model = kmeans.fit(randomDF) withPredictions = model.transform(randomDF).select("x", "y", "prediction")
  69. 69. #EUds5 Working with ML pipelines 60 estimator.fit(df)
  70. 70. #EUds5 Working with ML pipelines 61 estimator.fit(df) model.transform(df)
  71. 71. #EUds5 Working with ML pipelines 62 model.transform(df)
  72. 72. #EUds5 Working with ML pipelines 63 model.transform(df)
  73. 73. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df)
  74. 74. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed
  75. 75. #EUds5 Working with ML pipelines 64 estimator.fit(df) model.transform(df) inputCol epochs seed outputCol
  76. 76. #EUds5 Defining parameters 65 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  77. 77. #EUds5 Defining parameters 66 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  78. 78. #EUds5 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... Defining parameters 67 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  79. 79. #EUds5 Defining parameters 68 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  80. 80. #EUds5 Defining parameters 69 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  81. 81. #EUds5 Defining parameters 70 private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ... private[som] trait SOMParams extends Params with DefaultParamsWritable { final val x: IntParam = new IntParam(this, "x", "width of self-organizing map (>= 1)", ParamValidators.gtEq(1)) final def getX: Int = $(x) final def setX(value: Int): this.type = set(x, value) // ...
  82. 82. #EUds5 Don’t repeat yourself 71 /** * Common params for KMeans and KMeansModel */ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol with HasTol { /* ... */ }
  83. 83. #EUds5 Estimators and transformers 72 estimator.fit(df)
  84. 84. #EUds5 Estimators and transformers 72 estimator.fit(df)
  85. 85. #EUds5 Estimators and transformers 72 estimator.fit(df) model.transform(df)
  86. 86. #EUds5 Estimators and transformers 72 estimator.fit(df) model.transform(df)
  87. 87. #EUds5 Estimators and transformers 72 estimator.fit(df) model.transform(df)
  88. 88. #EUds5 Validate and transform at once 73 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  89. 89. #EUds5 Validate and transform at once 74 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… require(schema.fieldNames.contains($(featuresCol))) // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema }
  90. 90. #EUds5 Validate and transform at once 75 def transformSchema(schema: StructType): StructType = { // check that the input columns exist... // ...and are the proper type schema($(featuresCol)) match { case sf: StructField => require(sf.dataType.equals(VectorType)) } // ...and that the output columns don’t exist // ...and then make a new schema }
  91. 91. #EUds5 Validate and transform at once 76 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist require(!schema.fieldNames.contains($(predictionCol))) require(!schema.fieldNames.contains($(similarityCol))) // ...and then make a new schema }
  92. 92. #EUds5 Validate and transform at once 77 def transformSchema(schema: StructType): StructType = { // check that the input columns exist… // ...and are the proper type // ...and that the output columns don’t exist // ...and then make a new schema schema.add($(predictionCol), "int") .add($(similarityCol), "double") }
  93. 93. #EUds5 Training on data frames 78 def fit(examples: DataFrame) = { import examples.sparkSession.implicits._ import org.apache.spark.ml.linalg.{Vector=>SV} val dfexamples = examples.select($(exampleCol)).rdd.map { case Row(sv: SV) => sv } /* construct a model object with the result of training */ new SOMModel(train(dfexamples, $(x), $(y))) }
  94. 94. #EUds5 Practical considerations
 and key takeaways
  95. 95. #EUds5 Retire your visibility hacks 80 package org.apache.spark.ml.hacks object Hacks { import org.apache.spark.ml.linalg.VectorUDT val vectorUDT = new VectorUDT }
  96. 96. #EUds5 Retire your visibility hacks 81 package org.apache.spark.ml.linalg /* imports, etc., are elided ... */ @Since("2.0.0") @DeveloperApi object SQLDataTypes { val VectorType: DataType = new VectorUDT val MatrixType: DataType = new MatrixUDT }
  97. 97. #EUds5 Caching training data 82 val wasUncached = examples.storageLevel == StorageLevel.NONE if (wasUncached) { examples.cache() } /* actually train here */ if (wasUncached) { examples.unpersist() }
  98. 98. #EUds5 Improve serial execution times Are you repeatedly comparing training data to a model that only changes once per iteration? Consider caching norms. Are you doing a lot of dot products in a for loop? Consider replacing these loops with a matrix-vector multiplication. Seek to limit the number of library invocations you make and thus the time you spend copying data to and from your linear algebra library. 83
  99. 99. #EUds5 Key takeaways There are several techniques you can use to develop parallel implementations of machine learning algorithms. The RDD API may not be your favorite way to interact with Spark as a user, but it can be extremely valuable if you’re developing libraries for Spark. As a library developer, you might need to rely on developer APIs and dive in to Spark’s source code, but things are getting easier with each release! 84
  100. 100. #EUds5 @willb • willb@redhat.com https://chapeau.freevariable.com https://radanalytics.io Thanks!

×