Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Spark for Library Developers with Erik Erlandson and William Benton

62 views

Published on

As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.

We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Apache Spark for Library Developers with Erik Erlandson and William Benton

  1. 1. Apache Spark for 
 library developers William Benton willb@redhat.com @willb Erik Erlandson eje@redhat.com @manyangled
  2. 2. About Will
  3. 3. #SAISDD6 The Silex and Isarn libraries Reusable open-source code that works 
 with Spark, factored from internal apps. We’ve tracked Spark releases since Spark 1.3.0. See https://silex.radanalytics.io and 
 http://isarnproject.org
  4. 4. #SAISDD6 Forecast Basic considerations for reusable Spark code Generic functions for parallel collections Extending data frames with custom aggregates Exposing JVM libraries to Python Sharing your work with the world
  5. 5. Basic considerations
  6. 6. #SAISDD6
  7. 7. #SAISDD6
  8. 8. #SAISDD6
  9. 9. #SAISDD6
  10. 10. #SAISDD6
  11. 11. #SAISDD6
  12. 12. #SAISDD6
  13. 13. #SAISDD6
  14. 14. #SAISDD6
  15. 15. #SAISDD6 Today’s main themes
  16. 16. #SAISDD6 in your SBT build definition: Cross-building for Scala scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11") in your shell: $ sbt +compile $ sbt "++ 2.11.11" compile scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11")
  17. 17. #SAISDD6 in your SBT build definition: Cross-building for Scala scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11") in your shell: $ sbt +compile $ sbt "++ 2.11.11" compile $ sbt +compile # or test, package, publish, etc. $ sbt "++ 2.11.11" compile scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11")
  18. 18. #SAISDD6 in your SBT build definition: Cross-building for Scala scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11") in your shell: $ sbt +compile $ sbt "++ 2.11.11" compile $ sbt +compile # or test, package, publish, etc. $ sbt "++ 2.11.11" compile scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11")
  19. 19. #SAISDD6 in your SBT build definition: Bring-your-own Spark libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  20. 20. #SAISDD6 in your SBT build definition: Bring-your-own Spark libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  21. 21. #SAISDD6 in your SBT build definition: “Bring-your-own Spark” libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  22. 22. #SAISDD6 in your SBT build definition: “Bring-your-own Spark” libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  23. 23. #SAISDD6 Taking care with resources
  24. 24. #SAISDD6 Taking care with resources
  25. 25. #SAISDD6 Taking care with resources
  26. 26. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } rdd.cache()
  27. 27. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } rdd.cache()
  28. 28. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } rdd.cache() rdd.unpersist()
  29. 29. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result }
  30. 30. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result }
  31. 31. #SAISDD6 nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = current.unpersist sc.broadcast(nextModel)
  32. 32. #SAISDD6 nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = current.unpersist sc.broadcast(nextModel)
  33. 33. #SAISDD6 nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = current.unpersist sc.broadcast(nextModel)
  34. 34. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
  35. 35. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0
  36. 36. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…
  37. 37. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…
  38. 38. Continuous integration for Spark libraries and apps
  39. 39. #SAISDD6 local[*]
  40. 40. #SAISDD6 CPU Memory
  41. 41. #SAISDD6
  42. 42. #SAISDD6
  43. 43. #SAISDD6
  44. 44. #SAISDD6 local[2]
  45. 45. #SAISDD6
  46. 46. #SAISDD6
  47. 47. Writing generic code for Spark’s parallel collections
  48. 48. #SAISDD6 The RDD is invariant T <: U RDD[T] <: RDD[U]
  49. 49. #SAISDD6 The RDD is invariant T <: U RDD[T] <: RDD[U] dog animal
  50. 50. #SAISDD6 T <: U RDD[T] <: RDD[U] trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x))
  51. 51. #SAISDD6 T <: U RDD[T] <: RDD[U] trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x))
  52. 52. #SAISDD6 val xacts = spark.parallelize(Array( Transaction(1, 1, 1.0), Transaction(2, 2, 1.0) )) badKeyByUserId(xacts) <console>: error: type mismatch; found : org.apache.spark.rdd.RDD[Transaction] required: org.apache.spark.rdd.RDD[HasUserId] Note: Transaction <: HasUserID, but class RDD is invariant in type T. You may wish to define T as +T instead. (SLS 4.5) badKeyByUserId(xacts)
  53. 53. #SAISDD6 val xacts = spark.parallelize(Array( Transaction(1, 1, 1.0), Transaction(2, 2, 1.0) )) badKeyByUserId(xacts) <console>: error: type mismatch; found : org.apache.spark.rdd.RDD[Transaction] required: org.apache.spark.rdd.RDD[HasUserId] Note: Transaction <: HasUserID, but class RDD is invariant in type T. You may wish to define T as +T instead. (SLS 4.5) badKeyByUserId(xacts)
  54. 54. #SAISDD6
  55. 55. #SAISDD6 An example: natural join A B C D E A EB X Y
  56. 56. #SAISDD6 An example: natural join A B C D E A EB X Y
  57. 57. #SAISDD6 An example: natural join A B C D E X Y
  58. 58. #SAISDD6 Ad-hoc natural join df1.join(df2, df1("a") === df2("a") && df1("b") === df2("b") && df1("e") === df2("e"))
  59. 59. #SAISDD6 = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame
  60. 60. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) }
  61. 61. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } introspecting over column names
  62. 62. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) }
  63. 63. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing expressions
  64. 64. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing expressions
  65. 65. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing expressions
  66. 66. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } [left.a === right.a, left.b === right.b, …]
  67. 67. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } left.a === right.a && left.b === right.b && …
  68. 68. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } left.a === right.a && left.b === right.b && …
  69. 69. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) }
  70. 70. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing column lists
  71. 71. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing column lists
  72. 72. #SAISDD6 case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf) case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf)
  73. 73. #SAISDD6 case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf) case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf)
  74. 74. #SAISDD6 case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf)
  75. 75. #SAISDD6 User-defined functions {"a": 1, "b": "wilma", ..., "x": "club"} {"a": 2, "b": "betty", ..., "x": "diamond"} {"a": 3, "b": "fred", ..., "x": "heart"} {"a": 4, "b": "barney", ..., "x": "spade"}
  76. 76. #SAISDD6 User-defined functions {"a": 1, "b": "wilma", ..., "x": "club"} {"a": 2, "b": "betty", ..., "x": "diamond"} {"a": 3, "b": "fred", ..., "x": "heart"} {"a": 4, "b": "barney", ..., "x": "spade"} wilma club betty diamond fred heart barney spade
  77. 77. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  78. 78. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  79. 79. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  80. 80. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  81. 81. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  82. 82. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) extract_bx = selectively_structure(["b", "x"]) structured_df = df.withColumn("result", extract_bx("json"))
  83. 83. #SAISDD6 Spark’s ML pipelines model.transform(df)
  84. 84. #SAISDD6 Spark’s ML pipelines model.transform(df)
  85. 85. #SAISDD6 Spark’s ML pipelines estimator.fit(df)
  86. 86. #SAISDD6 Spark’s ML pipelines estimator.fit(df) model.transform(df)
  87. 87. #SAISDD6 Working with ML pipelines model.transform(df)
  88. 88. #SAISDD6 Working with ML pipelines model.transform(df)
  89. 89. #SAISDD6 Spark’s ML pipelines
  90. 90. #SAISDD6 Spark’s ML pipelines model.transform(df)
  91. 91. #SAISDD6 Spark’s ML pipelines estimator.fit(df) model.transform(df)
  92. 92. #SAISDD6 Spark’s ML pipelines estimator.fit(df) model.transform(df) inputCol epochs seed outputCol
  93. 93. #SAISDD6
  94. 94. #SAISDD6 Forecast Basic considerations for reusable Spark code Generic functions for parallel collections Extending data frames with custom aggregates Exposing JVM libraries to Python Sharing your work with the world
  95. 95. About Erik
  96. 96. User-defined aggregates: the fundamentals
  97. 97. #SAISDD6 Three components
  98. 98. #SAISDD6 Three components
  99. 99. #SAISDD6 Three components
  100. 100. #SAISDD6 Three components
  101. 101. #SAISDD6 Three components
  102. 102. User-defined aggregates: the implementation
  103. 103. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  104. 104. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  105. 105. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  106. 106. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  107. 107. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  108. 108. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  109. 109. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  110. 110. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  111. 111. #SAISDD6 Four main functions: initialize initialize
  112. 112. #SAISDD6 Four main functions: initialize initialize
  113. 113. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  114. 114. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  115. 115. #SAISDD6 Four main functions: evaluate evaluate
  116. 116. #SAISDD6 Four main functions: evaluate evaluate
  117. 117. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  118. 118. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  119. 119. #SAISDD6 Four main functions: update update
  120. 120. #SAISDD6 Four main functions: update update
  121. 121. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  122. 122. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  123. 123. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  124. 124. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  125. 125. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  126. 126. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  127. 127. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  128. 128. #SAISDD6 Four main functions: merge 1 merge 2
  129. 129. #SAISDD6 Four main functions: merge merge 1 + 2
  130. 130. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  131. 131. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  132. 132. User-defined aggregates: User-defined types
  133. 133. #SAISDD6 User-defined types package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // .... package org.apache.spark
  134. 134. #SAISDD6 package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // .... User-defined types package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // ....
  135. 135. #SAISDD6 package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // .... User-defined types package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // ....
  136. 136. #SAISDD6 Implementing custom types class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  137. 137. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  138. 138. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  139. 139. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  140. 140. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  141. 141. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  142. 142. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  143. 143. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  144. 144. #SAISDD6 def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) } def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) }
  145. 145. #SAISDD6 def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) } def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) }
  146. 146. #SAISDD6 def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) } def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) }
  147. 147. Extending PySpark with your Scala library
  148. 148. #SAISDD6 [ ]
  149. 149. #SAISDD6 [ ]
  150. 150. #SAISDD6 [ ]
  151. 151. #SAISDD6 [ ]
  152. 152. #SAISDD6 # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing
  153. 153. #SAISDD6 # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm
  154. 154. #SAISDD6 # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing
  155. 155. #SAISDD6 A Python-friendly wrapper package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) } package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) }
  156. 156. #SAISDD6 package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) } tdigestDoubleUDAF
  157. 157. #SAISDD6 package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) } Double
  158. 158. #SAISDD6 from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column))) from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column)))
  159. 159. #SAISDD6 from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column))) tdapply apply
  160. 160. #SAISDD6 from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column)))tdapply(_to_seq(sc, [col], _to_java_column))
  161. 161. #SAISDD6 class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) # ...
  162. 162. #SAISDD6 class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) # ...
  163. 163. #SAISDD6 class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest" class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest"
  164. 164. #SAISDD6 class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest" class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest"
  165. 165. #SAISDD6 class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest" class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest"
  166. 166. #SAISDD6 class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4]) class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4])
  167. 167. #SAISDD6 class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4]) class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4])
  168. 168. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { // ... override def pyUDT: String = “isarnproject.sketches.udt.tdigest.TDigestUDT" } class TDigestUDT extends UserDefinedType[TDigestSQL] { // ... override def pyUDT: String = “isarnproject.sketches.udt.tdigest.TDigestUDT" }
  169. 169. #SAISDD6 Python code in JAR files mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  170. 170. #SAISDD6 mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  171. 171. #SAISDD6 mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  172. 172. #SAISDD6 mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  173. 173. #SAISDD6 Cross-building for Python lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython)
  174. 174. #SAISDD6 lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython)
  175. 175. #SAISDD6 lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !)
  176. 176. #SAISDD6 Using versioned JAR files $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7' $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
  177. 177. #SAISDD6 Using versioned JAR files $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7' $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
  178. 178. #SAISDD6 Using versioned JAR files $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7' $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
  179. 179. Show your work: 
 publishing results
  180. 180. #SAISDD6 Developing with git-flow $ brew install git-flow # macOS $ dnf install git-flow # Fedora $ yum install git-flow # CentOS $ apt-get install git-flow # Debian and friends (Search the internet for “git flow” to learn more!)
  181. 181. #SAISDD6 # Set up git-flow in this repository $ git flow init # Start work on my-awesome-feature; create # and switch to a feature branch $ git flow feature start my-awesome-feature $ ... # Finish work on my-awesome-feature; merge # feature/my-awesome-feature to develop $ git flow feature finish my-awesome-feature
  182. 182. #SAISDD6 # Start work on a release branch $ git flow release start 0.1.0 # Hack and bump version numbers $ ... # Finish work on v0.1.0; merge # release/0.1.0 to develop and master; # tag v0.1.0 $ git flow release finish 0.1.0
  183. 183. #SAISDD6 Maven Central Bintray not really easy to set up for library developers trivial trivial easy to set up for library users mostly yes, via sbt easy to publish yes, via sbt + plugins yes easy to resolve artifacts mostly
  184. 184. Conclusions and takeaways
  185. 185. #SAISDD6
  186. 186. #SAISDD6
  187. 187. #SAISDD6
  188. 188. #SAISDD6
  189. 189. #SAISDD6 https://radanalytics.io eje@redhat.com • @manyangled willb@redhat.com • @willb KEEP IN TOUCH

×