Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache Spark for 

library developers
William Benton
willb@redhat.com
@willb
Erik Erlandson
eje@redhat.com
@manyangled
About Will
#SAISDD6
The Silex and Isarn libraries
Reusable open-source code that works 

with Spark, factored from internal apps.
We’...
#SAISDD6
Forecast
Basic considerations for reusable Spark code
Generic functions for parallel collections
Extending data f...
Basic considerations
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
Today’s main themes
#SAISDD6
in your SBT build definition:
Cross-building for Scala
scalaVersion := "2.11.11"
crossScalaVersions := Seq("2.10....
#SAISDD6
in your SBT build definition:
Cross-building for Scala
scalaVersion := "2.11.11"
crossScalaVersions := Seq("2.10....
#SAISDD6
in your SBT build definition:
Cross-building for Scala
scalaVersion := "2.11.11"
crossScalaVersions := Seq("2.10....
#SAISDD6
in your SBT build definition:
Bring-your-own Spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core...
#SAISDD6
in your SBT build definition:
Bring-your-own Spark
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core...
#SAISDD6
in your SBT build definition:
“Bring-your-own Spark”
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-co...
#SAISDD6
in your SBT build definition:
“Bring-your-own Spark”
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-co...
#SAISDD6
Taking care with resources
#SAISDD6
Taking care with resources
#SAISDD6
Taking care with resources
#SAISDD6
def step(rdd: RDD[_]) = {
val wasUncached = rdd.storageLevel == StorageLevel.NONE
if (wasUncached) { rdd.cache() ...
#SAISDD6
def step(rdd: RDD[_]) = {
val wasUncached = rdd.storageLevel == StorageLevel.NONE
if (wasUncached) { rdd.cache() ...
#SAISDD6
def step(rdd: RDD[_]) = {
val wasUncached = rdd.storageLevel == StorageLevel.NONE
if (wasUncached) { rdd.cache() ...
#SAISDD6
def step(rdd: RDD[_]) = {
val wasUncached = rdd.storageLevel == StorageLevel.NONE
if (wasUncached) { rdd.cache() ...
#SAISDD6
def step(rdd: RDD[_]) = {
val wasUncached = rdd.storageLevel == StorageLevel.NONE
if (wasUncached) { rdd.cache() ...
#SAISDD6
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterati...
#SAISDD6
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterati...
#SAISDD6
nextModel = modelFromState(newState)
current.unpersist
}
var nextModel = initialModel
for (int i = 0; i < iterati...
#SAISDD6
Minding the JVM heap
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
#SAISDD6
Minding the JVM heap
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element poi...
#SAISDD6
Minding the JVM heap
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element poi...
#SAISDD6
Minding the JVM heap
val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
class
pointer flags size locks element poi...
Continuous integration for
Spark libraries and apps
#SAISDD6
local[*]
#SAISDD6
CPU Memory
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
local[2]
#SAISDD6
#SAISDD6
Writing generic code for
Spark’s parallel collections
#SAISDD6
The RDD is invariant
T <: U RDD[T] <: RDD[U]
#SAISDD6
The RDD is invariant
T <: U RDD[T] <: RDD[U]
dog animal
#SAISDD6
T <: U RDD[T] <: RDD[U]
trait HasUserId { val userid: Int }
case class Transaction(override val userid: Int,
time...
#SAISDD6
T <: U RDD[T] <: RDD[U]
trait HasUserId { val userid: Int }
case class Transaction(override val userid: Int,
time...
#SAISDD6
val xacts = spark.parallelize(Array(
Transaction(1, 1, 1.0),
Transaction(2, 2, 1.0)
))
badKeyByUserId(xacts)
<con...
#SAISDD6
val xacts = spark.parallelize(Array(
Transaction(1, 1, 1.0),
Transaction(2, 2, 1.0)
))
badKeyByUserId(xacts)
<con...
#SAISDD6
#SAISDD6
An example: natural join
A B C D E A EB X Y
#SAISDD6
An example: natural join
A B C D E A EB X Y
#SAISDD6
An example: natural join
A B C D E X Y
#SAISDD6
Ad-hoc natural join
df1.join(df2, df1("a") === df2("a") &&
df1("b") === df2("b") &&
df1("e") === df2("e"))
#SAISDD6
= {
val lcols = left.columns
val rcols = right.columns
val ccols = lcols.toSet intersect rcols.toSet
if(ccols.isE...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
def natjoin(left: DataFrame, right: DataFrame): DataFrame = {
val lcols = left.columns
val rcols = right.columns
...
#SAISDD6
case class DFWithNatJoin(df: DataFrame)
extends NaturalJoining {
def natjoin(other: DataFrame): DataFrame = super...
#SAISDD6
case class DFWithNatJoin(df: DataFrame)
extends NaturalJoining {
def natjoin(other: DataFrame): DataFrame = super...
#SAISDD6
case class DFWithNatJoin(df: DataFrame)
extends NaturalJoining {
def natjoin(other: DataFrame): DataFrame = super...
#SAISDD6
User-defined functions
{"a": 1, "b": "wilma", ..., "x": "club"}
{"a": 2, "b": "betty", ..., "x": "diamond"}
{"a":...
#SAISDD6
User-defined functions
{"a": 1, "b": "wilma", ..., "x": "club"}
{"a": 2, "b": "betty", ..., "x": "diamond"}
{"a":...
#SAISDD6
import json
from pyspark.sql.types import *
from pyspark.sql.functions import udf
def selectively_structure(field...
#SAISDD6
import json
from pyspark.sql.types import *
from pyspark.sql.functions import udf
def selectively_structure(field...
#SAISDD6
import json
from pyspark.sql.types import *
from pyspark.sql.functions import udf
def selectively_structure(field...
#SAISDD6
import json
from pyspark.sql.types import *
from pyspark.sql.functions import udf
def selectively_structure(field...
#SAISDD6
import json
from pyspark.sql.types import *
from pyspark.sql.functions import udf
def selectively_structure(field...
#SAISDD6
import json
from pyspark.sql.types import *
from pyspark.sql.functions import udf
def selectively_structure(field...
#SAISDD6
Spark’s ML pipelines
model.transform(df)
#SAISDD6
Spark’s ML pipelines
model.transform(df)
#SAISDD6
Spark’s ML pipelines
estimator.fit(df)
#SAISDD6
Spark’s ML pipelines
estimator.fit(df) model.transform(df)
#SAISDD6
Working with ML pipelines
model.transform(df)
#SAISDD6
Working with ML pipelines
model.transform(df)
#SAISDD6
Spark’s ML pipelines
#SAISDD6
Spark’s ML pipelines
model.transform(df)
#SAISDD6
Spark’s ML pipelines
estimator.fit(df) model.transform(df)
#SAISDD6
Spark’s ML pipelines
estimator.fit(df) model.transform(df)
inputCol
epochs
seed
outputCol
#SAISDD6
#SAISDD6
Forecast
Basic considerations for reusable Spark code
Generic functions for parallel collections
Extending data f...
About Erik
User-defined aggregates:
the fundamentals
#SAISDD6
Three components
#SAISDD6
Three components
#SAISDD6
Three components
#SAISDD6
Three components
#SAISDD6
Three components
User-defined aggregates:
the implementation
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int)
(implicit num: Numeric[N], dataTpe: TDigestUDAFDataT...
#SAISDD6
Four main functions: initialize
initialize
#SAISDD6
Four main functions: initialize
initialize
#SAISDD6
def initialize(buf: MutableAggregationBuffer): Unit = {
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
...
#SAISDD6
def initialize(buf: MutableAggregationBuffer): Unit = {
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
...
#SAISDD6
Four main functions: evaluate
evaluate
#SAISDD6
Four main functions: evaluate
evaluate
#SAISDD6
def initialize(buf: MutableAggregationBuffer): Unit = {
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
...
#SAISDD6
def initialize(buf: MutableAggregationBuffer): Unit = {
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
...
#SAISDD6
Four main functions: update
update
#SAISDD6
Four main functions: update
update
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
Four main functions: merge
1
merge
2
#SAISDD6
Four main functions: merge
merge
1 + 2
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
#SAISDD6
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buf(0) = TDigestSQL(buf...
User-defined aggregates:
User-defined types
#SAISDD6
User-defined types
package org.apache.spark.isarnproject.sketches.udt
@SQLUserDefinedType(udt = classOf[TDigestUD...
#SAISDD6
package org.apache.spark.isarnproject.sketches.udt
@SQLUserDefinedType(udt = classOf[TDigestUDT])
case class TDig...
#SAISDD6
package org.apache.spark.isarnproject.sketches.udt
@SQLUserDefinedType(udt = classOf[TDigestUDT])
case class TDig...
#SAISDD6
Implementing custom types
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def userClass: Class[TDigestSQL]...
#SAISDD6
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def userClass: Class[TDigestSQL] = classOf[TDigestSQL]
ove...
#SAISDD6
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def userClass: Class[TDigestSQL] = classOf[TDigestSQL]
ove...
#SAISDD6
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def userClass: Class[TDigestSQL] = classOf[TDigestSQL]
ove...
#SAISDD6
def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest)
private[sketches] def serializeTD(td: TDigest)...
#SAISDD6
def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest)
private[sketches] def serializeTD(td: TDigest)...
#SAISDD6
def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest)
private[sketches] def serializeTD(td: TDigest)...
#SAISDD6
def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest)
private[sketches] def serializeTD(td: TDigest)...
#SAISDD6
def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td))
private[sketches] def deserializeTD(datum: A...
#SAISDD6
def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td))
private[sketches] def deserializeTD(datum: A...
#SAISDD6
def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td))
private[sketches] def deserializeTD(datum: A...
Extending PySpark with
your Scala library
#SAISDD6
[ ]
#SAISDD6
[ ]
#SAISDD6
[ ]
#SAISDD6
[ ]
#SAISDD6
# class to access the active Spark context for Python
from pyspark.context import SparkContext
# gateway to the J...
#SAISDD6
# class to access the active Spark context for Python
from pyspark.context import SparkContext
# gateway to the J...
#SAISDD6
# class to access the active Spark context for Python
from pyspark.context import SparkContext
# gateway to the J...
#SAISDD6
A Python-friendly wrapper
package org.isarnproject.sketches.udaf
object pythonBindings {
def tdigestDoubleUDAF(de...
#SAISDD6
package org.isarnproject.sketches.udaf
object pythonBindings {
def tdigestDoubleUDAF(delta: Double, maxDiscrete: ...
#SAISDD6
package org.isarnproject.sketches.udaf
object pythonBindings {
def tdigestDoubleUDAF(delta: Double, maxDiscrete: ...
#SAISDD6
from pyspark.sql.column import Column, _to_java_column, _to_seq
from pyspark.context import SparkContext
# one of...
#SAISDD6
from pyspark.sql.column import Column, _to_java_column, _to_seq
from pyspark.context import SparkContext
# one of...
#SAISDD6
from pyspark.sql.column import Column, _to_java_column, _to_seq
from pyspark.context import SparkContext
# one of...
#SAISDD6
class TDigestUDT(UserDefinedType):
@classmethod
def sqlType(cls):
return StructType([
StructField("delta", Double...
#SAISDD6
class TDigestUDT(UserDefinedType):
@classmethod
def sqlType(cls):
return StructType([
StructField("delta", Double...
#SAISDD6
class TDigestUDT(UserDefinedType):
# ...
@classmethod
def module(cls):
return "isarnproject.sketches.udt.tdigest"...
#SAISDD6
class TDigestUDT(UserDefinedType):
# ...
@classmethod
def module(cls):
return "isarnproject.sketches.udt.tdigest"...
#SAISDD6
class TDigestUDT(UserDefinedType):
# ...
@classmethod
def module(cls):
return "isarnproject.sketches.udt.tdigest"...
#SAISDD6
class TDigestUDT(UserDefinedType):
# ...
def serialize(self, obj):
return (obj.delta, obj.maxDiscrete, obj.nclust...
#SAISDD6
class TDigestUDT(UserDefinedType):
# ...
def serialize(self, obj):
return (obj.delta, obj.maxDiscrete, obj.nclust...
#SAISDD6
class TDigestUDT extends UserDefinedType[TDigestSQL] {
// ...
override def pyUDT: String =
“isarnproject.sketches...
#SAISDD6
Python code in JAR files
mappings in (Compile, packageBin) ++= Seq(
(baseDirectory.value / "python" / "isarnproje...
#SAISDD6
mappings in (Compile, packageBin) ++= Seq(
(baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") ->
...
#SAISDD6
mappings in (Compile, packageBin) ++= Seq(
(baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") ->
...
#SAISDD6
mappings in (Compile, packageBin) ++= Seq(
(baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") ->
...
#SAISDD6
Cross-building for Python
lazy val compilePython = taskKey[Unit]("Compile python files")
compilePython := {
val s...
#SAISDD6
lazy val compilePython = taskKey[Unit]("Compile python files")
compilePython := {
val s: TaskStreams = streams.va...
#SAISDD6
lazy val compilePython = taskKey[Unit]("Compile python files")
compilePython := {
val s: TaskStreams = streams.va...
#SAISDD6
Using versioned JAR files
$ pyspark --packages 
'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
$ ...
#SAISDD6
Using versioned JAR files
$ pyspark --packages 
'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
$ ...
#SAISDD6
Using versioned JAR files
$ pyspark --packages 
'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
$ ...
Show your work: 

publishing results
#SAISDD6
Developing with git-flow
$ brew install git-flow # macOS
$ dnf install git-flow # Fedora
$ yum install git-flow #...
#SAISDD6
# Set up git-flow in this repository
$ git flow init
# Start work on my-awesome-feature; create
# and switch to a...
#SAISDD6
# Start work on a release branch
$ git flow release start 0.1.0
# Hack and bump version numbers
$ ...
# Finish wo...
#SAISDD6
Maven Central Bintray
not really easy to set up for library developers trivial
trivial easy to set up for library...
Conclusions and takeaways
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
#SAISDD6
https://radanalytics.io
eje@redhat.com • @manyangled
willb@redhat.com • @willb
KEEP IN TOUCH
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Apache Spark for Library Developers with Erik Erlandson and William Benton
Upcoming SlideShare
Loading in …5
×

of

Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 1 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 2 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 3 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 4 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 5 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 6 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 7 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 8 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 9 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 10 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 11 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 12 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 13 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 14 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 15 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 16 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 17 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 18 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 19 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 20 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 21 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 22 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 23 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 24 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 25 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 26 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 27 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 28 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 29 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 30 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 31 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 32 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 33 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 34 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 35 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 36 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 37 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 38 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 39 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 40 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 41 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 42 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 43 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 44 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 45 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 46 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 47 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 48 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 49 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 50 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 51 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 52 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 53 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 54 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 55 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 56 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 57 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 58 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 59 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 60 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 61 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 62 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 63 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 64 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 65 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 66 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 67 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 68 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 69 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 70 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 71 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 72 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 73 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 74 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 75 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 76 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 77 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 78 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 79 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 80 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 81 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 82 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 83 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 84 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 85 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 86 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 87 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 88 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 89 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 90 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 91 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 92 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 93 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 94 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 95 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 96 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 97 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 98 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 99 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 100 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 101 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 102 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 103 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 104 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 105 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 106 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 107 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 108 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 109 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 110 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 111 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 112 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 113 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 114 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 115 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 116 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 117 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 118 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 119 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 120 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 121 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 122 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 123 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 124 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 125 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 126 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 127 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 128 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 129 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 130 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 131 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 132 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 133 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 134 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 135 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 136 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 137 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 138 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 139 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 140 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 141 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 142 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 143 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 144 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 145 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 146 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 147 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 148 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 149 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 150 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 151 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 152 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 153 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 154 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 155 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 156 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 157 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 158 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 159 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 160 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 161 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 162 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 163 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 164 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 165 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 166 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 167 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 168 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 169 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 170 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 171 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 172 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 173 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 174 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 175 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 176 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 177 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 178 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 179 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 180 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 181 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 182 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 183 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 184 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 185 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 186 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 187 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 188 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 189 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 190 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 191 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 192 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 193 Apache Spark for Library Developers with Erik Erlandson and William Benton Slide 194
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Apache Spark for Library Developers with Erik Erlandson and William Benton

Download to read offline

As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.

You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.

We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

  • Be the first to like this

Apache Spark for Library Developers with Erik Erlandson and William Benton

  1. 1. Apache Spark for 
 library developers William Benton willb@redhat.com @willb Erik Erlandson eje@redhat.com @manyangled
  2. 2. About Will
  3. 3. #SAISDD6 The Silex and Isarn libraries Reusable open-source code that works 
 with Spark, factored from internal apps. We’ve tracked Spark releases since Spark 1.3.0. See https://silex.radanalytics.io and 
 http://isarnproject.org
  4. 4. #SAISDD6 Forecast Basic considerations for reusable Spark code Generic functions for parallel collections Extending data frames with custom aggregates Exposing JVM libraries to Python Sharing your work with the world
  5. 5. Basic considerations
  6. 6. #SAISDD6
  7. 7. #SAISDD6
  8. 8. #SAISDD6
  9. 9. #SAISDD6
  10. 10. #SAISDD6
  11. 11. #SAISDD6
  12. 12. #SAISDD6
  13. 13. #SAISDD6
  14. 14. #SAISDD6
  15. 15. #SAISDD6 Today’s main themes
  16. 16. #SAISDD6 in your SBT build definition: Cross-building for Scala scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11") in your shell: $ sbt +compile $ sbt "++ 2.11.11" compile scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11")
  17. 17. #SAISDD6 in your SBT build definition: Cross-building for Scala scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11") in your shell: $ sbt +compile $ sbt "++ 2.11.11" compile $ sbt +compile # or test, package, publish, etc. $ sbt "++ 2.11.11" compile scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11")
  18. 18. #SAISDD6 in your SBT build definition: Cross-building for Scala scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11") in your shell: $ sbt +compile $ sbt "++ 2.11.11" compile $ sbt +compile # or test, package, publish, etc. $ sbt "++ 2.11.11" compile scalaVersion := "2.11.11" crossScalaVersions := Seq("2.10.6", "2.11.11")
  19. 19. #SAISDD6 in your SBT build definition: Bring-your-own Spark libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  20. 20. #SAISDD6 in your SBT build definition: Bring-your-own Spark libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  21. 21. #SAISDD6 in your SBT build definition: “Bring-your-own Spark” libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  22. 22. #SAISDD6 in your SBT build definition: “Bring-your-own Spark” libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.3.0" % Provided, "org.apache.spark" %% "spark-sql" % "2.3.0" % Provided, "org.apache.spark" %% "spark-mllib" % "2.3.0" % Provided, "joda-time" % "joda-time" % "2.7", "org.scalatest" %% "scalatest" % "2.2.4" % Test)
  23. 23. #SAISDD6 Taking care with resources
  24. 24. #SAISDD6 Taking care with resources
  25. 25. #SAISDD6 Taking care with resources
  26. 26. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } rdd.cache()
  27. 27. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } rdd.cache()
  28. 28. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } rdd.cache() rdd.unpersist()
  29. 29. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result }
  30. 30. #SAISDD6 def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result } Caching when necessary def step(rdd: RDD[_]) = { val wasUncached = rdd.storageLevel == StorageLevel.NONE if (wasUncached) { rdd.cache() } result = trainModel(rdd) if (wasUncached) { rdd.unpersist() } result }
  31. 31. #SAISDD6 nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = current.unpersist sc.broadcast(nextModel)
  32. 32. #SAISDD6 nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = current.unpersist sc.broadcast(nextModel)
  33. 33. #SAISDD6 nextModel = modelFromState(newState) current.unpersist } var nextModel = initialModel for (int i = 0; i < iterations; i++) { val current = sc.broadcast(nextModel) val newState = current.unpersist sc.broadcast(nextModel)
  34. 34. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0))
  35. 35. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0
  36. 36. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 32 bytes of data…
  37. 37. #SAISDD6 Minding the JVM heap val mat = Array(Array(1.0, 2.0), Array(3.0, 4.0)) class pointer flags size locks element pointer element pointer class pointer flags size locks 1.0 class pointer flags size locks 3.0 4.0 2.0 …and 64 bytes of overhead! 32 bytes of data…
  38. 38. Continuous integration for Spark libraries and apps
  39. 39. #SAISDD6 local[*]
  40. 40. #SAISDD6 CPU Memory
  41. 41. #SAISDD6
  42. 42. #SAISDD6
  43. 43. #SAISDD6
  44. 44. #SAISDD6 local[2]
  45. 45. #SAISDD6
  46. 46. #SAISDD6
  47. 47. Writing generic code for Spark’s parallel collections
  48. 48. #SAISDD6 The RDD is invariant T <: U RDD[T] <: RDD[U]
  49. 49. #SAISDD6 The RDD is invariant T <: U RDD[T] <: RDD[U] dog animal
  50. 50. #SAISDD6 T <: U RDD[T] <: RDD[U] trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x))
  51. 51. #SAISDD6 T <: U RDD[T] <: RDD[U] trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x)) trait HasUserId { val userid: Int } case class Transaction(override val userid: Int, timestamp: Int, amount: Double) extends HasUserId {} def badKeyByUserId(r: RDD[HasUserId]) = r.map(x => (x.userid, x))
  52. 52. #SAISDD6 val xacts = spark.parallelize(Array( Transaction(1, 1, 1.0), Transaction(2, 2, 1.0) )) badKeyByUserId(xacts) <console>: error: type mismatch; found : org.apache.spark.rdd.RDD[Transaction] required: org.apache.spark.rdd.RDD[HasUserId] Note: Transaction <: HasUserID, but class RDD is invariant in type T. You may wish to define T as +T instead. (SLS 4.5) badKeyByUserId(xacts)
  53. 53. #SAISDD6 val xacts = spark.parallelize(Array( Transaction(1, 1, 1.0), Transaction(2, 2, 1.0) )) badKeyByUserId(xacts) <console>: error: type mismatch; found : org.apache.spark.rdd.RDD[Transaction] required: org.apache.spark.rdd.RDD[HasUserId] Note: Transaction <: HasUserID, but class RDD is invariant in type T. You may wish to define T as +T instead. (SLS 4.5) badKeyByUserId(xacts)
  54. 54. #SAISDD6
  55. 55. #SAISDD6 An example: natural join A B C D E A EB X Y
  56. 56. #SAISDD6 An example: natural join A B C D E A EB X Y
  57. 57. #SAISDD6 An example: natural join A B C D E X Y
  58. 58. #SAISDD6 Ad-hoc natural join df1.join(df2, df1("a") === df2("a") && df1("b") === df2("b") && df1("e") === df2("e"))
  59. 59. #SAISDD6 = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame
  60. 60. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) }
  61. 61. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } introspecting over column names
  62. 62. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) }
  63. 63. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing expressions
  64. 64. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing expressions
  65. 65. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing expressions
  66. 66. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } [left.a === right.a, left.b === right.b, …]
  67. 67. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } left.a === right.a && left.b === right.b && …
  68. 68. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } left.a === right.a && left.b === right.b && …
  69. 69. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) }
  70. 70. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing column lists
  71. 71. #SAISDD6 def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } def natjoin(left: DataFrame, right: DataFrame): DataFrame = { val lcols = left.columns val rcols = right.columns val ccols = lcols.toSet intersect rcols.toSet if(ccols.isEmpty) left.limit(0).crossJoin(right.limit(0)) else left .join(right, ccols.map {col => left(col) === right(col) }.reduce(_ && _)) .select(lcols.collect { case c if ccols.contains(c) => left(c) } ++ lcols.collect { case c if !ccols.contains(c) => left(c) } ++ rcols.collect { case c if !ccols.contains(c) => right(c) } : _*) } dynamically constructing column lists
  72. 72. #SAISDD6 case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf) case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf)
  73. 73. #SAISDD6 case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf) case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf)
  74. 74. #SAISDD6 case class DFWithNatJoin(df: DataFrame) extends NaturalJoining { def natjoin(other: DataFrame): DataFrame = super.natjoin(df, other) } object NaturalJoin extends NaturalJoining { object implicits { implicit def dfWithNatJoin(df: DataFrame) = DFWithNatJoin(df) } } import NaturalJoin.implicits._ df.natjoin(otherdf)
  75. 75. #SAISDD6 User-defined functions {"a": 1, "b": "wilma", ..., "x": "club"} {"a": 2, "b": "betty", ..., "x": "diamond"} {"a": 3, "b": "fred", ..., "x": "heart"} {"a": 4, "b": "barney", ..., "x": "spade"}
  76. 76. #SAISDD6 User-defined functions {"a": 1, "b": "wilma", ..., "x": "club"} {"a": 2, "b": "betty", ..., "x": "diamond"} {"a": 3, "b": "fred", ..., "x": "heart"} {"a": 4, "b": "barney", ..., "x": "spade"} wilma club betty diamond fred heart barney spade
  77. 77. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  78. 78. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  79. 79. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  80. 80. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  81. 81. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType)
  82. 82. #SAISDD6 import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) import json from pyspark.sql.types import * from pyspark.sql.functions import udf def selectively_structure(fields): resultType = StructType([StructField(f, StringType(), nullable=True) for f in fields]) def impl(js): try: d = json.loads(js) return [str(d.get(f)) for f in fields] except: return [None] * len(fields) return udf(impl, resultType) extract_bx = selectively_structure(["b", "x"]) structured_df = df.withColumn("result", extract_bx("json"))
  83. 83. #SAISDD6 Spark’s ML pipelines model.transform(df)
  84. 84. #SAISDD6 Spark’s ML pipelines model.transform(df)
  85. 85. #SAISDD6 Spark’s ML pipelines estimator.fit(df)
  86. 86. #SAISDD6 Spark’s ML pipelines estimator.fit(df) model.transform(df)
  87. 87. #SAISDD6 Working with ML pipelines model.transform(df)
  88. 88. #SAISDD6 Working with ML pipelines model.transform(df)
  89. 89. #SAISDD6 Spark’s ML pipelines
  90. 90. #SAISDD6 Spark’s ML pipelines model.transform(df)
  91. 91. #SAISDD6 Spark’s ML pipelines estimator.fit(df) model.transform(df)
  92. 92. #SAISDD6 Spark’s ML pipelines estimator.fit(df) model.transform(df) inputCol epochs seed outputCol
  93. 93. #SAISDD6
  94. 94. #SAISDD6 Forecast Basic considerations for reusable Spark code Generic functions for parallel collections Extending data frames with custom aggregates Exposing JVM libraries to Python Sharing your work with the world
  95. 95. About Erik
  96. 96. User-defined aggregates: the fundamentals
  97. 97. #SAISDD6 Three components
  98. 98. #SAISDD6 Three components
  99. 99. #SAISDD6 Three components
  100. 100. #SAISDD6 Three components
  101. 101. #SAISDD6 Three components
  102. 102. User-defined aggregates: the implementation
  103. 103. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  104. 104. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  105. 105. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  106. 106. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  107. 107. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  108. 108. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  109. 109. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  110. 110. #SAISDD6 case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT case class TDigestUDAF[N](deltaV: Double, maxDiscreteV: Int) (implicit num: Numeric[N], dataTpe: TDigestUDAFDataType[N]) extends UserDefinedAggregateFunction { def deterministic: Boolean = false def inputSchema: StructType = StructType(StructField("x", dataTpe.tpe) :: Nil) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) def dataType: DataType = TDigestUDT
  111. 111. #SAISDD6 Four main functions: initialize initialize
  112. 112. #SAISDD6 Four main functions: initialize initialize
  113. 113. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  114. 114. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  115. 115. #SAISDD6 Four main functions: evaluate evaluate
  116. 116. #SAISDD6 Four main functions: evaluate evaluate
  117. 117. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  118. 118. #SAISDD6 def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0) def initialize(buf: MutableAggregationBuffer): Unit = { buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) } def evaluate(buf: Row): Any = buf.getAs[TDigestSQL](0)
  119. 119. #SAISDD6 Four main functions: update update
  120. 120. #SAISDD6 Four main functions: update update
  121. 121. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  122. 122. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  123. 123. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  124. 124. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  125. 125. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  126. 126. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  127. 127. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  128. 128. #SAISDD6 Four main functions: merge 1 merge 2
  129. 129. #SAISDD6 Four main functions: merge merge 1 + 2
  130. 130. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  131. 131. #SAISDD6 def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) } def update(buf: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + num.toDouble(input.getAs[N](0))) } } def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = { buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) }
  132. 132. User-defined aggregates: User-defined types
  133. 133. #SAISDD6 User-defined types package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // .... package org.apache.spark
  134. 134. #SAISDD6 package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // .... User-defined types package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // ....
  135. 135. #SAISDD6 package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // .... User-defined types package org.apache.spark.isarnproject.sketches.udt @SQLUserDefinedType(udt = classOf[TDigestUDT]) case class TDigestSQL(tdigest: TDigest) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] // ....
  136. 136. #SAISDD6 Implementing custom types class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  137. 137. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  138. 138. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  139. 139. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) class TDigestUDT extends UserDefinedType[TDigestSQL] { def userClass: Class[TDigestSQL] = classOf[TDigestSQL] override def pyUDT: String = "isarnproject.sketches.udt.tdigest.TDigestUDT" override def typeName: String = "tdigest" def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: /* ... */ StructField("clustM", ArrayType(DoubleType, false), false) :: Nil)
  140. 140. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  141. 141. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  142. 142. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  143. 143. #SAISDD6 def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row } def serialize(tdsql: TDigestSQL): Any = serializeTD(tdsql.tdigest) private[sketches] def serializeTD(td: TDigest): InternalRow = { val TDigest(delta, maxDiscrete, nclusters, clusters) = td val row = new GenericInternalRow(5) row.setDouble(0, delta) row.setInt(1, maxDiscrete) row.setInt(2, nclusters) val clustX = clusters.keys.toArray val clustM = clusters.values.toArray row.update(3, UnsafeArrayData.fromPrimitiveArray(clustX)) row.update(4, UnsafeArrayData.fromPrimitiveArray(clustM)) row }
  144. 144. #SAISDD6 def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) } def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) }
  145. 145. #SAISDD6 def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) } def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) }
  146. 146. #SAISDD6 def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) } def deserialize(td: Any): TDigestSQL = TDigestSQL(deserializeTD(td)) private[sketches] def deserializeTD(datum: Any): TDigest = datum match { case row: InternalRow => val delta = row.getDouble(0) val maxDiscrete = row.getInt(1) val nclusters = row.getInt(2) val clustX = row.getArray(3).toDoubleArray() val clustM = row.getArray(4).toDoubleArray() val clusters = clustX.zip(clustM) .foldLeft(TDigestMap.empty) { case (td, e) => td + e } TDigest(delta, maxDiscrete, nclusters, clusters) }
  147. 147. Extending PySpark with your Scala library
  148. 148. #SAISDD6 [ ]
  149. 149. #SAISDD6 [ ]
  150. 150. #SAISDD6 [ ]
  151. 151. #SAISDD6 [ ]
  152. 152. #SAISDD6 # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing
  153. 153. #SAISDD6 # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm
  154. 154. #SAISDD6 # class to access the active Spark context for Python from pyspark.context import SparkContext # gateway to the JVM from py4j sparkJVM = SparkContext._active_spark_context._jvm # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing # use the gateway to access JVM objects and classes thisThing = sparkJVM.com.path.to.this.thing
  155. 155. #SAISDD6 A Python-friendly wrapper package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) } package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) }
  156. 156. #SAISDD6 package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) } tdigestDoubleUDAF
  157. 157. #SAISDD6 package org.isarnproject.sketches.udaf object pythonBindings { def tdigestDoubleUDAF(delta: Double, maxDiscrete: Int) = TDigestUDAF[Double](delta, maxDiscrete) } Double
  158. 158. #SAISDD6 from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column))) from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column)))
  159. 159. #SAISDD6 from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column))) tdapply apply
  160. 160. #SAISDD6 from pyspark.sql.column import Column, _to_java_column, _to_seq from pyspark.context import SparkContext # one of these for each type parameter Double, Int, Long, etc def tdigestDoubleUDAF(col, delta=0.5, maxDiscrete=0): sc = SparkContext._active_spark_context pb = sc._jvm.org.isarnproject.sketches.udaf.pythonBindings tdapply = pb.tdigestDoubleUDAF(delta, maxDiscrete).apply return Column(tdapply(_to_seq(sc, [col], _to_java_column)))tdapply(_to_seq(sc, [col], _to_java_column))
  161. 161. #SAISDD6 class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) # ...
  162. 162. #SAISDD6 class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) class TDigestUDT(UserDefinedType): @classmethod def sqlType(cls): return StructType([ StructField("delta", DoubleType(), False), StructField("maxDiscrete", IntegerType(), False), StructField("nclusters", IntegerType(), False), StructField("clustX", ArrayType(DoubleType(), False), False), StructField("clustM", ArrayType(DoubleType(), False), False)]) # ...
  163. 163. #SAISDD6 class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest" class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest"
  164. 164. #SAISDD6 class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest" class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest"
  165. 165. #SAISDD6 class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest" class TDigestUDT(UserDefinedType): # ... @classmethod def module(cls): return "isarnproject.sketches.udt.tdigest" @classmethod def scalaUDT(cls): return "org.apache.spark.isarnproject.sketches.udt.TDigestUDT" def simpleString(self): return "tdigest"
  166. 166. #SAISDD6 class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4]) class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4])
  167. 167. #SAISDD6 class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4]) class TDigestUDT(UserDefinedType): # ... def serialize(self, obj): return (obj.delta, obj.maxDiscrete, obj.nclusters, [float(v) for v in obj.clustX], [float(v) for v in obj.clustM]) def deserialize(self, datum): return TDigest(datum[0], datum[1], datum[2], datum[3], datum[4])
  168. 168. #SAISDD6 class TDigestUDT extends UserDefinedType[TDigestSQL] { // ... override def pyUDT: String = “isarnproject.sketches.udt.tdigest.TDigestUDT" } class TDigestUDT extends UserDefinedType[TDigestSQL] { // ... override def pyUDT: String = “isarnproject.sketches.udt.tdigest.TDigestUDT" }
  169. 169. #SAISDD6 Python code in JAR files mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  170. 170. #SAISDD6 mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  171. 171. #SAISDD6 mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  172. 172. #SAISDD6 mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" ) mappings in (Compile, packageBin) ++= Seq( (baseDirectory.value / "python" / "isarnproject" / "__init__.pyc") -> "isarnproject/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "__init__.pyc") -> "isarnproject/sketches/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "__init__.pyc") -> "isarnproject/sketches/udaf/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udaf" / "tdigest.pyc") -> "isarnproject/sketches/udaf/tdigest.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "__init__.pyc") -> "isarnproject/sketches/udt/__init__.pyc", (baseDirectory.value / "python" / "isarnproject" / "sketches" / "udt" / "tdigest.pyc") -> "isarnproject/sketches/udt/tdigest.pyc" )
  173. 173. #SAISDD6 Cross-building for Python lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython)
  174. 174. #SAISDD6 lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython)
  175. 175. #SAISDD6 lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !) if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) lazy val compilePython = taskKey[Unit]("Compile python files") compilePython := { val s: TaskStreams = streams.value s.log.info("compiling python...") if (stat != 0) { throw new IllegalStateException("python compile failed") } } (packageBin in Compile) <<= (packageBin in Compile).dependsOn(compilePython) val stat = (Seq(pythonCMD, "-m", "compileall", "python/") !)
  176. 176. #SAISDD6 Using versioned JAR files $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7' $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
  177. 177. #SAISDD6 Using versioned JAR files $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7' $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
  178. 178. #SAISDD6 Using versioned JAR files $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7' $ pyspark --packages 'org.isarnproject:isarn-sketches-spark_2.11:0.3.0-sp2.2-py2.7'
  179. 179. Show your work: 
 publishing results
  180. 180. #SAISDD6 Developing with git-flow $ brew install git-flow # macOS $ dnf install git-flow # Fedora $ yum install git-flow # CentOS $ apt-get install git-flow # Debian and friends (Search the internet for “git flow” to learn more!)
  181. 181. #SAISDD6 # Set up git-flow in this repository $ git flow init # Start work on my-awesome-feature; create # and switch to a feature branch $ git flow feature start my-awesome-feature $ ... # Finish work on my-awesome-feature; merge # feature/my-awesome-feature to develop $ git flow feature finish my-awesome-feature
  182. 182. #SAISDD6 # Start work on a release branch $ git flow release start 0.1.0 # Hack and bump version numbers $ ... # Finish work on v0.1.0; merge # release/0.1.0 to develop and master; # tag v0.1.0 $ git flow release finish 0.1.0
  183. 183. #SAISDD6 Maven Central Bintray not really easy to set up for library developers trivial trivial easy to set up for library users mostly yes, via sbt easy to publish yes, via sbt + plugins yes easy to resolve artifacts mostly
  184. 184. Conclusions and takeaways
  185. 185. #SAISDD6
  186. 186. #SAISDD6
  187. 187. #SAISDD6
  188. 188. #SAISDD6
  189. 189. #SAISDD6 https://radanalytics.io eje@redhat.com • @manyangled willb@redhat.com • @willb KEEP IN TOUCH

As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark. You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community. We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.

Views

Total views

504

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

25

Shares

0

Comments

0

Likes

0

×