User Defined Aggregation in Apache Spark: A Love Story

Databricks
DatabricksDeveloper Marketing and Relations at MuleSoft
User Defined Aggregation
In Apache Spark
A Love Story
Erik Erlandson
Principal Software Engineer
All Love Stories
Are The Same
Hero Meets Aggregators
Hero Files Spark JIRA
Hero Merges Spark PR
Establish
The Plot
Spark’s Scale-Out World
2
3
2
5
3
5
2
3
5
logical
Spark’s Scale-Out World
2 3 2
5 3 5
2 3 5
2
3
2
5
3
5
2
3
5
physical
logical
Scale-Out Sum
2 3 5
s
=
0
Scale-Out Sum
2 3 5
s
=
s
+
2
(2)
Scale-Out Sum
2 3 5
s
=
s
+
3
(5)
Scale-Out Sum
2 3 5
s
=
s
+
5
(10)
Scale-Out Sum
2 3 5 10
Scale-Out Sum
2 3 5 10
5 3 5 13
Scale-Out Sum
2 3 5 10
5 3 5 13
2 3 2 7
Scale-Out Sum
2 3 5 10
5 3 5 13 + 7 = 20
2 3 2
Scale-Out Sum
2 3 5 10 + 20 = 30
5 3 5
2 3 2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)
Present
sum / count
Love Interest
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Romantic Chemistry
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))
spark.udf.register("p90",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
Romantic Chemistry
val query = records
.writeStream //...
+---------+
|wordcount|
+---------+
| 12|
| 5|
| 9|
| 18|
| 12|
+---------+
val r = records.withColumn("time", current_timestamp())
.groupBy(window($”time”, “30 seconds”))
.agg(sketchCDF($"wordcount").alias("CDF"))
.select(callUDF("p50", $"CDF").alias("p50"),
callUDF("p90", $"CDF").alias("p90"))
val query = r.writeStream //...
+----+----+
| p50| p90|
+----+----+
|15.6|31.0|
|16.0|30.8|
|15.8|30.0|
|15.7|31.0|
|16.0|31.0|
+----+----+
Romantic Montage
Sketching Data with T-Digest In Apache Spark
Smart Scalable Feature Reduction With Random Forests
One-Pass Data Science In Apache Spark With Generative T-Digests
Apache Spark for Library Developers
Extending Structured Streaming Made Easy with Algebra
Conflict!
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
Expensive
What Could Go Wrong?
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def serialize(tdsql: TDigestSQL): Any = {
print(“In serialize”)
// ...
}
def deserialize(datum: Any): TDigestSQL = {
print(“In deserialize”)
// ...
}
// yada yada yada ...
}
What Could Go Wrong?
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Wait What?
val sketchCDF = tdigestUDAF[Double]
val data = /* data frame with 1000 rows of data */
val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first
In deserialize
In serialize
In deserialize
In serialize
… 997 more times !
In deserialize
In serialize
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
buf(0) = TDigestSQL(updated) // re-serialize
}
SPARK-27296
Resolution
#25024
Aggregator Anatomy
class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends
Aggregator[Double, TDigestSQL, TDigestSQL] {
def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)
def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =
TDigestSQL(b1.tdigest ++ b2.tdigest)
def finish(b: TDigestSQL): TDigestSQL = b
val serde = ExpressionEncoder[TDigestSQL]()
def bufferEncoder: Encoder[TDigestSQL] = serde
def outputEncoder: Encoder[TDigestSQL] = serde
}
Intuitive Serialization
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Custom Aggregation in Spark 3.0
import org.apache.spark.sql.functions.udaf
val sketchAgg = TDigestAggregator(0.5, 0)
val sketchCDF: UserDefinedFunction = udaf(sketchAgg)
val sketch = data.agg(sketchCDF($”column”)).first
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
70x
Faster
Epilogue
Don’t Give Up
Patience
Respect
ErikE ErErlandson
Principal Software Engineer
Erik Erlandson
eje@redhat.com
@ManyAngled
1 of 50

Recommended

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo... by
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...
Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks
1.6K views40 slides
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J... by
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
98.6K views44 slides
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag... by
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
5.8K views70 slides
Extreme Apache Spark: how in 3 months we created a pipeline that can process ... by
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
25.4K views65 slides
Common issues with Apache Kafka® Producer by
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producerconfluent
2.8K views16 slides
Paris Redis Meetup Introduction by
Paris Redis Meetup IntroductionParis Redis Meetup Introduction
Paris Redis Meetup IntroductionGregory Boissinot
1.7K views76 slides

More Related Content

What's hot

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021 by
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
536 views18 slides
Kafka streams windowing behind the curtain by
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain confluent
1.3K views28 slides
MongoDB at Scale by
MongoDB at ScaleMongoDB at Scale
MongoDB at ScaleMongoDB
7.1K views55 slides
Optimizing Delta/Parquet Data Lakes for Apache Spark by
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
2.5K views51 slides
Terraform 101 by
Terraform 101Terraform 101
Terraform 101Haggai Philip Zagury
453 views55 slides
Top 5 Mistakes to Avoid When Writing Apache Spark Applications by
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
127.8K views69 slides

What's hot(20)

Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021 by StreamNative
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative536 views
Kafka streams windowing behind the curtain by confluent
Kafka streams windowing behind the curtain Kafka streams windowing behind the curtain
Kafka streams windowing behind the curtain
confluent1.3K views
MongoDB at Scale by MongoDB
MongoDB at ScaleMongoDB at Scale
MongoDB at Scale
MongoDB7.1K views
Optimizing Delta/Parquet Data Lakes for Apache Spark by Databricks
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks2.5K views
Top 5 Mistakes to Avoid When Writing Apache Spark Applications by Cloudera, Inc.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.127.8K views
Cosco: An Efficient Facebook-Scale Shuffle Service by Databricks
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks2.9K views
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD... by InfluxData
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxDB IOx Tech Talks: Replication, Durability and Subscriptions in InfluxD...
InfluxData414 views
A Deep Dive into Query Execution Engine of Spark SQL by Databricks
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks6.6K views
The Apache Spark File Format Ecosystem by Databricks
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks2.1K views
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath... by Databricks
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Databricks6.2K views
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17 by spark-project
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project53.4K views
Deep Dive: Memory Management in Apache Spark by Databricks
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks14.5K views
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake by Databricks
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks2.2K views
Understanding Query Plans and Spark UIs by Databricks
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks4.7K views
Apache doris (incubating) introduction by leanderlee2
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
leanderlee2831 views
Introduction to Presto at Treasure Data by Taro L. Saito
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Taro L. Saito1.7K views
Deep dive into stateful stream processing in structured streaming by Tathaga... by Databricks
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks8.3K views

Similar to User Defined Aggregation in Apache Spark: A Love Story

Spark workshop by
Spark workshopSpark workshop
Spark workshopWojciech Pituła
1.1K views61 slides
Stratosphere Intro (Java and Scala Interface) by
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
5.4K views41 slides
Modern technologies in data science by
Modern technologies in data science Modern technologies in data science
Modern technologies in data science Chucheng Hsieh
3K views75 slides
Refactoring to Macros with Clojure by
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
3.5K views51 slides
Compact and safely: static DSL on Kotlin by
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinDmitry Pranchuk
232 views58 slides
Big Data Analytics with Scala at SCALA.IO 2013 by
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Samir Bessalah
9.5K views56 slides

Similar to User Defined Aggregation in Apache Spark: A Love Story(20)

Stratosphere Intro (Java and Scala Interface) by Robert Metzger
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
Robert Metzger5.4K views
Modern technologies in data science by Chucheng Hsieh
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Chucheng Hsieh3K views
Refactoring to Macros with Clojure by Dmitry Buzdin
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
Dmitry Buzdin3.5K views
Compact and safely: static DSL on Kotlin by Dmitry Pranchuk
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
Dmitry Pranchuk232 views
Big Data Analytics with Scala at SCALA.IO 2013 by Samir Bessalah
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah9.5K views
Apache Spark for Library Developers with William Benton and Erik Erlandson by Databricks
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks596 views
Chainer-Compiler 動かしてみた by Akira Maruoka
Chainer-Compiler 動かしてみたChainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみた
Akira Maruoka997 views
The Ring programming language version 1.5 book - Part 8 of 31 by Mahmoud Samir Fayed
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Basic R Data Manipulation by Chu An
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
Chu An117 views
Coding in Style by scalaconfjp
Coding in StyleCoding in Style
Coding in Style
scalaconfjp1.4K views
Designing a database like an archaeologist by yoavrubin
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
yoavrubin1K views
ggtimeseries-->ggplot2 extensions by Dr. Volkan OBAN
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
Dr. Volkan OBAN250 views
The Ring programming language version 1.7 book - Part 48 of 196 by Mahmoud Samir Fayed
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196

More from Databricks

DW Migration Webinar-March 2022.pptx by
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
4.3K views25 slides
Data Lakehouse Symposium | Day 1 | Part 1 by
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
1.5K views43 slides
Data Lakehouse Symposium | Day 1 | Part 2 by
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
739 views16 slides
Data Lakehouse Symposium | Day 4 by
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
1.8K views74 slides
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
6.3K views64 slides
Democratizing Data Quality Through a Centralized Platform by
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
1.4K views36 slides

More from Databricks(20)

DW Migration Webinar-March 2022.pptx by Databricks
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks4.3K views
Data Lakehouse Symposium | Day 1 | Part 1 by Databricks
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks1.5K views
Data Lakehouse Symposium | Day 1 | Part 2 by Databricks
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks739 views
Data Lakehouse Symposium | Day 4 by Databricks
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks1.8K views
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop by Databricks
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks6.3K views
Democratizing Data Quality Through a Centralized Platform by Databricks
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks1.4K views
Learn to Use Databricks for Data Science by Databricks
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks1.6K views
Why APM Is Not the Same As ML Monitoring by Databricks
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks743 views
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix by Databricks
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks688 views
Stage Level Scheduling Improving Big Data and AI Integration by Databricks
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks850 views
Simplify Data Conversion from Spark to TensorFlow and PyTorch by Databricks
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks1.8K views
Scaling your Data Pipelines with Apache Spark on Kubernetes by Databricks
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks2.1K views
Scaling and Unifying SciKit Learn and Apache Spark Pipelines by Databricks
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks667 views
Sawtooth Windows for Feature Aggregations by Databricks
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks604 views
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink by Databricks
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks675 views
Re-imagine Data Monitoring with whylogs and Spark by Databricks
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks550 views
Raven: End-to-end Optimization of ML Prediction Queries by Databricks
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks448 views
Processing Large Datasets for ADAS Applications using Apache Spark by Databricks
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks512 views
Massive Data Processing in Adobe Using Delta Lake by Databricks
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks719 views
Machine Learning CI/CD for Email Attack Detection by Databricks
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks389 views

Recently uploaded

Organic Shopping in Google Analytics 4.pdf by
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdfGA4 Tutorials
11 views13 slides
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented GenerationDataScienceConferenc1
7 views29 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
11 views4 slides
PROGRAMME.pdf by
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdfHiNedHaJar
18 views13 slides
TGP 2.docx by
TGP 2.docxTGP 2.docx
TGP 2.docxsandi636490
10 views8 slides
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
5 views30 slides

Recently uploaded(20)

Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials11 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.3K views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0119 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
SUPER STORE SQL PROJECT.pptx by khan888620
SUPER STORE SQL PROJECT.pptxSUPER STORE SQL PROJECT.pptx
SUPER STORE SQL PROJECT.pptx
khan88862012 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx

User Defined Aggregation in Apache Spark: A Love Story

  • 1. User Defined Aggregation In Apache Spark A Love Story Erik Erlandson Principal Software Engineer
  • 2. All Love Stories Are The Same Hero Meets Aggregators Hero Files Spark JIRA Hero Merges Spark PR
  • 5. Spark’s Scale-Out World 2 3 2 5 3 5 2 3 5 2 3 2 5 3 5 2 3 5 physical logical
  • 7. Scale-Out Sum 2 3 5 s = s + 2 (2)
  • 8. Scale-Out Sum 2 3 5 s = s + 3 (5)
  • 9. Scale-Out Sum 2 3 5 s = s + 5 (10)
  • 11. Scale-Out Sum 2 3 5 10 5 3 5 13
  • 12. Scale-Out Sum 2 3 5 10 5 3 5 13 2 3 2 7
  • 13. Scale-Out Sum 2 3 5 10 5 3 5 13 + 7 = 20 2 3 2
  • 14. Scale-Out Sum 2 3 5 10 + 20 = 30 5 3 5 2 3 2
  • 15. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2
  • 16. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2)
  • 17. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) Present sum / count
  • 19. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 20. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 21. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 22. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 23. Romantic Chemistry val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  • 24. Romantic Chemistry val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  • 25. Romantic Montage Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass Data Science In Apache Spark With Generative T-Digests Apache Spark for Library Developers Extending Structured Streaming Made Easy with Algebra
  • 27. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 28. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 29. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... }
  • 30. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... } Expensive
  • 31. What Could Go Wrong? class TDigestUDT extends UserDefinedType[TDigestSQL] { def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... } def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... } // yada yada yada ... }
  • 32. What Could Go Wrong? 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 33. Wait What? val sketchCDF = tdigestUDAF[Double] val data = /* data frame with 1000 rows of data */ val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first In deserialize In serialize In deserialize In serialize … 997 more times ! In deserialize In serialize
  • 34. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { }
  • 35. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize }
  • 36. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update }
  • 37. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize }
  • 41. Aggregator Anatomy class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends Aggregator[Double, TDigestSQL, TDigestSQL] { def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a) def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL = TDigestSQL(b1.tdigest ++ b2.tdigest) def finish(b: TDigestSQL): TDigestSQL = b val serde = ExpressionEncoder[TDigestSQL]() def bufferEncoder: Encoder[TDigestSQL] = serde def outputEncoder: Encoder[TDigestSQL] = serde }
  • 42. Intuitive Serialization 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 43. Custom Aggregation in Spark 3.0 import org.apache.spark.sql.functions.udaf val sketchAgg = TDigestAggregator(0.5, 0) val sketchCDF: UserDefinedFunction = udaf(sketchAgg) val sketch = data.agg(sketchCDF($”column”)).first
  • 44. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
  • 45. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112) 70x Faster
  • 50. ErikE ErErlandson Principal Software Engineer Erik Erlandson eje@redhat.com @ManyAngled