SlideShare a Scribd company logo
User Defined Aggregation
In Apache Spark
A Love Story
Erik Erlandson
Principal Software Engineer
All Love Stories
Are The Same
Hero Meets Aggregators
Hero Files Spark JIRA
Hero Merges Spark PR
Establish
The Plot
Spark’s Scale-Out World
2
3
2
5
3
5
2
3
5
logical
Spark’s Scale-Out World
2 3 2
5 3 5
2 3 5
2
3
2
5
3
5
2
3
5
physical
logical
Scale-Out Sum
2 3 5
s
=
0
Scale-Out Sum
2 3 5
s
=
s
+
2
(2)
Scale-Out Sum
2 3 5
s
=
s
+
3
(5)
Scale-Out Sum
2 3 5
s
=
s
+
5
(10)
Scale-Out Sum
2 3 5 10
Scale-Out Sum
2 3 5 10
5 3 5 13
Scale-Out Sum
2 3 5 10
5 3 5 13
2 3 2 7
Scale-Out Sum
2 3 5 10
5 3 5 13 + 7 = 20
2 3 2
Scale-Out Sum
2 3 5 10 + 20 = 30
5 3 5
2 3 2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Spark Aggregators
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number -∞ max(a, x) max(a1, a2)
Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2)
Present
sum / count
Love Interest
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Data Sketching: T-Digest
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Is T-Digest an Aggregator?
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest + x
Merge tdigest1 + tdigest2
Present tdigest.cdfInverse(quantile)
Romantic Chemistry
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5))
spark.udf.register("p90",
(c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
Romantic Chemistry
val query = records
.writeStream //...
+---------+
|wordcount|
+---------+
| 12|
| 5|
| 9|
| 18|
| 12|
+---------+
val r = records.withColumn("time", current_timestamp())
.groupBy(window($”time”, “30 seconds”))
.agg(sketchCDF($"wordcount").alias("CDF"))
.select(callUDF("p50", $"CDF").alias("p50"),
callUDF("p90", $"CDF").alias("p90"))
val query = r.writeStream //...
+----+----+
| p50| p90|
+----+----+
|15.6|31.0|
|16.0|30.8|
|15.8|30.0|
|15.7|31.0|
|16.0|31.0|
+----+----+
Romantic Montage
Sketching Data with T-Digest In Apache Spark
Smart Scalable Feature Reduction With Random Forests
One-Pass Data Science In Apache Spark With Generative T-Digests
Apache Spark for Library Developers
Extending Structured Streaming Made Easy with Algebra
Conflict!
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
UDAF Anatomy
class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends
UserDefinedAggregateFunction {
def initialize(buf: MutableAggregationBuffer): Unit =
buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0))
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit =
buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++
buf2.getAs[TDigestSQL](0).tdigest)
def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil)
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
User Defined Type Anatomy
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def sqlType: DataType = StructType(
StructField("delta", DoubleType, false) ::
StructField("maxDiscrete", IntegerType, false) ::
StructField("nclusters", IntegerType, false) ::
StructField("clustX", ArrayType(DoubleType, false), false) ::
StructField("clustM", ArrayType(DoubleType, false), false) ::
Nil)
def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ }
def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ }
// yada yada yada ...
}
Expensive
What Could Go Wrong?
class TDigestUDT extends UserDefinedType[TDigestSQL] {
def serialize(tdsql: TDigestSQL): Any = {
print(“In serialize”)
// ...
}
def deserialize(datum: Any): TDigestSQL = {
print(“In deserialize”)
// ...
}
// yada yada yada ...
}
What Could Go Wrong?
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Wait What?
val sketchCDF = tdigestUDAF[Double]
val data = /* data frame with 1000 rows of data */
val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first
In deserialize
In serialize
In deserialize
In serialize
… 997 more times !
In deserialize
In serialize
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
}
Oh No
def update(buf: MutableAggregationBuffer, input: Row): Unit =
buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest +
input.getDouble(0))
// is equivalent to ...
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize
val updated = tdigest + input.getDouble(0) // do the actual update
buf(0) = TDigestSQL(updated) // re-serialize
}
SPARK-27296
Resolution
#25024
Aggregator Anatomy
class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends
Aggregator[Double, TDigestSQL, TDigestSQL] {
def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV))
def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a)
def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL =
TDigestSQL(b1.tdigest ++ b2.tdigest)
def finish(b: TDigestSQL): TDigestSQL = b
val serde = ExpressionEncoder[TDigestSQL]()
def bufferEncoder: Encoder[TDigestSQL] = serde
def outputEncoder: Encoder[TDigestSQL] = serde
}
Intuitive Serialization
2 3 2
5 3 5
2 3 5
Init Updates Serialize
Init Updates Serialize
Init Updates Serialize
Merge
Custom Aggregation in Spark 3.0
import org.apache.spark.sql.functions.udaf
val sketchAgg = TDigestAggregator(0.5, 0)
val sketchCDF: UserDefinedFunction = udaf(sketchAgg)
val sketch = data.agg(sketchCDF($”column”)).first
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
Performance
scala> val sketchOld = TDigestUDAF(0.5, 0)
sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ...
scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first }
res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846)
scala> val sketchNew = udaf(TDigestAggregator(0.5, 0))
sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ...
scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first }
res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
70x
Faster
Epilogue
Don’t Give Up
Patience
Respect
ErikE ErErlandson
Principal Software Engineer
Erik Erlandson
eje@redhat.com
@ManyAngled

More Related Content

What's hot

ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ScyllaDB
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
Databricks
 
Postgresql
PostgresqlPostgresql
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
Hans-Jürgen Schönig
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?
Mydbops
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
Flink Forward
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
PgDay.Seoul
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Altinity Ltd
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 

What's hot (20)

ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
ceph::errorator<> throw/catch-free, compile time-checked exceptions for seast...
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
Postgresql
PostgresqlPostgresql
Postgresql
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
 
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
 
What is new in PostgreSQL 14?
What is new in PostgreSQL 14?What is new in PostgreSQL 14?
What is new in PostgreSQL 14?
 
Practical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobsPractical learnings from running thousands of Flink jobs
Practical learnings from running thousands of Flink jobs
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud[pgday.Seoul 2022] PostgreSQL with Google Cloud
[pgday.Seoul 2022] PostgreSQL with Google Cloud
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 

Similar to User Defined Aggregation in Apache Spark: A Love Story

Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
Robert Metzger
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
Chucheng Hsieh
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
Dmitry Pranchuk
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Samir Bessalah
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
Databricks
 
Chainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたChainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみた
Akira Maruoka
 
Scala by Luc Duponcheel
Scala by Luc DuponcheelScala by Luc Duponcheel
Scala by Luc Duponcheel
Stephan Janssen
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
Sander Kieft
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
Mahmoud Samir Fayed
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
Chu An
 
SDC - Einführung in Scala
SDC - Einführung in ScalaSDC - Einführung in Scala
SDC - Einführung in Scala
Christian Baranowski
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
scalaconfjp
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJS
Kyung Yeol Kim
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
yoavrubin
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
Dr. Volkan OBAN
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
Mahmoud Samir Fayed
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
Vyacheslav Arbuzov
 

Similar to User Defined Aggregation in Apache Spark: A Love Story (20)

Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Modern technologies in data science
Modern technologies in data science Modern technologies in data science
Modern technologies in data science
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Compact and safely: static DSL on Kotlin
Compact and safely: static DSL on KotlinCompact and safely: static DSL on Kotlin
Compact and safely: static DSL on Kotlin
 
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
 
Apache Spark for Library Developers with William Benton and Erik Erlandson
 Apache Spark for Library Developers with William Benton and Erik Erlandson Apache Spark for Library Developers with William Benton and Erik Erlandson
Apache Spark for Library Developers with William Benton and Erik Erlandson
 
Chainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみたChainer-Compiler 動かしてみた
Chainer-Compiler 動かしてみた
 
Scala by Luc Duponcheel
Scala by Luc DuponcheelScala by Luc Duponcheel
Scala by Luc Duponcheel
 
Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31The Ring programming language version 1.5 book - Part 8 of 31
The Ring programming language version 1.5 book - Part 8 of 31
 
D3.js workshop
D3.js workshopD3.js workshop
D3.js workshop
 
Basic R Data Manipulation
Basic R Data ManipulationBasic R Data Manipulation
Basic R Data Manipulation
 
SDC - Einführung in Scala
SDC - Einführung in ScalaSDC - Einführung in Scala
SDC - Einführung in Scala
 
Coding in Style
Coding in StyleCoding in Style
Coding in Style
 
Compose Async with RxJS
Compose Async with RxJSCompose Async with RxJS
Compose Async with RxJS
 
Designing a database like an archaeologist
Designing a database like an archaeologistDesigning a database like an archaeologist
Designing a database like an archaeologist
 
ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions ggtimeseries-->ggplot2 extensions
ggtimeseries-->ggplot2 extensions
 
The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196The Ring programming language version 1.7 book - Part 48 of 196
The Ring programming language version 1.7 book - Part 48 of 196
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 

Recently uploaded (20)

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 

User Defined Aggregation in Apache Spark: A Love Story

  • 1. User Defined Aggregation In Apache Spark A Love Story Erik Erlandson Principal Software Engineer
  • 2. All Love Stories Are The Same Hero Meets Aggregators Hero Files Spark JIRA Hero Merges Spark PR
  • 5. Spark’s Scale-Out World 2 3 2 5 3 5 2 3 5 2 3 2 5 3 5 2 3 5 physical logical
  • 7. Scale-Out Sum 2 3 5 s = s + 2 (2)
  • 8. Scale-Out Sum 2 3 5 s = s + 3 (5)
  • 9. Scale-Out Sum 2 3 5 s = s + 5 (10)
  • 11. Scale-Out Sum 2 3 5 10 5 3 5 13
  • 12. Scale-Out Sum 2 3 5 10 5 3 5 13 2 3 2 7
  • 13. Scale-Out Sum 2 3 5 10 5 3 5 13 + 7 = 20 2 3 2
  • 14. Scale-Out Sum 2 3 5 10 + 20 = 30 5 3 5 2 3 2
  • 15. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2
  • 16. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2)
  • 17. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) Present sum / count
  • 19. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 20. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  • 21. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 22. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  • 23. Romantic Chemistry val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  • 24. Romantic Chemistry val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  • 25. Romantic Montage Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass Data Science In Apache Spark With Generative T-Digests Apache Spark for Library Developers Extending Structured Streaming Made Easy with Algebra
  • 27. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 28. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  • 29. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... }
  • 30. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... } Expensive
  • 31. What Could Go Wrong? class TDigestUDT extends UserDefinedType[TDigestSQL] { def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... } def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... } // yada yada yada ... }
  • 32. What Could Go Wrong? 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 33. Wait What? val sketchCDF = tdigestUDAF[Double] val data = /* data frame with 1000 rows of data */ val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first In deserialize In serialize In deserialize In serialize … 997 more times ! In deserialize In serialize
  • 34. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { }
  • 35. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize }
  • 36. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update }
  • 37. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize }
  • 41. Aggregator Anatomy class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends Aggregator[Double, TDigestSQL, TDigestSQL] { def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a) def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL = TDigestSQL(b1.tdigest ++ b2.tdigest) def finish(b: TDigestSQL): TDigestSQL = b val serde = ExpressionEncoder[TDigestSQL]() def bufferEncoder: Encoder[TDigestSQL] = serde def outputEncoder: Encoder[TDigestSQL] = serde }
  • 42. Intuitive Serialization 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  • 43. Custom Aggregation in Spark 3.0 import org.apache.spark.sql.functions.udaf val sketchAgg = TDigestAggregator(0.5, 0) val sketchCDF: UserDefinedFunction = udaf(sketchAgg) val sketch = data.agg(sketchCDF($”column”)).first
  • 44. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
  • 45. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112) 70x Faster
  • 50. ErikE ErErlandson Principal Software Engineer Erik Erlandson eje@redhat.com @ManyAngled