Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

User Defined Aggregation in Apache Spark: A Love Story

Defining customized scalable aggregation logic is one of Apache Spark’s most powerful features.

  • Be the first to comment

  • Be the first to like this

User Defined Aggregation in Apache Spark: A Love Story

  1. 1. User Defined Aggregation In Apache Spark A Love Story Erik Erlandson Principal Software Engineer
  2. 2. All Love Stories Are The Same Hero Meets Aggregators Hero Files Spark JIRA Hero Merges Spark PR
  3. 3. Establish The Plot
  4. 4. Spark’s Scale-Out World 2 3 2 5 3 5 2 3 5 logical
  5. 5. Spark’s Scale-Out World 2 3 2 5 3 5 2 3 5 2 3 2 5 3 5 2 3 5 physical logical
  6. 6. Scale-Out Sum 2 3 5 s = 0
  7. 7. Scale-Out Sum 2 3 5 s = s + 2 (2)
  8. 8. Scale-Out Sum 2 3 5 s = s + 3 (5)
  9. 9. Scale-Out Sum 2 3 5 s = s + 5 (10)
  10. 10. Scale-Out Sum 2 3 5 10
  11. 11. Scale-Out Sum 2 3 5 10 5 3 5 13
  12. 12. Scale-Out Sum 2 3 5 10 5 3 5 13 2 3 2 7
  13. 13. Scale-Out Sum 2 3 5 10 5 3 5 13 + 7 = 20 2 3 2
  14. 14. Scale-Out Sum 2 3 5 10 + 20 = 30 5 3 5 2 3 2
  15. 15. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2
  16. 16. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2)
  17. 17. Spark Aggregators Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) Present sum / count
  18. 18. Love Interest
  19. 19. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  20. 20. Data Sketching: T-Digest q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  21. 21. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  22. 22. Is T-Digest an Aggregator? Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  23. 23. Romantic Chemistry val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  24. 24. Romantic Chemistry val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  25. 25. Romantic Montage Sketching Data with T-Digest In Apache Spark Smart Scalable Feature Reduction With Random Forests One-Pass Data Science In Apache Spark With Generative T-Digests Apache Spark for Library Developers Extending Structured Streaming Made Easy with Algebra
  26. 26. Conflict!
  27. 27. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  28. 28. UDAF Anatomy class TDigestUDAF(deltaV: Double, maxDiscreteV: Int) extends UserDefinedAggregateFunction { def initialize(buf: MutableAggregationBuffer): Unit = buf(0) = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = buf1(0) = TDigestSQL(buf1.getAs[TDigestSQL](0).tdigest ++ buf2.getAs[TDigestSQL](0).tdigest) def bufferSchema: StructType = StructType(StructField("tdigest", TDigestUDT) :: Nil) // yada yada yada ... }
  29. 29. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... }
  30. 30. User Defined Type Anatomy class TDigestUDT extends UserDefinedType[TDigestSQL] { def sqlType: DataType = StructType( StructField("delta", DoubleType, false) :: StructField("maxDiscrete", IntegerType, false) :: StructField("nclusters", IntegerType, false) :: StructField("clustX", ArrayType(DoubleType, false), false) :: StructField("clustM", ArrayType(DoubleType, false), false) :: Nil) def serialize(tdsql: TDigestSQL): Any = { /* pack T-Digest */ } def deserialize(datum: Any): TDigestSQL = { /* unpack T-Digest */ } // yada yada yada ... } Expensive
  31. 31. What Could Go Wrong? class TDigestUDT extends UserDefinedType[TDigestSQL] { def serialize(tdsql: TDigestSQL): Any = { print(“In serialize”) // ... } def deserialize(datum: Any): TDigestSQL = { print(“In deserialize”) // ... } // yada yada yada ... }
  32. 32. What Could Go Wrong? 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  33. 33. Wait What? val sketchCDF = tdigestUDAF[Double] val data = /* data frame with 1000 rows of data */ val sketch = data.agg(sketchCDF($”column”).alias(“sketch”)).first In deserialize In serialize In deserialize In serialize … 997 more times ! In deserialize In serialize
  34. 34. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { }
  35. 35. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize }
  36. 36. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update }
  37. 37. Oh No def update(buf: MutableAggregationBuffer, input: Row): Unit = buf(0) = TDigestSQL(buf.getAs[TDigestSQL](0).tdigest + input.getDouble(0)) // is equivalent to ... def update(buf: MutableAggregationBuffer, input: Row): Unit = { val tdigest = buf.getAs[TDigestSQL](0).tdigest // deserialize val updated = tdigest + input.getDouble(0) // do the actual update buf(0) = TDigestSQL(updated) // re-serialize }
  38. 38. SPARK-27296
  39. 39. Resolution
  40. 40. #25024
  41. 41. Aggregator Anatomy class TDigestAggregator(deltaV: Double, maxDiscreteV: Int) extends Aggregator[Double, TDigestSQL, TDigestSQL] { def zero: TDigestSQL = TDigestSQL(TDigest.empty(deltaV, maxDiscreteV)) def reduce(b: TDigestSQL, a: Double): TDigestSQL = TDigestSQL(b.tdigest + a) def merge(b1: TDigestSQL, b2: TDigestSQL): TDigestSQL = TDigestSQL(b1.tdigest ++ b2.tdigest) def finish(b: TDigestSQL): TDigestSQL = b val serde = ExpressionEncoder[TDigestSQL]() def bufferEncoder: Encoder[TDigestSQL] = serde def outputEncoder: Encoder[TDigestSQL] = serde }
  42. 42. Intuitive Serialization 2 3 2 5 3 5 2 3 5 Init Updates Serialize Init Updates Serialize Init Updates Serialize Merge
  43. 43. Custom Aggregation in Spark 3.0 import org.apache.spark.sql.functions.udaf val sketchAgg = TDigestAggregator(0.5, 0) val sketchCDF: UserDefinedFunction = udaf(sketchAgg) val sketch = data.agg(sketchCDF($”column”)).first
  44. 44. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112)
  45. 45. Performance scala> val sketchOld = TDigestUDAF(0.5, 0) sketchOld: org.apache.spark.tdigest.TDigestUDAF = // ... scala> Benchmark.sample(5) { data.agg(sketchOld($"x1")).first } res4: Array[Double] = Array(6.655, 7.044, 8.875, 9.425, 9.846) scala> val sketchNew = udaf(TDigestAggregator(0.5, 0)) sketchNew: org.apache.spark.sql.expressions.UserDefinedFunction = // ... scala> Benchmark.sample(5) { data.agg(sketchNew($"x1")).first } res5: Array[Double] = Array(0.128, 0.12, 0.118, 0.12, 0.112) 70x Faster
  46. 46. Epilogue
  47. 47. Don’t Give Up
  48. 48. Patience
  49. 49. Respect
  50. 50. ErikE ErErlandson Principal Software Engineer Erik Erlandson eje@redhat.com @ManyAngled

×