Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending Structured Streaming Made Easy with Algebra with Erik Erlandson

49 views

Published on

Apache Spark’s Structured Streaming library provides a powerful set of primitives for building streaming pipelines for data processing. However, it is not always obvious how to take full advantage of this power in a way that works naturally with your application’s unique business logic. If you associate algebra with solving equations while wishing you were doing something else, think again: we’ll see how we can apply the properties of operations we all understand — like addition, multiplication, and set union — to reason about our data engineering pipelines.

Attendees will learn easy techniques for exploiting algebraic patterns in their data processing logic that work seamlessly with Spark’s Structured Streaming constructs, effectively extending Spark’s native primitives with your customized data processing operations. These simple yet powerful ideas will be illustrated with real world examples.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Extending Structured Streaming Made Easy with Algebra with Erik Erlandson

  1. 1. Erik Erlandson Red Hat, Inc. Extending Structured Streaming Made Easy with Algebra #SAISDev2 eje@redhat.com @manyangled
  2. 2. #SAISDev2 In The Beginning 2 D A T A
  3. 3. #SAISDev2 In The Beginning 3 D A T A 1 Data Set
  4. 4. #SAISDev2 In The Beginning 4 D A T A 1 File 1 Data Set
  5. 5. #SAISDev2 In The Beginning 5 D A T A 1 Machine 1 Data Set 1 File
  6. 6. #SAISDev2 Sum 6 2 3 5
  7. 7. #SAISDev2 Sum 7 2 3 5 s = 0
  8. 8. #SAISDev2 Sum 8 2 3 5 s = s + 2 (2)
  9. 9. #SAISDev2 Sum 9 2 3 5 s = s + 3 (5)
  10. 10. #SAISDev2 Sum 10 2 3 5 s = s + 5 (10)
  11. 11. #SAISDev2 Commodity Scale-Out 11 2003 Google FS 2004 MapReduce
  12. 12. #SAISDev2 Commodity Scale-Out 12 2003 Google FS 2004 MapReduce 2006 Hadoop & HDFS
  13. 13. #SAISDev2 Commodity Scale-Out 13 2003 Google FS 2004 MapReduce 2006 Hadoop & HDFS 2009 Spark & RDD
  14. 14. #SAISDev2 Commodity Scale-Out 14 2003 Google FS 2004 MapReduce 2006 Hadoop & HDFS 2009 Spark & RDD 2015 DataFrame 2016 Structured Streaming
  15. 15. #SAISDev2 Our Scale-Out World 15 2 3 2 5 3 5 2 3 5 logical
  16. 16. #SAISDev2 Our Scale-Out World 16 2 3 2 5 3 5 2 3 5 2 3 2 5 3 5 2 3 5 physical logical
  17. 17. #SAISDev2 Scale-Out Sum 17 2 3 5 s = 0
  18. 18. #SAISDev2 Scale-Out Sum 18 2 3 5 s = s + 2 (2)
  19. 19. #SAISDev2 Scale-Out Sum 19 2 3 5 s = s + 3 (5)
  20. 20. #SAISDev2 Scale-Out Sum 20 2 3 5 s = s + 5 (10)
  21. 21. #SAISDev2 Scale-Out Sum 21 2 3 5 10
  22. 22. #SAISDev2 Scale-Out Sum 22 5 3 5 2 3 5 10 13
  23. 23. #SAISDev2 Scale-Out Sum 23 2 3 2 5 3 5 2 3 5 10 13 7
  24. 24. #SAISDev2 Scale-Out Sum 24 2 3 2 5 3 5 2 3 5 10 13 + 7 = 20
  25. 25. #SAISDev2 Scale-Out Sum 25 2 3 2 5 3 5 2 3 5 10 + 20 = 30
  26. 26. #SAISDev2 Unique 26 2 3 5 {}
  27. 27. #SAISDev2 Unique 27 2 3 5 {}+ 2 = {2}
  28. 28. #SAISDev2 Unique 28 2 3 5 {2}+ 3 = {2,3}
  29. 29. #SAISDev2 Unique 29 2 3 5 {2,3}+ 5 = {2,3,5}
  30. 30. #SAISDev2 Unique 30 2 3 5 {2,3,5}
  31. 31. #SAISDev2 Unique 31 5 3 5 2 3 5 {2,3,5} {3,5}
  32. 32. #SAISDev2 Unique 32 2 3 2 5 3 5 2 3 5 {2,3,5} {3,5} {2,3}
  33. 33. #SAISDev2 Unique 33 2 3 2 5 3 5 2 3 5 {3,5} U {2,3} = {2,3,5} {2,3,5}
  34. 34. #SAISDev2 Unique 34 2 3 2 5 3 5 2 3 5 {2,3,5} U {2,3,5} = {2,3,5}
  35. 35. #SAISDev2 Patterns 35 Examples Sum Unique Pattern s = 0 s = {} 0 { } zero (aka identity)
  36. 36. #SAISDev2 Patterns 36 Examples Sum Unique Pattern s = 0 s = {} 0 { } zero (aka identity) 2 + 3 = 5 {2} + 3 = {2, 3} addition set insertion update (aka reduce)
  37. 37. #SAISDev2 Patterns 37 Examples Sum Unique Pattern s = 0 s = {} 0 { } zero (aka identity) 2 + 3 = 5 {2} + 3 = {2, 3} addition set insertion update (aka reduce) 13 + 7 = 20 {3,5} U {2,3} = {2,3,5} addition set union merge (aka combine)
  38. 38. #SAISDev2 Spark Operators 38 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2
  39. 39. #SAISDev2 Spark Operators 39 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2)
  40. 40. #SAISDev2 Spark Operators 40 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) Present sum / count
  41. 41. #SAISDev2 Shhhhh... 41 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) We’re secretly algebras!
  42. 42. #SAISDev2 Algebras are Pattern Checklists 42 Object Sets (data types) ✔
  43. 43. #SAISDev2 Algebras are Pattern Checklists 43 Object Sets (data types) ✔ Operations ✔
  44. 44. #SAISDev2 Algebras are Pattern Checklists 44 Object Sets (data types) ✔ Operations ✔ Properties ✔
  45. 45. #SAISDev2 DataFrame Aggregations... records.show(5) +----------+---------+ | user_id|wordcount| +----------+---------+ |6458791872| 12| |7699035787| 5| |2509155359| 9| |9914782373| 18| |7816616846| 12| +----------+---------+ 45 records.groupBy($"user_id") .agg(avg($"wordcount").alias("avg")) .orderBy($"avg".desc) .show(5) +----------+----+ | user_id| avg| +----------+----+ |9438801796|42.0| |0837938601|41.0| |0004926696|40.0| |7439949213|39.0| |2505585758|39.0| +----------+----+
  46. 46. #SAISDev2 DataFrame Aggregations... records.show(5) +----------+---------+ | user_id|wordcount| +----------+---------+ |6458791872| 12| |7699035787| 5| |2509155359| 9| |9914782373| 18| |7816616846| 12| +----------+---------+ 46 records.groupBy($"user_id") .agg(avg($"wordcount").alias("avg")) .orderBy($"avg".desc) .show(5) +----------+----+ | user_id| avg| +----------+----+ |9438801796|42.0| |0837938601|41.0| |0004926696|40.0| |7439949213|39.0| |2505585758|39.0| +----------+----+ Are Algebras!
  47. 47. #SAISDev2 Aggregator Algebra 47 Data Type
  48. 48. #SAISDev2 Aggregator Algebra 48 Data Type Accumulator Type
  49. 49. #SAISDev2 Aggregator Algebra 49 Data Type Accumulator Type Zero
  50. 50. #SAISDev2 Aggregator Algebra 50 Data Type Accumulator Type Zero Update
  51. 51. #SAISDev2 Aggregator Algebra 51 Data Type Accumulator Type Zero Update Merge
  52. 52. #SAISDev2 Aggregator Algebra 52 Data Type Accumulator Type Zero Update Merge Present
  53. 53. #SAISDev2 Merge Is Associative 53 10 13 7 2 3 5 5 3 5 2 3 2 (10 + 13) + 7 = 30
  54. 54. #SAISDev2 Merge Is Associative 54 10 13 7 2 3 5 5 3 5 2 3 2 (10 + 13) + 7 = 30 10 + (13 + 7) = 30
  55. 55. #SAISDev2 Merge Is (usually) Commutative 55 10 13 7 2 3 5 5 3 5 2 3 2 10 + 13 + 7 = 30
  56. 56. #SAISDev2 Merge Is (usually) Commutative 56 10 13 7 2 3 5 5 3 5 2 3 2 10 + 13 + 7 = 30 13 + 7 + 10 = 30
  57. 57. #SAISDev2 Merge Is (usually) Commutative 57 10 13 7 2 3 5 5 3 5 2 3 2 sum a1 + a2 = a2 + a1 max max(a1, a2) = max(a2, a1) avg (s1+s2, n1+n2) = (s2+s1, n2+n1)
  58. 58. #SAISDev2 Structured Streaming 58 a 5 c 7 a 5 c 7 user wordcount a 5 c 7 Aggregations Logical
  59. 59. #SAISDev2 Structured Streaming 59 a 5 c 7 b 8 c 10 a 5 c 7 a 5 c 7 b 8 c 10 user wordcount a 5 b 8 c 17 a 5 c 7 Aggregations Logical Discarded
  60. 60. #SAISDev2 Structured Streaming 60 a 5 c 7 b 8 c 10 a 9 b 6 a 5 c 7 a 5 c 7 b 8 c 10 user wordcount a 5 c 7 b 8 c 10 a 9 b 6 a 14 b 14 c 17 a 5 b 8 c 17 a 5 c 7 Aggregations Logical Discarded
  61. 61. #SAISDev2 Structured Streaming 61 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount
  62. 62. #SAISDev2 Structured Streaming 62 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 group-by
  63. 63. #SAISDev2 Structured Streaming 63 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 a 16 b 4 c 17 group-by update
  64. 64. #SAISDev2 Structured Streaming 64 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 a 5 b 8 c 20 a 16 b 4 c 17 a 21 b 12 c 37 group-by update merge
  65. 65. #SAISDev2 Structured Streaming 65 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 a 5 b 8 c 20 a 16 b 4 c 17 a 21 b 12 c 37 group-by update merge S B Algebras
  66. 66. #SAISDev2 Algebras Around Us 66 Business Logic
  67. 67. #SAISDev2 Algebras Around Us 67 Business Logic Data Type Accumulator Zero Update Merge Present
  68. 68. #SAISDev2 Algebras Around Us 68 Business Logic Data Type Accumulator Zero Update Merge Present Structured Streaming Query
  69. 69. #SAISDev2 Aggregating Quantiles 69 10 9 12 27 . . . ? aggregating query The median of my data is 11 The 90th percentile of my data is 25
  70. 70. #SAISDev2 Distribution Sketch: T-Digest 70 0 1 CDF
  71. 71. #SAISDev2 Distribution Sketch: T-Digest 71 q = 0.9 0 1 CDF
  72. 72. #SAISDev2 Distribution Sketch: T-Digest 72 q = 0.9 0 1 (x,q) CDF
  73. 73. #SAISDev2 Distribution Sketch: T-Digest 73 q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  74. 74. #SAISDev2 Distribution Sketch: T-Digest 74 q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  75. 75. #SAISDev2 Is T-Digest an Algebra? 75 Data Type Numeric
  76. 76. #SAISDev2 Is T-Digest an Algebra? 76 Data Type Numeric Accumulator Type T-Digest Sketch
  77. 77. #SAISDev2 Is T-Digest an Algebra? 77 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest
  78. 78. #SAISDev2 Is T-Digest an Algebra? 78 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x
  79. 79. #SAISDev2 Is T-Digest an Algebra? 79 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2
  80. 80. #SAISDev2 Is T-Digest an Algebra? 80 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  81. 81. #SAISDev2 Is T-Digest an Algebra? 81 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile) YES!
  82. 82. #SAISDev2 User Defined Aggregator Function 82 val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  83. 83. #SAISDev2 User Defined Aggregator Function 83 val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  84. 84. #SAISDev2 Streaming Percentiles 84 val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  85. 85. #SAISDev2 Streaming Percentiles 85 val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  86. 86. #SAISDev2 Most-Frequent Items 86 #DogRates #YOLO #SAIRocks #TruthBomb #Mondays ? aggregating query #tag frequency #DogRates 1000000000 #Blockchain 780000 #TaylorSwift 650000
  87. 87. #SAISDev2 Heavy-Hitter Sketch 87 CountMin Heap I estimate frequencies of objects! I store objects in sorted order!
  88. 88. #SAISDev2 Heavy-Hitter Sketch 88 #DogRates update CountMin Heap
  89. 89. #SAISDev2 Heavy-Hitter Sketch 89 #DogRates update query CountMin Heap
  90. 90. #SAISDev2 Heavy-Hitter Sketch 90 #DogRates update query insert CountMin Heap
  91. 91. #SAISDev2 Heavy-Hitter Sketch 91 #DogRates #tag frequency #DogRates 1000000000 #Blockchain 780000 #TaylorSwift 650000 update query insert top k CountMin Heap
  92. 92. #SAISDev2 Heavy-Hitter Sketch 92 #DogRates #tag frequency #DogRates 1000000000 #Blockchain 780000 #TaylorSwift 650000 update query insert top k CountMin Heap
  93. 93. #SAISDev2 Is Heavy-Hitter an Algebra? 93 Data Type Items
  94. 94. #SAISDev2 Is Heavy-Hitter an Algebra? 94 Data Type Items Accumulator Type countmin sketch heap
  95. 95. #SAISDev2 Is Heavy-Hitter an Algebra? 95 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap
  96. 96. #SAISDev2 Is Heavy-Hitter an Algebra? 96 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency)
  97. 97. #SAISDev2 Is Heavy-Hitter an Algebra? 97 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency) Merge countmin1 + countmin2 update(heap1 + heap2)
  98. 98. #SAISDev2 Is Heavy-Hitter an Algebra? 98 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency) Merge countmin1 + countmin2 update(heap1 + heap2) Present heap.top(k)
  99. 99. #SAISDev2 Is Heavy-Hitter an Algebra? 99 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency) Merge countmin1 + countmin2 update(heap1 + heap2) Present heap.top(k) YES!
  100. 100. #SAISDev2 Streaming Most-Frequent Items 100 val query = records .writeStream //... +----------+ | hashtag| +----------+ | #Gardiner| | #PETE| | #Nutiva| | #Fairfax| | #Darcy| +----------+ val windowBy60 = windowing.windowBy[(Timestamp, String)](_._1, 60) val top3 = new TopKAgg[(Timestamp, String)](_._2) val r = records.withColumn("time", current_timestamp()) .as[(Timestamp, String)] .groupByKey(windowBy60).agg(top3.toColumn) .map { case (_, tk) => tk.toList.toString } val query = r.writeStream //... +------------------------------------------------------+ + value | +------------------------------------------------------+ |List((#Iddesleigh,3), (#Gardiner,2), (#Willoughby,2)) | |List((#Elizabeth,3), (#HERPRESENT,1), (#upforhours,1))| |List((#PETE,4), (#Nutiva,1), (#Fairfax,1)) | +------------------------------------------------------+
  101. 101. #SAISDev2 Review 101
  102. 102. #SAISDev2 Review 102
  103. 103. #SAISDev2 Review 103
  104. 104. #SAISDev2 Review 104
  105. 105. #SAISDev2 Review 105 eje@redhat.com @manyangled

×