Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Erik Erlandson
Red Hat, Inc.
Extending Structured
Streaming Made Easy
with Algebra
#SAISDev2
eje@redhat.com
@manyangled
#SAISDev2
In The Beginning
2
D
A
T
A
#SAISDev2
In The Beginning
3
D
A
T
A
1 Data Set
#SAISDev2
In The Beginning
4
D
A
T
A
1 File
1 Data Set
#SAISDev2
In The Beginning
5
D
A
T
A
1 Machine
1 Data Set
1 File
#SAISDev2
Sum
6
2
3
5
#SAISDev2
Sum
7
2
3
5
s = 0
#SAISDev2
Sum
8
2
3
5
s = s + 2 (2)
#SAISDev2
Sum
9
2
3
5
s = s + 3 (5)
#SAISDev2
Sum
10
2
3
5 s = s + 5 (10)
#SAISDev2
Commodity Scale-Out
11
2003 Google FS
2004 MapReduce
#SAISDev2
Commodity Scale-Out
12
2003 Google FS
2004 MapReduce
2006 Hadoop & HDFS
#SAISDev2
Commodity Scale-Out
13
2003 Google FS
2004 MapReduce
2006 Hadoop & HDFS
2009 Spark & RDD
#SAISDev2
Commodity Scale-Out
14
2003 Google FS
2004 MapReduce
2006 Hadoop & HDFS
2009 Spark & RDD
2015 DataFrame
2016 Str...
#SAISDev2
Our Scale-Out World
15
2
3
2
5
3
5
2
3
5
logical
#SAISDev2
Our Scale-Out World
16
2 3 2
5 3 5
2 3 5
2
3
2
5
3
5
2
3
5
physical
logical
#SAISDev2
Scale-Out Sum
17
2 3 5
s
=
0
#SAISDev2
Scale-Out Sum
18
2 3 5
s
=
s
+
2
(2)
#SAISDev2
Scale-Out Sum
19
2 3 5
s
=
s
+
3
(5)
#SAISDev2
Scale-Out Sum
20
2 3 5
s
=
s
+
5
(10)
#SAISDev2
Scale-Out Sum
21
2 3 5 10
#SAISDev2
Scale-Out Sum
22
5 3 5
2 3 5 10
13
#SAISDev2
Scale-Out Sum
23
2 3 2
5 3 5
2 3 5 10
13
7
#SAISDev2
Scale-Out Sum
24
2 3 2
5 3 5
2 3 5 10
13 + 7 = 20
#SAISDev2
Scale-Out Sum
25
2 3 2
5 3 5
2 3 5 10 + 20 = 30
#SAISDev2
Unique
26
2 3 5
{}
#SAISDev2
Unique
27
2 3 5
{}+
2
=
{2}
#SAISDev2
Unique
28
2 3 5
{2}+
3
=
{2,3}
#SAISDev2
Unique
29
2 3 5
{2,3}+
5
=
{2,3,5}
#SAISDev2
Unique
30
2 3 5 {2,3,5}
#SAISDev2
Unique
31
5 3 5
2 3 5 {2,3,5}
{3,5}
#SAISDev2
Unique
32
2 3 2
5 3 5
2 3 5 {2,3,5}
{3,5}
{2,3}
#SAISDev2
Unique
33
2 3 2
5 3 5
2 3 5
{3,5} U {2,3} = {2,3,5}
{2,3,5}
#SAISDev2
Unique
34
2 3 2
5 3 5
2 3 5 {2,3,5} U {2,3,5} = {2,3,5}
#SAISDev2
Patterns
35
Examples Sum Unique Pattern
s = 0
s = {}
0 { }
zero
(aka identity)
#SAISDev2
Patterns
36
Examples Sum Unique Pattern
s = 0
s = {}
0 { }
zero
(aka identity)
2 + 3 = 5
{2} + 3 = {2, 3}
additi...
#SAISDev2
Patterns
37
Examples Sum Unique Pattern
s = 0
s = {}
0 { }
zero
(aka identity)
2 + 3 = 5
{2} + 3 = {2, 3}
additi...
#SAISDev2
Spark Operators
38
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
#SAISDev2
Spark Operators
39
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers N...
#SAISDev2
Spark Operators
40
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers N...
#SAISDev2
Shhhhh...
41
Operation Data Accumulator Zero Update Merge
Sum Numbers Number 0 a + x a1 + a2
Max Numbers Number ...
#SAISDev2
Algebras are Pattern Checklists
42
Object Sets
(data types) ✔
#SAISDev2
Algebras are Pattern Checklists
43
Object Sets
(data types) ✔
Operations ✔
#SAISDev2
Algebras are Pattern Checklists
44
Object Sets
(data types) ✔
Operations ✔
Properties ✔
#SAISDev2
DataFrame Aggregations...
records.show(5)
+----------+---------+
| user_id|wordcount|
+----------+---------+
|64...
#SAISDev2
DataFrame Aggregations...
records.show(5)
+----------+---------+
| user_id|wordcount|
+----------+---------+
|64...
#SAISDev2
Aggregator Algebra
47
Data Type
#SAISDev2
Aggregator Algebra
48
Data Type
Accumulator Type
#SAISDev2
Aggregator Algebra
49
Data Type
Accumulator Type
Zero
#SAISDev2
Aggregator Algebra
50
Data Type
Accumulator Type
Zero
Update
#SAISDev2
Aggregator Algebra
51
Data Type
Accumulator Type
Zero
Update
Merge
#SAISDev2
Aggregator Algebra
52
Data Type
Accumulator Type
Zero
Update
Merge
Present
#SAISDev2
Merge Is Associative
53
10 13 7
2
3
5
5
3
5
2
3
2
(10 + 13) + 7 = 30
#SAISDev2
Merge Is Associative
54
10 13 7
2
3
5
5
3
5
2
3
2
(10 + 13) + 7 = 30
10 + (13 + 7) = 30
#SAISDev2
Merge Is (usually) Commutative
55
10 13 7
2
3
5
5
3
5
2
3
2
10 + 13 + 7 = 30
#SAISDev2
Merge Is (usually) Commutative
56
10 13 7
2
3
5
5
3
5
2
3
2
10 + 13 + 7 = 30
13 + 7 + 10 = 30
#SAISDev2
Merge Is (usually) Commutative
57
10 13 7
2
3
5
5
3
5
2
3
2
sum a1 + a2 = a2 + a1
max max(a1, a2) = max(a2, a1)
...
#SAISDev2
Structured Streaming
58
a 5
c 7
a 5
c 7
user wordcount
a 5
c 7
Aggregations
Logical
#SAISDev2
Structured Streaming
59
a 5
c 7
b 8
c 10
a 5
c 7
a 5
c 7
b 8
c 10
user wordcount
a 5
b 8
c 17
a 5
c 7 Aggregatio...
#SAISDev2
Structured Streaming
60
a 5
c 7
b 8
c 10
a 9
b 6
a 5
c 7
a 5
c 7
b 8
c 10
user wordcount
a 5
c 7
b 8
c 10
a 9
b ...
#SAISDev2
Structured Streaming
61
a 5
c 7
b 4
a 8
c 10
a 3
user wordcount
#SAISDev2
Structured Streaming
62
a 5
c 7
b 4
a 8
c 10
a 3
user wordcount
a 5, 8, 3
b 4
c 7, 10
group-by
#SAISDev2
Structured Streaming
63
a 5
c 7
b 4
a 8
c 10
a 3
user wordcount
a 5, 8, 3
b 4
c 7, 10
a 16
b 4
c 17
group-by
upd...
#SAISDev2
Structured Streaming
64
a 5
c 7
b 4
a 8
c 10
a 3
user wordcount
a 5, 8, 3
b 4
c 7, 10
a 5
b 8
c 20
a 16
b 4
c 17...
#SAISDev2
Structured Streaming
65
a 5
c 7
b 4
a 8
c 10
a 3
user wordcount
a 5, 8, 3
b 4
c 7, 10
a 5
b 8
c 20
a 16
b 4
c 17...
#SAISDev2
Algebras Around Us
66
Business
Logic
#SAISDev2
Algebras Around Us
67
Business
Logic
Data Type
Accumulator
Zero
Update
Merge
Present
#SAISDev2
Algebras Around Us
68
Business
Logic
Data Type
Accumulator
Zero
Update
Merge
Present
Structured
Streaming
Query
#SAISDev2
Aggregating Quantiles
69
10
9
12
27
.
.
.
?
aggregating
query
The median of
my data is 11
The 90th
percentile of...
#SAISDev2
Distribution Sketch: T-Digest
70
0
1 CDF
#SAISDev2
Distribution Sketch: T-Digest
71
q = 0.9
0
1 CDF
#SAISDev2
Distribution Sketch: T-Digest
72
q = 0.9
0
1
(x,q)
CDF
#SAISDev2
Distribution Sketch: T-Digest
73
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
#SAISDev2
Distribution Sketch: T-Digest
74
q = 0.9
x is the 90th %-ile
0
1
(x,q)
CDF
#SAISDev2
Is T-Digest an Algebra?
75
Data Type Numeric
#SAISDev2
Is T-Digest an Algebra?
76
Data Type Numeric
Accumulator Type T-Digest Sketch
#SAISDev2
Is T-Digest an Algebra?
77
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
#SAISDev2
Is T-Digest an Algebra?
78
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest...
#SAISDev2
Is T-Digest an Algebra?
79
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest...
#SAISDev2
Is T-Digest an Algebra?
80
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest...
#SAISDev2
Is T-Digest an Algebra?
81
Data Type Numeric
Accumulator Type T-Digest Sketch
Zero Empty T-Digest
Update tdigest...
#SAISDev2
User Defined Aggregator Function
82
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50",
(c:Any)=>c.asI...
#SAISDev2
User Defined Aggregator Function
83
val sketchCDF = tdigestUDAF[Double]
spark.udf.register("p50",
(c:Any)=>c.asI...
#SAISDev2
Streaming Percentiles
84
val query = records
.writeStream //...
+---------+
|wordcount|
+---------+
| 12|
| 5|
|...
#SAISDev2
Streaming Percentiles
85
val query = records
.writeStream //...
+---------+
|wordcount|
+---------+
| 12|
| 5|
|...
#SAISDev2
Most-Frequent Items
86
#DogRates
#YOLO
#SAIRocks
#TruthBomb
#Mondays
?
aggregating
query
#tag frequency
#DogRate...
#SAISDev2
Heavy-Hitter Sketch
87
CountMin
Heap
I estimate
frequencies of
objects!
I store objects
in sorted order!
#SAISDev2
Heavy-Hitter Sketch
88
#DogRates
update
CountMin
Heap
#SAISDev2
Heavy-Hitter Sketch
89
#DogRates
update query
CountMin
Heap
#SAISDev2
Heavy-Hitter Sketch
90
#DogRates
update query
insert
CountMin
Heap
#SAISDev2
Heavy-Hitter Sketch
91
#DogRates
#tag frequency
#DogRates 1000000000
#Blockchain 780000
#TaylorSwift 650000
upda...
#SAISDev2
Heavy-Hitter Sketch
92
#DogRates
#tag frequency
#DogRates 1000000000
#Blockchain 780000
#TaylorSwift 650000
upda...
#SAISDev2
Is Heavy-Hitter an Algebra?
93
Data Type Items
#SAISDev2
Is Heavy-Hitter an Algebra?
94
Data Type Items
Accumulator Type
countmin sketch
heap
#SAISDev2
Is Heavy-Hitter an Algebra?
95
Data Type Items
Accumulator Type
countmin sketch
heap
Zero
empty countmin
empty h...
#SAISDev2
Is Heavy-Hitter an Algebra?
96
Data Type Items
Accumulator Type
countmin sketch
heap
Zero
empty countmin
empty h...
#SAISDev2
Is Heavy-Hitter an Algebra?
97
Data Type Items
Accumulator Type
countmin sketch
heap
Zero
empty countmin
empty h...
#SAISDev2
Is Heavy-Hitter an Algebra?
98
Data Type Items
Accumulator Type
countmin sketch
heap
Zero
empty countmin
empty h...
#SAISDev2
Is Heavy-Hitter an Algebra?
99
Data Type Items
Accumulator Type
countmin sketch
heap
Zero
empty countmin
empty h...
#SAISDev2
Streaming Most-Frequent Items
100
val query = records
.writeStream //...
+----------+
| hashtag|
+----------+
| ...
#SAISDev2
Review
101
#SAISDev2
Review
102
#SAISDev2
Review
103
#SAISDev2
Review
104
#SAISDev2
Review
105
eje@redhat.com
@manyangled
Upcoming SlideShare
Loading in …5
×

Extending Structured Streaming Made Easy with Algebra with Erik Erlandson

95 views

Published on

Apache Spark’s Structured Streaming library provides a powerful set of primitives for building streaming pipelines for data processing. However, it is not always obvious how to take full advantage of this power in a way that works naturally with your application’s unique business logic. If you associate algebra with solving equations while wishing you were doing something else, think again: we’ll see how we can apply the properties of operations we all understand — like addition, multiplication, and set union — to reason about our data engineering pipelines.

Attendees will learn easy techniques for exploiting algebraic patterns in their data processing logic that work seamlessly with Spark’s Structured Streaming constructs, effectively extending Spark’s native primitives with your customized data processing operations. These simple yet powerful ideas will be illustrated with real world examples.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Extending Structured Streaming Made Easy with Algebra with Erik Erlandson

  1. 1. Erik Erlandson Red Hat, Inc. Extending Structured Streaming Made Easy with Algebra #SAISDev2 eje@redhat.com @manyangled
  2. 2. #SAISDev2 In The Beginning 2 D A T A
  3. 3. #SAISDev2 In The Beginning 3 D A T A 1 Data Set
  4. 4. #SAISDev2 In The Beginning 4 D A T A 1 File 1 Data Set
  5. 5. #SAISDev2 In The Beginning 5 D A T A 1 Machine 1 Data Set 1 File
  6. 6. #SAISDev2 Sum 6 2 3 5
  7. 7. #SAISDev2 Sum 7 2 3 5 s = 0
  8. 8. #SAISDev2 Sum 8 2 3 5 s = s + 2 (2)
  9. 9. #SAISDev2 Sum 9 2 3 5 s = s + 3 (5)
  10. 10. #SAISDev2 Sum 10 2 3 5 s = s + 5 (10)
  11. 11. #SAISDev2 Commodity Scale-Out 11 2003 Google FS 2004 MapReduce
  12. 12. #SAISDev2 Commodity Scale-Out 12 2003 Google FS 2004 MapReduce 2006 Hadoop & HDFS
  13. 13. #SAISDev2 Commodity Scale-Out 13 2003 Google FS 2004 MapReduce 2006 Hadoop & HDFS 2009 Spark & RDD
  14. 14. #SAISDev2 Commodity Scale-Out 14 2003 Google FS 2004 MapReduce 2006 Hadoop & HDFS 2009 Spark & RDD 2015 DataFrame 2016 Structured Streaming
  15. 15. #SAISDev2 Our Scale-Out World 15 2 3 2 5 3 5 2 3 5 logical
  16. 16. #SAISDev2 Our Scale-Out World 16 2 3 2 5 3 5 2 3 5 2 3 2 5 3 5 2 3 5 physical logical
  17. 17. #SAISDev2 Scale-Out Sum 17 2 3 5 s = 0
  18. 18. #SAISDev2 Scale-Out Sum 18 2 3 5 s = s + 2 (2)
  19. 19. #SAISDev2 Scale-Out Sum 19 2 3 5 s = s + 3 (5)
  20. 20. #SAISDev2 Scale-Out Sum 20 2 3 5 s = s + 5 (10)
  21. 21. #SAISDev2 Scale-Out Sum 21 2 3 5 10
  22. 22. #SAISDev2 Scale-Out Sum 22 5 3 5 2 3 5 10 13
  23. 23. #SAISDev2 Scale-Out Sum 23 2 3 2 5 3 5 2 3 5 10 13 7
  24. 24. #SAISDev2 Scale-Out Sum 24 2 3 2 5 3 5 2 3 5 10 13 + 7 = 20
  25. 25. #SAISDev2 Scale-Out Sum 25 2 3 2 5 3 5 2 3 5 10 + 20 = 30
  26. 26. #SAISDev2 Unique 26 2 3 5 {}
  27. 27. #SAISDev2 Unique 27 2 3 5 {}+ 2 = {2}
  28. 28. #SAISDev2 Unique 28 2 3 5 {2}+ 3 = {2,3}
  29. 29. #SAISDev2 Unique 29 2 3 5 {2,3}+ 5 = {2,3,5}
  30. 30. #SAISDev2 Unique 30 2 3 5 {2,3,5}
  31. 31. #SAISDev2 Unique 31 5 3 5 2 3 5 {2,3,5} {3,5}
  32. 32. #SAISDev2 Unique 32 2 3 2 5 3 5 2 3 5 {2,3,5} {3,5} {2,3}
  33. 33. #SAISDev2 Unique 33 2 3 2 5 3 5 2 3 5 {3,5} U {2,3} = {2,3,5} {2,3,5}
  34. 34. #SAISDev2 Unique 34 2 3 2 5 3 5 2 3 5 {2,3,5} U {2,3,5} = {2,3,5}
  35. 35. #SAISDev2 Patterns 35 Examples Sum Unique Pattern s = 0 s = {} 0 { } zero (aka identity)
  36. 36. #SAISDev2 Patterns 36 Examples Sum Unique Pattern s = 0 s = {} 0 { } zero (aka identity) 2 + 3 = 5 {2} + 3 = {2, 3} addition set insertion update (aka reduce)
  37. 37. #SAISDev2 Patterns 37 Examples Sum Unique Pattern s = 0 s = {} 0 { } zero (aka identity) 2 + 3 = 5 {2} + 3 = {2, 3} addition set insertion update (aka reduce) 13 + 7 = 20 {3,5} U {2,3} = {2,3,5} addition set union merge (aka combine)
  38. 38. #SAISDev2 Spark Operators 38 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2
  39. 39. #SAISDev2 Spark Operators 39 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2)
  40. 40. #SAISDev2 Spark Operators 40 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) Present sum / count
  41. 41. #SAISDev2 Shhhhh... 41 Operation Data Accumulator Zero Update Merge Sum Numbers Number 0 a + x a1 + a2 Max Numbers Number -∞ max(a, x) max(a1, a2) Average Numbers (sum, count) (0, 0) (sum + x, count + 1) (s1 + s2, c1 + c2) We’re secretly algebras!
  42. 42. #SAISDev2 Algebras are Pattern Checklists 42 Object Sets (data types) ✔
  43. 43. #SAISDev2 Algebras are Pattern Checklists 43 Object Sets (data types) ✔ Operations ✔
  44. 44. #SAISDev2 Algebras are Pattern Checklists 44 Object Sets (data types) ✔ Operations ✔ Properties ✔
  45. 45. #SAISDev2 DataFrame Aggregations... records.show(5) +----------+---------+ | user_id|wordcount| +----------+---------+ |6458791872| 12| |7699035787| 5| |2509155359| 9| |9914782373| 18| |7816616846| 12| +----------+---------+ 45 records.groupBy($"user_id") .agg(avg($"wordcount").alias("avg")) .orderBy($"avg".desc) .show(5) +----------+----+ | user_id| avg| +----------+----+ |9438801796|42.0| |0837938601|41.0| |0004926696|40.0| |7439949213|39.0| |2505585758|39.0| +----------+----+
  46. 46. #SAISDev2 DataFrame Aggregations... records.show(5) +----------+---------+ | user_id|wordcount| +----------+---------+ |6458791872| 12| |7699035787| 5| |2509155359| 9| |9914782373| 18| |7816616846| 12| +----------+---------+ 46 records.groupBy($"user_id") .agg(avg($"wordcount").alias("avg")) .orderBy($"avg".desc) .show(5) +----------+----+ | user_id| avg| +----------+----+ |9438801796|42.0| |0837938601|41.0| |0004926696|40.0| |7439949213|39.0| |2505585758|39.0| +----------+----+ Are Algebras!
  47. 47. #SAISDev2 Aggregator Algebra 47 Data Type
  48. 48. #SAISDev2 Aggregator Algebra 48 Data Type Accumulator Type
  49. 49. #SAISDev2 Aggregator Algebra 49 Data Type Accumulator Type Zero
  50. 50. #SAISDev2 Aggregator Algebra 50 Data Type Accumulator Type Zero Update
  51. 51. #SAISDev2 Aggregator Algebra 51 Data Type Accumulator Type Zero Update Merge
  52. 52. #SAISDev2 Aggregator Algebra 52 Data Type Accumulator Type Zero Update Merge Present
  53. 53. #SAISDev2 Merge Is Associative 53 10 13 7 2 3 5 5 3 5 2 3 2 (10 + 13) + 7 = 30
  54. 54. #SAISDev2 Merge Is Associative 54 10 13 7 2 3 5 5 3 5 2 3 2 (10 + 13) + 7 = 30 10 + (13 + 7) = 30
  55. 55. #SAISDev2 Merge Is (usually) Commutative 55 10 13 7 2 3 5 5 3 5 2 3 2 10 + 13 + 7 = 30
  56. 56. #SAISDev2 Merge Is (usually) Commutative 56 10 13 7 2 3 5 5 3 5 2 3 2 10 + 13 + 7 = 30 13 + 7 + 10 = 30
  57. 57. #SAISDev2 Merge Is (usually) Commutative 57 10 13 7 2 3 5 5 3 5 2 3 2 sum a1 + a2 = a2 + a1 max max(a1, a2) = max(a2, a1) avg (s1+s2, n1+n2) = (s2+s1, n2+n1)
  58. 58. #SAISDev2 Structured Streaming 58 a 5 c 7 a 5 c 7 user wordcount a 5 c 7 Aggregations Logical
  59. 59. #SAISDev2 Structured Streaming 59 a 5 c 7 b 8 c 10 a 5 c 7 a 5 c 7 b 8 c 10 user wordcount a 5 b 8 c 17 a 5 c 7 Aggregations Logical Discarded
  60. 60. #SAISDev2 Structured Streaming 60 a 5 c 7 b 8 c 10 a 9 b 6 a 5 c 7 a 5 c 7 b 8 c 10 user wordcount a 5 c 7 b 8 c 10 a 9 b 6 a 14 b 14 c 17 a 5 b 8 c 17 a 5 c 7 Aggregations Logical Discarded
  61. 61. #SAISDev2 Structured Streaming 61 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount
  62. 62. #SAISDev2 Structured Streaming 62 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 group-by
  63. 63. #SAISDev2 Structured Streaming 63 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 a 16 b 4 c 17 group-by update
  64. 64. #SAISDev2 Structured Streaming 64 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 a 5 b 8 c 20 a 16 b 4 c 17 a 21 b 12 c 37 group-by update merge
  65. 65. #SAISDev2 Structured Streaming 65 a 5 c 7 b 4 a 8 c 10 a 3 user wordcount a 5, 8, 3 b 4 c 7, 10 a 5 b 8 c 20 a 16 b 4 c 17 a 21 b 12 c 37 group-by update merge S B Algebras
  66. 66. #SAISDev2 Algebras Around Us 66 Business Logic
  67. 67. #SAISDev2 Algebras Around Us 67 Business Logic Data Type Accumulator Zero Update Merge Present
  68. 68. #SAISDev2 Algebras Around Us 68 Business Logic Data Type Accumulator Zero Update Merge Present Structured Streaming Query
  69. 69. #SAISDev2 Aggregating Quantiles 69 10 9 12 27 . . . ? aggregating query The median of my data is 11 The 90th percentile of my data is 25
  70. 70. #SAISDev2 Distribution Sketch: T-Digest 70 0 1 CDF
  71. 71. #SAISDev2 Distribution Sketch: T-Digest 71 q = 0.9 0 1 CDF
  72. 72. #SAISDev2 Distribution Sketch: T-Digest 72 q = 0.9 0 1 (x,q) CDF
  73. 73. #SAISDev2 Distribution Sketch: T-Digest 73 q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  74. 74. #SAISDev2 Distribution Sketch: T-Digest 74 q = 0.9 x is the 90th %-ile 0 1 (x,q) CDF
  75. 75. #SAISDev2 Is T-Digest an Algebra? 75 Data Type Numeric
  76. 76. #SAISDev2 Is T-Digest an Algebra? 76 Data Type Numeric Accumulator Type T-Digest Sketch
  77. 77. #SAISDev2 Is T-Digest an Algebra? 77 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest
  78. 78. #SAISDev2 Is T-Digest an Algebra? 78 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x
  79. 79. #SAISDev2 Is T-Digest an Algebra? 79 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2
  80. 80. #SAISDev2 Is T-Digest an Algebra? 80 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile)
  81. 81. #SAISDev2 Is T-Digest an Algebra? 81 Data Type Numeric Accumulator Type T-Digest Sketch Zero Empty T-Digest Update tdigest + x Merge tdigest1 + tdigest2 Present tdigest.cdfInverse(quantile) YES!
  82. 82. #SAISDev2 User Defined Aggregator Function 82 val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  83. 83. #SAISDev2 User Defined Aggregator Function 83 val sketchCDF = tdigestUDAF[Double] spark.udf.register("p50", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.5)) spark.udf.register("p90", (c:Any)=>c.asInstanceOf[TDigestSQL].tdigest.cdfInverse(0.9))
  84. 84. #SAISDev2 Streaming Percentiles 84 val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  85. 85. #SAISDev2 Streaming Percentiles 85 val query = records .writeStream //... +---------+ |wordcount| +---------+ | 12| | 5| | 9| | 18| | 12| +---------+ val r = records.withColumn("time", current_timestamp()) .groupBy(window($”time”, “30 seconds”)) .agg(sketchCDF($"wordcount").alias("CDF")) .select(callUDF("p50", $"CDF").alias("p50"), callUDF("p90", $"CDF").alias("p90")) val query = r.writeStream //... +----+----+ | p50| p90| +----+----+ |15.6|31.0| |16.0|30.8| |15.8|30.0| |15.7|31.0| |16.0|31.0| +----+----+
  86. 86. #SAISDev2 Most-Frequent Items 86 #DogRates #YOLO #SAIRocks #TruthBomb #Mondays ? aggregating query #tag frequency #DogRates 1000000000 #Blockchain 780000 #TaylorSwift 650000
  87. 87. #SAISDev2 Heavy-Hitter Sketch 87 CountMin Heap I estimate frequencies of objects! I store objects in sorted order!
  88. 88. #SAISDev2 Heavy-Hitter Sketch 88 #DogRates update CountMin Heap
  89. 89. #SAISDev2 Heavy-Hitter Sketch 89 #DogRates update query CountMin Heap
  90. 90. #SAISDev2 Heavy-Hitter Sketch 90 #DogRates update query insert CountMin Heap
  91. 91. #SAISDev2 Heavy-Hitter Sketch 91 #DogRates #tag frequency #DogRates 1000000000 #Blockchain 780000 #TaylorSwift 650000 update query insert top k CountMin Heap
  92. 92. #SAISDev2 Heavy-Hitter Sketch 92 #DogRates #tag frequency #DogRates 1000000000 #Blockchain 780000 #TaylorSwift 650000 update query insert top k CountMin Heap
  93. 93. #SAISDev2 Is Heavy-Hitter an Algebra? 93 Data Type Items
  94. 94. #SAISDev2 Is Heavy-Hitter an Algebra? 94 Data Type Items Accumulator Type countmin sketch heap
  95. 95. #SAISDev2 Is Heavy-Hitter an Algebra? 95 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap
  96. 96. #SAISDev2 Is Heavy-Hitter an Algebra? 96 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency)
  97. 97. #SAISDev2 Is Heavy-Hitter an Algebra? 97 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency) Merge countmin1 + countmin2 update(heap1 + heap2)
  98. 98. #SAISDev2 Is Heavy-Hitter an Algebra? 98 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency) Merge countmin1 + countmin2 update(heap1 + heap2) Present heap.top(k)
  99. 99. #SAISDev2 Is Heavy-Hitter an Algebra? 99 Data Type Items Accumulator Type countmin sketch heap Zero empty countmin empty heap Update countmin.update(x) heap.insert(x, frequency) Merge countmin1 + countmin2 update(heap1 + heap2) Present heap.top(k) YES!
  100. 100. #SAISDev2 Streaming Most-Frequent Items 100 val query = records .writeStream //... +----------+ | hashtag| +----------+ | #Gardiner| | #PETE| | #Nutiva| | #Fairfax| | #Darcy| +----------+ val windowBy60 = windowing.windowBy[(Timestamp, String)](_._1, 60) val top3 = new TopKAgg[(Timestamp, String)](_._2) val r = records.withColumn("time", current_timestamp()) .as[(Timestamp, String)] .groupByKey(windowBy60).agg(top3.toColumn) .map { case (_, tk) => tk.toList.toString } val query = r.writeStream //... +------------------------------------------------------+ + value | +------------------------------------------------------+ |List((#Iddesleigh,3), (#Gardiner,2), (#Willoughby,2)) | |List((#Elizabeth,3), (#HERPRESENT,1), (#upforhours,1))| |List((#PETE,4), (#Nutiva,1), (#Fairfax,1)) | +------------------------------------------------------+
  101. 101. #SAISDev2 Review 101
  102. 102. #SAISDev2 Review 102
  103. 103. #SAISDev2 Review 103
  104. 104. #SAISDev2 Review 104
  105. 105. #SAISDev2 Review 105 eje@redhat.com @manyangled

×