"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Online Approximate OLAP in SparkSQL
1. Online Approximate
OLAP in SparkSQL
Kai Zengand Sameer Agarwal
Hadoop Summit | San Jose, CA | June 9th 2015
2. Hard Disks
(dailyreports)
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
(hourly reports)
100 TB on 1000 machines
Need for Quick & Continuous Feedback
(exploratoryqueries)
3. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation1
What is the average latency in
the table?
34.6667
1Joseph M. Hellerstein, Peter J. Haas, Helen J.
Wang. Online Aggregation. In SIGMOD 1997.
4. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
35
5. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
35
6. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
34.6667
35
7. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
19. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Part 1: Generalized Online Aggregation (G-OLA)
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
20. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
21. Key Idea:
Introduce Delta Maintenance as a
First Class Citizen in Query
Execution
Part 1: Generalized Online Aggregation (G-OLA)
22. Query Execution: Under The Hood
SQL Query Query Plan
AGGR
SCAN
AVG(latency)
SELECT avg(latency)
FROM log
log
23. Query Execution: Under The Hood
SQL Query Query Plan
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)SELECT avg(latency)
FROM log
WHERE latency >
(
SELECT
avg(latency)
FROM log
)
log
log
33. Intermediate State
SELECT avg(latency)
FROM log
WHERE latency >
(
SELECT avg(latency)
FROM log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
34. Intermediate State
SELECT avg(latency)
FROM log
WHERE latency >
(
SELECT avg(latency)
FROM log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
35. 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
36. 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
37. 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
38. 35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
39. 35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
40. 35 ± 1.5
Uncertain Tuples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT avg(latency)
FROM log
WHERE latency > SELECT avg(latency)
FROM log
Intermediate State
33.83
42. Goal – Minimize Intermediate State
Each delta-update query
• Optimized using Catalyst in Spark SQL
– Capable of using all existing optimizations
• Compiled into a distributed Spark Job
Online Execution Model
43. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
44. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.6667
45. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5
34.6667
46. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5 ± 2.0
34.6667
47. Focused on estimating aggregate errors given representative
samples
Central LimitTheorem (CLT) Error Estimation using Bootstrap
HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93
47
Error Estimation on a Sample of Data
d
predicate for the query)
The following results are (asymptotically in sample size) true, but not di-
rectly useful, since they depend on unknown properties of the underlying dis-
tribution. In all cases we just plug in the sample values. For example, instead
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count, I assume no filtering
(p = 1). Filtering will increase variance a bit, or potentially a lot for extremely
selective queries (p = 0). I can compute the filtering-adjusted values if you like.
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Sampling!
…
Resampling!
D
S100
S1
S
θ(S1
)
θ(S100
)θ(S) 95%
confidence
interval!
…
48. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
What is the average latency in
the table?
49. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
50. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
θ1 = 34
51. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
θ1 = 34 θ2 = 32.5
52. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...θ2 = 32.5 θ100 = 35
53. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
54. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
55. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
ID City Latency
1 SLC 37
2 NYC 30
3 SLC 34
4 LA 36
5 NYC 30
ID City Latency
1 SLC 34
2 SLC 37
3 SLC 34
4 LA 36
5 LA 36
...
θ1 = 34.6
...
35 ± 1.6
θ2 = 33.4 θ100 = 35.4
56. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
Error Estimation in BlinkDB
Leverage Poissonized
Resampling to generate
samples with replacement
Knowing When You’re Wrong: Building Fast
and Reliable Approximate Query Processing
Systems. In SIGMOD 2014.
57. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
Sample from a
Poisson (1) Distribution
θ1 = 33.8
Error Estimation in BlinkDB
58. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
6 SF 28 2
Incremental Error
Estimation
Error Estimation in BlinkDB
59. ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1 #2
1 NYC 30 2 1
2 NYC 38 1 0
3 SLC 34 0 2
4 SLC 34 1 2
5 SLC 37 1 0
6 SF 28 2 1
Construct all
Resamples in
a Single Pass
Error Estimation in BlinkDB
60. Bootstrap Performance
Compute Sum(latency)
rSumi = rSumi + latency * #i
Virtual call to Add.eval()
Virtual call to rSum.eval()
Return boxed Double
Virtual call to Multiply.eval()
Virtual call to latency.eval()
Return boxed Double
Virtual call to #.eval()
……
Add
rSum Multiply
latency #
61. Using Scala Quasiquotes/ Runtime Reflection
def generateCode(e: Expression): Tree = e match {
case Attribute(ordinal) =>
q"inputRow.getInt($ordinal)"
case Add(left, right) =>
q"""
{
val leftResult = ${generateCode(left)}
val rightResult = ${generateCode(right)}
leftResult + rightResult
}
"""
}
62. Code Generation Bootstrap
Compute Sum(latency)
rSumi = rSumi + latency * #i
Consolidate all bootstrap in a tight for loop
val left0 = inputRow.getDouble(j) // latency
for (i from 0 until n)
val right0 = inputRow.getInt(k+i) // #i
val left1 = inputRow.getDouble(0+i) // rSumi
val right1 = left0*right0
val result1 = left1 + right1
resultRow.setDouble(0, result1)
62
2-5x Faster!
63. Conclusion
1. Online Approximation is an important
means to achieve interactivityin
processing large datasets
2. New SparkSQL Libraries:
- G-OLA for Continuous Answers
- BlinkDB for Continuous Error Bars
64. Check out our code!
1. Early Adopter Preview: http://github.com/amplab/bootstrap-
sql. Send us an email to kaizeng@cs.berkeley.eduand
sameer@databricks.comto get access!
2. Spark Package coming soon
3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond