SlideShare a Scribd company logo
1 of 65
Online Approximate
OLAP in SparkSQL
Kai Zengand Sameer Agarwal
Hadoop Summit | San Jose, CA | June 9th 2015
Hard Disks
(dailyreports)
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
(hourly reports)
100 TB on 1000 machines
Need for Quick & Continuous Feedback
(exploratoryqueries)
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation1
What is the average latency in
the table?
34.6667
1Joseph M. Hellerstein, Peter J. Haas, Helen J.
Wang. Online Aggregation. In SIGMOD 1997.
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
34.6667
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
System Design
8
SparkSQL
System Design
9
SparkSQL
G-OLA
(Generalized OLA)
Continuous Answers Continuous Error Bars
SparkSQL Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  batch  processing
val result  =  dataFrame.collect()  //  34.6667
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
onlineDataFrame.collectNext() //  35  ± 2.1
onlineDataFrame.collectNext() //  33.83  ± 1.3
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext())  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
responseTime <=  10.seconds)  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
errorBound >=  0.01)  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
AGGREGATES/  UDAFs
JOINS/GROUP  BYs
NESTED  QUERIES
SparkSQL + G-OLA + BlinkDB
This Talk ...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Part 1: Generalized Online Aggregation (G-OLA)
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
Key Idea:
Introduce Delta Maintenance as a
First Class Citizen in Query
Execution
Part 1: Generalized Online Aggregation (G-OLA)
Query Execution: Under The Hood
SQL Query Query Plan
AGGR
SCAN
AVG(latency)
SELECT  avg(latency)
FROM  log
log
Query Execution: Under The Hood
SQL Query Query Plan
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  
avg(latency)
FROM  log
)
log
log
Query Execution: Under The Hood
log
Query Execution: Under The Hood
log
Query Execution: Under The Hood
collect()
log
Query
Query Execution: Under The Hood
collect() collectNext()
Query Execution: Under The Hood
collect() collectNext()
Delta
Input 1
Delta Input: New data + Previous Intermediate State
Query Execution: Under The Hood
Delta
Input 2
Delta Input: New data + Previous Intermediate State
Intermediate State
SELECT  avg(latency)
FROM  log
Intermediate State
?SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
Intermediate State
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
Intermediate State
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
Intermediate State
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35 ± 1.5
Uncertain Tuples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)
Uncertain Tuples:
𝑇", 𝑇$
Running sum/count of:
𝑇%, 𝑇&  
Running sum/count of:
𝑇(, 𝑇%, 𝑇", 𝑇$, 𝑇&  
  
Intermediate State:
Intermediate State
Goal – Minimize Intermediate State
Each delta-update query
• Optimized using Catalyst in Spark SQL
– Capable of using all existing optimizations
• Compiled into a distributed Spark Job
Online Execution Model
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.6667
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5
34.6667
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5 ± 2.0
34.6667
Focused on estimating aggregate errors given representative
samples
Central LimitTheorem (CLT) Error Estimation using Bootstrap
HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93
47
Error Estimation on a Sample of Data
d
predicate for the query)
The following results are (asymptotically in sample size) true, but not di-
rectly useful, since they depend on unknown properties of the underlying dis-
tribution. In all cases we just plug in the sample values. For example, instead
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count, I assume no filtering
(p = 1). Filtering will increase variance a bit, or potentially a lot for extremely
selective queries (p = 0). I can compute the filtering-adjusted values if you like.
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Sampling!
…
Resampling!
D
S100
S1
S
θ(S1
)
θ(S100
)θ(S) 95%
confidence
interval!
…
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
θ1 = 34
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
θ1 = 34 θ2 = 32.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
ID City Latency
1 SLC 37
2 NYC 30
3 SLC 34
4 LA 36
5 NYC 30
ID City Latency
1 SLC 34
2 SLC 37
3 SLC 34
4 LA 36
5 LA 36
...
θ1 = 34.6
...
35 ± 1.6
θ2 = 33.4 θ100 = 35.4
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
Error Estimation in BlinkDB
Leverage Poissonized
Resampling to generate
samples with replacement
Knowing When You’re Wrong: Building Fast
and Reliable Approximate Query Processing
Systems. In SIGMOD 2014.
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
Sample from a
Poisson (1) Distribution
θ1 = 33.8
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
6 SF 28 2
Incremental Error
Estimation
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1 #2
1 NYC 30 2 1
2 NYC 38 1 0
3 SLC 34 0 2
4 SLC 34 1 2
5 SLC 37 1 0
6 SF 28 2 1
Construct all
Resamples in
a Single Pass
Error Estimation in BlinkDB
Bootstrap Performance
Compute  Sum(latency)
rSumi =  rSumi +  latency   *  #i
Virtual  call  to  Add.eval()
Virtual  call  to  rSum.eval()
Return  boxed  Double
Virtual  call  to  Multiply.eval()
Virtual  call  to  latency.eval()
Return  boxed  Double
Virtual  call  to  #.eval()
……
Add
rSum Multiply
latency #
Using Scala Quasiquotes/ Runtime Reflection
def generateCode(e:   Expression):   Tree =  e  match {
case Attribute(ordinal)   =>
q"inputRow.getInt($ordinal)"
case Add(left,   right)   =>
q"""
{
val leftResult =  ${generateCode(left)}
val rightResult =  ${generateCode(right)}
leftResult +  rightResult
}
"""
}
Code Generation Bootstrap
Compute  Sum(latency)
rSumi =  rSumi +  latency   *  #i
Consolidate  all  bootstrap  in  a  tight  for  loop
val left0  =  inputRow.getDouble(j)   //  latency
for  (i from  0  until  n)
val right0  =  inputRow.getInt(k+i)   //  #i
val left1  =  inputRow.getDouble(0+i)   //  rSumi
val right1  =  left0*right0
val result1   =  left1  +  right1
resultRow.setDouble(0,   result1)
62
2-5x Faster!
Conclusion
1. Online Approximation is an important
means to achieve interactivityin
processing large datasets
2. New SparkSQL Libraries:
- G-OLA for Continuous Answers
- BlinkDB for Continuous Error Bars
Check out our code!
1. Early Adopter Preview: http://github.com/amplab/bootstrap-
sql. Send us an email to kaizeng@cs.berkeley.eduand
sameer@databricks.comto get access!
2. Spark Package coming soon
3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond
Thank you.

More Related Content

What's hot

Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...corehard_by
 
nosymbols - defcon russia 20
nosymbols - defcon russia 20nosymbols - defcon russia 20
nosymbols - defcon russia 20DefconRussia
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupStephan Ewen
 
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksSegmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksDavid Evans
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueChong-Kuan Chen
 
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...Yuanxuan Wang
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)David Evans
 
Weakpass - defcon russia 23
Weakpass - defcon russia 23Weakpass - defcon russia 23
Weakpass - defcon russia 23DefconRussia
 
[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobinding[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobindingDefconRussia
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaLisa Hua
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Yandex
 
Hash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - BeanHash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - BeanElaine Cecília Gatto
 
VHdl lab report
VHdl lab reportVHdl lab report
VHdl lab reportJinesh Kb
 
Triton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON ChinaTriton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON ChinaWei-Bo Chen
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer developmentAndrey Karpov
 
Digital system design practical file
Digital system design practical fileDigital system design practical file
Digital system design practical fileArchita Misra
 

What's hot (20)

Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
 
Managing Memory
Managing MemoryManaging Memory
Managing Memory
 
nosymbols - defcon russia 20
nosymbols - defcon russia 20nosymbols - defcon russia 20
nosymbols - defcon russia 20
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksSegmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
 
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
Weakpass - defcon russia 23
Weakpass - defcon russia 23Weakpass - defcon russia 23
Weakpass - defcon russia 23
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobinding[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobinding
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace
 
Hash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - BeanHash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - Bean
 
VHdl lab report
VHdl lab reportVHdl lab report
VHdl lab report
 
Triton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON ChinaTriton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON China
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
 
Mutation @ Spotify
Mutation @ Spotify Mutation @ Spotify
Mutation @ Spotify
 
Pattern Matching in Java 14
Pattern Matching in Java 14Pattern Matching in Java 14
Pattern Matching in Java 14
 
Digital system design practical file
Digital system design practical fileDigital system design practical file
Digital system design practical file
 

Viewers also liked

How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentBlueData, Inc.
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenDataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNDataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...DataWorks Summit
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets DataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]DataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLDataWorks Summit
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeDataWorks Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopDataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...DataWorks Summit
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)DataWorks Summit
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionDataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?DataWorks Summit
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 

Viewers also liked (20)

How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 

Similar to Online Approximate OLAP in SparkSQL

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDatabricks
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)PingCAP
 
3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptx3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptxGaganaP13
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdfFrangoCamila
 
BlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large DataBlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large DataKnoldus Inc.
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraSveta Smirnova
 
Applied Econometrics assignment3
Applied Econometrics assignment3Applied Econometrics assignment3
Applied Econometrics assignment3Chenguang Li
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiDatabricks
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
Data Acquisition
Data AcquisitionData Acquisition
Data Acquisitionazhar557
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezJ On The Beach
 

Similar to Online Approximate OLAP in SparkSQL (20)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptx3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptx
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
BlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large DataBlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large Data
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Applied Econometrics assignment3
Applied Econometrics assignment3Applied Econometrics assignment3
Applied Econometrics assignment3
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
GCC
GCCGCC
GCC
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Data Acquisition
Data AcquisitionData Acquisition
Data Acquisition
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Spark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and OpsSpark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and Ops
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Online Approximate OLAP in SparkSQL

  • 1. Online Approximate OLAP in SparkSQL Kai Zengand Sameer Agarwal Hadoop Summit | San Jose, CA | June 9th 2015
  • 2. Hard Disks (dailyreports) ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory (hourly reports) 100 TB on 1000 machines Need for Quick & Continuous Feedback (exploratoryqueries)
  • 3. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation1 What is the average latency in the table? 34.6667 1Joseph M. Hellerstein, Peter J. Haas, Helen J. Wang. Online Aggregation. In SIGMOD 1997.
  • 4. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 35
  • 5. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 33.83 35
  • 6. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 33.83 34.6667 35
  • 7. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1
  • 10. SparkSQL Interface val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  batch  processing val result  =  dataFrame.collect()  //  34.6667
  • 11. SparkSQL + G-OLA + BlinkDB val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online onlineDataFrame.collectNext() //  35  ± 2.1 onlineDataFrame.collectNext() //  33.83  ± 1.3
  • 12. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext())  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 13. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && responseTime <=  10.seconds)  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 14. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && errorBound >=  0.01)  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 15. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 16. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 17. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } AGGREGATES/  UDAFs JOINS/GROUP  BYs NESTED  QUERIES SparkSQL + G-OLA + BlinkDB
  • 19. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Part 1: Generalized Online Aggregation (G-OLA) What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1
  • 20. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1 Part 2: Error Estimation in BlinkDB
  • 21. Key Idea: Introduce Delta Maintenance as a First Class Citizen in Query Execution Part 1: Generalized Online Aggregation (G-OLA)
  • 22. Query Execution: Under The Hood SQL Query Query Plan AGGR SCAN AVG(latency) SELECT  avg(latency) FROM  log log
  • 23. Query Execution: Under The Hood SQL Query Query Plan FILTER JOIN AGGR SCAN SCAN AGGR latency > AVG(latency) AVG(latency) AVG(latency)SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT   avg(latency) FROM  log ) log log
  • 24. Query Execution: Under The Hood log
  • 25. Query Execution: Under The Hood log
  • 26. Query Execution: Under The Hood collect() log Query
  • 27. Query Execution: Under The Hood collect() collectNext()
  • 28. Query Execution: Under The Hood collect() collectNext() Delta Input 1 Delta Input: New data + Previous Intermediate State
  • 29. Query Execution: Under The Hood Delta Input 2 Delta Input: New data + Previous Intermediate State
  • 31. Intermediate State ?SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log )
  • 32. Intermediate State SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log )
  • 33. Intermediate State SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015
  • 34. Intermediate State SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015
  • 35. 35 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State
  • 36. 35 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 37. 35 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 38. 35 ± 1.5 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 39. 35 ± 1.5 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 40. 35 ± 1.5 Uncertain Tuples ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 41. FILTER JOIN AGGR SCAN SCAN AGGR latency > AVG(latency) AVG(latency) AVG(latency) FILTER JOIN AGGR SCAN SCAN AGGR latency > AVG(latency) AVG(latency) AVG(latency) Uncertain Tuples: 𝑇", 𝑇$ Running sum/count of: 𝑇%, 𝑇&   Running sum/count of: 𝑇(, 𝑇%, 𝑇", 𝑇$, 𝑇&     Intermediate State: Intermediate State
  • 42. Goal – Minimize Intermediate State Each delta-update query • Optimized using Catalyst in Spark SQL – Capable of using all existing optimizations • Compiled into a distributed Spark Job Online Execution Model
  • 43. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1 Part 2: Error Estimation in BlinkDB
  • 44. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation on a Sample of Data What is the average latency in the table? 34.6667
  • 45. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation on a Sample of Data What is the average latency in the table? 34.5 34.6667
  • 46. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation on a Sample of Data What is the average latency in the table? 34.5 ± 2.0 34.6667
  • 47. Focused on estimating aggregate errors given representative samples Central LimitTheorem (CLT) Error Estimation using Bootstrap HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93 47 Error Estimation on a Sample of Data d predicate for the query) The following results are (asymptotically in sample size) true, but not di- rectly useful, since they depend on unknown properties of the underlying dis- tribution. In all cases we just plug in the sample values. For example, instead of µ we use 1 n Pn i=1 Xi where Xi is the ith sample value. Note that for estimators other than sum and count, I assume no filtering (p = 1). Filtering will increase variance a bit, or potentially a lot for extremely selective queries (p = 0). I can compute the filtering-adjusted values if you like. 1. Count: N(np, n(1 p)p) 2. Sum: N(npµ, np( 2 + (1 p)µ2 )) 3. Mean: N(µ, 2 /n) 4. Variance: N( 2 , (µ4 4 )/n) 5. Stddev: N( , (µ4 4 )/(4 2 n)) Sampling! … Resampling! D S100 S1 S θ(S1 ) θ(S100 )θ(S) 95% confidence interval! …
  • 48. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap What is the average latency in the table?
  • 49. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34
  • 50. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 θ1 = 34
  • 51. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 θ1 = 34 θ2 = 32.5
  • 52. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ...θ2 = 32.5 θ100 = 35
  • 53. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 54. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 55. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 ID City Latency 1 SLC 37 2 NYC 30 3 SLC 34 4 LA 36 5 NYC 30 ID City Latency 1 SLC 34 2 SLC 37 3 SLC 34 4 LA 36 5 LA 36 ... θ1 = 34.6 ... 35 ± 1.6 θ2 = 33.4 θ100 = 35.4
  • 56. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 Error Estimation in BlinkDB Leverage Poissonized Resampling to generate samples with replacement Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD 2014.
  • 57. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 Sample from a Poisson (1) Distribution θ1 = 33.8 Error Estimation in BlinkDB
  • 58. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 6 SF 28 2 Incremental Error Estimation Error Estimation in BlinkDB
  • 59. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 #2 1 NYC 30 2 1 2 NYC 38 1 0 3 SLC 34 0 2 4 SLC 34 1 2 5 SLC 37 1 0 6 SF 28 2 1 Construct all Resamples in a Single Pass Error Estimation in BlinkDB
  • 60. Bootstrap Performance Compute  Sum(latency) rSumi =  rSumi +  latency   *  #i Virtual  call  to  Add.eval() Virtual  call  to  rSum.eval() Return  boxed  Double Virtual  call  to  Multiply.eval() Virtual  call  to  latency.eval() Return  boxed  Double Virtual  call  to  #.eval() …… Add rSum Multiply latency #
  • 61. Using Scala Quasiquotes/ Runtime Reflection def generateCode(e:   Expression):   Tree =  e  match { case Attribute(ordinal)   => q"inputRow.getInt($ordinal)" case Add(left,   right)   => q""" { val leftResult =  ${generateCode(left)} val rightResult =  ${generateCode(right)} leftResult +  rightResult } """ }
  • 62. Code Generation Bootstrap Compute  Sum(latency) rSumi =  rSumi +  latency   *  #i Consolidate  all  bootstrap  in  a  tight  for  loop val left0  =  inputRow.getDouble(j)   //  latency for  (i from  0  until  n) val right0  =  inputRow.getInt(k+i)   //  #i val left1  =  inputRow.getDouble(0+i)   //  rSumi val right1  =  left0*right0 val result1   =  left1  +  right1 resultRow.setDouble(0,   result1) 62 2-5x Faster!
  • 63. Conclusion 1. Online Approximation is an important means to achieve interactivityin processing large datasets 2. New SparkSQL Libraries: - G-OLA for Continuous Answers - BlinkDB for Continuous Error Bars
  • 64. Check out our code! 1. Early Adopter Preview: http://github.com/amplab/bootstrap- sql. Send us an email to kaizeng@cs.berkeley.eduand sameer@databricks.comto get access! 2. Spark Package coming soon 3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond