SlideShare a Scribd company logo
Online Approximate
OLAP in SparkSQL
Kai Zengand Sameer Agarwal
Hadoop Summit | San Jose, CA | June 9th 2015
Hard Disks
(dailyreports)
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
(hourly reports)
100 TB on 1000 machines
Need for Quick & Continuous Feedback
(exploratoryqueries)
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation1
What is the average latency in
the table?
34.6667
1Joseph M. Hellerstein, Peter J. Haas, Helen J.
Wang. Online Aggregation. In SIGMOD 1997.
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83
34.6667
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Online Aggregation
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
System Design
8
SparkSQL
System Design
9
SparkSQL
G-OLA
(Generalized OLA)
Continuous Answers Continuous Error Bars
SparkSQL Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  batch  processing
val result  =  dataFrame.collect()  //  34.6667
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
onlineDataFrame.collectNext() //  35  ± 2.1
onlineDataFrame.collectNext() //  33.83  ± 1.3
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext())  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
responseTime <=  10.seconds)  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
errorBound >=  0.01)  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
SparkSQL + G-OLA + BlinkDB
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.online
while (onlineDataFrame.hasNext()  &&
userEvent.cancelled())  {
onlineDataFrame.collectNext()
}
AGGREGATES/  UDAFs
JOINS/GROUP  BYs
NESTED  QUERIES
SparkSQL + G-OLA + BlinkDB
This Talk ...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Part 1: Generalized Online Aggregation (G-OLA)
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
Key Idea:
Introduce Delta Maintenance as a
First Class Citizen in Query
Execution
Part 1: Generalized Online Aggregation (G-OLA)
Query Execution: Under The Hood
SQL Query Query Plan
AGGR
SCAN
AVG(latency)
SELECT  avg(latency)
FROM  log
log
Query Execution: Under The Hood
SQL Query Query Plan
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  
avg(latency)
FROM  log
)
log
log
Query Execution: Under The Hood
log
Query Execution: Under The Hood
log
Query Execution: Under The Hood
collect()
log
Query
Query Execution: Under The Hood
collect() collectNext()
Query Execution: Under The Hood
collect() collectNext()
Delta
Input 1
Delta Input: New data + Previous Intermediate State
Query Execution: Under The Hood
Delta
Input 2
Delta Input: New data + Previous Intermediate State
Intermediate State
SELECT  avg(latency)
FROM  log
Intermediate State
?SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
Intermediate State
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
Intermediate State
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
Intermediate State
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  avg(latency)
FROM  log
)
G-OLA: Generalized Online Aggregation for
Interactive Analysis on Big Data. In SIGMOD 2015
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35 ± 1.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
35 ± 1.5
Uncertain Tuples
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
SELECT  avg(latency)
FROM  log
WHERE  latency  >  SELECT  avg(latency)
FROM  log
Intermediate State
33.83
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)
FILTER
JOIN
AGGR SCAN
SCAN
AGGR
latency > AVG(latency)
AVG(latency)
AVG(latency)
Uncertain Tuples:
𝑇", 𝑇$
Running sum/count of:
𝑇%, 𝑇&  
Running sum/count of:
𝑇(, 𝑇%, 𝑇", 𝑇$, 𝑇&  
  
Intermediate State:
Intermediate State
Goal – Minimize Intermediate State
Each delta-update query
• Optimized using Catalyst in Spark SQL
– Capable of using all existing optimizations
• Compiled into a distributed Spark Job
Online Execution Model
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
33.83 ± 1.3
34.6667 ± 0.0
35 ± 2.1
Part 2: Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.6667
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5
34.6667
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation on a Sample of Data
What is the average latency in
the table?
34.5 ± 2.0
34.6667
Focused on estimating aggregate errors given representative
samples
Central LimitTheorem (CLT) Error Estimation using Bootstrap
HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93
47
Error Estimation on a Sample of Data
d
predicate for the query)
The following results are (asymptotically in sample size) true, but not di-
rectly useful, since they depend on unknown properties of the underlying dis-
tribution. In all cases we just plug in the sample values. For example, instead
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count, I assume no filtering
(p = 1). Filtering will increase variance a bit, or potentially a lot for extremely
selective queries (p = 0). I can compute the filtering-adjusted values if you like.
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Sampling!
…
Resampling!
D
S100
S1
S
θ(S1
)
θ(S100
)θ(S) 95%
confidence
interval!
…
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
θ1 = 34
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
θ1 = 34 θ2 = 32.5
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
ID City Latency
1 NYC 30
2 NYC 30
3 SLC 34
4 LA 36
ID City Latency
1 SLC 34
2 LA 36
3 SLC 34
4 LA 36
...
θ1 = 34 ...
34.5 ± 2
θ2 = 32.5 θ100 = 35
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
Error Estimation using Bootstrap
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
ID City Latency
1 SLC 37
2 NYC 30
3 SLC 34
4 LA 36
5 NYC 30
ID City Latency
1 SLC 34
2 SLC 37
3 SLC 34
4 LA 36
5 LA 36
...
θ1 = 34.6
...
35 ± 1.6
θ2 = 33.4 θ100 = 35.4
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 SLC 34
5 SLC 37
Error Estimation in BlinkDB
Leverage Poissonized
Resampling to generate
samples with replacement
Knowing When You’re Wrong: Building Fast
and Reliable Approximate Query Processing
Systems. In SIGMOD 2014.
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
Sample from a
Poisson (1) Distribution
θ1 = 33.8
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1
1 NYC 30 2
2 NYC 38 1
3 SLC 34 0
4 SLC 34 1
5 SLC 37 1
6 SF 28 2
Incremental Error
Estimation
Error Estimation in BlinkDB
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
What is the average latency in
the table?
ID City Latency #1 #2
1 NYC 30 2 1
2 NYC 38 1 0
3 SLC 34 0 2
4 SLC 34 1 2
5 SLC 37 1 0
6 SF 28 2 1
Construct all
Resamples in
a Single Pass
Error Estimation in BlinkDB
Bootstrap Performance
Compute  Sum(latency)
rSumi =  rSumi +  latency   *  #i
Virtual  call  to  Add.eval()
Virtual  call  to  rSum.eval()
Return  boxed  Double
Virtual  call  to  Multiply.eval()
Virtual  call  to  latency.eval()
Return  boxed  Double
Virtual  call  to  #.eval()
……
Add
rSum Multiply
latency #
Using Scala Quasiquotes/ Runtime Reflection
def generateCode(e:   Expression):   Tree =  e  match {
case Attribute(ordinal)   =>
q"inputRow.getInt($ordinal)"
case Add(left,   right)   =>
q"""
{
val leftResult =  ${generateCode(left)}
val rightResult =  ${generateCode(right)}
leftResult +  rightResult
}
"""
}
Code Generation Bootstrap
Compute  Sum(latency)
rSumi =  rSumi +  latency   *  #i
Consolidate  all  bootstrap  in  a  tight  for  loop
val left0  =  inputRow.getDouble(j)   //  latency
for  (i from  0  until  n)
val right0  =  inputRow.getInt(k+i)   //  #i
val left1  =  inputRow.getDouble(0+i)   //  rSumi
val right1  =  left0*right0
val result1   =  left1  +  right1
resultRow.setDouble(0,   result1)
62
2-5x Faster!
Conclusion
1. Online Approximation is an important
means to achieve interactivityin
processing large datasets
2. New SparkSQL Libraries:
- G-OLA for Continuous Answers
- BlinkDB for Continuous Error Bars
Check out our code!
1. Early Adopter Preview: http://github.com/amplab/bootstrap-
sql. Send us an email to kaizeng@cs.berkeley.eduand
sameer@databricks.comto get access!
2. Spark Package coming soon
3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond
Thank you.

More Related Content

What's hot

Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
corehard_by
 
Managing Memory
Managing MemoryManaging Memory
Managing Memory
David Evans
 
nosymbols - defcon russia 20
nosymbols - defcon russia 20nosymbols - defcon russia 20
nosymbols - defcon russia 20
DefconRussia
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksSegmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
David Evans
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
Chong-Kuan Chen
 
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Yuanxuan Wang
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
David Evans
 
Weakpass - defcon russia 23
Weakpass - defcon russia 23Weakpass - defcon russia 23
Weakpass - defcon russia 23
DefconRussia
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
Doug Hawkins
 
[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobinding[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobinding
DefconRussia
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
Lisa Hua
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace
Yandex
 
Hash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - BeanHash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - Bean
Elaine Cecília Gatto
 
VHdl lab report
VHdl lab reportVHdl lab report
VHdl lab report
Jinesh Kb
 
Triton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON ChinaTriton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON China
Wei-Bo Chen
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
Andrey Karpov
 
Mutation @ Spotify
Mutation @ Spotify Mutation @ Spotify
Mutation @ Spotify
STAMP Project
 
Pattern Matching in Java 14
Pattern Matching in Java 14Pattern Matching in Java 14
Pattern Matching in Java 14
GlobalLogic Ukraine
 
Digital system design practical file
Digital system design practical fileDigital system design practical file
Digital system design practical file
Archita Misra
 

What's hot (20)

Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
Update on C++ Core Guidelines Lifetime Analysis. Gábor Horváth. CoreHard Spri...
 
Managing Memory
Managing MemoryManaging Memory
Managing Memory
 
nosymbols - defcon russia 20
nosymbols - defcon russia 20nosymbols - defcon russia 20
nosymbols - defcon russia 20
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and TasksSegmentation Faults, Page Faults, Processes, Threads, and Tasks
Segmentation Faults, Page Faults, Processes, Threads, and Tasks
 
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense TechniqueDARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
DARPA CGC and DEFCON CTF: Automatic Attack and Defense Technique
 
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
Dynamic Filtering: Multi-Purpose Architecture Support for Language Runtime Sy...
 
Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)Making a Process (Virtualizing Memory)
Making a Process (Virtualizing Memory)
 
Weakpass - defcon russia 23
Weakpass - defcon russia 23Weakpass - defcon russia 23
Weakpass - defcon russia 23
 
JVM Mechanics
JVM MechanicsJVM Mechanics
JVM Mechanics
 
[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobinding[Defcon Russia #29] Алексей Тюрин - Spring autobinding
[Defcon Russia #29] Алексей Тюрин - Spring autobinding
 
EdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for JavaEdSketch: Execution-Driven Sketching for Java
EdSketch: Execution-Driven Sketching for Java
 
Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace Вениамин Гвоздиков: Особенности использования DTrace
Вениамин Гвоздиков: Особенности использования DTrace
 
Hash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - BeanHash Endereçamento Quadrático - Bean
Hash Endereçamento Quadrático - Bean
 
VHdl lab report
VHdl lab reportVHdl lab report
VHdl lab report
 
Triton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON ChinaTriton and Symbolic execution on GDB@DEF CON China
Triton and Symbolic execution on GDB@DEF CON China
 
Story of static code analyzer development
Story of static code analyzer developmentStory of static code analyzer development
Story of static code analyzer development
 
Mutation @ Spotify
Mutation @ Spotify Mutation @ Spotify
Mutation @ Spotify
 
Pattern Matching in Java 14
Pattern Matching in Java 14Pattern Matching in Java 14
Pattern Matching in Java 14
 
Digital system design practical file
Digital system design practical fileDigital system design practical file
Digital system design practical file
 

Viewers also liked

How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
BlueData, Inc.
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
DataWorks Summit
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
DataWorks Summit
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
DataWorks Summit
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
DataWorks Summit
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
DataWorks Summit
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
 
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
Yahoo Developer Network
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
DataWorks Summit
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
DataWorks Summit
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
DataWorks Summit
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
DataWorks Summit
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
DataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DataWorks Summit
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
DataWorks Summit
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?
DataWorks Summit
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
DataWorks Summit
 

Viewers also liked (20)

How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP HavenCarpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
 
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
 
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
The Most Valuable Customer on Earth-1298: Comic Book Analysis with Oracel's B...
 
Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets Karta an ETL Framework to process high volume datasets
Karta an ETL Framework to process high volume datasets
 
Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]Inspiring Travel at Airbnb [WIP]
Inspiring Travel at Airbnb [WIP]
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
 
Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010Hadoop for Genomics__HadoopSummit2010
Hadoop for Genomics__HadoopSummit2010
 
Hadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance InitiativeHadoop in Validated Environment - Data Governance Initiative
Hadoop in Validated Environment - Data Governance Initiative
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
Coexistence and Migration of Vendor HPC based infrastructure to Hadoop Ecosys...
 
One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)One Click Hadoop Clusters - Anywhere (Using Docker)
One Click Hadoop Clusters - Anywhere (Using Docker)
 
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in ProductionRunning Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
NoSQL Needs SomeSQL
NoSQL Needs SomeSQLNoSQL Needs SomeSQL
NoSQL Needs SomeSQL
 
Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?Open Source SQL for Hadoop: Where are we and Where are we Going?
Open Source SQL for Hadoop: Where are we and Where are we Going?
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 

Similar to Online Approximate OLAP in SparkSQL

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Data Con LA
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
Databricks
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
PingCAP
 
3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptx3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptx
GaganaP13
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
FrangoCamila
 
BlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large DataBlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large Data
Knoldus Inc.
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
Alexander Korotkov
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
Anil Gaihre
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
cookie1969
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
Sveta Smirnova
 
Applied Econometrics assignment3
Applied Econometrics assignment3Applied Econometrics assignment3
Applied Econometrics assignment3Chenguang Li
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Databricks
 
GCC
GCCGCC
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
Aman Gupta
 
Data Acquisition
Data AcquisitionData Acquisition
Data Acquisition
azhar557
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Evention
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
Kostas Tzoumas
 
Spark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and OpsSpark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and Ops
Francisco Pérez Paradas
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
J On The Beach
 

Similar to Online Approximate OLAP in SparkSQL (20)

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Delta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the HoodDelta Lake Streaming: Under the Hood
Delta Lake Streaming: Under the Hood
 
A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)A Brief Introduction of TiDB (Percona Live)
A Brief Introduction of TiDB (Percona Live)
 
3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptx3.ASSEMBLERS.pptx
3.ASSEMBLERS.pptx
 
design-compiler.pdf
design-compiler.pdfdesign-compiler.pdf
design-compiler.pdf
 
BlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large DataBlinkDB - Approximate Queries on Very Large Data
BlinkDB - Approximate Queries on Very Large Data
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Applied Econometrics assignment3
Applied Econometrics assignment3Applied Econometrics assignment3
Applied Econometrics assignment3
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
GCC
GCCGCC
GCC
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Data Acquisition
Data AcquisitionData Acquisition
Data Acquisition
 
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansApache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Spark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and OpsSpark Streaming Tips for Devs and Ops
Spark Streaming Tips for Devs and Ops
 
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezSpark Streaming Tips for Devs and Ops by Fran perez y federico fernández
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernández
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 

Online Approximate OLAP in SparkSQL

  • 1. Online Approximate OLAP in SparkSQL Kai Zengand Sameer Agarwal Hadoop Summit | San Jose, CA | June 9th 2015
  • 2. Hard Disks (dailyreports) ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory (hourly reports) 100 TB on 1000 machines Need for Quick & Continuous Feedback (exploratoryqueries)
  • 3. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation1 What is the average latency in the table? 34.6667 1Joseph M. Hellerstein, Peter J. Haas, Helen J. Wang. Online Aggregation. In SIGMOD 1997.
  • 4. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 35
  • 5. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 33.83 35
  • 6. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 33.83 34.6667 35
  • 7. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Online Aggregation What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1
  • 10. SparkSQL Interface val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  batch  processing val result  =  dataFrame.collect()  //  34.6667
  • 11. SparkSQL + G-OLA + BlinkDB val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online onlineDataFrame.collectNext() //  35  ± 2.1 onlineDataFrame.collectNext() //  33.83  ± 1.3
  • 12. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext())  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 13. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && responseTime <=  10.seconds)  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 14. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && errorBound >=  0.01)  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 15. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 16. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } SparkSQL + G-OLA + BlinkDB
  • 17. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } AGGREGATES/  UDAFs JOINS/GROUP  BYs NESTED  QUERIES SparkSQL + G-OLA + BlinkDB
  • 19. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Part 1: Generalized Online Aggregation (G-OLA) What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1
  • 20. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1 Part 2: Error Estimation in BlinkDB
  • 21. Key Idea: Introduce Delta Maintenance as a First Class Citizen in Query Execution Part 1: Generalized Online Aggregation (G-OLA)
  • 22. Query Execution: Under The Hood SQL Query Query Plan AGGR SCAN AVG(latency) SELECT  avg(latency) FROM  log log
  • 23. Query Execution: Under The Hood SQL Query Query Plan FILTER JOIN AGGR SCAN SCAN AGGR latency > AVG(latency) AVG(latency) AVG(latency)SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT   avg(latency) FROM  log ) log log
  • 24. Query Execution: Under The Hood log
  • 25. Query Execution: Under The Hood log
  • 26. Query Execution: Under The Hood collect() log Query
  • 27. Query Execution: Under The Hood collect() collectNext()
  • 28. Query Execution: Under The Hood collect() collectNext() Delta Input 1 Delta Input: New data + Previous Intermediate State
  • 29. Query Execution: Under The Hood Delta Input 2 Delta Input: New data + Previous Intermediate State
  • 31. Intermediate State ?SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log )
  • 32. Intermediate State SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log )
  • 33. Intermediate State SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015
  • 34. Intermediate State SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015
  • 35. 35 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State
  • 36. 35 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 37. 35 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 38. 35 ± 1.5 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 39. 35 ± 1.5 ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 40. 35 ± 1.5 Uncertain Tuples ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 SELECT  avg(latency) FROM  log WHERE  latency  >  SELECT  avg(latency) FROM  log Intermediate State 33.83
  • 41. FILTER JOIN AGGR SCAN SCAN AGGR latency > AVG(latency) AVG(latency) AVG(latency) FILTER JOIN AGGR SCAN SCAN AGGR latency > AVG(latency) AVG(latency) AVG(latency) Uncertain Tuples: 𝑇", 𝑇$ Running sum/count of: 𝑇%, 𝑇&   Running sum/count of: 𝑇(, 𝑇%, 𝑇", 𝑇$, 𝑇&     Intermediate State: Intermediate State
  • 42. Goal – Minimize Intermediate State Each delta-update query • Optimized using Catalyst in Spark SQL – Capable of using all existing optimizations • Compiled into a distributed Spark Job Online Execution Model
  • 43. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1 Part 2: Error Estimation in BlinkDB
  • 44. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation on a Sample of Data What is the average latency in the table? 34.6667
  • 45. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation on a Sample of Data What is the average latency in the table? 34.5 34.6667
  • 46. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation on a Sample of Data What is the average latency in the table? 34.5 ± 2.0 34.6667
  • 47. Focused on estimating aggregate errors given representative samples Central LimitTheorem (CLT) Error Estimation using Bootstrap HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93 47 Error Estimation on a Sample of Data d predicate for the query) The following results are (asymptotically in sample size) true, but not di- rectly useful, since they depend on unknown properties of the underlying dis- tribution. In all cases we just plug in the sample values. For example, instead of µ we use 1 n Pn i=1 Xi where Xi is the ith sample value. Note that for estimators other than sum and count, I assume no filtering (p = 1). Filtering will increase variance a bit, or potentially a lot for extremely selective queries (p = 0). I can compute the filtering-adjusted values if you like. 1. Count: N(np, n(1 p)p) 2. Sum: N(npµ, np( 2 + (1 p)µ2 )) 3. Mean: N(µ, 2 /n) 4. Variance: N( 2 , (µ4 4 )/n) 5. Stddev: N( , (µ4 4 )/(4 2 n)) Sampling! … Resampling! D S100 S1 S θ(S1 ) θ(S100 )θ(S) 95% confidence interval! …
  • 48. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap What is the average latency in the table?
  • 49. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34
  • 50. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 θ1 = 34
  • 51. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 θ1 = 34 θ2 = 32.5
  • 52. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ...θ2 = 32.5 θ100 = 35
  • 53. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 54. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  • 55. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 ID City Latency 1 SLC 37 2 NYC 30 3 SLC 34 4 LA 36 5 NYC 30 ID City Latency 1 SLC 34 2 SLC 37 3 SLC 34 4 LA 36 5 LA 36 ... θ1 = 34.6 ... 35 ± 1.6 θ2 = 33.4 θ100 = 35.4
  • 56. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 Error Estimation in BlinkDB Leverage Poissonized Resampling to generate samples with replacement Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In SIGMOD 2014.
  • 57. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 Sample from a Poisson (1) Distribution θ1 = 33.8 Error Estimation in BlinkDB
  • 58. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 6 SF 28 2 Incremental Error Estimation Error Estimation in BlinkDB
  • 59. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 #2 1 NYC 30 2 1 2 NYC 38 1 0 3 SLC 34 0 2 4 SLC 34 1 2 5 SLC 37 1 0 6 SF 28 2 1 Construct all Resamples in a Single Pass Error Estimation in BlinkDB
  • 60. Bootstrap Performance Compute  Sum(latency) rSumi =  rSumi +  latency   *  #i Virtual  call  to  Add.eval() Virtual  call  to  rSum.eval() Return  boxed  Double Virtual  call  to  Multiply.eval() Virtual  call  to  latency.eval() Return  boxed  Double Virtual  call  to  #.eval() …… Add rSum Multiply latency #
  • 61. Using Scala Quasiquotes/ Runtime Reflection def generateCode(e:   Expression):   Tree =  e  match { case Attribute(ordinal)   => q"inputRow.getInt($ordinal)" case Add(left,   right)   => q""" { val leftResult =  ${generateCode(left)} val rightResult =  ${generateCode(right)} leftResult +  rightResult } """ }
  • 62. Code Generation Bootstrap Compute  Sum(latency) rSumi =  rSumi +  latency   *  #i Consolidate  all  bootstrap  in  a  tight  for  loop val left0  =  inputRow.getDouble(j)   //  latency for  (i from  0  until  n) val right0  =  inputRow.getInt(k+i)   //  #i val left1  =  inputRow.getDouble(0+i)   //  rSumi val right1  =  left0*right0 val result1   =  left1  +  right1 resultRow.setDouble(0,   result1) 62 2-5x Faster!
  • 63. Conclusion 1. Online Approximation is an important means to achieve interactivityin processing large datasets 2. New SparkSQL Libraries: - G-OLA for Continuous Answers - BlinkDB for Continuous Error Bars
  • 64. Check out our code! 1. Early Adopter Preview: http://github.com/amplab/bootstrap- sql. Send us an email to kaizeng@cs.berkeley.eduand sameer@databricks.comto get access! 2. Spark Package coming soon 3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond