Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BlinkDB and G-OLA:
Supporting Approximate Answers in SparkSQL
Sameer Agarwal and Kai Zeng
Spark Summit | San Francisco, CA...
About Us
1. Sameer Agarwal
- Software Engineer at Databricks
- PhD in Databases (UC Berkeley)
- Research on ApproximateQue...
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
100 TB on 1000 machines
Continuous Query Execution on Samples of Data
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
9
Demo
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
G-OLA
Interface
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  batch  processing
val result  =  dataFrame.c...
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.on...
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.on...
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.on...
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.on...
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.on...
val dataFrame =  
sqlCtx.sql(“select  avg(latency)  from  log”)
//  online  processing
val onlineDataFrame =  dataFrame.on...
SELECT
foo (*)
FROM
TABLE
A ± ε
Error
Estimation
Query
Execution
Data
Storage
Continuous Query Execution on Samples
SELECT
foo (*)
FROM
TABLE
A ± ε
Query
Interface
Error
Estimation
Query
Execution
Data
Storage
Sameer Agarwal, Barzan Mozaf...
Focused on estimating aggregate errors given representative
samples
Central LimitTheorem (CLT) Error Estimation using Boot...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
ID City Latency
1 NYC 30
2 NYC 38
3 SLC 34
4 LA 36
5 SLC 37
6 SF 28
7 NYC 32
8 NYC 38
9 LA 36
10 SF 35
11 NYC 38
12 LA 34
...
High Level Take-away:
Bootstrap and Poissonized Resampling
Techniques are the key towards
achieving quick and continuous e...
SELECT
foo (*)
FROM
TABLE
A ± ε
Query
Interface
Error
Estimation
Query
Execution
Data
Storage
G-OLA
Kai Zeng, Sameer Agarw...
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A”
10+10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A”
10+10+10  sec
Overall Quadratic Cost!
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
A
Query Execution: Under The Hood
Data
Query
Answer ± ε
10  sec
A’
10+10  sec
A
Delta Update Query
Delta Update Queries
Data
Query
Answer ± ε
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
Data
Query
Answer ± ε
Delta Update Queries
47
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
48
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
49
Delta Update: Simple Queries
AVG
SCAN
SELECT  avg(latency)
FROM  log
A
AVG
SCAN
A
50
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
51
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
52
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
53
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
54
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
55
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
56
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
57
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
58
Delta Update: Nested Queries
FILTER
JOIN
AVG SCAN
SCAN
AVG
SELECT  avg(latency)
FROM  log
WHERE  latency  >
(
SELECT  a...
High Level Take-away:
Introduce Delta Update Queries as a First Class
Citizen in Query Execution
Check out our code!
1. Code Preview: http://github.com/amplab/bootstrap-sql.
Send us an email to kaizeng@cs.berkeley.eduan...
Conclusion
1. Continuous QueryExecution on
Samples of Data is an important means
to achieve interactivityin processing
lar...
Thank you.
SameerAgarwal(sameer@databricks.com)
KaiZeng(kaizeng@cs.berkeley.edu)
Upcoming SlideShare
Loading in …5
×

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

2,447 views

Published on

Presentation at Spark Summit 2015

Published in: Data & Analytics
  • Be the first to comment

BlinkDB and G-OLA: Supporting Continuous Answers with Error Bars in SparkSQL-(Sameer Agarwal and Kai Zeng, Databricks and AMPLab UC Berkeley

  1. 1. BlinkDB and G-OLA: Supporting Approximate Answers in SparkSQL Sameer Agarwal and Kai Zeng Spark Summit | San Francisco, CA | June 15th 2015
  2. 2. About Us 1. Sameer Agarwal - Software Engineer at Databricks - PhD in Databases (UC Berkeley) - Research on ApproximateQuery Processing (BlinkDB) 2. Kai Zeng - Post-doc in AMP Lab/ Intern at Databricks - PhD in Databases (UCLA) - Research on ApproximateQuery Processing (ABM)
  3. 3. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 100 TB on 1000 machines Continuous Query Execution on Samples of Data
  4. 4. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Continuous Query Execution on Samples What is the average latency in the table? 34.6667
  5. 5. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 Continuous Query Execution on Samples
  6. 6. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 ± 2.1 Continuous Query Execution on Samples
  7. 7. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 35 ± 2.1 33.83 ± 1.3 Continuous Query Execution on Samples
  8. 8. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? 33.83 ± 1.3 34.6667 ± 0.0 35 ± 2.1 Continuous Query Execution on Samples
  9. 9. 9 Demo
  10. 10. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples
  11. 11. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples G-OLA
  12. 12. Interface val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  batch  processing val result  =  dataFrame.collect()  //  34.6667
  13. 13. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online onlineDataFrame.collectNext() //  35  ± 2.1 onlineDataFrame.collectNext() //  33.83  ± 1.3 Interface
  14. 14. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext())  { onlineDataFrame.collectNext() } Interface
  15. 15. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && responseTime <=  10.seconds)  { onlineDataFrame.collectNext() } Interface
  16. 16. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && errorBound >=  0.01)  { onlineDataFrame.collectNext() } Interface
  17. 17. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } Interface
  18. 18. val dataFrame =   sqlCtx.sql(“select  avg(latency)  from  log”) //  online  processing val onlineDataFrame =  dataFrame.online while (onlineDataFrame.hasNext()  && userEvent.cancelled())  { onlineDataFrame.collectNext() } AGGREGATES/  UDAFs JOINS/GROUP  BYs NESTED  QUERIES Interface
  19. 19. SELECT foo (*) FROM TABLE A ± ε Error Estimation Query Execution Data Storage Continuous Query Execution on Samples
  20. 20. SELECT foo (*) FROM TABLE A ± ε Query Interface Error Estimation Query Execution Data Storage Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In ACM EuroSys 2013. Ariel Kleiner, Ameet Talwalkar, Sameer Agarwal, Ion Stoica, Michael Jordan. A General Bootstrap Performance Diagnostic. In ACM KDD 2013 Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar,Michael Jordan, Samuel Madden, Barzan Mozafari, Ion Stoica. Knowing When You’re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In ACM SIGMOD 2014. Continuous Query Execution on Samples
  21. 21. Focused on estimating aggregate errors given representative samples Central LimitTheorem (CLT) Error Estimation using Bootstrap HOE:ASTAT63,BIL: WILEY86, CGL:ASTAT83,PH:IBM96 EFRON:JAS82, EFRON:JAS87,VP:TPMS80, FGK:IJCAI99, ET:CH93 21 Error Estimation on a Sample of Data d predicate for the query) The following results are (asymptotically in sample size) true, but not di- rectly useful, since they depend on unknown properties of the underlying dis- tribution. In all cases we just plug in the sample values. For example, instead of µ we use 1 n Pn i=1 Xi where Xi is the ith sample value. Note that for estimators other than sum and count, I assume no filtering (p = 1). Filtering will increase variance a bit, or potentially a lot for extremely selective queries (p = 0). I can compute the filtering-adjusted values if you like. 1. Count: N(np, n(1 p)p) 2. Sum: N(npµ, np( 2 + (1 p)µ2 )) 3. Mean: N(µ, 2 /n) 4. Variance: N( 2 , (µ4 4 )/n) 5. Stddev: N( , (µ4 4 )/(4 2 n)) Sampling! … Resampling! D S100 S1 S θ(S1 ) θ(S100 )θ(S) 95% confidence interval! …
  22. 22. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap What is the average latency in the table?
  23. 23. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  24. 24. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 ID City Latency 1 NYC 30 2 NYC 30 3 SLC 34 4 LA 36 ID City Latency 1 SLC 34 2 LA 36 3 SLC 34 4 LA 36 ... θ1 = 34 ... 34.5 ± 2 θ2 = 32.5 θ100 = 35
  25. 25. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 Error Estimation using Bootstrap ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 ID City Latency 1 SLC 37 2 NYC 30 3 SLC 34 4 LA 36 5 NYC 30 ID City Latency 1 SLC 34 2 SLC 37 3 SLC 34 4 LA 36 5 LA 36 ... θ1 = 34.6 ... 35 ± 1.6 θ2 = 33.4 θ100 = 35.4
  26. 26. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 SLC 34 5 SLC 37 Error Estimation in BlinkDB Leverage Poissonized Resampling to generate samples with replacement
  27. 27. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 Sample from a Poisson (1) Distribution θ1 = 33.8 Error Estimation in BlinkDB
  28. 28. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 1 NYC 30 2 2 NYC 38 1 3 SLC 34 0 4 SLC 34 1 5 SLC 37 1 6 SF 28 2 Incremental Error Estimation Error Estimation in BlinkDB
  29. 29. ID City Latency 1 NYC 30 2 NYC 38 3 SLC 34 4 LA 36 5 SLC 37 6 SF 28 7 NYC 32 8 NYC 38 9 LA 36 10 SF 35 11 NYC 38 12 LA 34 What is the average latency in the table? ID City Latency #1 #2 1 NYC 30 2 1 2 NYC 38 1 0 3 SLC 34 0 2 4 SLC 34 1 2 5 SLC 37 1 0 6 SF 28 2 1 Construct all Resamples in a Single Pass Error Estimation in BlinkDB 0.2-0.5% additional overhead
  30. 30. High Level Take-away: Bootstrap and Poissonized Resampling Techniques are the key towards achieving quick and continuous error bars for a general set of queries 30
  31. 31. SELECT foo (*) FROM TABLE A ± ε Query Interface Error Estimation Query Execution Data Storage G-OLA Kai Zeng, Sameer Agarwal, Ankur Dave, Michael Armbrust and Ion Stoica. G-OLA: Generalized Online Aggregation for Interactive Analysis on Big Data. In SIGMOD 2015. Continuous Query Execution on Samples
  32. 32. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec
  33. 33. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  34. 34. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A” 10+10+10  sec
  35. 35. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A” 10+10+10  sec Overall Quadratic Cost!
  36. 36. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  37. 37. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec
  38. 38. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A
  39. 39. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A
  40. 40. A Query Execution: Under The Hood Data Query Answer ± ε 10  sec A’ 10+10  sec A Delta Update Query
  41. 41. Delta Update Queries Data Query Answer ± ε
  42. 42. Data Query Answer ± ε Delta Update Queries
  43. 43. Data Query Answer ± ε Delta Update Queries
  44. 44. Data Query Answer ± ε Delta Update Queries
  45. 45. Data Query Answer ± ε Delta Update Queries
  46. 46. Data Query Answer ± ε Delta Update Queries
  47. 47. 47 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A
  48. 48. 48 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A
  49. 49. 49 Delta Update: Simple Queries AVG SCAN SELECT  avg(latency) FROM  log A AVG SCAN A
  50. 50. 50 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A (I)
  51. 51. 51 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A A’ A’ (I) (II)
  52. 52. 52 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) A latency > A (I)
  53. 53. 53 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A A ± ε (I)
  54. 54. 54 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 (I)
  55. 55. 55 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 latency < 8 (I)
  56. 56. 56 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 latency > 12 (I)
  57. 57. 57 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 8 < latency < 12 (I)
  58. 58. 58 Delta Update: Nested Queries FILTER JOIN AVG SCAN SCAN AVG SELECT  avg(latency) FROM  log WHERE  latency  > ( SELECT  avg(latency) FROM  log ) latency > A 10±2 8 < latency < 12 (I) (II)
  59. 59. High Level Take-away: Introduce Delta Update Queries as a First Class Citizen in Query Execution
  60. 60. Check out our code! 1. Code Preview: http://github.com/amplab/bootstrap-sql. Send us an email to kaizeng@cs.berkeley.eduand sameer@databricks.comto get access! 2. Spark Package in July’15 3. GradualNative SparkSQL Integration in 1.5, 1.6 and beyond
  61. 61. Conclusion 1. Continuous QueryExecution on Samples of Data is an important means to achieve interactivityin processing large datasets 2. New SparkSQL Libraries: - BlinkDB for Continuous Error Bars - G-OLA for Continuous Partial Answers
  62. 62. Thank you. SameerAgarwal(sameer@databricks.com) KaiZeng(kaizeng@cs.berkeley.edu)

×