Big datasets are growing exponentially, but our needs to get quick interactive responses to our queries remain ever as important. This talk will feature an overview of various components in BlinkDB used to incrementally process massive amounts of data on clusters of tens, hundreds or thousands of machines while returning approximate answers. More precisely, this new execution model enables SparkSQL to present the user with meaningful approximate results (with error bars) that are continuously refined and updated, at a speed comfortable to the user, while it crunches larger and larger fractions of the whole dataset in the background. This not only alleviates the need for pre-processing the data in advance for a wide range of queries, but also enables the users to observe the progress of a query and control its execution on the fly– enabling a smooth time/accuracy trade-off.
1. Approximate Queries onVery
Large Data
UC Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, AmeetTalwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M I T 1
2. About Me
1. Software Engineer at Databricks in San Francisco, CA
2. PhD in Databases from University of California,
Berkeley 2014.
3. Actively work as part of the open source community
and the AMPLab to create BDAS (Berkeley Data
Analytics Stack) that constitutes Apache Spark,
Tachyon, BlinkDB, Mesos etc.
3. Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10TB on 100 machines
Query Execution on Samples
4. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
0.2325
5. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325
6. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325
7. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05
9. Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response time
is dominated by I/O
10. Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars
11. Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
12. Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
* Conditions Apply
13. Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
* Conditions Apply
21. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
22. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
36. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
37. Error Estimation
Closed Form Aggregate Functions
- Central LimitTheorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
38. Error Estimation
Closed Form Aggregate Functions
The following results are (asymptotically in sample
rectly useful, since they depend on unknown properties
tribution. In all cases we just plug in the sample values.
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count,
(p = 1). Filtering will increase variance a bit, or potentia
selective queries (p = 0). I can compute the filtering-adju
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
42. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
44. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
45. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Offline Process]
46. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Online Process]
49. SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
BlinkDB Architecture
49
50. SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick best
sample(s) based on
query latency and
accuracy
requirements
BlinkDB Architecture
50
57. BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and materialized
views
3. Adds approximate aggregate functions with statistical closed
forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s Presto
(storage, serdes, UDFs, types, metadata)