BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

Querying Petabytes of Data in
Seconds using Sampling
UC
Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet
Talwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M I T 1

Can we do better than in-
memory?

Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10 TB on 100 machines
Query Execution on Samples

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
What is the average buffering
ratio in the table?
0.2325

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
ratio in the table?
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325

ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
ratio in the table?
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05

Speed/Accuracy Trade-offError
30 mins
Time to
Execute on
Entire
Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)

Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response
time
is dominated by I/O

Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars

Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*

Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
* Conditions Apply

Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS 234.23 ± 15.32

FROM Table
WITHIN 2 SECONDS 234.23 ± 15.32
239.46 ± 4.96

FROM Table
WITHIN 1 SECONDS
AVG, COUNT, SUM,
STDEV, PERCENTILE
etc.

FROM Table
WITHIN 1 SECONDS FILTERS, GROUP BY
clauses

FROM Table
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
JOINS,
Nested
Queries etc.

SELECT my_function(sessionTime)
FROM Table
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
ML Primitives,
User Defined
Functions

FROM Table
ERROR 0.1 CONFIDENCE 95.0%

What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime

Uniform Samples
2
4
1
3
U
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10

Uniform Samples
2
4
1
3
U
1. FILTER rand() < 1/3
2. Adds per-row weights
3. In-memory Shuffle
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10

Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/3
8 NYC 0.25 1/3
6 Berkeley 0.09 1/3
11 NYC 0.19 1/3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Doesn’t change
Spark RDD
Semantics

Stratified Samples
2
4
1
3
S
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10

Stratified Samples
2
4
1
3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
SPLIT

Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count
NYC 7
Berkeley 5
GROUP

Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
GROUP

Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
S2 JOIN

Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data Weight
2 NYC 0.13 2/7
8 NYC 0.25 2/7
6 Berkeley 0.09 2/5
12 Berkeley 0.49 2/5
Doesn’t change
Shark RDD
Semantics

Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV

Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
A
1
2
Sampl
e
AVG
SUM
COUNT
STDEV
VARIANCE
A
1
2
Sampl
e
A
±ε
A

Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested queries,
UDFs, joins etc.

Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested
queries, UDFs, joins etc.
Sampl
e
A
Sampl
e
AA1A2A100
…
…
B
±ε

What is BlinkDB?
- creates and maintains a variety of random
and stratified samples from underlying data
returns at runtime

Error VerificationError
Sample Size
More Data  Higher
Accuracy300 Data Points  97%
Accuracy [KDD’13] [SIGMOD’14]

Single Pass Execution
Sampl
e
A
Approximate Query on a Sample
46

Sampl
e
A
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
47

Sampl
e
R
A
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
Resampling Operator
48

Sampl
e
A
R
Sample “Pushdown”
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
49

Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Sample “Pushdown”
50

Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Resampling Operator
51

Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
Resampling Operator
52

Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
A A1 An
…
53

Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
A A1 An
…
54

Sampl
e
A
Metric Weight
1971 1/4
3819 1/4
Leverage Poissonized
Resampling to generate
samples with
replacement
55

Sampl
e
A
Metric Weight S1
1971 1/4 2
3819 1/4 1
Sample from a
Poisson (1) Distribution
A1
56

Sampl
e
A
Metric Weight S1 Sk
1971 1/4 2 1
3819 1/4 1 0
Construct all Resamples
in Single Pass
A1 Ak
57

S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Sampl
e
A
58

S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Sampl
e
A
Additional Overhead: 200 bytes/row
59

S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
Sampl
e
A
Embarrassingly Parallel
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS 60

What is BlinkDB?
returns at runtime
[Offline Process]

What is BlinkDB?
returns at runtime
[Online Process]

TABLE
Sampling
Module
Original
Data
Offline-sampling:
Creates an optimal
set of samples on
native tables and
materialized views
based on query
history and
BlinkDB Architecture
64

TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Sample
Placement:
Samples striped
over 100s or 1,000s
of machines both
on disks and in-
memory.
65

SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
66

SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick
best sample(s)
based on query
latency and
accuracy
requirements
67

TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Hive/Shark/Prest
o
SELECT
foo (*)
FROM
TABLE
WITHIN 2
New Query Plan
HiveQL/SQL
Query
Sample Selection
Error Bars &
Confidence
Intervals
Result
182.23 ± 5.56
(95% confidence)
Parallel query
execution on
multiple samples
striped across
68

BlinkDB is Fast!
- 5 Queries, 5 machines
- 20 GB samples (0.001%-1% of original data)
- 1-5% Error

ResponseTime(s)
Query Execution
Overall Query Execution

Overall Query ExecutionResponseTime(s)
Error Estimation
Overhead

Overall Query ExecutionResponseTime(s)
Error Verification
Overhead

Coming Soon: Native Spark
Integration

BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and
materialized views
3. Adds approximate aggregate functions with statistical
closed forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s
Presto (storage, serdes, UDFs, types, metadata)

An Open Question
We still haven’t figured out the right user-
interface for approximate queries:
- Time/Error Bounds?
- Continuous Error Bars?
- Hide Errors Altogether?
- UI/UX Specific?
- Application Specific?
- … 75

http://blinkdb.org
Native Spark Integration Coming
Soon!

BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

Recommended

Recommended

More Related Content

Similar to BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

Similar to BlinkDB: Qureying Petabytes of Data in Seconds using Sampling (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

Editor's Notes