Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
BlinkDB: Qureying Petabytes of Data in Seconds using Sampling
1. Querying Petabytes of Data in
Seconds using Sampling
UC
Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet
Talwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M I T 1
5. Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10 TB on 100 machines
Query Execution on Samples
6. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
0.2325
7. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325
8. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325
9. ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05
11. Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response
time
is dominated by I/O
12. Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars
13. Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
14. Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
* Conditions Apply
15. Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
* Conditions Apply
23. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
24. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
38. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
39. Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
40. Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
41. Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
A
1
2
Sampl
e
AVG
SUM
COUNT
STDEV
VARIANCE
A
1
2
Sampl
e
A
±ε
A
43. Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested
queries, UDFs, joins etc.
Sampl
e
A
Sampl
e
AA1A2A100
…
…
B
±ε
44. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
57. Sampl
e
A
Metric Weight S1 Sk
1971 1/4 2 1
3819 1/4 1 0
Construct all Resamples
in Single Pass
Single Pass Execution
A1 Ak
57
58. S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e
A
58
59. S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e
A
Additional Overhead: 200 bytes/row
59
60. S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
Single Pass Execution
Sampl
e
A
Embarrassingly Parallel
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS 60
61. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
62. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Offline Process]
63. What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Online Process]
66. SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
BlinkDB Architecture
66
67. SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick
best sample(s)
based on query
latency and
accuracy
requirements
BlinkDB Architecture
67
74. BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and
materialized views
3. Adds approximate aggregate functions with statistical
closed forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s
Presto (storage, serdes, UDFs, types, metadata)
75. An Open Question
We still haven’t figured out the right user-
interface for approximate queries:
- Time/Error Bounds?
- Continuous Error Bars?
- Hide Errors Altogether?
- UI/UX Specific?
- Application Specific?
- … 75
Now with shark and spark really pushing the limits of in-memory computations, one of the natural questions that comes to mind is– “Can we do better than in-memory?”
And being better than in memory, could mean one or both of these 2 things--
And if we focus on this error, the error has an amazing statistical property.
Error Decreases with Moore’s law: Halves every 36 months!
And if we focus on this error, the error has an amazing statistical property.
Error Decreases with Moore’s law: Halves every 36 months!
And if we focus on this error, the error has an amazing statistical property.
Error Decreases with Moore’s law: Halves every 36 months!