Querying Petabytes of Data in
Seconds using Sampling
UC
Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milne...
Can we do better than in-
memory?
Can we get more with less?
Can fast get faster?
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10 TB on 100 machines
Query Execution on Samples
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Ber...
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Ber...
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Ber...
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Ber...
Speed/Accuracy Trade-offError
30 mins
Time to
Execute on
Entire
Dataset
Interactive
Queries
2 sec
Execution Time (Sample S...
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds...
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds...
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (...
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (...
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (...
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS 234.23 ± 15.32
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 2 SECONDS 234.23 ± 15.32
239...
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS
AVG, COUNT, SUM,
S...
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS FILTERS, GROUP BY
...
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_l...
Speed/Accuracy Trade-off
SELECT my_function(sessionTime)
FROM Table
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON ve...
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples fro...
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples fro...
Uniform Samples
2
4
1
3
Uniform Samples
2
4
1
3
U
Uniform Samples
2
4
1
3
U
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0...
Uniform Samples
2
4
1
3
U
1. FILTER rand() < 1/3
2. Adds per-row weights
3. In-memory Shuffle
ID City Data
1 NYC 0.78
2 NY...
Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/3
8 NYC 0.25 1/3
6 Berkeley 0.09 1/3
11 NYC 0.19 1/3
ID City Da...
Stratified Samples
2
4
1
3
Stratified Samples
2
4
1
3
S
Stratified Samples
2
4
1
3
S
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NY...
Stratified Samples
2
4
1
3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC ...
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 N...
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 N...
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 N...
Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data Weight
2 NYC 0.13 2/7
8 NYC 0.25 2/7
6 Berkeley 0.09 2/5
12 Berkeley 0....
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples fro...
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STD...
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STD...
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STD...
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested queries,
UDFs,...
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested
queries, UDFs,...
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from...
Error VerificationError
Sample Size
More Data  Higher
Accuracy300 Data Points  97%
Accuracy [KDD’13] [SIGMOD’14]
Single Pass Execution
Sampl
e
A
Approximate Query on a Sample
46
Single Pass Execution
Sampl
e
A
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
47
Single Pass Execution
Sampl
e
R
A
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
Resa...
Single Pass Execution
Sampl
e
A
R
Sample “Pushdown”
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
M...
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Sample “Pushdown”
50
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Resampling Operator
51
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3...
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3...
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3...
Sampl
e
A
Metric Weight
1971 1/4
3819 1/4
Leverage Poissonized
Resampling to generate
samples with
replacement
Single Pass...
Sampl
e
A
Metric Weight S1
1971 1/4 2
3819 1/4 1
Sample from a
Poisson (1) Distribution
Single Pass Execution
A1
56
Sampl
e
A
Metric Weight S1 Sk
1971 1/4 2 1
3819 1/4 1 0
Construct all Resamples
in Single Pass
Single Pass Execution
A1 Ak...
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e...
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e...
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
Single Pass Execution
Sampl
e
A
Embarrassingly Parallel
SAMPLE BOOTSTRAP
W...
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from...
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples fro...
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples fro...
TABLE
Sampling
Module
Original
Data
Offline-sampling:
Creates an optimal
set of samples on
native tables and
materialized ...
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Sample
Placement:
Samples striped
over 100s or 1,000...
SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On...
SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On...
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Hive/Shark/Prest
o
SELECT
foo (*)
FROM
TABLE
WITHIN ...
BlinkDB is Fast!
- 5 Queries, 5 machines
- 20 GB samples (0.001%-1% of original data)
- 1-5% Error
ResponseTime(s)
Query Execution
Overall Query Execution
Overall Query ExecutionResponseTime(s)
Error Estimation
Overhead
Overall Query ExecutionResponseTime(s)
Error Verification
Overhead
Coming Soon: Native Spark
Integration
BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native ta...
An Open Question
We still haven’t figured out the right user-
interface for approximate queries:
- Time/Error Bounds?
- Co...
http://blinkdb.org
Native Spark Integration Coming
Soon!
Upcoming SlideShare
Loading in …5
×

BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

2,525 views

Published on

0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,525
On SlideShare
0
From Embeds
0
Number of Embeds
195
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

No notes for slide
  • Now with shark and spark really pushing the limits of in-memory computations, one of the natural questions that comes to mind is– “Can we do better than in-memory?”
    And being better than in memory, could mean one or both of these 2 things--
  • And if we focus on this error, the error has an amazing statistical property.

    Error Decreases with Moore’s law: Halves every 36 months!
  • And if we focus on this error, the error has an amazing statistical property.

    Error Decreases with Moore’s law: Halves every 36 months!
  • And if we focus on this error, the error has an amazing statistical property.

    Error Decreases with Moore’s law: Halves every 36 months!
  • Original data 2TB-2PB
  • BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

    1. 1. Querying Petabytes of Data in Seconds using Sampling UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica M I T 1
    2. 2. Can we do better than in- memory?
    3. 3. Can we get more with less?
    4. 4. Can fast get faster?
    5. 5. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 10 TB on 100 machines Query Execution on Samples
    6. 6. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325
    7. 7. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325
    8. 8. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 +/- 0.05 0.2325
    9. 9. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05
    10. 10. Speed/Accuracy Trade-offError 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size)
    11. 11. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 10x as response time is dominated by I/O
    12. 12. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) Error Bars
    13. 13. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))*
    14. 14. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
    15. 15. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
    16. 16. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32
    17. 17. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 239.46 ± 4.96
    18. 18. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS AVG, COUNT, SUM, STDEV, PERCENTILE etc.
    19. 19. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS FILTERS, GROUP BY clauses
    20. 20. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS JOINS, Nested Queries etc.
    21. 21. Speed/Accuracy Trade-off SELECT my_function(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS ML Primitives, User Defined Functions
    22. 22. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
    23. 23. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    24. 24. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    25. 25. Uniform Samples 2 4 1 3
    26. 26. Uniform Samples 2 4 1 3 U
    27. 27. Uniform Samples 2 4 1 3 U ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
    28. 28. Uniform Samples 2 4 1 3 U 1. FILTER rand() < 1/3 2. Adds per-row weights 3. In-memory Shuffle ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
    29. 29. Uniform Samples 2 4 1 3 U ID City Data Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 0.19 1/3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Doesn’t change Spark RDD Semantics
    30. 30. Stratified Samples 2 4 1 3
    31. 31. Stratified Samples 2 4 1 3 S
    32. 32. Stratified Samples 2 4 1 3 S ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
    33. 33. Stratified Samples 2 4 1 3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 SPLIT
    34. 34. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count NYC 7 Berkeley 5 GROUP
    35. 35. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 GROUP
    36. 36. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 S2 JOIN
    37. 37. Stratified Samples 2 4 1 3 S1 S2 S2 U ID City Data Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 Doesn’t change Shark RDD Semantics
    38. 38. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    39. 39. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
    40. 40. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
    41. 41. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV A 1 2 Sampl e AVG SUM COUNT STDEV VARIANCE A 1 2 Sampl e A ±ε A
    42. 42. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc.
    43. 43. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. Sampl e A Sampl e AA1A2A100 … … B ±ε
    44. 44. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    45. 45. Error VerificationError Sample Size More Data  Higher Accuracy300 Data Points  97% Accuracy [KDD’13] [SIGMOD’14]
    46. 46. Single Pass Execution Sampl e A Approximate Query on a Sample 46
    47. 47. Single Pass Execution Sampl e A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 47
    48. 48. Single Pass Execution Sampl e R A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Resampling Operator 48
    49. 49. Single Pass Execution Sampl e A R Sample “Pushdown” State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 49
    50. 50. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Sample “Pushdown” 50
    51. 51. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Resampling Operator 51
    52. 52. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 Resampling Operator 52
    53. 53. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 53
    54. 54. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 54
    55. 55. Sampl e A Metric Weight 1971 1/4 3819 1/4 Leverage Poissonized Resampling to generate samples with replacement Single Pass Execution 55
    56. 56. Sampl e A Metric Weight S1 1971 1/4 2 3819 1/4 1 Sample from a Poisson (1) Distribution Single Pass Execution A1 56
    57. 57. Sampl e A Metric Weight S1 Sk 1971 1/4 2 1 3819 1/4 1 0 Construct all Resamples in Single Pass Single Pass Execution A1 Ak 57
    58. 58. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A 58
    59. 59. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A Additional Overhead: 200 bytes/row 59
    60. 60. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … Single Pass Execution Sampl e A Embarrassingly Parallel SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS 60
    61. 61. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    62. 62. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Offline Process]
    63. 63. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Online Process]
    64. 64. TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and BlinkDB Architecture 64
    65. 65. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in- memory. BlinkDB Architecture 65
    66. 66. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data BlinkDB Architecture 66
    67. 67. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements BlinkDB Architecture 67
    68. 68. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Hive/Shark/Prest o SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across BlinkDB Architecture 68
    69. 69. BlinkDB is Fast! - 5 Queries, 5 machines - 20 GB samples (0.001%-1% of original data) - 1-5% Error
    70. 70. ResponseTime(s) Query Execution Overall Query Execution
    71. 71. Overall Query ExecutionResponseTime(s) Error Estimation Overhead
    72. 72. Overall Query ExecutionResponseTime(s) Error Verification Overhead
    73. 73. Coming Soon: Native Spark Integration
    74. 74. BlinkDB Prototype 1. Alpha 0.2.0 released and available at http://blinkdb.org 2. Allows you to create samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, Spark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)
    75. 75. An Open Question We still haven’t figured out the right user- interface for approximate queries: - Time/Error Bounds? - Continuous Error Bars? - Hide Errors Altogether? - UI/UX Specific? - Application Specific? - … 75
    76. 76. http://blinkdb.org Native Spark Integration Coming Soon!

    ×