Your SlideShare is downloading. ×
BlinkDB: Qureying Petabytes of Data in Seconds using Sampling
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

747
views

Published on


0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
747
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Now with shark and spark really pushing the limits of in-memory computations, one of the natural questions that comes to mind is– “Can we do better than in-memory?”
    And being better than in memory, could mean one or both of these 2 things--
  • And if we focus on this error, the error has an amazing statistical property.

    Error Decreases with Moore’s law: Halves every 36 months!
  • And if we focus on this error, the error has an amazing statistical property.

    Error Decreases with Moore’s law: Halves every 36 months!
  • And if we focus on this error, the error has an amazing statistical property.

    Error Decreases with Moore’s law: Halves every 36 months!
  • Original data 2TB-2PB
  • Transcript

    • 1. Querying Petabytes of Data in Seconds using Sampling UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica M I T 1
    • 2. Can we do better than in- memory?
    • 3. Can we get more with less?
    • 4. Can fast get faster?
    • 5. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 10 TB on 100 machines Query Execution on Samples
    • 6. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325
    • 7. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325
    • 8. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 +/- 0.05 0.2325
    • 9. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05
    • 10. Speed/Accuracy Trade-offError 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size)
    • 11. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 10x as response time is dominated by I/O
    • 12. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) Error Bars
    • 13. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))*
    • 14. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
    • 15. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
    • 16. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32
    • 17. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 239.46 ± 4.96
    • 18. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS AVG, COUNT, SUM, STDEV, PERCENTILE etc.
    • 19. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS FILTERS, GROUP BY clauses
    • 20. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS JOINS, Nested Queries etc.
    • 21. Speed/Accuracy Trade-off SELECT my_function(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS ML Primitives, User Defined Functions
    • 22. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
    • 23. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    • 24. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    • 25. Uniform Samples 2 4 1 3
    • 26. Uniform Samples 2 4 1 3 U
    • 27. Uniform Samples 2 4 1 3 U ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
    • 28. Uniform Samples 2 4 1 3 U 1. FILTER rand() < 1/3 2. Adds per-row weights 3. In-memory Shuffle ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
    • 29. Uniform Samples 2 4 1 3 U ID City Data Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 0.19 1/3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Doesn’t change Spark RDD Semantics
    • 30. Stratified Samples 2 4 1 3
    • 31. Stratified Samples 2 4 1 3 S
    • 32. Stratified Samples 2 4 1 3 S ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
    • 33. Stratified Samples 2 4 1 3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 SPLIT
    • 34. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count NYC 7 Berkeley 5 GROUP
    • 35. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 GROUP
    • 36. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 S2 JOIN
    • 37. Stratified Samples 2 4 1 3 S1 S2 S2 U ID City Data Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 Doesn’t change Shark RDD Semantics
    • 38. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    • 39. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
    • 40. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
    • 41. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV A 1 2 Sampl e AVG SUM COUNT STDEV VARIANCE A 1 2 Sampl e A ±ε A
    • 42. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc.
    • 43. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. Sampl e A Sampl e AA1A2A100 … … B ±ε
    • 44. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    • 45. Error VerificationError Sample Size More Data  Higher Accuracy300 Data Points  97% Accuracy [KDD’13] [SIGMOD’14]
    • 46. Single Pass Execution Sampl e A Approximate Query on a Sample 46
    • 47. Single Pass Execution Sampl e A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 47
    • 48. Single Pass Execution Sampl e R A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Resampling Operator 48
    • 49. Single Pass Execution Sampl e A R Sample “Pushdown” State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 49
    • 50. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Sample “Pushdown” 50
    • 51. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Resampling Operator 51
    • 52. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 Resampling Operator 52
    • 53. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 53
    • 54. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 54
    • 55. Sampl e A Metric Weight 1971 1/4 3819 1/4 Leverage Poissonized Resampling to generate samples with replacement Single Pass Execution 55
    • 56. Sampl e A Metric Weight S1 1971 1/4 2 3819 1/4 1 Sample from a Poisson (1) Distribution Single Pass Execution A1 56
    • 57. Sampl e A Metric Weight S1 Sk 1971 1/4 2 1 3819 1/4 1 0 Construct all Resamples in Single Pass Single Pass Execution A1 Ak 57
    • 58. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A 58
    • 59. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A Additional Overhead: 200 bytes/row 59
    • 60. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … Single Pass Execution Sampl e A Embarrassingly Parallel SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS 60
    • 61. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
    • 62. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Offline Process]
    • 63. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Online Process]
    • 64. TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and BlinkDB Architecture 64
    • 65. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in- memory. BlinkDB Architecture 65
    • 66. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data BlinkDB Architecture 66
    • 67. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements BlinkDB Architecture 67
    • 68. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Hive/Shark/Prest o SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across BlinkDB Architecture 68
    • 69. BlinkDB is Fast! - 5 Queries, 5 machines - 20 GB samples (0.001%-1% of original data) - 1-5% Error
    • 70. ResponseTime(s) Query Execution Overall Query Execution
    • 71. Overall Query ExecutionResponseTime(s) Error Estimation Overhead
    • 72. Overall Query ExecutionResponseTime(s) Error Verification Overhead
    • 73. Coming Soon: Native Spark Integration
    • 74. BlinkDB Prototype 1. Alpha 0.2.0 released and available at http://blinkdb.org 2. Allows you to create samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, Spark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)
    • 75. An Open Question We still haven’t figured out the right user- interface for approximate queries: - Time/Error Bounds? - Continuous Error Bars? - Hide Errors Altogether? - UI/UX Specific? - Application Specific? - … 75
    • 76. http://blinkdb.org Native Spark Integration Coming Soon!