SlideShare a Scribd company logo
1 of 76
Querying Petabytes of Data in
Seconds using Sampling
UC
Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet
Talwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M I T 1
Can we do better than in-
memory?
Can we get more with less?
Can fast get faster?
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10 TB on 100 machines
Query Execution on Samples
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering
ratio in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05
Speed/Accuracy Trade-offError
30 mins
Time to
Execute on
Entire
Dataset
Interactive
Queries
2 sec
Execution Time (Sample Size)
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response
time
is dominated by I/O
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
* Conditions Apply
Sampling Error
Typically, error depends on sample size
(n) and not on original data size, i.e.,
error is proportional to (1/sqrt(n))*
* Conditions Apply
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS 234.23 ± 15.32
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 2 SECONDS 234.23 ± 15.32
239.46 ± 4.96
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS
AVG, COUNT, SUM,
STDEV, PERCENTILE
etc.
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
WITHIN 1 SECONDS FILTERS, GROUP BY
clauses
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
JOINS,
Nested
Queries etc.
Speed/Accuracy Trade-off
SELECT my_function(sessionTime)
FROM Table
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
ML Primitives,
User Defined
Functions
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROM Table
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Uniform Samples
2
4
1
3
Uniform Samples
2
4
1
3
U
Uniform Samples
2
4
1
3
U
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
1. FILTER rand() < 1/3
2. Adds per-row weights
3. In-memory Shuffle
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/3
8 NYC 0.25 1/3
6 Berkeley 0.09 1/3
11 NYC 0.19 1/3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Doesn’t change
Spark RDD
Semantics
Stratified Samples
2
4
1
3
Stratified Samples
2
4
1
3
S
Stratified Samples
2
4
1
3
S
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Stratified Samples
2
4
1
3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
SPLIT
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count
NYC 7
Berkeley 5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
S2 JOIN
Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data Weight
2 NYC 0.13 2/7
8 NYC 0.25 2/7
6 Berkeley 0.09 2/5
12 Berkeley 0.49 2/5
Doesn’t change
Shark RDD
Semantics
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error Estimation
Closed Form Aggregate Functions
- Central Limit Theorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
A
1
2
Sampl
e
AVG
SUM
COUNT
STDEV
VARIANCE
A
1
2
Sampl
e
A
±ε
A
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested queries,
UDFs, joins etc.
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested
queries, UDFs, joins etc.
Sampl
e
A
Sampl
e
AA1A2A100
…
…
B
±ε
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Error VerificationError
Sample Size
More Data  Higher
Accuracy300 Data Points  97%
Accuracy [KDD’13] [SIGMOD’14]
Single Pass Execution
Sampl
e
A
Approximate Query on a Sample
46
Single Pass Execution
Sampl
e
A
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
47
Single Pass Execution
Sampl
e
R
A
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
Resampling Operator
48
Single Pass Execution
Sampl
e
A
R
Sample “Pushdown”
State Age Metric Weight
CA 20 1971 1/4
CA 22 2819 1/4
MA 22 3819 1/4
MA 30 3091 1/4
49
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Sample “Pushdown”
50
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Resampling Operator
51
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
Resampling Operator
52
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
A A1 An
…
53
Single Pass Execution
Sampl
e
A
R
Metric Weight
1971 1/4
3819 1/4
Metric Weight
1971 1/4
1971 1/4
Metric Weight
3819 1/4
3819 1/4
A A1 An
…
54
Sampl
e
A
Metric Weight
1971 1/4
3819 1/4
Leverage Poissonized
Resampling to generate
samples with
replacement
Single Pass Execution
55
Sampl
e
A
Metric Weight S1
1971 1/4 2
3819 1/4 1
Sample from a
Poisson (1) Distribution
Single Pass Execution
A1
56
Sampl
e
A
Metric Weight S1 Sk
1971 1/4 2 1
3819 1/4 1 0
Construct all Resamples
in Single Pass
Single Pass Execution
A1 Ak
57
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e
A
58
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS
Single Pass Execution
Sampl
e
A
Additional Overhead: 200 bytes/row
59
S S1 S100
… Da1 Da100
… Db1 Db100
… Dc1 Dc100
…
Single Pass Execution
Sampl
e
A
Embarrassingly Parallel
SAMPLE BOOTSTRAP
WEIGHTS
DIAGNOSTICS
WEIGHTS 60
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random
and stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Offline Process]
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error
bars by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Online Process]
TABLE
Sampling
Module
Original
Data
Offline-sampling:
Creates an optimal
set of samples on
native tables and
materialized views
based on query
history and
BlinkDB Architecture
64
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Sample
Placement:
Samples striped
over 100s or 1,000s
of machines both
on disks and in-
memory.
BlinkDB Architecture
65
SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
BlinkDB Architecture
66
SELECT
foo (*)
FROM
TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick
best sample(s)
based on query
latency and
accuracy
requirements
BlinkDB Architecture
67
TABLE
Sampling
Module
In-Memory
Samples
On-Disk
Samples
Original
Data
Hive/Shark/Prest
o
SELECT
foo (*)
FROM
TABLE
WITHIN 2
New Query Plan
HiveQL/SQL
Query
Sample Selection
Error Bars &
Confidence
Intervals
Result
182.23 ± 5.56
(95% confidence)
Parallel query
execution on
multiple samples
striped across
BlinkDB Architecture
68
BlinkDB is Fast!
- 5 Queries, 5 machines
- 20 GB samples (0.001%-1% of original data)
- 1-5% Error
ResponseTime(s)
Query Execution
Overall Query Execution
Overall Query ExecutionResponseTime(s)
Error Estimation
Overhead
Overall Query ExecutionResponseTime(s)
Error Verification
Overhead
Coming Soon: Native Spark
Integration
BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and
materialized views
3. Adds approximate aggregate functions with statistical
closed forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s
Presto (storage, serdes, UDFs, types, metadata)
An Open Question
We still haven’t figured out the right user-
interface for approximate queries:
- Time/Error Bounds?
- Continuous Error Bars?
- Hide Errors Altogether?
- UI/UX Specific?
- Application Specific?
- … 75
http://blinkdb.org
Native Spark Integration Coming
Soon!

More Related Content

Similar to BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

IllinoisScan_seminar.ppt
IllinoisScan_seminar.pptIllinoisScan_seminar.ppt
IllinoisScan_seminar.pptcoolbusinessman
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraSveta Smirnova
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for ScyllaScyllaDB
 
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Sease
 
Iaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectorsIaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectorsIaetsd Iaetsd
 
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseAhmedmchayaa
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsDmitryZaitsev5
 
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...Arkansas State University
 
Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis ivanokitov
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...Zhen Ming (Jack) Jiang
 
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...iosrjce
 
Using Neural Networks to predict user ratings
Using Neural Networks to predict user ratingsUsing Neural Networks to predict user ratings
Using Neural Networks to predict user ratingsrecsysfr
 

Similar to BlinkDB: Qureying Petabytes of Data in Seconds using Sampling (20)

IllinoisScan_seminar.ppt
IllinoisScan_seminar.pptIllinoisScan_seminar.ppt
IllinoisScan_seminar.ppt
 
Clustering
ClusteringClustering
Clustering
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
Final_Presentation_Docker_KP
Final_Presentation_Docker_KPFinal_Presentation_Docker_KP
Final_Presentation_Docker_KP
 
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Ski...
 
Iaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectorsIaetsd design and implementation of multiple sic vectors
Iaetsd design and implementation of multiple sic vectors
 
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed Database
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Modeling computer networks by colored Petri nets
Modeling computer networks by colored Petri netsModeling computer networks by colored Petri nets
Modeling computer networks by colored Petri nets
 
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
A Statistical Approach to Optimize Parameters for Electrodeposition of Indium...
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
 
Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis Multivariate dimensionality reduction in cross-correlation analysis
Multivariate dimensionality reduction in cross-correlation analysis
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
 
H010613642
H010613642H010613642
H010613642
 
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
Vlsi Design of Low Transition Low Power Test Pattern Generator Using Fault Co...
 
Using Neural Networks to predict user ratings
Using Neural Networks to predict user ratingsUsing Neural Networks to predict user ratings
Using Neural Networks to predict user ratings
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 

BlinkDB: Qureying Petabytes of Data in Seconds using Sampling

  • 1. Querying Petabytes of Data in Seconds using Sampling UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael Jordan, Samuel Madden, Ion Stoica M I T 1
  • 2. Can we do better than in- memory?
  • 3. Can we get more with less?
  • 4. Can fast get faster?
  • 5. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 10 TB on 100 machines Query Execution on Samples
  • 6. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325
  • 7. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325
  • 8. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 +/- 0.05 0.2325
  • 9. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05
  • 10. Speed/Accuracy Trade-offError 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size)
  • 11. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 10x as response time is dominated by I/O
  • 12. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) Error Bars
  • 13. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))*
  • 14. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 15. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 16. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32
  • 17. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 239.46 ± 4.96
  • 18. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS AVG, COUNT, SUM, STDEV, PERCENTILE etc.
  • 19. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ WITHIN 1 SECONDS FILTERS, GROUP BY clauses
  • 20. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS JOINS, Nested Queries etc.
  • 21. Speed/Accuracy Trade-off SELECT my_function(sessionTime) FROM Table WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS ML Primitives, User Defined Functions
  • 22. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROM Table WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
  • 23. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 24. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 27. Uniform Samples 2 4 1 3 U ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 28. Uniform Samples 2 4 1 3 U 1. FILTER rand() < 1/3 2. Adds per-row weights 3. In-memory Shuffle ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 29. Uniform Samples 2 4 1 3 U ID City Data Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 0.19 1/3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Doesn’t change Spark RDD Semantics
  • 32. Stratified Samples 2 4 1 3 S ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 33. Stratified Samples 2 4 1 3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 SPLIT
  • 34. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count NYC 7 Berkeley 5 GROUP
  • 35. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 GROUP
  • 36. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 S2 JOIN
  • 37. Stratified Samples 2 4 1 3 S1 S2 S2 U ID City Data Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 Doesn’t change Shark RDD Semantics
  • 38. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 39. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
  • 40. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
  • 41. Error Estimation Closed Form Aggregate Functions - Central Limit Theorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV A 1 2 Sampl e AVG SUM COUNT STDEV VARIANCE A 1 2 Sampl e A ±ε A
  • 42. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc.
  • 43. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc. Sampl e A Sampl e AA1A2A100 … … B ±ε
  • 44. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 45. Error VerificationError Sample Size More Data  Higher Accuracy300 Data Points  97% Accuracy [KDD’13] [SIGMOD’14]
  • 47. Single Pass Execution Sampl e A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 47
  • 48. Single Pass Execution Sampl e R A State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 Resampling Operator 48
  • 49. Single Pass Execution Sampl e A R Sample “Pushdown” State Age Metric Weight CA 20 1971 1/4 CA 22 2819 1/4 MA 22 3819 1/4 MA 30 3091 1/4 49
  • 50. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Sample “Pushdown” 50
  • 51. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Resampling Operator 51
  • 52. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 Resampling Operator 52
  • 53. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 53
  • 54. Single Pass Execution Sampl e A R Metric Weight 1971 1/4 3819 1/4 Metric Weight 1971 1/4 1971 1/4 Metric Weight 3819 1/4 3819 1/4 A A1 An … 54
  • 55. Sampl e A Metric Weight 1971 1/4 3819 1/4 Leverage Poissonized Resampling to generate samples with replacement Single Pass Execution 55
  • 56. Sampl e A Metric Weight S1 1971 1/4 2 3819 1/4 1 Sample from a Poisson (1) Distribution Single Pass Execution A1 56
  • 57. Sampl e A Metric Weight S1 Sk 1971 1/4 2 1 3819 1/4 1 0 Construct all Resamples in Single Pass Single Pass Execution A1 Ak 57
  • 58. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A 58
  • 59. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS Single Pass Execution Sampl e A Additional Overhead: 200 bytes/row 59
  • 60. S S1 S100 … Da1 Da100 … Db1 Db100 … Dc1 Dc100 … Single Pass Execution Sampl e A Embarrassingly Parallel SAMPLE BOOTSTRAP WEIGHTS DIAGNOSTICS WEIGHTS 60
  • 61. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 62. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Offline Process]
  • 63. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Online Process]
  • 64. TABLE Sampling Module Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and BlinkDB Architecture 64
  • 65. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in- memory. BlinkDB Architecture 65
  • 66. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data BlinkDB Architecture 66
  • 67. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements BlinkDB Architecture 67
  • 68. TABLE Sampling Module In-Memory Samples On-Disk Samples Original Data Hive/Shark/Prest o SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across BlinkDB Architecture 68
  • 69. BlinkDB is Fast! - 5 Queries, 5 machines - 20 GB samples (0.001%-1% of original data) - 1-5% Error
  • 73. Coming Soon: Native Spark Integration
  • 74. BlinkDB Prototype 1. Alpha 0.2.0 released and available at http://blinkdb.org 2. Allows you to create samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, Spark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)
  • 75. An Open Question We still haven’t figured out the right user- interface for approximate queries: - Time/Error Bounds? - Continuous Error Bars? - Hide Errors Altogether? - UI/UX Specific? - Application Specific? - … 75

Editor's Notes

  1. Now with shark and spark really pushing the limits of in-memory computations, one of the natural questions that comes to mind is– “Can we do better than in-memory?” And being better than in memory, could mean one or both of these 2 things--
  2. And if we focus on this error, the error has an amazing statistical property. Error Decreases with Moore’s law: Halves every 36 months!
  3. And if we focus on this error, the error has an amazing statistical property. Error Decreases with Moore’s law: Halves every 36 months!
  4. And if we focus on this error, the error has an amazing statistical property. Error Decreases with Moore’s law: Halves every 36 months!
  5. Original data 2TB-2PB