SlideShare a Scribd company logo
1 of 58
Download to read offline
Approximate Queries onVery
Large Data
UC Berkeley
Sameer Agarwal
Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, AmeetTalwalkar,
Michael Jordan, Samuel Madden, Ion Stoica
M	I	T 1
About Me
1. Software Engineer at Databricks in San Francisco, CA
2. PhD in Databases from University of California,
Berkeley 2014.
3. Actively work as part of the open source community
and the AMPLab to create BDAS (Berkeley Data
Analytics Stack) that constitutes Apache Spark,
Tachyon, BlinkDB, Mesos etc.
Hard Disks
½ - 1 Hour 1 - 5 Minutes 1 second
?
Memory
10TB on 100 machines
Query Execution on Samples
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/4
6 Berkeley 0.25 1/4
8 NYC 0.19 1/4
Uniform
Sample
0.19 +/- 0.05
0.2325
ID City Buff Ratio
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Query Execution on Samples
What is the average buffering ratio
in the table?
ID City Buff Ratio Sampling Rate
2 NYC 0.13 1/2
3 Berkeley 0.25 1/2
5 NYC 0.19 1/2
6 Berkeley 0.09 1/2
8 NYC 0.18 1/2
12 Berkeley 0.49 1/2
Uniform
Sample
$0.22 +/- 0.02
0.2325
0.19 +/- 0.05
Speed/Accuracy Trade-offError
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
2 sec
ExecutionTime (Sample Size)
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
10x as response time
is dominated by I/O
Sampling Vs. No Sampling
0
200
400
600
800
1000
1 10-1 10-2 10-3 10-4 10-5
Fraction of full data
QueryResponseTime(Seconds)
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars
Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
* Conditions Apply
Sampling Error
Typically, error depends on sample size (n)
and not on original data size, i.e., error is
proportional to (1/sqrt(n))*
* Conditions Apply
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 1 SECONDS 234.23 ± 15.32
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 2 SECONDS 234.23 ± 15.32
239.46 ± 4.96
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 1 SECONDS
AVG, COUNT, SUM,
STDEV, PERCENTILE etc.
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
WITHIN 1 SECONDS
FILTERS, GROUP BY clauses
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS JOINS, Nested
Queries etc.
Speed/Accuracy Trade-off
SELECT my_function(sessionTime)
FROMTable
WHERE city=‘San Francisco’
LEFT OUTER JOIN logs2
ON very_big_log.id = logs.id
WITHIN 1 SECONDS
ML Primitives,
User Defined Functions
Speed/Accuracy Trade-off
SELECT avg(sessionTime)
FROMTable
WHERE city=‘San Francisco’
ERROR 0.1 CONFIDENCE 95.0%
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Uniform Samples
2
4
1
3
Uniform Samples
2
4
1
3
U
Uniform Samples
2
4
1
3
U
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
1. FILTER rand() < 1/3
2. Adds	per-row	weights
3. In-memory	Shuffle
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Uniform Samples
2
4
1
3
U
ID City Data Weight
2 NYC 0.13 1/3
8 NYC 0.25 1/3
6 Berkeley 0.09 1/3
11 NYC 0.19 1/3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Doesn’t change
Spark RDD
Semantics
Stratified Samples
2
4
1
3
Stratified Samples
2
4
1
3
S
Stratified Samples
2
4
1
3
S
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
Stratified Samples
2
4
1
3
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
SPLIT
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count
NYC 7
Berkeley 5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
GROUP
Stratified Samples
2
4
1
3
S1
ID City Data
1 NYC 0.78
2 NYC 0.13
3 Berkeley 0.25
4 NYC 0.19
5 NYC 0.11
6 Berkeley 0.09
7 NYC 0.18
8 NYC 0.15
9 Berkeley 0.13
10 Berkeley 0.49
11 NYC 0.19
12 Berkeley 0.10
S2
City Count Ratio
NYC 7 2/7
Berkeley 5 2/5
S2 JOIN
Stratified Samples
2
4
1
3
S1
S2
S2
U
ID City Data Weight
2 NYC 0.13 2/7
8 NYC 0.25 2/7
6 Berkeley 0.09 2/5
12 Berkeley 0.49 2/5
Doesn’t change
Shark RDD
Semantics
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
Error Estimation
Closed Form Aggregate Functions
- Central LimitTheorem
- Applicable to AVG, COUNT, SUM,
VARIANCE and STDEV
Error Estimation
Closed Form Aggregate Functions
The following results are (asymptotically in sample
rectly useful, since they depend on unknown properties
tribution. In all cases we just plug in the sample values.
of µ we use 1
n
Pn
i=1 Xi where Xi is the ith sample value.
Note that for estimators other than sum and count,
(p = 1). Filtering will increase variance a bit, or potentia
selective queries (p = 0). I can compute the filtering-adju
1. Count: N(np, n(1 p)p)
2. Sum: N(npµ, np( 2
+ (1 p)µ2
))
3. Mean: N(µ, 2
/n)
4. Variance: N( 2
, (µ4
4
)/n)
5. Stddev: N( , (µ4
4
)/(4 2
n))
Error Estimation
Closed Form Aggregate Functions
A
1
2
Sample
AVG
SUM
COUNT
STDEV
VARIANCE
A
1
2
Sample
A
±ε
A
Error Estimation
Generalized Aggregate Functions
- Statistical Bootstrap
- Applicable to complex and nested queries, UDFs,
joins etc.
Error Estimation
Generalized Aggregate Functions
Sample
A
Sample
AA1A2A100
…
…
B
±ε
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
ErrorVerificationError
Sample Size
More Data à Higher Accuracy
300 Data Points à 97% Accuracy
[KDD’13] [SIGMOD’14]
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of random and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Offline Process]
What is BlinkDB?
A framework built on Spark that …
- creates and maintains a variety of uniform and
stratified samples from underlying data
- returns fast, approximate answers with error bars
by executing queries on samples of data
- verifies the correctness of the error bars that it
returns at runtime
[Online Process]
TABLE
SamplingModule
Original
Data
Offline-sampling:
Creates an optimal set
of samples on native
tables and materialized
views based on query
history and workload
characteristics
BlinkDB Architecture
47
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Sample Placement:
Samples striped over
100s or 1,000s of
machines both on
disks and in-memory.
BlinkDB Architecture
48
SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
BlinkDB Architecture
49
SELECT
foo (*)
FROM TABLE
WITHIN 2
Query Plan
HiveQL/SQL
Query
Sample Selection
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Online sample
selection to pick best
sample(s) based on
query latency and
accuracy
requirements
BlinkDB Architecture
50
TABLE
SamplingModule
In-Memory
Samples
On-Disk
Samples
Original
Data
Hive/Spark/Presto
SELECT
foo (*)
FROM TABLE
WITHIN 2
New Query Plan
HiveQL/SQL
Query
Sample Selection
Error Bars &
Confidence Intervals
Result
182.23 ± 5.56
(95% confidence)
Parallel query
execution on multiple
samples striped across
multiple machines.
BlinkDB Architecture
51
BlinkDB is Fast!
- 5 Queries, 5 machines
- 20 GB samples (0.001%-1% of original data)
- 1-5% Error
ResponseTime(s)
Query Execution
Overall Query Execution
Overall Query Execution
ResponseTime(s)
Error Estimation
Overhead
Overall Query Execution
ResponseTime(s)
ErrorVerification
Overhead
Coming Soon: Native Spark Integration
BlinkDB Prototype
1. Alpha 0.2.0 released and available at http://blinkdb.org
2. Allows you to create samples on native tables and materialized
views
3. Adds approximate aggregate functions with statistical closed
forms to HiveQL
4. Compatible with Apache Hive, Spark and Facebook’s Presto
(storage, serdes, UDFs, types, metadata)
http://blinkdb.org
Native Spark Integration Coming Soon!

More Related Content

Similar to BlinkDB - Approximate Queries on Very Large Data

Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraAlexander Korotkov
 
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseAhmedmchayaa
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraSveta Smirnova
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for ScyllaScyllaDB
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLNicolas Poggi
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Databricks
 
IllinoisScan_seminar.ppt
IllinoisScan_seminar.pptIllinoisScan_seminar.ppt
IllinoisScan_seminar.pptcoolbusinessman
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineDatabricks
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_PresentationSourabh Gujar
 
Balaji_CLIMA_Slide Deck
Balaji_CLIMA_Slide DeckBalaji_CLIMA_Slide Deck
Balaji_CLIMA_Slide DeckBalaji Kalluri
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...Zhen Ming (Jack) Jiang
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke PredictionMohammadRakib8
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke predictiondrubosaha
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
 

Similar to BlinkDB - Approximate Queries on Very Large Data (20)

Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
Open Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second eraOpen Source SQL databases enter millions queries per second era
Open Source SQL databases enter millions queries per second era
 
FallThesisFinal
FallThesisFinalFallThesisFinal
FallThesisFinal
 
Clustering
ClusteringClustering
Clustering
 
Spanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed DatabaseSpanner : Google' s Globally Distributed Database
Spanner : Google' s Globally Distributed Database
 
Open Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second eraOpen Source SQL databases enters millions queries per second era
Open Source SQL databases enters millions queries per second era
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
 
IllinoisScan_seminar.ppt
IllinoisScan_seminar.pptIllinoisScan_seminar.ppt
IllinoisScan_seminar.ppt
 
Fast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL EngineFast and Reliable Apache Spark SQL Engine
Fast and Reliable Apache Spark SQL Engine
 
Internship_Presentation
Internship_PresentationInternship_Presentation
Internship_Presentation
 
Balaji_CLIMA_Slide Deck
Balaji_CLIMA_Slide DeckBalaji_CLIMA_Slide Deck
Balaji_CLIMA_Slide Deck
 
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
A Framework to Evaluate the Effectiveness of Different Load Testing Analysis ...
 
Final_Presentation_Docker_KP
Final_Presentation_Docker_KPFinal_Presentation_Docker_KP
Final_Presentation_Docker_KP
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
 
A Survey on Stroke Prediction
A Survey on Stroke PredictionA Survey on Stroke Prediction
A Survey on Stroke Prediction
 
A survey on heart stroke prediction
A survey on heart stroke predictionA survey on heart stroke prediction
A survey on heart stroke prediction
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 

More from Knoldus Inc.

Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingKnoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionKnoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxKnoldus Inc.
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptxKnoldus Inc.
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfKnoldus Inc.
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxKnoldus Inc.
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingKnoldus Inc.
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesKnoldus Inc.
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxKnoldus Inc.
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxKnoldus Inc.
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxKnoldus Inc.
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxKnoldus Inc.
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxKnoldus Inc.
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationKnoldus Inc.
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationKnoldus Inc.
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIsKnoldus Inc.
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II PresentationKnoldus Inc.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAKnoldus Inc.
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Knoldus Inc.
 

More from Knoldus Inc. (20)

Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 
Introduction to Redis and its features.pptx
Introduction to Redis and its features.pptxIntroduction to Redis and its features.pptx
Introduction to Redis and its features.pptx
 
GraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdfGraphQL with .NET Core Microservices.pdf
GraphQL with .NET Core Microservices.pdf
 
NuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptxNuGet Packages Presentation (DoT NeT).pptx
NuGet Packages Presentation (DoT NeT).pptx
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
 
K8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose KubernetesK8sGPTThe AI​ way to diagnose Kubernetes
K8sGPTThe AI​ way to diagnose Kubernetes
 
Introduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptxIntroduction to Circle Ci Presentation.pptx
Introduction to Circle Ci Presentation.pptx
 
Robusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptxRobusta -Tool Presentation (DevOps).pptx
Robusta -Tool Presentation (DevOps).pptx
 
Optimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptxOptimizing Kubernetes using GOLDILOCKS.pptx
Optimizing Kubernetes using GOLDILOCKS.pptx
 
Azure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptxAzure Function App Exception Handling.pptx
Azure Function App Exception Handling.pptx
 
CQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptxCQRS Design Pattern Presentation (Java).pptx
CQRS Design Pattern Presentation (Java).pptx
 
ETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake PresentationETL Observability: Azure to Snowflake Presentation
ETL Observability: Azure to Snowflake Presentation
 
Scripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics PresentationScripting with K6 - Beyond the Basics Presentation
Scripting with K6 - Beyond the Basics Presentation
 
Getting started with dotnet core Web APIs
Getting started with dotnet core Web APIsGetting started with dotnet core Web APIs
Getting started with dotnet core Web APIs
 
Introduction To Rust part II Presentation
Introduction To Rust part II PresentationIntroduction To Rust part II Presentation
Introduction To Rust part II Presentation
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Configuring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRAConfiguring Workflows & Validators in JIRA
Configuring Workflows & Validators in JIRA
 
Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)Advanced Python (with dependency injection and hydra configuration packages)
Advanced Python (with dependency injection and hydra configuration packages)
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

BlinkDB - Approximate Queries on Very Large Data

  • 1. Approximate Queries onVery Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, AmeetTalwalkar, Michael Jordan, Samuel Madden, Ion Stoica M I T 1
  • 2. About Me 1. Software Engineer at Databricks in San Francisco, CA 2. PhD in Databases from University of California, Berkeley 2014. 3. Actively work as part of the open source community and the AMPLab to create BDAS (Berkeley Data Analytics Stack) that constitutes Apache Spark, Tachyon, BlinkDB, Mesos etc.
  • 3. Hard Disks ½ - 1 Hour 1 - 5 Minutes 1 second ? Memory 10TB on 100 machines Query Execution on Samples
  • 4. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? 0.2325
  • 5. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 0.2325
  • 6. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/4 6 Berkeley 0.25 1/4 8 NYC 0.19 1/4 Uniform Sample 0.19 +/- 0.05 0.2325
  • 7. ID City Buff Ratio 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Query Execution on Samples What is the average buffering ratio in the table? ID City Buff Ratio Sampling Rate 2 NYC 0.13 1/2 3 Berkeley 0.25 1/2 5 NYC 0.19 1/2 6 Berkeley 0.09 1/2 8 NYC 0.18 1/2 12 Berkeley 0.49 1/2 Uniform Sample $0.22 +/- 0.02 0.2325 0.19 +/- 0.05
  • 8. Speed/Accuracy Trade-offError 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec ExecutionTime (Sample Size)
  • 9. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 10x as response time is dominated by I/O
  • 10. Sampling Vs. No Sampling 0 200 400 600 800 1000 1 10-1 10-2 10-3 10-4 10-5 Fraction of full data QueryResponseTime(Seconds) 103 1020 18 13 10 8 (0.02%) (0.07%) (1.1%) (3.4%) (11%) Error Bars
  • 11. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))*
  • 12. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 13. Sampling Error Typically, error depends on sample size (n) and not on original data size, i.e., error is proportional to (1/sqrt(n))* * Conditions Apply
  • 14. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHERE city=‘San Francisco’ WITHIN 1 SECONDS 234.23 ± 15.32
  • 15. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHERE city=‘San Francisco’ WITHIN 2 SECONDS 234.23 ± 15.32 239.46 ± 4.96
  • 16. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHERE city=‘San Francisco’ WITHIN 1 SECONDS AVG, COUNT, SUM, STDEV, PERCENTILE etc.
  • 17. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHERE city=‘San Francisco’ WITHIN 1 SECONDS FILTERS, GROUP BY clauses
  • 18. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS JOINS, Nested Queries etc.
  • 19. Speed/Accuracy Trade-off SELECT my_function(sessionTime) FROMTable WHERE city=‘San Francisco’ LEFT OUTER JOIN logs2 ON very_big_log.id = logs.id WITHIN 1 SECONDS ML Primitives, User Defined Functions
  • 20. Speed/Accuracy Trade-off SELECT avg(sessionTime) FROMTable WHERE city=‘San Francisco’ ERROR 0.1 CONFIDENCE 95.0%
  • 21. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 22. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 25. Uniform Samples 2 4 1 3 U ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 26. Uniform Samples 2 4 1 3 U 1. FILTER rand() < 1/3 2. Adds per-row weights 3. In-memory Shuffle ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 27. Uniform Samples 2 4 1 3 U ID City Data Weight 2 NYC 0.13 1/3 8 NYC 0.25 1/3 6 Berkeley 0.09 1/3 11 NYC 0.19 1/3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 Doesn’t change Spark RDD Semantics
  • 30. Stratified Samples 2 4 1 3 S ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10
  • 31. Stratified Samples 2 4 1 3 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 SPLIT
  • 32. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count NYC 7 Berkeley 5 GROUP
  • 33. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 GROUP
  • 34. Stratified Samples 2 4 1 3 S1 ID City Data 1 NYC 0.78 2 NYC 0.13 3 Berkeley 0.25 4 NYC 0.19 5 NYC 0.11 6 Berkeley 0.09 7 NYC 0.18 8 NYC 0.15 9 Berkeley 0.13 10 Berkeley 0.49 11 NYC 0.19 12 Berkeley 0.10 S2 City Count Ratio NYC 7 2/7 Berkeley 5 2/5 S2 JOIN
  • 35. Stratified Samples 2 4 1 3 S1 S2 S2 U ID City Data Weight 2 NYC 0.13 2/7 8 NYC 0.25 2/7 6 Berkeley 0.09 2/5 12 Berkeley 0.49 2/5 Doesn’t change Shark RDD Semantics
  • 36. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 37. Error Estimation Closed Form Aggregate Functions - Central LimitTheorem - Applicable to AVG, COUNT, SUM, VARIANCE and STDEV
  • 38. Error Estimation Closed Form Aggregate Functions The following results are (asymptotically in sample rectly useful, since they depend on unknown properties tribution. In all cases we just plug in the sample values. of µ we use 1 n Pn i=1 Xi where Xi is the ith sample value. Note that for estimators other than sum and count, (p = 1). Filtering will increase variance a bit, or potentia selective queries (p = 0). I can compute the filtering-adju 1. Count: N(np, n(1 p)p) 2. Sum: N(npµ, np( 2 + (1 p)µ2 )) 3. Mean: N(µ, 2 /n) 4. Variance: N( 2 , (µ4 4 )/n) 5. Stddev: N( , (µ4 4 )/(4 2 n))
  • 39. Error Estimation Closed Form Aggregate Functions A 1 2 Sample AVG SUM COUNT STDEV VARIANCE A 1 2 Sample A ±ε A
  • 40. Error Estimation Generalized Aggregate Functions - Statistical Bootstrap - Applicable to complex and nested queries, UDFs, joins etc.
  • 41. Error Estimation Generalized Aggregate Functions Sample A Sample AA1A2A100 … … B ±ε
  • 42. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 43. ErrorVerificationError Sample Size More Data à Higher Accuracy 300 Data Points à 97% Accuracy [KDD’13] [SIGMOD’14]
  • 44. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of random and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime
  • 45. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Offline Process]
  • 46. What is BlinkDB? A framework built on Spark that … - creates and maintains a variety of uniform and stratified samples from underlying data - returns fast, approximate answers with error bars by executing queries on samples of data - verifies the correctness of the error bars that it returns at runtime [Online Process]
  • 47. TABLE SamplingModule Original Data Offline-sampling: Creates an optimal set of samples on native tables and materialized views based on query history and workload characteristics BlinkDB Architecture 47
  • 48. TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data Sample Placement: Samples striped over 100s or 1,000s of machines both on disks and in-memory. BlinkDB Architecture 48
  • 49. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data BlinkDB Architecture 49
  • 50. SELECT foo (*) FROM TABLE WITHIN 2 Query Plan HiveQL/SQL Query Sample Selection TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data Online sample selection to pick best sample(s) based on query latency and accuracy requirements BlinkDB Architecture 50
  • 51. TABLE SamplingModule In-Memory Samples On-Disk Samples Original Data Hive/Spark/Presto SELECT foo (*) FROM TABLE WITHIN 2 New Query Plan HiveQL/SQL Query Sample Selection Error Bars & Confidence Intervals Result 182.23 ± 5.56 (95% confidence) Parallel query execution on multiple samples striped across multiple machines. BlinkDB Architecture 51
  • 52. BlinkDB is Fast! - 5 Queries, 5 machines - 20 GB samples (0.001%-1% of original data) - 1-5% Error
  • 56. Coming Soon: Native Spark Integration
  • 57. BlinkDB Prototype 1. Alpha 0.2.0 released and available at http://blinkdb.org 2. Allows you to create samples on native tables and materialized views 3. Adds approximate aggregate functions with statistical closed forms to HiveQL 4. Compatible with Apache Hive, Spark and Facebook’s Presto (storage, serdes, UDFs, types, metadata)