SlideShare a Scribd company logo
Lazy Join Optimizations without
Upfront Statistics
MATTEO INTERLANDI
Open Source Data-Intensive Scalable Computing (DISC)
Platforms: Hadoop MapReduce and Spark
◦ functional API
◦ map and reduce User-Defined Functions
◦ RDD transformations (filter, flatMap, zipPartitions, etc.)
Several years later, introduction of high-level SQL-like
declarative query languages (and systems)
◦ Conciseness
◦ Pick a physical execution plan from a number of alternatives
Cloud Computing Programs
Two steps process
◦ Logical optimizations (e.g., filter pushdown)
◦ Physical optimizations (e.g., join orders and implementation)
Physical optimizer in RDMBS:
◦ Cost-base
◦ Data statistics (e.g., predicate selectivities, cost of data access, etc.)
The role of the cost-based optimizer is to
(1) enumerate some set of equivalent plans
(2) estimate the cost of each plan
(3) select a sufficiently good plan
Query Optimization
Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
unabletofinishin5+hours
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
unabletofinishin5+hours
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
Bad plans over Big Data can be disastrous!
No cost-based join enumeration
◦ Rely on order of relations in FROM clause
◦ Left-deep plans
No upfront statistics:
◦ Often data sits in HDFS and unstructured
Even if input statistics are available:
◦ Correlations between predicates
◦ Exponential error propagation in joins
◦ Arbitrary UDFs
Cost-base Optimizer in DISC
Bad statistics
Adaptive Query planning
RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Pilot runs (samples)
DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Cost-base Optimizer in DISC: State of the Art
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
• Samples are expensive
• Only foreign-key joins
• No runtime plan revision
Lazy Cost-base Optimizer for Spark
Key idea: interleave query planning and execution
◦ Query plans are lazily executed
◦ Statistics are gathered at runtime
◦ Joins are greedly scheduled
◦ Next join can be dynamically changed if a bad decision was made
◦ Execute-Gather-Aggregate-Plan strategy (EGAP)
Neither upfront statistics nor pilot runs are required
◦ Raw dataset size for initial guess
Support for not foreign-key joins
Lazy Optimizer: an Example
BA
A
C
AA
AAB
AAC
Assumption: A < C
Lazy Optimizer: Execute Step
BA
A
C
AA
AAB
AAC
A
B
C
Assumption: A < C
Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
Lazy Optimizer: Aggregate step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
Driver
Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
ABCABCABC
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
Assumption: A < Cσ
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
Repartition step
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
ABCABCABC
σ(A) > σ(C)
Runtime Integrated Optimizer for Spark
Spark batch execution model allows late binding of joins
Set of Statistics:
◦ Join estimations (based on sampling or sketches)
◦ Number of records
◦ Average size of each record
Statistics are aggregates using a Spark job or accumulators
Join implementations are picked based on thresholds
Challenges and Optimizations
Execute - Block and revise execution plans without wasting
computation
Gather - Asynchronous generation of statistics
Aggregate - Efficient accumulation of statistics
Plan - Try to schedule as many broadcast joins as possible
Experiments
Q1: Is RIOS able to generate good query plans?
Q2: What are the performance of RIOS compared to regular
Spark and pilot runs?
Q3: How expensive are wrong guesses?
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q1: RIOS is able to avoid bad plans
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q2: RIOS is always faster than pilot run approach
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q3: Bad guesses cost around 15% in the worst case
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
Q1: RIOS generates optimal plans
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
Q2: RIOS is always the faster approach
Conclusions
RIOS: cost-base query optimizer for Spark
Statistics are gathered at runtime (no need for initial
statistics or pilot runs)
Late bind of joins
Up to 2x faster than the best left-deep plans (Spark), and >
100x than previous approaches for fact table joins.
Future Work
More flexible shuffle operations:
◦ Efficient switch from shuffle-base joins to broadcast joins
◦ Allow records to be partitioned in different ways
Take in consideration interesting orders and partitions
Add aggregation and additional statistics (IO and network
cost)
Thank you
๏ Datasets:
• TPCDS
• TPCH
๏ Configuration:
• 16 machines, 4 cores (2 hyper threads per core)
machines, 32GB of RAM, 1TB disk
• Spark 1.6.3
• Scale factor from 1 to 1000 (~1TB)
Experiment Configuration

More Related Content

What's hot

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Databricks
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
Jen Aman
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Spark Summit
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Spark Summit
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
Databricks
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
Databricks
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
Kazuaki Ishizaki
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Flink Forward
 

What's hot (20)

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 

Similar to Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi

List intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and OptimizationsList intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and Optimizations
Sunghwan Kim
 
Relaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 SlidesRelaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 Slides
rvernica
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
cookie1969
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
Databricks
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
SAP Technology
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Altinity Ltd
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
Edge AI and Vision Alliance
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
Pandey_G
 
Se notes
Se notesSe notes
WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013Magdalene Tan
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
EDB
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
Anju Garg
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Scott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
SigOpt
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
InfluxData
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
DataStax Academy
 
Effectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational DatabaseEffectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational Database
Todd McGrath
 
Cansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final PresentationCansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final Presentation
American Astronautical Society
 
Uav Stability Augmentation System Usas
Uav Stability Augmentation System   UsasUav Stability Augmentation System   Usas
Uav Stability Augmentation System Usas
ahmad bassiouny
 

Similar to Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi (20)

List intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and OptimizationsList intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and Optimizations
 
Relaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 SlidesRelaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 Slides
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 
Se notes
Se notesSe notes
Se notes
 
WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Performance Risk Management
Performance Risk ManagementPerformance Risk Management
Performance Risk Management
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Effectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational DatabaseEffectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational Database
 
Cansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final PresentationCansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final Presentation
 
Uav Stability Augmentation System Usas
Uav Stability Augmentation System   UsasUav Stability Augmentation System   Usas
Uav Stability Augmentation System Usas
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 

Recently uploaded (20)

一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 

Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi

  • 1. Lazy Join Optimizations without Upfront Statistics MATTEO INTERLANDI
  • 2. Open Source Data-Intensive Scalable Computing (DISC) Platforms: Hadoop MapReduce and Spark ◦ functional API ◦ map and reduce User-Defined Functions ◦ RDD transformations (filter, flatMap, zipPartitions, etc.) Several years later, introduction of high-level SQL-like declarative query languages (and systems) ◦ Conciseness ◦ Pick a physical execution plan from a number of alternatives Cloud Computing Programs
  • 3. Two steps process ◦ Logical optimizations (e.g., filter pushdown) ◦ Physical optimizations (e.g., join orders and implementation) Physical optimizer in RDMBS: ◦ Cost-base ◦ Data statistics (e.g., predicate selectivities, cost of data access, etc.) The role of the cost-based optimizer is to (1) enumerate some set of equivalent plans (2) estimate the cost of each plan (3) select a sufficiently good plan Query Optimization
  • 4. Query Optimization: Why Important? 0.25 1 4 16 64 256 1024 4096 16384 W R W R W R W R 1082 70 343 21 unabletofinishin5+hours 276 15102 954 Time(s) Scale Factor = 10 Spark AsterixDB Hive Pig
  • 5. Query Optimization: Why Important? 0.25 1 4 16 64 256 1024 4096 16384 W R W R W R W R 1082 70 343 21 unabletofinishin5+hours 276 15102 954 Time(s) Scale Factor = 10 Spark AsterixDB Hive Pig Bad plans over Big Data can be disastrous!
  • 6. No cost-based join enumeration ◦ Rely on order of relations in FROM clause ◦ Left-deep plans No upfront statistics: ◦ Often data sits in HDFS and unstructured Even if input statistics are available: ◦ Correlations between predicates ◦ Exponential error propagation in joins ◦ Arbitrary UDFs Cost-base Optimizer in DISC
  • 7. Bad statistics Adaptive Query planning RoPe [NSDI 12, VLDB 2013] No upfront statistics Pilot runs (samples) DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art
  • 8. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics Cost-base Optimizer in DISC: State of the Art
  • 9. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist
  • 10. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics ◦ Pilot runs (samples) ◦ DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist
  • 11. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics ◦ Pilot runs (samples) ◦ DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist • Samples are expensive • Only foreign-key joins • No runtime plan revision
  • 12. Lazy Cost-base Optimizer for Spark Key idea: interleave query planning and execution ◦ Query plans are lazily executed ◦ Statistics are gathered at runtime ◦ Joins are greedly scheduled ◦ Next join can be dynamically changed if a bad decision was made ◦ Execute-Gather-Aggregate-Plan strategy (EGAP) Neither upfront statistics nor pilot runs are required ◦ Raw dataset size for initial guess Support for not foreign-key joins
  • 13. Lazy Optimizer: an Example BA A C AA AAB AAC Assumption: A < C
  • 14. Lazy Optimizer: Execute Step BA A C AA AAB AAC A B C Assumption: A < C
  • 15. Lazy Optimizer: Gather step BA A C AA AAB AAC A B C S S S S Assumption: A < C
  • 16. U Lazy Optimizer: Aggregate step BA A C AA AAB AAC A B C S S S S Assumption: A < C Driver
  • 17. Lazy Optimizer: Plan step BA A C AA AAB AAC A B C S S S S Assumption: A < C U
  • 18. Lazy Optimizer: Execute step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB
  • 19. Lazy Optimizer: Gather step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S
  • 20. Lazy Optimizer: Plan step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S
  • 21. Lazy Optimizer: Execute step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S ABCABCABC
  • 22. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC Assumption: A < Cσ σ(A) > σ(C)
  • 23. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ σ(A) > σ(C)
  • 24. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B σ(A) > σ(C) Repartition step
  • 25. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B σ(A) > σ(C)
  • 26. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S σ(A) > σ(C)
  • 27. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S σ(A) > σ(C)
  • 28. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S ABCABCABC σ(A) > σ(C)
  • 29. Runtime Integrated Optimizer for Spark Spark batch execution model allows late binding of joins Set of Statistics: ◦ Join estimations (based on sampling or sketches) ◦ Number of records ◦ Average size of each record Statistics are aggregates using a Spark job or accumulators Join implementations are picked based on thresholds
  • 30. Challenges and Optimizations Execute - Block and revise execution plans without wasting computation Gather - Asynchronous generation of statistics Aggregate - Efficient accumulation of statistics Plan - Try to schedule as many broadcast joins as possible
  • 31. Experiments Q1: Is RIOS able to generate good query plans? Q2: What are the performance of RIOS compared to regular Spark and pilot runs? Q3: How expensive are wrong guesses?
  • 32. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run
  • 33. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q1: RIOS is able to avoid bad plans
  • 34. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q2: RIOS is always faster than pilot run approach
  • 35. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q3: Bad guesses cost around 15% in the worst case
  • 36. TPCDS and TPCH Queries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order
  • 37. TPCDS and TPCH Queries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order Q1: RIOS generates optimal plans
  • 38. TPCDS and TPCH Queries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order Q2: RIOS is always the faster approach
  • 39. Conclusions RIOS: cost-base query optimizer for Spark Statistics are gathered at runtime (no need for initial statistics or pilot runs) Late bind of joins Up to 2x faster than the best left-deep plans (Spark), and > 100x than previous approaches for fact table joins.
  • 40. Future Work More flexible shuffle operations: ◦ Efficient switch from shuffle-base joins to broadcast joins ◦ Allow records to be partitioned in different ways Take in consideration interesting orders and partitions Add aggregation and additional statistics (IO and network cost)
  • 42. ๏ Datasets: • TPCDS • TPCH ๏ Configuration: • 16 machines, 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk • Spark 1.6.3 • Scale factor from 1 to 1000 (~1TB) Experiment Configuration