SlideShare a Scribd company logo
1 of 42
Download to read offline
Lazy Join Optimizations without
Upfront Statistics
MATTEO INTERLANDI
Open Source Data-Intensive Scalable Computing (DISC)
Platforms: Hadoop MapReduce and Spark
◦ functional API
◦ map and reduce User-Defined Functions
◦ RDD transformations (filter, flatMap, zipPartitions, etc.)
Several years later, introduction of high-level SQL-like
declarative query languages (and systems)
◦ Conciseness
◦ Pick a physical execution plan from a number of alternatives
Cloud Computing Programs
Two steps process
◦ Logical optimizations (e.g., filter pushdown)
◦ Physical optimizations (e.g., join orders and implementation)
Physical optimizer in RDMBS:
◦ Cost-base
◦ Data statistics (e.g., predicate selectivities, cost of data access, etc.)
The role of the cost-based optimizer is to
(1) enumerate some set of equivalent plans
(2) estimate the cost of each plan
(3) select a sufficiently good plan
Query Optimization
Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
unabletofinishin5+hours
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
unabletofinishin5+hours
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
Bad plans over Big Data can be disastrous!
No cost-based join enumeration
◦ Rely on order of relations in FROM clause
◦ Left-deep plans
No upfront statistics:
◦ Often data sits in HDFS and unstructured
Even if input statistics are available:
◦ Correlations between predicates
◦ Exponential error propagation in joins
◦ Arbitrary UDFs
Cost-base Optimizer in DISC
Bad statistics
Adaptive Query planning
RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Pilot runs (samples)
DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Cost-base Optimizer in DISC: State of the Art
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
• Samples are expensive
• Only foreign-key joins
• No runtime plan revision
Lazy Cost-base Optimizer for Spark
Key idea: interleave query planning and execution
◦ Query plans are lazily executed
◦ Statistics are gathered at runtime
◦ Joins are greedly scheduled
◦ Next join can be dynamically changed if a bad decision was made
◦ Execute-Gather-Aggregate-Plan strategy (EGAP)
Neither upfront statistics nor pilot runs are required
◦ Raw dataset size for initial guess
Support for not foreign-key joins
Lazy Optimizer: an Example
BA
A
C
AA
AAB
AAC
Assumption: A < C
Lazy Optimizer: Execute Step
BA
A
C
AA
AAB
AAC
A
B
C
Assumption: A < C
Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
Lazy Optimizer: Aggregate step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
Driver
Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
ABCABCABC
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
Assumption: A < Cσ
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
Repartition step
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
ABCABCABC
σ(A) > σ(C)
Runtime Integrated Optimizer for Spark
Spark batch execution model allows late binding of joins
Set of Statistics:
◦ Join estimations (based on sampling or sketches)
◦ Number of records
◦ Average size of each record
Statistics are aggregates using a Spark job or accumulators
Join implementations are picked based on thresholds
Challenges and Optimizations
Execute - Block and revise execution plans without wasting
computation
Gather - Asynchronous generation of statistics
Aggregate - Efficient accumulation of statistics
Plan - Try to schedule as many broadcast joins as possible
Experiments
Q1: Is RIOS able to generate good query plans?
Q2: What are the performance of RIOS compared to regular
Spark and pilot runs?
Q3: How expensive are wrong guesses?
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q1: RIOS is able to avoid bad plans
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q2: RIOS is always faster than pilot run approach
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q3: Bad guesses cost around 15% in the worst case
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
Q1: RIOS generates optimal plans
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
Q2: RIOS is always the faster approach
Conclusions
RIOS: cost-base query optimizer for Spark
Statistics are gathered at runtime (no need for initial
statistics or pilot runs)
Late bind of joins
Up to 2x faster than the best left-deep plans (Spark), and >
100x than previous approaches for fact table joins.
Future Work
More flexible shuffle operations:
◦ Efficient switch from shuffle-base joins to broadcast joins
◦ Allow records to be partitioned in different ways
Take in consideration interesting orders and partitions
Add aggregation and additional statistics (IO and network
cost)
Thank you
๏ Datasets:
• TPCDS
• TPCH
๏ Configuration:
• 16 machines, 4 cores (2 hyper threads per core)
machines, 32GB of RAM, 1TB disk
• Spark 1.6.3
• Scale factor from 1 to 1000 (~1TB)
Experiment Configuration

More Related Content

What's hot

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 

What's hot (20)

Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 

Similar to Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi

List intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and OptimizationsList intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and OptimizationsSunghwan Kim
 
Relaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 SlidesRelaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 Slidesrvernica
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkDatabricks
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereSAP Technology
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesAltinity Ltd
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...Edge AI and Vision Alliance
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
 
WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013Magdalene Tan
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query OptimizationAnju Garg
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKInfluxData
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightDataStax Academy
 
Effectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational DatabaseEffectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational DatabaseTodd McGrath
 
Cansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final PresentationCansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final PresentationAmerican Astronautical Society
 
Uav Stability Augmentation System Usas
Uav Stability Augmentation System   UsasUav Stability Augmentation System   Usas
Uav Stability Augmentation System Usasahmad bassiouny
 

Similar to Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi (20)

List intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and OptimizationsList intersection for web search: Algorithms, Cost Models, and Optimizations
List intersection for web search: Algorithms, Cost Models, and Optimizations
 
Relaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 SlidesRelaxing Join and Selection Queries - VLDB 2006 Slides
Relaxing Join and Selection Queries - VLDB 2006 Slides
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Keeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache SparkKeeping Identity Graphs In Sync With Apache Spark
Keeping Identity Graphs In Sync With Apache Spark
 
Maximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL AnywhereMaximizing Database Tuning in SAP SQL Anywhere
Maximizing Database Tuning in SAP SQL Anywhere
 
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert HodgesA Fast Intro to Fast Query with ClickHouse, by Robert Hodges
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
 
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
“Efficiently Map AI and Vision Applications onto Multi-core AI Processors Usi...
 
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsC-SAW: A Framework for Graph Sampling and Random Walk on GPUs
C-SAW: A Framework for Graph Sampling and Random Walk on GPUs
 
Se notes
Se notesSe notes
Se notes
 
WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013WITS Presentation - 6 Dec 2013
WITS Presentation - 6 Dec 2013
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Adaptive Query Optimization
Adaptive Query OptimizationAdaptive Query Optimization
Adaptive Query Optimization
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Performance Risk Management
Performance Risk ManagementPerformance Risk Management
Performance Risk Management
 
OPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACKOPTIMIZING THE TICK STACK
OPTIMIZING THE TICK STACK
 
Macy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-FlightMacy's: Changing Engines in Mid-Flight
Macy's: Changing Engines in Mid-Flight
 
Effectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational DatabaseEffectively Migrating to Cassandra from a Relational Database
Effectively Migrating to Cassandra from a Relational Database
 
Cansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final PresentationCansat 2008: University of Michigan Maizesat Final Presentation
Cansat 2008: University of Michigan Maizesat Final Presentation
 
Uav Stability Augmentation System Usas
Uav Stability Augmentation System   UsasUav Stability Augmentation System   Usas
Uav Stability Augmentation System Usas
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfNicoChristianSunaryo
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfsimulationsindia
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfnikeshsingh56
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfPratikPatil591646
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etclalithasri22
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligencePriyadharshiniG41
 

Recently uploaded (20)

IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Digital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdfDigital Indonesia Report 2024 by We Are Social .pdf
Digital Indonesia Report 2024 by We Are Social .pdf
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdfWorld Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
World Economic Forum Metaverse Ecosystem By Utpal Chakraborty.pdf
 
2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use2023 Survey Shows Dip in High School E-Cigarette Use
2023 Survey Shows Dip in High School E-Cigarette Use
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Statistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdfStatistics For Management by Richard I. Levin 8ed.pdf
Statistics For Management by Richard I. Levin 8ed.pdf
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Non Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdfNon Text Magic Studio Magic Design for Presentations L&P.pdf
Non Text Magic Studio Magic Design for Presentations L&P.pdf
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
DATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etcDATA ANALYSIS using various data sets like shoping data set etc
DATA ANALYSIS using various data sets like shoping data set etc
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
knowledge representation in artificial intelligence
knowledge representation in artificial intelligenceknowledge representation in artificial intelligence
knowledge representation in artificial intelligence
 

Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi

  • 1. Lazy Join Optimizations without Upfront Statistics MATTEO INTERLANDI
  • 2. Open Source Data-Intensive Scalable Computing (DISC) Platforms: Hadoop MapReduce and Spark ◦ functional API ◦ map and reduce User-Defined Functions ◦ RDD transformations (filter, flatMap, zipPartitions, etc.) Several years later, introduction of high-level SQL-like declarative query languages (and systems) ◦ Conciseness ◦ Pick a physical execution plan from a number of alternatives Cloud Computing Programs
  • 3. Two steps process ◦ Logical optimizations (e.g., filter pushdown) ◦ Physical optimizations (e.g., join orders and implementation) Physical optimizer in RDMBS: ◦ Cost-base ◦ Data statistics (e.g., predicate selectivities, cost of data access, etc.) The role of the cost-based optimizer is to (1) enumerate some set of equivalent plans (2) estimate the cost of each plan (3) select a sufficiently good plan Query Optimization
  • 4. Query Optimization: Why Important? 0.25 1 4 16 64 256 1024 4096 16384 W R W R W R W R 1082 70 343 21 unabletofinishin5+hours 276 15102 954 Time(s) Scale Factor = 10 Spark AsterixDB Hive Pig
  • 5. Query Optimization: Why Important? 0.25 1 4 16 64 256 1024 4096 16384 W R W R W R W R 1082 70 343 21 unabletofinishin5+hours 276 15102 954 Time(s) Scale Factor = 10 Spark AsterixDB Hive Pig Bad plans over Big Data can be disastrous!
  • 6. No cost-based join enumeration ◦ Rely on order of relations in FROM clause ◦ Left-deep plans No upfront statistics: ◦ Often data sits in HDFS and unstructured Even if input statistics are available: ◦ Correlations between predicates ◦ Exponential error propagation in joins ◦ Arbitrary UDFs Cost-base Optimizer in DISC
  • 7. Bad statistics Adaptive Query planning RoPe [NSDI 12, VLDB 2013] No upfront statistics Pilot runs (samples) DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art
  • 8. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics Cost-base Optimizer in DISC: State of the Art
  • 9. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist
  • 10. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics ◦ Pilot runs (samples) ◦ DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist
  • 11. Bad statistics ◦ Adaptive Query planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics ◦ Pilot runs (samples) ◦ DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist • Samples are expensive • Only foreign-key joins • No runtime plan revision
  • 12. Lazy Cost-base Optimizer for Spark Key idea: interleave query planning and execution ◦ Query plans are lazily executed ◦ Statistics are gathered at runtime ◦ Joins are greedly scheduled ◦ Next join can be dynamically changed if a bad decision was made ◦ Execute-Gather-Aggregate-Plan strategy (EGAP) Neither upfront statistics nor pilot runs are required ◦ Raw dataset size for initial guess Support for not foreign-key joins
  • 13. Lazy Optimizer: an Example BA A C AA AAB AAC Assumption: A < C
  • 14. Lazy Optimizer: Execute Step BA A C AA AAB AAC A B C Assumption: A < C
  • 15. Lazy Optimizer: Gather step BA A C AA AAB AAC A B C S S S S Assumption: A < C
  • 16. U Lazy Optimizer: Aggregate step BA A C AA AAB AAC A B C S S S S Assumption: A < C Driver
  • 17. Lazy Optimizer: Plan step BA A C AA AAB AAC A B C S S S S Assumption: A < C U
  • 18. Lazy Optimizer: Execute step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB
  • 19. Lazy Optimizer: Gather step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S
  • 20. Lazy Optimizer: Plan step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S
  • 21. Lazy Optimizer: Execute step BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S ABCABCABC
  • 22. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC Assumption: A < Cσ σ(A) > σ(C)
  • 23. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ σ(A) > σ(C)
  • 24. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B σ(A) > σ(C) Repartition step
  • 25. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B σ(A) > σ(C)
  • 26. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S σ(A) > σ(C)
  • 27. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S σ(A) > σ(C)
  • 28. Lazy Optimizer: Wrong Guess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S ABCABCABC σ(A) > σ(C)
  • 29. Runtime Integrated Optimizer for Spark Spark batch execution model allows late binding of joins Set of Statistics: ◦ Join estimations (based on sampling or sketches) ◦ Number of records ◦ Average size of each record Statistics are aggregates using a Spark job or accumulators Join implementations are picked based on thresholds
  • 30. Challenges and Optimizations Execute - Block and revise execution plans without wasting computation Gather - Asynchronous generation of statistics Aggregate - Efficient accumulation of statistics Plan - Try to schedule as many broadcast joins as possible
  • 31. Experiments Q1: Is RIOS able to generate good query plans? Q2: What are the performance of RIOS compared to regular Spark and pilot runs? Q3: How expensive are wrong guesses?
  • 32. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run
  • 33. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q1: RIOS is able to avoid bad plans
  • 34. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q2: RIOS is always faster than pilot run approach
  • 35. Minibenchmark with 3 Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q3: Bad guesses cost around 15% in the worst case
  • 36. TPCDS and TPCH Queries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order
  • 37. TPCDS and TPCH Queries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order Q1: RIOS generates optimal plans
  • 38. TPCDS and TPCH Queries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order Q2: RIOS is always the faster approach
  • 39. Conclusions RIOS: cost-base query optimizer for Spark Statistics are gathered at runtime (no need for initial statistics or pilot runs) Late bind of joins Up to 2x faster than the best left-deep plans (Spark), and > 100x than previous approaches for fact table joins.
  • 40. Future Work More flexible shuffle operations: ◦ Efficient switch from shuffle-base joins to broadcast joins ◦ Allow records to be partitioned in different ways Take in consideration interesting orders and partitions Add aggregation and additional statistics (IO and network cost)
  • 42. ๏ Datasets: • TPCDS • TPCH ๏ Configuration: • 16 machines, 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk • Spark 1.6.3 • Scale factor from 1 to 1000 (~1TB) Experiment Configuration