Lazy Join Optimizations without
Upfront Statistics
MATTEO INTERLANDI
Open Source Data-Intensive Scalable Computing (DISC)
Platforms: Hadoop MapReduce and Spark
◦ functional API
◦ map and reduce User-Defined Functions
◦ RDD transformations (filter, flatMap, zipPartitions, etc.)
Several years later, introduction of high-level SQL-like
declarative query languages (and systems)
◦ Conciseness
◦ Pick a physical execution plan from a number of alternatives
Cloud Computing Programs
Two steps process
◦ Logical optimizations (e.g., filter pushdown)
◦ Physical optimizations (e.g., join orders and implementation)
Physical optimizer in RDMBS:
◦ Cost-base
◦ Data statistics (e.g., predicate selectivities, cost of data access, etc.)
The role of the cost-based optimizer is to
(1) enumerate some set of equivalent plans
(2) estimate the cost of each plan
(3) select a sufficiently good plan
Query Optimization
Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
unabletofinishin5+hours
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
Query Optimization: Why Important?
0.25
1
4
16
64
256
1024
4096
16384
W R W R W R W R
1082
70
343
21
unabletofinishin5+hours
276
15102
954
Time(s)
Scale Factor = 10
Spark
AsterixDB
Hive
Pig
Bad plans over Big Data can be disastrous!
No cost-based join enumeration
◦ Rely on order of relations in FROM clause
◦ Left-deep plans
No upfront statistics:
◦ Often data sits in HDFS and unstructured
Even if input statistics are available:
◦ Correlations between predicates
◦ Exponential error propagation in joins
◦ Arbitrary UDFs
Cost-base Optimizer in DISC
Bad statistics
Adaptive Query planning
RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Pilot runs (samples)
DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Cost-base Optimizer in DISC: State of the Art
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
Bad statistics
◦ Adaptive Query planning
◦ RoPe [NSDI 12, VLDB 2013]
No upfront statistics
◦ Pilot runs (samples)
◦ DynO [SIGMOD 2014]
Cost-base Optimizer in DISC: State of the Art
Assumption is that some initial
statistics exist
• Samples are expensive
• Only foreign-key joins
• No runtime plan revision
Lazy Cost-base Optimizer for Spark
Key idea: interleave query planning and execution
◦ Query plans are lazily executed
◦ Statistics are gathered at runtime
◦ Joins are greedly scheduled
◦ Next join can be dynamically changed if a bad decision was made
◦ Execute-Gather-Aggregate-Plan strategy (EGAP)
Neither upfront statistics nor pilot runs are required
◦ Raw dataset size for initial guess
Support for not foreign-key joins
Lazy Optimizer: an Example
BA
A
C
AA
AAB
AAC
Assumption: A < C
Lazy Optimizer: Execute Step
BA
A
C
AA
AAB
AAC
A
B
C
Assumption: A < C
Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
Lazy Optimizer: Aggregate step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
Driver
Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
Lazy Optimizer: Gather step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
Lazy Optimizer: Plan step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
Lazy Optimizer: Execute step
BA
A
C
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < C
U
AB
S
ABCABCABC
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
Assumption: A < Cσ
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
Repartition step
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
σ(A) > σ(C)
Lazy Optimizer: Wrong Guess
B(A)
A
σ(C)
AA
AAB
AAC
A
B
C
S
S
S
S
Assumption: A < Cσ
B
BC
S
ABCABCABC
σ(A) > σ(C)
Runtime Integrated Optimizer for Spark
Spark batch execution model allows late binding of joins
Set of Statistics:
◦ Join estimations (based on sampling or sketches)
◦ Number of records
◦ Average size of each record
Statistics are aggregates using a Spark job or accumulators
Join implementations are picked based on thresholds
Challenges and Optimizations
Execute - Block and revise execution plans without wasting
computation
Gather - Asynchronous generation of statistics
Aggregate - Efficient accumulation of statistics
Plan - Try to schedule as many broadcast joins as possible
Experiments
Q1: Is RIOS able to generate good query plans?
Q2: What are the performance of RIOS compared to regular
Spark and pilot runs?
Q3: How expensive are wrong guesses?
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q1: RIOS is able to avoid bad plans
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q2: RIOS is always faster than pilot run approach
Minibenchmark with 3 Fact Tables
16
64
256
1024
4096
16384
1 10 100 1000
40
66
115
997
41
61
111
868
45
66
123
1140
136
4194
unabletofinishin5+hours
unabletofinishin5+hours
143
4230
unabletofinishin5+hours
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order
RIOS R2R
RIOS W2R
spark wrong-order
pilot-run
Q3: Bad guesses cost around 15% in the worst case
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
Q1: RIOS generates optimal plans
TPCDS and TPCH Queries
16
32
64
128
256
512
1024
2048
4096
8192
1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000
Query 17 Query 50 Query 28 Query 9
38
49
140
2298
38
55
87
1185
37
41
107
617
55
80
326
8511
37
41
67
899
38
41
54
843
34
35
39
464
46
47
137
7250
40
43
69
930
41
52
70
1128
37
38
42
490
50
50
215
7831
45
55
198
3898
47
105
291
6069
37
44
109
712
70
153
633
unabletofinishin5+hours
Time(s)
Scale Factor
spark good-order RIOS pilot-run spark bad-order
Q2: RIOS is always the faster approach
Conclusions
RIOS: cost-base query optimizer for Spark
Statistics are gathered at runtime (no need for initial
statistics or pilot runs)
Late bind of joins
Up to 2x faster than the best left-deep plans (Spark), and >
100x than previous approaches for fact table joins.
Future Work
More flexible shuffle operations:
◦ Efficient switch from shuffle-base joins to broadcast joins
◦ Allow records to be partitioned in different ways
Take in consideration interesting orders and partitions
Add aggregation and additional statistics (IO and network
cost)
Thank you
๏ Datasets:
• TPCDS
• TPCH
๏ Configuration:
• 16 machines, 4 cores (2 hyper threads per core)
machines, 32GB of RAM, 1TB disk
• Spark 1.6.3
• Scale factor from 1 to 1000 (~1TB)
Experiment Configuration

Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi

  • 1.
    Lazy Join Optimizationswithout Upfront Statistics MATTEO INTERLANDI
  • 2.
    Open Source Data-IntensiveScalable Computing (DISC) Platforms: Hadoop MapReduce and Spark ◦ functional API ◦ map and reduce User-Defined Functions ◦ RDD transformations (filter, flatMap, zipPartitions, etc.) Several years later, introduction of high-level SQL-like declarative query languages (and systems) ◦ Conciseness ◦ Pick a physical execution plan from a number of alternatives Cloud Computing Programs
  • 3.
    Two steps process ◦Logical optimizations (e.g., filter pushdown) ◦ Physical optimizations (e.g., join orders and implementation) Physical optimizer in RDMBS: ◦ Cost-base ◦ Data statistics (e.g., predicate selectivities, cost of data access, etc.) The role of the cost-based optimizer is to (1) enumerate some set of equivalent plans (2) estimate the cost of each plan (3) select a sufficiently good plan Query Optimization
  • 4.
    Query Optimization: WhyImportant? 0.25 1 4 16 64 256 1024 4096 16384 W R W R W R W R 1082 70 343 21 unabletofinishin5+hours 276 15102 954 Time(s) Scale Factor = 10 Spark AsterixDB Hive Pig
  • 5.
    Query Optimization: WhyImportant? 0.25 1 4 16 64 256 1024 4096 16384 W R W R W R W R 1082 70 343 21 unabletofinishin5+hours 276 15102 954 Time(s) Scale Factor = 10 Spark AsterixDB Hive Pig Bad plans over Big Data can be disastrous!
  • 6.
    No cost-based joinenumeration ◦ Rely on order of relations in FROM clause ◦ Left-deep plans No upfront statistics: ◦ Often data sits in HDFS and unstructured Even if input statistics are available: ◦ Correlations between predicates ◦ Exponential error propagation in joins ◦ Arbitrary UDFs Cost-base Optimizer in DISC
  • 7.
    Bad statistics Adaptive Queryplanning RoPe [NSDI 12, VLDB 2013] No upfront statistics Pilot runs (samples) DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art
  • 8.
    Bad statistics ◦ AdaptiveQuery planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics Cost-base Optimizer in DISC: State of the Art
  • 9.
    Bad statistics ◦ AdaptiveQuery planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist
  • 10.
    Bad statistics ◦ AdaptiveQuery planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics ◦ Pilot runs (samples) ◦ DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist
  • 11.
    Bad statistics ◦ AdaptiveQuery planning ◦ RoPe [NSDI 12, VLDB 2013] No upfront statistics ◦ Pilot runs (samples) ◦ DynO [SIGMOD 2014] Cost-base Optimizer in DISC: State of the Art Assumption is that some initial statistics exist • Samples are expensive • Only foreign-key joins • No runtime plan revision
  • 12.
    Lazy Cost-base Optimizerfor Spark Key idea: interleave query planning and execution ◦ Query plans are lazily executed ◦ Statistics are gathered at runtime ◦ Joins are greedly scheduled ◦ Next join can be dynamically changed if a bad decision was made ◦ Execute-Gather-Aggregate-Plan strategy (EGAP) Neither upfront statistics nor pilot runs are required ◦ Raw dataset size for initial guess Support for not foreign-key joins
  • 13.
    Lazy Optimizer: anExample BA A C AA AAB AAC Assumption: A < C
  • 14.
    Lazy Optimizer: ExecuteStep BA A C AA AAB AAC A B C Assumption: A < C
  • 15.
    Lazy Optimizer: Gatherstep BA A C AA AAB AAC A B C S S S S Assumption: A < C
  • 16.
    U Lazy Optimizer: Aggregatestep BA A C AA AAB AAC A B C S S S S Assumption: A < C Driver
  • 17.
    Lazy Optimizer: Planstep BA A C AA AAB AAC A B C S S S S Assumption: A < C U
  • 18.
    Lazy Optimizer: Executestep BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB
  • 19.
    Lazy Optimizer: Gatherstep BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S
  • 20.
    Lazy Optimizer: Planstep BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S
  • 21.
    Lazy Optimizer: Executestep BA A C AA AAB AAC A B C S S S S Assumption: A < C U AB S ABCABCABC
  • 22.
    Lazy Optimizer: WrongGuess B(A) A σ(C) AA AAB AAC Assumption: A < Cσ σ(A) > σ(C)
  • 23.
    Lazy Optimizer: WrongGuess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ σ(A) > σ(C)
  • 24.
    Lazy Optimizer: WrongGuess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B σ(A) > σ(C) Repartition step
  • 25.
    Lazy Optimizer: WrongGuess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B σ(A) > σ(C)
  • 26.
    Lazy Optimizer: WrongGuess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S σ(A) > σ(C)
  • 27.
    Lazy Optimizer: WrongGuess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S σ(A) > σ(C)
  • 28.
    Lazy Optimizer: WrongGuess B(A) A σ(C) AA AAB AAC A B C S S S S Assumption: A < Cσ B BC S ABCABCABC σ(A) > σ(C)
  • 29.
    Runtime Integrated Optimizerfor Spark Spark batch execution model allows late binding of joins Set of Statistics: ◦ Join estimations (based on sampling or sketches) ◦ Number of records ◦ Average size of each record Statistics are aggregates using a Spark job or accumulators Join implementations are picked based on thresholds
  • 30.
    Challenges and Optimizations Execute- Block and revise execution plans without wasting computation Gather - Asynchronous generation of statistics Aggregate - Efficient accumulation of statistics Plan - Try to schedule as many broadcast joins as possible
  • 31.
    Experiments Q1: Is RIOSable to generate good query plans? Q2: What are the performance of RIOS compared to regular Spark and pilot runs? Q3: How expensive are wrong guesses?
  • 32.
    Minibenchmark with 3Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run
  • 33.
    Minibenchmark with 3Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q1: RIOS is able to avoid bad plans
  • 34.
    Minibenchmark with 3Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q2: RIOS is always faster than pilot run approach
  • 35.
    Minibenchmark with 3Fact Tables 16 64 256 1024 4096 16384 1 10 100 1000 40 66 115 997 41 61 111 868 45 66 123 1140 136 4194 unabletofinishin5+hours unabletofinishin5+hours 143 4230 unabletofinishin5+hours unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS R2R RIOS W2R spark wrong-order pilot-run Q3: Bad guesses cost around 15% in the worst case
  • 36.
    TPCDS and TPCHQueries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order
  • 37.
    TPCDS and TPCHQueries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order Q1: RIOS generates optimal plans
  • 38.
    TPCDS and TPCHQueries 16 32 64 128 256 512 1024 2048 4096 8192 1 10 100 1000 1 10 100 1000 1 10 100 1000 1 10 100 1000 Query 17 Query 50 Query 28 Query 9 38 49 140 2298 38 55 87 1185 37 41 107 617 55 80 326 8511 37 41 67 899 38 41 54 843 34 35 39 464 46 47 137 7250 40 43 69 930 41 52 70 1128 37 38 42 490 50 50 215 7831 45 55 198 3898 47 105 291 6069 37 44 109 712 70 153 633 unabletofinishin5+hours Time(s) Scale Factor spark good-order RIOS pilot-run spark bad-order Q2: RIOS is always the faster approach
  • 39.
    Conclusions RIOS: cost-base queryoptimizer for Spark Statistics are gathered at runtime (no need for initial statistics or pilot runs) Late bind of joins Up to 2x faster than the best left-deep plans (Spark), and > 100x than previous approaches for fact table joins.
  • 40.
    Future Work More flexibleshuffle operations: ◦ Efficient switch from shuffle-base joins to broadcast joins ◦ Allow records to be partitioned in different ways Take in consideration interesting orders and partitions Add aggregation and additional statistics (IO and network cost)
  • 41.
  • 42.
    ๏ Datasets: • TPCDS •TPCH ๏ Configuration: • 16 machines, 4 cores (2 hyper threads per core) machines, 32GB of RAM, 1TB disk • Spark 1.6.3 • Scale factor from 1 to 1000 (~1TB) Experiment Configuration