This document discusses methods for optimizing query performance in a query optimizer called Scope by selecting alternative rule configurations. It proposes using rule signatures to group similar queries and generate candidate rule configurations to execute for each group. A learning model is then trained on execution results to select the best configuration for future queries in each group. The goal is to improve upon the default configuration by adapting to workloads and addressing inaccuracies in cardinality estimation that can lead to suboptimal plans.
Invezz.com - Grow your wealth with trading signals
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workloads
1. RMIT Classification: Trusted
Steering Query Optimizers: A Practical Take on Big
Data Workloads
Parimarjan Negi, Matteo Interlandi, Ryan Marcus, Mohammad Alizadeh, Tim Kraska,
Marc Friedman, Alekh Jindal
Microsoft, MIT & Intel Lab.
SIGMOD’21
Presented by Hai Lan
2. RMIT Classification: Trusted
Outline
• Background
• Optimizer & Learned methods on optimizer
• Bao (SIGMOD’21)
• To-be-shared work
• Scope optimizer & workload
• Motivation & Goal
• Method
• Discussions
19/8/21 Group meeting 2
4. RMIT Classification: Trusted
Background – Optimizer
19/8/21 Group meeting 4
Query
Parser
AST
Query Rewrite
AST’
Optimizer
Phy. Plan
Executor
Results
Logical Opt.
Physical Opt.
Log. Plan
Phy. Plan
Life of A Query
5. RMIT Classification: Trusted
Background – Optimizer
19/8/21 Group meeting 5
Query
Parser
AST
Query Rewrite
AST’
Optimizer
Phy. Plan
Executor
Results
Logical Opt.
Physical Opt.
Log. Plan
Phy. Plan
Life of A Query Two Representative Arch.
Volcano
6. RMIT Classification: Trusted
Background – Optimizer
19/8/21 Group meeting 6
Query
Parser
AST
Query Rewrite
AST’
Optimizer
Phy. Plan
Executor
Results
Logical Opt.
Physical Opt.
Log. Plan
Phy. Plan
Life of A Query Two Representative Arch.
Cascades
Volcano
7. RMIT Classification: Trusted
Background – Keys in Optimizer
19/8/21 Group meeting 7
Cardinality Estimation
Plan Enumeration (Join Order)
Cost Model
A structure to store the table statistics, e.g., sample,
histogram, sketch.
Evaluation model, e.g., evaluate on sample, assumptions when
using histogram.
Cardinality Estimation
Predefined parameters, which are related to physical operators, running env.
Cost Model
Large join query
8. RMIT Classification: Trusted
Background – Keys in Optimizer
19/8/21 Group meeting 8
Cardinality Estimation
Plan Enumeration (Join Order)
Cost Model
A structure to store the table statistics, e.g., sample,
histogram, sketch.
Evaluation model, e.g., evaluate on sample, assumptions when
using histogram.
Cardinality Estimation
Predefined parameters, which are related to physical operator, environment.
Cost Model
Large join query
The root of all evil, the Achilles Heel of query optimization, is the
estimation of the size of intermediate results, known as cardinalities.
-- Guy Lohman
9. RMIT Classification: Trusted
19/8/21 Group meeting 9
Learned model to estimate
the cardinality.
Learned model to get the cost8,9.
Reinforcement learning methods to obtain the join order10,11.
Query-driven1,2,3
Data-driven 4,5,6
Hybrid 7
1. Andreas Kipf et al. : Learned Cardinalities: Estimating Correlated Joins with Deep Learning. CIDR 2019
2. Anshuman Dutt et al. : Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 12(9): 1044-1057 (2019)
3. Chenggang Wu et al. : Towards a Learning Optimizer for Shared Clouds. Proc. VLDB Endow. 12(3): 210-222 (2018)
4. Zongheng Yang et al. : Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow. 13(3): 279-292 (2019)
5. Benjamin Hilprecht et al. : DeepDB: Learn from Data, not from Queries! Proc. VLDB Endow. 13(7): 992-1005 (2020)
6. Rong Zhu et al. : FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. Proc. VLDB Endow. 14(9): 1489-1502 (2021)
7. Peizhi Wu, Gao Cong: A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. SIGMOD Conference 2021: 2009-2022
8. Ji Sun, Guoliang Li: An End-to-End Learning-based Cost Estimator. Proc. VLDB Endow. 13(3): 307-319 (2019)
9. Tarique Siddiqui et al. : Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings. SIGMOD Conference 2020: 99-113
10. Sanjay Krishnan et al. : Learning to Optimize Join Queries With Deep Reinforcement Learning. CoRR abs/1808.03196 (2018)
11. Xiang Yu, Guoliang Li, Chengliang Chai, Nan Tang: Reinforcement Learning with Tree-LSTM for Join Order Selection. ICDE 2020: 1297-1308
10. RMIT Classification: Trusted
Background – Bao1 (Bandit Optimizer)
19/8/21 Group meeting 10
Motivations.
• Due to the inaccurate cardinality estimation, wrong
physical operators may be selected.
• Databases support hints2 to specify some operators.
1. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska: Bao: Making Learned Query
Optimization Practical. SIGMOD Conference 2021: 1275-1288
2. Here `hint` is not the same with in TiDB or MySQL.
11. RMIT Classification: Trusted
Background – Bao1 (Bandit Optimizer)
19/8/21 Group meeting 11
Motivations.
• Due to the inaccurate cardinality estimation, wrong
physical operators may be selected.
• Databases support hints2 to specify some operators.
Bao’s Work.
• It automatically and adaptively determines the right hint set to use for an incoming query.
• Instead of using `cost` in optimizer, users can specify a metric, like running time used in the paper.
1. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska: Bao: Making Learned Query
Optimization Practical. SIGMOD Conference 2021: 1275-1288
2. Here `hint` is not the same with in TiDB or MySQL.
48 (26) hint sets
12. RMIT Classification: Trusted
Background – Bao
19/8/21 Group meeting 12
Method.
• Train a predictive model for the metric.
• When a query coming, select the plan with the lowest cost under the metric.
13. RMIT Classification: Trusted
Background – Bao
19/8/21 Group meeting 13
Method.
• Train a predictive model for the metric.
• When a query coming, select the plan with the lowest cost under the metric.
Prons & Cons.
• Prons
• Dynamic situations
• Integrate with a real system
• Training time
• Cons
• It cannot specify the subplan hint.
15. RMIT Classification: Trusted
Scope Overview
19/8/21 Group meeting 15
Scope Optimizer.
• Belong to Cascades family.
• 256 rules in total
• Required rules, e.g., EnforceExchange
• Implementation rules, e.g., HashJoinImpl1
• On-by-default rules, e.g., various rewrite rules
• Off-by-default rules, e.g., CorrelatedJoinOnUnion
16. RMIT Classification: Trusted
Scope Overview
19/8/21 Group meeting 16
Scope Optimizer.
• Belong to Cascades family.
• 256 rules in total
• Required rules, e.g., EnforceExchange
• Implementation rules, e.g., HashJoinImpl1
• On-by-default rules, e.g., various rewrite rules
• Off-by-default rules, e.g., CorrelatedJoinOnUnion
Default
Rules
17. RMIT Classification: Trusted
Scope Overview
19/8/21 Group meeting 17
Scope Optimizer.
Workload in Scope.
• Belong to Cascades family.
• 256 rules in total
• Required rules, e.g., EnforceExchange
• Implementation rules, e.g., HashJoinImpl1
• On-by-default rules, e.g., various rewrite rules
• Off-by-default rules, e.g., CorrelatedJoinOnUnion
Default
Rules
18. RMIT Classification: Trusted
Scope Overview
19/8/21 Group meeting 18
Scope Optimizer.
Workload in Scope.
• Recurrent jobs, same template with different variables.
• Short & long running jobs.
• 10% of jobs last over 5 min while consume 90% of containers.
• Metrics
• Runtime
• CPU time
• Total I/O time
• Belong to Cascades family.
• 256 rules in total
• Required rules, e.g., EnforceExchange
• Implementation rules, e.g., HashJoinImpl1
• On-by-default rules, e.g., various rewrite rules
• Off-by-default rules, e.g., CorrelatedJoinOnUnion
Default
Rules
19. RMIT Classification: Trusted
Motivations & Goal
19/8/21 Group meeting 19
Motivations.
• Due to the inaccurate cardinality estimation, wrong rules may be selected.
• Hints to specify which rules to use.
20. RMIT Classification: Trusted
Motivations & Goal
19/8/21 Group meeting 20
Motivations.
• Due to the inaccurate cardinality estimation, wrong rules may be selected.
• Hints to specify which rules to use.
Goal.
• Output an alternative rule configuration which is better for optimizing this
particular job, and for a given metric
21. RMIT Classification: Trusted
Motivations & Goal
19/8/21 Group meeting 21
Motivations.
• Due to the inaccurate cardinality estimation, wrong rules may be selected.
• Hints to specify which rules to use.
Relationship with Bao.
• Directly apply Bao on Scope?
• Hint -> Rule; Hint Set -> Rule configuration
• However …
• A lot more rules (200+ vs. 6) -> too many possible rule configurations
• Large workload -> large running time & hundreds of operator nodes.
Goal.
• Output an alternative rule configuration which is better for optimizing this
particular job, and for a given metric
22. RMIT Classification: Trusted
Rule Signature & Job Span
19/8/21 Group meeting 22
Rule Signature.
• A bit vector specifying which rules directly contribute to the final query plan produced by the
optimizer as the rule signature.
• The rule signature of a query optimized using the default rule configuration as the default
rule signature.
Job Span
• Given a job, its span contains all non-required rules which, if enabled or disabled, can
affect the final query plan.
• Heuristics to generate the span.
23. RMIT Classification: Trusted
Which rules to try?
19/8/21 Group meeting 23
• Enable all the rules that are not in the span of the given job.
• For each rule category, independently sample a subset of rules from the job span. Disable
these rules, and enable all others. This gives us a new rule configuration.
• If the rule configuration has not been seen before, add it to the candidate list. Repeat until 𝑀
configurations are generated.
𝑀 = 1000
Randomized Configuration Search.
24. RMIT Classification: Trusted
Which jobs to try?
19/8/21 Group meeting 24
Choose Jobs & Configurations to Execute.
• Select Jobs.
• Jobs with clearly lower costs with recompiled plans under the default cost model.
• Jobs with low cost, high runtimes under the default configuration (cost model is wrong).
• Select Configurations.
• Select the 10 cheapest (cost model) alternative rule configurations and execute them.
Workload B (compare to the default configuration)
25. RMIT Classification: Trusted
Which jobs to try?
19/8/21 Group meeting 25
Choose Jobs & Configurations to Execute.
• Select Jobs.
• Jobs with clearly lower costs with recompiled plans under the default cost model.
• Jobs with low cost, high runtimes under the default configuration (cost model is wrong).
• Select Configurations.
• Select the 10 cheapest (cost model) alternative rule configurations and execute them.
Workload B (compare to the default configuration)
29. RMIT Classification: Trusted
Different metrics
19/8/21 Group meeting 29
All metrics cannot be improved together.
Potentially to adopt different models for each one.
30. RMIT Classification: Trusted
Extrapolating to other jobs
19/8/21 Group meeting 30
• The rule signature as the level of granularity across which the same set of rule
configurations could be useful.
• Rule signature job group
• The set of jobs whose default rule signature map to the same bit vector.
Idea.
Methods.
• Case 1: simply apply a previously seen rule configuration.
• Case 2: find set of interesting configurations for each job group and adopt a
model to choose one at the compile time.
31. RMIT Classification: Trusted
Learning Rule Configurations
19/8/21 Group meeting 31
• Select S rule signatures from Workload.
• Collect the jobs whose default rule signature maps to these rule signatures.
• Obtain K candidate configurations for each job group.
• we sample 𝑀 jobs from all the jobs mapping to these job groups.
• execute each of the 𝐾 configurations for every job.
Training Set.
Learning Problem.
• Treat the dataset of samples in each job group as an independent learning problem.
• Goal is to select one of the 𝐾 candidate configurations for a given query.
• Supervised learning to estimate the running time of query under a configuration.
Featurization.
• Job level features, e.g., input cardinality size, hash of template.
• Rule configuration features, e.g., cost of plan, bit vector of RuleDiff.
• Query graph features, e.g., operators’ id, cost.
Learned Models.
• For each job group, a fully connected neural network with one hidden layer of size 1024.
(Job, RuleConf, Running Time)
33. RMIT Classification: Trusted
Discussion
19/8/21 Group meeting 33
Future work.
• Methods to generate the job span & interesting rule configurations.
• Use feedback from the execution results to guide future iterations of the configuration search
• Other configurable options in Scope.
Discussion.
Summary.
• How to choose the right rule configuration for an incoming query.
• Propose rule signature & job span & several heuristics algos to obtain the candidate rule confs.
• Adopt a learning model to choose the rule confs for each job group.
• Papers.
• Methods.
• Model for each group.
• Parameters.