[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workloads

RMIT Classification: Trusted
Steering Query Optimizers: A Practical Take on Big
Data Workloads
Parimarjan Negi, Matteo Interlandi, Ryan Marcus, Mohammad Alizadeh, Tim Kraska,
Marc Friedman, Alekh Jindal
Microsoft, MIT & Intel Lab.
SIGMOD’21
Presented by Hai Lan

Outline
• Background
• Optimizer & Learned methods on optimizer
• Bao (SIGMOD’21)
• To-be-shared work
• Scope optimizer & workload
• Motivation & Goal
• Method
• Discussions
19/8/21 Group meeting 2

Background

Background – Optimizer
Query
Parser
AST
Query Rewrite
AST’
Optimizer
Phy. Plan
Executor
Results
Logical Opt.
Physical Opt.
Log. Plan
Phy. Plan
Life of A Query

Query
Parser
AST
Query Rewrite
AST’
Optimizer
Phy. Plan
Executor
Results
Logical Opt.
Physical Opt.
Log. Plan
Phy. Plan
Life of A Query Two Representative Arch.
Volcano

Query
Parser
AST
Query Rewrite
AST’
Optimizer
Phy. Plan
Executor
Results
Logical Opt.
Physical Opt.
Log. Plan
Phy. Plan
Life of A Query Two Representative Arch.
Cascades
Volcano

Background – Keys in Optimizer
Cardinality Estimation
Plan Enumeration (Join Order)
Cost Model
A structure to store the table statistics, e.g., sample,
histogram, sketch.
Evaluation model, e.g., evaluate on sample, assumptions when
using histogram.
Predefined parameters, which are related to physical operators, running env.
Cost Model
Large join query

Background – Keys in Optimizer
Plan Enumeration (Join Order)
Cost Model
A structure to store the table statistics, e.g., sample,
histogram, sketch.
Evaluation model, e.g., evaluate on sample, assumptions when
using histogram.
Predefined parameters, which are related to physical operator, environment.
Cost Model
Large join query
The root of all evil, the Achilles Heel of query optimization, is the
estimation of the size of intermediate results, known as cardinalities.
-- Guy Lohman

Learned model to estimate
the cardinality.
Learned model to get the cost8,9.
Reinforcement learning methods to obtain the join order10,11.
Query-driven1,2,3
Data-driven 4,5,6
Hybrid 7
1. Andreas Kipf et al. : Learned Cardinalities: Estimating Correlated Joins with Deep Learning. CIDR 2019
2. Anshuman Dutt et al. : Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 12(9): 1044-1057 (2019)
3. Chenggang Wu et al. : Towards a Learning Optimizer for Shared Clouds. Proc. VLDB Endow. 12(3): 210-222 (2018)
4. Zongheng Yang et al. : Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow. 13(3): 279-292 (2019)
5. Benjamin Hilprecht et al. : DeepDB: Learn from Data, not from Queries! Proc. VLDB Endow. 13(7): 992-1005 (2020)
6. Rong Zhu et al. : FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation. Proc. VLDB Endow. 14(9): 1489-1502 (2021)
7. Peizhi Wu, Gao Cong: A Unified Deep Model of Learning from both Data and Queries for Cardinality Estimation. SIGMOD Conference 2021: 2009-2022
8. Ji Sun, Guoliang Li: An End-to-End Learning-based Cost Estimator. Proc. VLDB Endow. 13(3): 307-319 (2019)
9. Tarique Siddiqui et al. : Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings. SIGMOD Conference 2020: 99-113
10. Sanjay Krishnan et al. : Learning to Optimize Join Queries With Deep Reinforcement Learning. CoRR abs/1808.03196 (2018)
11. Xiang Yu, Guoliang Li, Chengliang Chai, Nan Tang: Reinforcement Learning with Tree-LSTM for Join Order Selection. ICDE 2020: 1297-1308

Background – Bao1 (Bandit Optimizer)
Motivations.
• Due to the inaccurate cardinality estimation, wrong
physical operators may be selected.
• Databases support hints2 to specify some operators.
1. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska: Bao: Making Learned Query
Optimization Practical. SIGMOD Conference 2021: 1275-1288
2. Here `hint` is not the same with in TiDB or MySQL.

Background – Bao1 (Bandit Optimizer)
Motivations.
• Due to the inaccurate cardinality estimation, wrong
physical operators may be selected.
• Databases support hints2 to specify some operators.
Bao’s Work.
• It automatically and adaptively determines the right hint set to use for an incoming query.
• Instead of using `cost` in optimizer, users can specify a metric, like running time used in the paper.
1. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska: Bao: Making Learned Query
Optimization Practical. SIGMOD Conference 2021: 1275-1288
2. Here `hint` is not the same with in TiDB or MySQL.
48 (26) hint sets

Background – Bao
Method.
• Train a predictive model for the metric.
• When a query coming, select the plan with the lowest cost under the metric.

Background – Bao
Method.
• Train a predictive model for the metric.
• When a query coming, select the plan with the lowest cost under the metric.
Prons & Cons.
• Prons
• Dynamic situations
• Integrate with a real system
• Training time
• Cons
• It cannot specify the subplan hint.

Steer Scope Optimizer

Scope Overview
Scope Optimizer.
• Belong to Cascades family.
• 256 rules in total
• Required rules, e.g., EnforceExchange
• Implementation rules, e.g., HashJoinImpl1
• On-by-default rules, e.g., various rewrite rules
• Off-by-default rules, e.g., CorrelatedJoinOnUnion

Scope Overview
Scope Optimizer.
Default
Rules

Scope Overview
Scope Optimizer.
Workload in Scope.
Default
Rules

Scope Overview
Scope Optimizer.
Workload in Scope.
• Recurrent jobs, same template with different variables.
• Short & long running jobs.
• 10% of jobs last over 5 min while consume 90% of containers.
• Metrics
• Runtime
• CPU time
• Total I/O time
Default
Rules

Motivations & Goal
Motivations.
• Due to the inaccurate cardinality estimation, wrong rules may be selected.
• Hints to specify which rules to use.

Motivations & Goal
Motivations.
Goal.
• Output an alternative rule configuration which is better for optimizing this
particular job, and for a given metric

Motivations & Goal
Motivations.
Relationship with Bao.
• Directly apply Bao on Scope?
• Hint -> Rule; Hint Set -> Rule configuration
• However …
• A lot more rules (200+ vs. 6) -> too many possible rule configurations
• Large workload -> large running time & hundreds of operator nodes.
Goal.
• Output an alternative rule configuration which is better for optimizing this
particular job, and for a given metric

Rule Signature & Job Span
Rule Signature.
• A bit vector specifying which rules directly contribute to the final query plan produced by the
optimizer as the rule signature.
• The rule signature of a query optimized using the default rule configuration as the default
rule signature.
Job Span
• Given a job, its span contains all non-required rules which, if enabled or disabled, can
affect the final query plan.
• Heuristics to generate the span.

Which rules to try?
• Enable all the rules that are not in the span of the given job.
• For each rule category, independently sample a subset of rules from the job span. Disable
these rules, and enable all others. This gives us a new rule configuration.
• If the rule configuration has not been seen before, add it to the candidate list. Repeat until 𝑀
configurations are generated.
𝑀 = 1000
Randomized Configuration Search.

Which jobs to try?
Choose Jobs & Configurations to Execute.
• Select Jobs.
• Jobs with clearly lower costs with recompiled plans under the default cost model.
• Jobs with low cost, high runtimes under the default configuration (cost model is wrong).
• Select Configurations.
• Select the 10 cheapest (cost model) alternative rule configurations and execute them.
Workload B (compare to the default configuration)

Different metrics
Other metrics sometime see regression.

Different metrics

Different metrics
All metrics cannot be improved together.
Potentially to adopt different models for each one.

Extrapolating to other jobs
• The rule signature as the level of granularity across which the same set of rule
configurations could be useful.
• Rule signature job group
• The set of jobs whose default rule signature map to the same bit vector.
Idea.
Methods.
• Case 1: simply apply a previously seen rule configuration.
• Case 2: find set of interesting configurations for each job group and adopt a
model to choose one at the compile time.

Learning Rule Configurations
• Select S rule signatures from Workload.
• Collect the jobs whose default rule signature maps to these rule signatures.
• Obtain K candidate configurations for each job group.
• we sample 𝑀 jobs from all the jobs mapping to these job groups.
• execute each of the 𝐾 configurations for every job.
Training Set.
Learning Problem.
• Treat the dataset of samples in each job group as an independent learning problem.
• Goal is to select one of the 𝐾 candidate configurations for a given query.
• Supervised learning to estimate the running time of query under a configuration.
Featurization.
• Job level features, e.g., input cardinality size, hash of template.
• Rule configuration features, e.g., cost of plan, bit vector of RuleDiff.
• Query graph features, e.g., operators’ id, cost.
Learned Models.
• For each job group, a fully connected neural network with one hidden layer of size 1024.
(Job, RuleConf, Running Time)

Learning Rule Configurations

Discussion
Future work.
• Methods to generate the job span & interesting rule configurations.
• Use feedback from the execution results to guide future iterations of the configuration search
• Other configurable options in Scope.
Discussion.
Summary.
• How to choose the right rule configuration for an incoming query.
• Propose rule signature & job span & several heuristics algos to obtain the candidate rule confs.
• Adopt a learning model to choose the rule confs for each job group.
• Papers.
• Methods.
• Model for each group.
• Parameters.

Q & A

[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workloads

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workloads

Similar to [Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workloads (20)

More from PingCAP

More from PingCAP (20)

Recently uploaded

Recently uploaded (20)

[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workloads