Paper Reading
Orca: A Modular Query Optimizer Architecture for Big Data
Presented by Weizhen Wang
Background
In this paper we present the architecture of Orca, the new query
optimizer for all Pivotal data management products, including
Pivotal Greenplum Database and Pivotal HAWQ.
ORCA = GPORCA = Greenplum ORCA = ...
Jul 2015, PQO: Pivotal Query Optimizer (Released in GPDB 4.3.5.0)
Jun 2014, ORCA: SIGMOD 2014 Paper
https://github.com/greenplum-db/gporca
In source code GPOPT
src/backend/gpopt
psql: select gp_opt_version();
High level GPDB architecture
• Storage and processing of large amounts of data are handledby distributing the load across several
servers or hosts to create an array of individual databases. all working together to present a single
database image.
• The master is the entry point to GPDB, where clients connect and submit SQL statements. The
master coordinates work with other database instances, called segments, to handle data processing
and storage.
• When a query is submitted to the master, it is optimized and broken into smaller components
dispatched to segments to work together on delivering the final results.
• The interconnect is the networking layer responsible for inter-process communication between the
segments. The interconnect uses a standard Gigabit Ethernet switching fabric.
High level GPDB architecture
GPDB Using Planner (`set optimizer=off`)
GPDB Using GPORCA (`set optimizer=on`)
Orca architecture
DXL: Data exchange
Language
Interaction of Orca with database system
Five Optimization Steps in Orca
● [Step 0] Pre-Process: e.g., predicates pushdown
● [Step 1] Exploration: All equivalent logical plans
● [Step 2] Statistics Derivation: histograms
● [Step 3] Implementation: Logical to Physical
● [Step 4] MPP Optimization: Enforcing and Costing
Step 0 Pre-Process with 25 iterations (TL, DR)
Step 1 Exploration 1/2: Memoization
Compact in-memory data
structure
capturing plan space:
Group: Container of equivalent
expressions
Group Expression: Operator
that has
other groups as its children
Step 1 Exploration 2/2: Transformation
Step 2 Statistics Derivation
Step 3 Implementation
Step 4 MPP Optimization 1/3 Requirement
Step 4 MPP Optimization 2/3 Cost
Step 4 MPP Optimization 3/3 Enforcement
Processing optimization requests in the Memo
Parallel Query Optimization
Exp(g): Generate logically equivalent expressions of all group expressions in group g.
Exp(gexpr): Generate logically equivalent expressions of a group expression gexpr.
Imp(g): Generate implementations of all group expressions in group g.
Imp(gexpr): Generate implementation alternatives of a group expression gexpr.
Opt(g, req): Return the plan with the least estimated cost that is rooted by an operator in
group g and satises optimization request req.
Opt(gexpr, req): Return the plan with the least estimated cost that is rooted by gexpr and
satises optimization request req.
Xform(gexpr, t): Transform group expression gexpr using rule t.
Optimization process is broken to small work units called optimization jobs.
Orca currently has seven different types of optimization jobs:
Parallel Query Optimization
Optimization jobs dependency graph
Test
Minimal Repros
AMPERe is a tool for Automatic capture of
Minimal
Portable and Executable Repros.
Replay of AMPERe dump
TAQO
MEASURING ACCURACY
• Discordance of plan pairs.
• Relevance of plan
• Pairwise distance
MEASURING ACCURACY
PingCAP.com
TAQO components
● Configuration Generator
● Execution Tracker
● Plan Deduplicator
● Ranker
Great Query Performance of Orca vs Planner
(TPC-DS 10TB)
Thanks

[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data

  • 1.
    Paper Reading Orca: AModular Query Optimizer Architecture for Big Data Presented by Weizhen Wang
  • 2.
    Background In this paperwe present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ.
  • 3.
    ORCA = GPORCA= Greenplum ORCA = ... Jul 2015, PQO: Pivotal Query Optimizer (Released in GPDB 4.3.5.0) Jun 2014, ORCA: SIGMOD 2014 Paper https://github.com/greenplum-db/gporca In source code GPOPT src/backend/gpopt psql: select gp_opt_version();
  • 4.
    High level GPDBarchitecture • Storage and processing of large amounts of data are handledby distributing the load across several servers or hosts to create an array of individual databases. all working together to present a single database image. • The master is the entry point to GPDB, where clients connect and submit SQL statements. The master coordinates work with other database instances, called segments, to handle data processing and storage. • When a query is submitted to the master, it is optimized and broken into smaller components dispatched to segments to work together on delivering the final results. • The interconnect is the networking layer responsible for inter-process communication between the segments. The interconnect uses a standard Gigabit Ethernet switching fabric.
  • 5.
    High level GPDBarchitecture
  • 6.
    GPDB Using Planner(`set optimizer=off`)
  • 7.
    GPDB Using GPORCA(`set optimizer=on`)
  • 8.
  • 9.
    Interaction of Orcawith database system
  • 10.
    Five Optimization Stepsin Orca ● [Step 0] Pre-Process: e.g., predicates pushdown ● [Step 1] Exploration: All equivalent logical plans ● [Step 2] Statistics Derivation: histograms ● [Step 3] Implementation: Logical to Physical ● [Step 4] MPP Optimization: Enforcing and Costing
  • 11.
    Step 0 Pre-Processwith 25 iterations (TL, DR)
  • 12.
    Step 1 Exploration1/2: Memoization Compact in-memory data structure capturing plan space: Group: Container of equivalent expressions Group Expression: Operator that has other groups as its children
  • 13.
    Step 1 Exploration2/2: Transformation
  • 14.
  • 15.
  • 16.
    Step 4 MPPOptimization 1/3 Requirement
  • 17.
    Step 4 MPPOptimization 2/3 Cost
  • 18.
    Step 4 MPPOptimization 3/3 Enforcement
  • 19.
  • 20.
    Parallel Query Optimization Exp(g):Generate logically equivalent expressions of all group expressions in group g. Exp(gexpr): Generate logically equivalent expressions of a group expression gexpr. Imp(g): Generate implementations of all group expressions in group g. Imp(gexpr): Generate implementation alternatives of a group expression gexpr. Opt(g, req): Return the plan with the least estimated cost that is rooted by an operator in group g and satises optimization request req. Opt(gexpr, req): Return the plan with the least estimated cost that is rooted by gexpr and satises optimization request req. Xform(gexpr, t): Transform group expression gexpr using rule t. Optimization process is broken to small work units called optimization jobs. Orca currently has seven different types of optimization jobs:
  • 21.
  • 22.
  • 23.
    Minimal Repros AMPERe isa tool for Automatic capture of Minimal Portable and Executable Repros.
  • 24.
  • 25.
  • 26.
    MEASURING ACCURACY • Discordanceof plan pairs. • Relevance of plan • Pairwise distance
  • 27.
  • 28.
    PingCAP.com TAQO components ● ConfigurationGenerator ● Execution Tracker ● Plan Deduplicator ● Ranker
  • 29.
    Great Query Performanceof Orca vs Planner (TPC-DS 10TB)
  • 30.