DB reading group
May. 16, 2018 Keisuke Suzuki
Today’s paper
Efficiently Compiling Efficient Query Plans for Modern Hardware
● Thomas Neuman, 2011 VLDB
○ Creator of HyPer
■ Main memory RDBMS for mixed OLTP and OLAP workloads
● Topic: Query execution on the modern CPUs
Query processing on RDBMS
Scope of this paper
Executing relational algebraic plans
Ri: relation
σ: selection
Γ: aggregation
⋈: natual join
● Variation of executor
○ Compiled VS Interpreted
○ Pipelining VS Block processing
○ Pull VS Push
● ref: CMU Advanced Database
Systems - 03 Query Compilation
Volcano style execution
● interpreted + pipelining + pull
● Pros
○ easy to implement
○ no materialization
● Cons
○ poor cache locality
○ high cost virtual function calls
● popular in disk-based DBMS
○ e.g. PostgreSQL
● performance much worse than
hand-written code on modern systems
1. next()
2. next()
3. a tuple
4. a tuple
Related work: MonetDB/X100
● interpreted + block processing + pull
● Pros
○ better locality than Volcano
● Cons
○ virtual function calls
○ unnecessary tuple copy can be
happened on the boundaries
e.g.) tuples x <> 7 on step 3.
● still slower than hand-written code
1. next_chunk()
2. next_
chunk()
3. chunk
of tuples
4. chunk
of tuples
Proposed method
● compiled + block processing + push
○ tuples are pushed to the next pipeline
breaker (e.g. hash, aggregation, …)
● Pros
○ good locality
○ no virtual function calls
○ generated query execution codes are
easy to parallelize
■ SIMD
■ multi threading
1. loop the filter over
R1 tuples
Translate algebraic plan into code fragments
?
Translation: Pull based
interface Node { Tuple next(); }
class JoinNode implement Node {
Node left, right;
Tuple next() { .. }
}
class SelectNode implement Node {
Node child;
Tuple next() { .. }
}
class ScanNode implement Node {
Tuple next() { .. }
}
● Simple pipelining of operator nodes
● Tree structure
Translation: Proposed method
● Not tree structure
● Ambiguous operation boundaries
?
Producer / Consumer interface
● produce()
○ asks the operator to produce results
● comsume(attributes, source)
○ called to push results forward the
operator
● Flow
1. call produce() of root operator
2. recursively call produce() until
reaching leaf operator
3. leaf operator generate results
4. recursively call consume()
until reaching root operator
Example
⋈{a=b}.produce
-> σ{x=7}.produce
-> scan{R1}.produce
(read tuples from R1)
-> σ{x=7}.consume
(select tuples with x = 7)
-> ⋈{a=b}.consume
(materialize tuples in hash table)
Example
⋈{a=b}.produce
-> σ{x=7}.produce
-> scan{R1}.produce
(read tuples from R1)
-> σ{x=7}.consume
(select tuples with x = 7)
-> ⋈{a=b}.consume
(materialize tuples in hash table)
Materialize breaks loop
Generating Machine Code
● At first: generate C++ codes -> compile -> load as shared library
○ their system written in C++ (HyPer)
○ Bad: slow compilation (multiple seconds)
○ Bad: C++ does not offer total control over the generated code
● Next: Mixed LLVM and C++ codes
○ drive and connect operators by LLVM and call pre-compiled C++
functions for complex processing (e.g. disk IOs, memory allocation)
○ good: fast compilation (a few milliseconds)
○ good: LLVM enables robust assembler producing than manual writing
Performance Tuning
● Branch prediction
○ branch nearly 0% or 100% true is cheap
○ branch 50% true is expensive
20% faster
hash value mostly exists but mostly no collision
-> 1st iteration true, 2nd iteration false
Performance on OLTP / OLAP
● OLTP: small performance improvement
○ low selectivity (touch only small number of tuples)
● OLAP: big performance improvement
Criticism: Maintainability of operator template
● Template expansion easily becomes too complex
○ Code bases increase as more and more optimization added
○ One of the major reason that pull (iterator) model is prefered
● low-level language (LLVM IR)
Some study follow this problem
● e.g. Building Efficient Query Engines in a High-Level Language

DB reading group may 16, 2018

  • 1.
    DB reading group May.16, 2018 Keisuke Suzuki
  • 2.
    Today’s paper Efficiently CompilingEfficient Query Plans for Modern Hardware ● Thomas Neuman, 2011 VLDB ○ Creator of HyPer ■ Main memory RDBMS for mixed OLTP and OLAP workloads ● Topic: Query execution on the modern CPUs
  • 3.
    Query processing onRDBMS Scope of this paper
  • 4.
    Executing relational algebraicplans Ri: relation σ: selection Γ: aggregation ⋈: natual join ● Variation of executor ○ Compiled VS Interpreted ○ Pipelining VS Block processing ○ Pull VS Push ● ref: CMU Advanced Database Systems - 03 Query Compilation
  • 5.
    Volcano style execution ●interpreted + pipelining + pull ● Pros ○ easy to implement ○ no materialization ● Cons ○ poor cache locality ○ high cost virtual function calls ● popular in disk-based DBMS ○ e.g. PostgreSQL ● performance much worse than hand-written code on modern systems 1. next() 2. next() 3. a tuple 4. a tuple
  • 6.
    Related work: MonetDB/X100 ●interpreted + block processing + pull ● Pros ○ better locality than Volcano ● Cons ○ virtual function calls ○ unnecessary tuple copy can be happened on the boundaries e.g.) tuples x <> 7 on step 3. ● still slower than hand-written code 1. next_chunk() 2. next_ chunk() 3. chunk of tuples 4. chunk of tuples
  • 7.
    Proposed method ● compiled+ block processing + push ○ tuples are pushed to the next pipeline breaker (e.g. hash, aggregation, …) ● Pros ○ good locality ○ no virtual function calls ○ generated query execution codes are easy to parallelize ■ SIMD ■ multi threading 1. loop the filter over R1 tuples
  • 8.
    Translate algebraic planinto code fragments ?
  • 9.
    Translation: Pull based interfaceNode { Tuple next(); } class JoinNode implement Node { Node left, right; Tuple next() { .. } } class SelectNode implement Node { Node child; Tuple next() { .. } } class ScanNode implement Node { Tuple next() { .. } } ● Simple pipelining of operator nodes ● Tree structure
  • 10.
    Translation: Proposed method ●Not tree structure ● Ambiguous operation boundaries ?
  • 11.
    Producer / Consumerinterface ● produce() ○ asks the operator to produce results ● comsume(attributes, source) ○ called to push results forward the operator ● Flow 1. call produce() of root operator 2. recursively call produce() until reaching leaf operator 3. leaf operator generate results 4. recursively call consume() until reaching root operator
  • 12.
    Example ⋈{a=b}.produce -> σ{x=7}.produce -> scan{R1}.produce (readtuples from R1) -> σ{x=7}.consume (select tuples with x = 7) -> ⋈{a=b}.consume (materialize tuples in hash table)
  • 13.
    Example ⋈{a=b}.produce -> σ{x=7}.produce -> scan{R1}.produce (readtuples from R1) -> σ{x=7}.consume (select tuples with x = 7) -> ⋈{a=b}.consume (materialize tuples in hash table) Materialize breaks loop
  • 14.
    Generating Machine Code ●At first: generate C++ codes -> compile -> load as shared library ○ their system written in C++ (HyPer) ○ Bad: slow compilation (multiple seconds) ○ Bad: C++ does not offer total control over the generated code ● Next: Mixed LLVM and C++ codes ○ drive and connect operators by LLVM and call pre-compiled C++ functions for complex processing (e.g. disk IOs, memory allocation) ○ good: fast compilation (a few milliseconds) ○ good: LLVM enables robust assembler producing than manual writing
  • 15.
    Performance Tuning ● Branchprediction ○ branch nearly 0% or 100% true is cheap ○ branch 50% true is expensive 20% faster hash value mostly exists but mostly no collision -> 1st iteration true, 2nd iteration false
  • 16.
    Performance on OLTP/ OLAP ● OLTP: small performance improvement ○ low selectivity (touch only small number of tuples) ● OLAP: big performance improvement
  • 17.
    Criticism: Maintainability ofoperator template ● Template expansion easily becomes too complex ○ Code bases increase as more and more optimization added ○ One of the major reason that pull (iterator) model is prefered ● low-level language (LLVM IR) Some study follow this problem ● e.g. Building Efficient Query Engines in a High-Level Language