Tempura: A General Cost-Based
Optimizer Framework for
Incremental Data Processing
Zuozhi Wang2
, Kai Zeng1
, Botong Huang1
, Wei Chen1
, Xiaozong Cui1
,
Bo Wang1
, Ji Liu1
, Liya Fan1
, Dachuan Qu1
, Zhenyu Hou1
, Tao Guan1
,
Chen Li2, Jingren Zhou1
1. Alibaba Group 2. UC Irvine
1
Zuozhi Wang
Incremental Computation
• Widely Used in Many Scenarios
• Discretized Stream Processing
• Progressive Data Warehouse
• Late Data Processing
• Incremental View Maintenance
• …
• Different Scenarios have Different Characteristics
2
Zuozhi Wang
Total Income of the Day over Time
Discretized Stream Processing
Sales
1AM 2AM
Sales
…… 11PM 12 midnight
Sales Sales
……
3
Zuozhi Wang
Data Arrives Continuously with 1 hour Interval
Total Income of the Day over Time
Discretized Stream Processing
Sales
1AM
∑
2AM
$100
Sales
∑
+
…… 11PM 12 midnight
Sales
∑
+
Sales
∑
+
…… $2300
……
$100
$200
$100
$2400
$100
4
……
Zuozhi Wang
End-of-Day Report
Data Warehouse
Sales
1AM 2AM
Sales
…… 11PM 12 midnight
Sales Sales
……
Traditional Batch Computation
Accumulate all data and compute at the end.
Accumulate All Data
5
Resource
Usage
Zuozhi Wang
End-of-Day Report
Data Warehouse
Sales
1AM 2AM
Sales
…… 11PM 12 midnight
Sales Sales
……
Problem:
Many daily routine analytical queries run around the same time.
High cluster resource load at midnight.
Accumulate All Data
6
Resource
Usage
Zuozhi Wang
End-of-Day Report
Progressive Data Warehouse
Sales
1AM 2AM
Sales
…… 11PM 12 midnight
Sales Sales
……
Incremental Computation: Update result as soon as new data arrives.
Problem: Some jobs still execute in rush hours.
……
7
Cluster Usage
Zuozhi Wang
End-of-Day Report
Progressive Data Warehouse
Sales
1AM 2AM
Sales
…… 11PM 12 midnight
Sales Sales
……
Observation: User only cares about the last result.
Intermediate incremental jobs can have more flexibility.
……
8
Cluster Usage
Zuozhi Wang
End-of-Day Report
Progressive Data Warehouse
Sales
1AM 2AM
Sales
…… 11PM 12 midnight
Sales Sales
……
𝐶𝑜𝑠𝑡 𝑉𝑒𝑐𝑡𝑜𝑟: [ 0.9 0.3 …… 0.2 1.0 ]
9
Predictable
Resource Usage
Pattern
Zuozhi Wang
Assign a cost factor at each time point based on resource usage.
The optimizer can choose to skip the execution at 1AM.
Late Data Processing
Sales
12 midnight 1AM …… 5AM 6AM
Sales (late) Sales (late)
Complete Result
Operator1:
Low Incremental Computation Overhead (filters, aggregations, ...)
Operator2:
High Incremental Computation Overhead (outer joins, nested queries, …)
Partial Result
10
Zuozhi Wang
Sales (late)
……
Small amounts of late
data arrive continuously.
Late Data Processing
Sales
12 midnight 1AM
Sales (late)
…… 5AM 6AM
……
Sales (late) Sales (late)
Complete Result
……
Partial Result
11
Zuozhi Wang
Only incrementally compute operator1.
Compute operator2 (high overhead) only
when the user needs the output result.
𝑇= [ ] k time points
Tempura: A General Cost-Based Optimizer
Framework for Incremental Data Processing
• Incremental Computation in its Most General Form
• Single Optimizer Framework for Many Scenarios
𝑄= [ ] Expected delivered result
𝐷= [ ] Input data at each time
𝐷1
T1
𝑄𝑢𝑒𝑟𝑦(𝐷1)
𝐷2
T2
Ø
𝐷𝑘
T𝑘
𝑄𝑢𝑒𝑟𝑦(𝐷𝑘)
…
𝑃= [ ] Incremental Plan
𝐶= [ 𝑐1 𝑐2 𝑐𝑘 ] Cost function
Inputs
12
(Optimizer Output)
…
Ø
Zuozhi Wang
How to do Incremental Computation?
• Many Incremental Computation Algorithms
• Retractable
• Non-Retractable
• Outer Join View Maintenance
• Higher Order View Maintenance
• …
• Best Algorithm is Data Dependent
13
Zuozhi Wang
How to do Incremental Computation?
14
Sales
id profit
o1 100
o2 100
o3 100
Returns
id loss
o1 10
o2 20
id income
o1 -10
o2 -20
o3 100
Left Outer Join
Zuozhi Wang
gross
70 Sum SELECT SUM(income) AS gross FROM
(SELECT
sales.id,
CASE
WHEN loss IS NOT NULL THEN -loss
ELSE profit END AS income
FROM sales LEFT OUTER JOIN returns
ON sales.id = returns.id
) AS sales_status
Compute gross income based on
profit from sales and loss from returns.
Retractable Incremental Computation
Returns
Sales Returns
id profit
o1 100
o2 100
o3 100
id loss
o1 10
id loss
o2 20
id income
o1 100 -10
o2 100
o3 100
id income
o1 -10
o2 100 -20
o3 100
id income
o1 -10
o2 -20
o3 100
t1 t2 t3 t4 15
Zuozhi Wang
id income
o1 100
o2 100
o3 100
gross
300
gross
190
gross
70
gross
70
Non-Retractable Incremental Computation
Returns
Sales Returns
id profit
o1 100
o2 100
o3 100
id loss
o1 10
id loss
o2 20
id income
o1 -10
id income
o1 -10
o2 -20
id income
o1 -10
o2 -20
o3 100
t1 t2 t3 t4 16
Zuozhi Wang
id income
gross gross
-10
gross
-30
gross
70
Retractable vs Non-Retractable Algorithm
• Retractable Incremental Computation
• Needs to retract output whenever new return orders arrive.
• Better when return orders are rare. (Less computation in the end)
• Non-Retractable Incremental Computation
• Holds more data to ensure no retractions. (Less computation overhead)
• Better when return orders are frequent.
• Let an optimizer automatically finds the best algorithm!
Zuozhi Wang 17
Tempura: Contributions
18
Zuozhi Wang
• Propose a New Model for Incremental Computation
• Provide a Rewrite-Rule Framework
• Describes and unifies many incremental computation techniques
• Integrate with a Volcano/Cascades-style Optimizer
TVR-Based Incremental Processing
• Time-Varying Relation (TVR)
• Mapping from a time domain to relations
• Snapshot and Delta
id income
𝑜1 100
𝑜2 100
𝑜3 100
id income
𝑜1 -10
𝑜2 100
𝑜3 100
𝑜4 170
Snapshot(𝑡1) Snapshot(t2)
id income #
𝑜1 100 -1
𝑜1 -10 +1
𝑜4 170 +1
Delta(t1, t2)
t1 t2
+#
→
Merge Operation
19
Zuozhi Wang
Query Optimization on TVR
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
20
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
SELECT SUM(income) AS gross FROM
(SELECT
sales.id,
CASE
WHEN loss IS NOT NULL THEN -loss
ELSE profit END AS income
FROM sales LEFT OUTER JOIN returns
ON sales.id = returns.id
) AS sales_status
Each horizontal line is a TVR.
TVR-Generating Rules
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
21
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
? ? ?
? ? ?
With 2 time points:
Expand operator tree into 3 trees:
Snapshot(1), Delta(1,2), and Snapshot(2)
TVR-Generating Rules
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
22
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
Δ ⋈ 𝑙𝑜
? ? ?
Compute the delta of left outer join.
TVR-Generating Rules
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
23
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
Σ
Δ ⋈ 𝑙𝑜
Compute the delta of aggregation.
Intra-TVR Rules
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
24
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
Σ
Δ ⋈ 𝑙𝑜
+𝑈𝑛𝑖𝑜𝑛
How to merge snapshot and delta?
Intra-TVR Rules
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
25
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
Σ
Δ ⋈ 𝑙𝑜
+𝑈𝑛𝑖𝑜𝑛
+𝑆𝑢𝑚
How to merge snapshot and delta?
Inter-TVR Rules: Non-Retractable Algorithm
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
26
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
⋈
⋈ 𝑙𝑎
𝑅 ⋈ 𝑆
𝑅 ⋈ 𝑙𝑎 𝑆
𝑈𝑛𝑖𝑜𝑛
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
⋈
⋈ 𝑙𝑎
𝑈𝑛𝑖𝑜𝑛
Decompose
Left Outer Join
into:
Left Anti Join
- Retractable
Inner Join
- Insertion Only
Full Search Space
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
27
Zuozhi Wang
Σ Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
⋈
⋈ 𝑙𝑎
𝑈𝑛𝑖𝑜𝑛
𝑅 ⋈ 𝑆
𝑅 ⋈ 𝑙𝑎 𝑆
Σ(R ⋈ 𝑙a S)
Σ(R ⋈ S)
Σ
Σ
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
Σ
⋈
⋈ 𝑙𝑎
Δ
⋈
Δ ⋈ 𝑙𝑜
𝑈𝑛𝑖𝑜𝑛
Σ
Σ
𝑈𝑛𝑖𝑜𝑛
Σ
𝑈𝑛𝑖𝑜𝑛
+𝑆𝑢𝑚
+𝑈𝑛𝑖𝑜𝑛
Any path that can reach the final output
is a valid execution plan.
Plan: Retractable Algorithm
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
28
Zuozhi Wang
Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
⋈
⋈ 𝑙𝑎
𝑈𝑛𝑖𝑜𝑛
𝑅 ⋈ 𝑆
𝑅 ⋈ 𝑙𝑎 𝑆
Σ(R ⋈ 𝑙a S)
Σ(R ⋈ S)
Σ
Σ
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
Σ
⋈
⋈ 𝑙𝑎
Δ
⋈
Δ ⋈ 𝑙𝑜
𝑈𝑛𝑖𝑜𝑛
Σ
Σ
𝑈𝑛𝑖𝑜𝑛
Σ
Σ
𝑈𝑛𝑖𝑜𝑛
+𝑆𝑢𝑚
+𝑈𝑛𝑖𝑜𝑛
Plan: Non-Retractable Algorithm
𝑆2
⋈ 𝑙𝑜
𝑅2
𝑆
𝑅
R ⋈ 𝑙𝑜 S
Snapshot(2)
29
Zuozhi Wang
Σ(R ⋈ 𝑙𝑜 S)
Snapshot(1) Delta(1,2)
⋈
⋈ 𝑙𝑎
𝑈𝑛𝑖𝑜𝑛
𝑅 ⋈ 𝑆
𝑅 ⋈ 𝑙𝑎 𝑆
Σ(R ⋈ 𝑙a S)
Σ(R ⋈ S)
Σ
𝑆1
⋈ 𝑙𝑜
𝑅1
Σ
𝑆2
𝑅2
Σ
⋈
⋈ 𝑙𝑎
Δ
⋈
Δ ⋈ 𝑙𝑜
𝑈𝑛𝑖𝑜𝑛
Σ
Σ
𝑈𝑛𝑖𝑜𝑛
Σ
Σ
𝑈𝑛𝑖𝑜𝑛
+𝑆𝑢𝑚
+𝑈𝑛𝑖𝑜𝑛
Σ
Tempura: More Details in the Paper
• Integration with Volcano/Cascades-style Optimizer
• Speed up Optimization Process
• How to choose optimal plan?
• Dynamic re-optimization
• Statistics Estimation
• ……
30
Zuozhi Wang
Experimental Study
• TPC-DS
• 4 basic incremental computation algorithms:
• IM1 (Retractable Incremental Computation)
• IM2 (Non-retractable Incremental Computation)
• OJV (Outer-Join View Maintenance)
• HOV (Higher-Order View Maintenance)
• Tempura
• Unify all 4 algorithms.
31
Zuozhi Wang
Effectiveness
32
• 5 queries, 4 data arrival patterns, and 2 cost functions.
• Different methods are good at different scenarios.
• Tempura is always the best.
Zuozhi Wang
Query Optimization Performance
• For 80% TPC-DS query: optimization finished within 3 seconds
• Slower than traditional optimizers but can generate much better plans.
33
Zuozhi Wang
Tempura
• Open Source
• Built on top of Apache Calcite
• https://github.com/alibaba/cost-based-incremental-optimizer
• https://issues.apache.org/jira/browse/CALCITE-4568
34
Zuozhi Wang

Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing

  • 1.
    Tempura: A GeneralCost-Based Optimizer Framework for Incremental Data Processing Zuozhi Wang2 , Kai Zeng1 , Botong Huang1 , Wei Chen1 , Xiaozong Cui1 , Bo Wang1 , Ji Liu1 , Liya Fan1 , Dachuan Qu1 , Zhenyu Hou1 , Tao Guan1 , Chen Li2, Jingren Zhou1 1. Alibaba Group 2. UC Irvine 1 Zuozhi Wang
  • 2.
    Incremental Computation • WidelyUsed in Many Scenarios • Discretized Stream Processing • Progressive Data Warehouse • Late Data Processing • Incremental View Maintenance • … • Different Scenarios have Different Characteristics 2 Zuozhi Wang
  • 3.
    Total Income ofthe Day over Time Discretized Stream Processing Sales 1AM 2AM Sales …… 11PM 12 midnight Sales Sales …… 3 Zuozhi Wang Data Arrives Continuously with 1 hour Interval
  • 4.
    Total Income ofthe Day over Time Discretized Stream Processing Sales 1AM ∑ 2AM $100 Sales ∑ + …… 11PM 12 midnight Sales ∑ + Sales ∑ + …… $2300 …… $100 $200 $100 $2400 $100 4 …… Zuozhi Wang
  • 5.
    End-of-Day Report Data Warehouse Sales 1AM2AM Sales …… 11PM 12 midnight Sales Sales …… Traditional Batch Computation Accumulate all data and compute at the end. Accumulate All Data 5 Resource Usage Zuozhi Wang
  • 6.
    End-of-Day Report Data Warehouse Sales 1AM2AM Sales …… 11PM 12 midnight Sales Sales …… Problem: Many daily routine analytical queries run around the same time. High cluster resource load at midnight. Accumulate All Data 6 Resource Usage Zuozhi Wang
  • 7.
    End-of-Day Report Progressive DataWarehouse Sales 1AM 2AM Sales …… 11PM 12 midnight Sales Sales …… Incremental Computation: Update result as soon as new data arrives. Problem: Some jobs still execute in rush hours. …… 7 Cluster Usage Zuozhi Wang
  • 8.
    End-of-Day Report Progressive DataWarehouse Sales 1AM 2AM Sales …… 11PM 12 midnight Sales Sales …… Observation: User only cares about the last result. Intermediate incremental jobs can have more flexibility. …… 8 Cluster Usage Zuozhi Wang
  • 9.
    End-of-Day Report Progressive DataWarehouse Sales 1AM 2AM Sales …… 11PM 12 midnight Sales Sales …… 𝐶𝑜𝑠𝑡 𝑉𝑒𝑐𝑡𝑜𝑟: [ 0.9 0.3 …… 0.2 1.0 ] 9 Predictable Resource Usage Pattern Zuozhi Wang Assign a cost factor at each time point based on resource usage. The optimizer can choose to skip the execution at 1AM.
  • 10.
    Late Data Processing Sales 12midnight 1AM …… 5AM 6AM Sales (late) Sales (late) Complete Result Operator1: Low Incremental Computation Overhead (filters, aggregations, ...) Operator2: High Incremental Computation Overhead (outer joins, nested queries, …) Partial Result 10 Zuozhi Wang Sales (late) …… Small amounts of late data arrive continuously.
  • 11.
    Late Data Processing Sales 12midnight 1AM Sales (late) …… 5AM 6AM …… Sales (late) Sales (late) Complete Result …… Partial Result 11 Zuozhi Wang Only incrementally compute operator1. Compute operator2 (high overhead) only when the user needs the output result.
  • 12.
    𝑇= [ ]k time points Tempura: A General Cost-Based Optimizer Framework for Incremental Data Processing • Incremental Computation in its Most General Form • Single Optimizer Framework for Many Scenarios 𝑄= [ ] Expected delivered result 𝐷= [ ] Input data at each time 𝐷1 T1 𝑄𝑢𝑒𝑟𝑦(𝐷1) 𝐷2 T2 Ø 𝐷𝑘 T𝑘 𝑄𝑢𝑒𝑟𝑦(𝐷𝑘) … 𝑃= [ ] Incremental Plan 𝐶= [ 𝑐1 𝑐2 𝑐𝑘 ] Cost function Inputs 12 (Optimizer Output) … Ø Zuozhi Wang
  • 13.
    How to doIncremental Computation? • Many Incremental Computation Algorithms • Retractable • Non-Retractable • Outer Join View Maintenance • Higher Order View Maintenance • … • Best Algorithm is Data Dependent 13 Zuozhi Wang
  • 14.
    How to doIncremental Computation? 14 Sales id profit o1 100 o2 100 o3 100 Returns id loss o1 10 o2 20 id income o1 -10 o2 -20 o3 100 Left Outer Join Zuozhi Wang gross 70 Sum SELECT SUM(income) AS gross FROM (SELECT sales.id, CASE WHEN loss IS NOT NULL THEN -loss ELSE profit END AS income FROM sales LEFT OUTER JOIN returns ON sales.id = returns.id ) AS sales_status Compute gross income based on profit from sales and loss from returns.
  • 15.
    Retractable Incremental Computation Returns SalesReturns id profit o1 100 o2 100 o3 100 id loss o1 10 id loss o2 20 id income o1 100 -10 o2 100 o3 100 id income o1 -10 o2 100 -20 o3 100 id income o1 -10 o2 -20 o3 100 t1 t2 t3 t4 15 Zuozhi Wang id income o1 100 o2 100 o3 100 gross 300 gross 190 gross 70 gross 70
  • 16.
    Non-Retractable Incremental Computation Returns SalesReturns id profit o1 100 o2 100 o3 100 id loss o1 10 id loss o2 20 id income o1 -10 id income o1 -10 o2 -20 id income o1 -10 o2 -20 o3 100 t1 t2 t3 t4 16 Zuozhi Wang id income gross gross -10 gross -30 gross 70
  • 17.
    Retractable vs Non-RetractableAlgorithm • Retractable Incremental Computation • Needs to retract output whenever new return orders arrive. • Better when return orders are rare. (Less computation in the end) • Non-Retractable Incremental Computation • Holds more data to ensure no retractions. (Less computation overhead) • Better when return orders are frequent. • Let an optimizer automatically finds the best algorithm! Zuozhi Wang 17
  • 18.
    Tempura: Contributions 18 Zuozhi Wang •Propose a New Model for Incremental Computation • Provide a Rewrite-Rule Framework • Describes and unifies many incremental computation techniques • Integrate with a Volcano/Cascades-style Optimizer
  • 19.
    TVR-Based Incremental Processing •Time-Varying Relation (TVR) • Mapping from a time domain to relations • Snapshot and Delta id income 𝑜1 100 𝑜2 100 𝑜3 100 id income 𝑜1 -10 𝑜2 100 𝑜3 100 𝑜4 170 Snapshot(𝑡1) Snapshot(t2) id income # 𝑜1 100 -1 𝑜1 -10 +1 𝑜4 170 +1 Delta(t1, t2) t1 t2 +# → Merge Operation 19 Zuozhi Wang
  • 20.
    Query Optimization onTVR 𝑆2 ⋈ 𝑙𝑜 𝑅2 𝑆 𝑅 R ⋈ 𝑙𝑜 S 20 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) SELECT SUM(income) AS gross FROM (SELECT sales.id, CASE WHEN loss IS NOT NULL THEN -loss ELSE profit END AS income FROM sales LEFT OUTER JOIN returns ON sales.id = returns.id ) AS sales_status Each horizontal line is a TVR.
  • 21.
    TVR-Generating Rules 𝑆2 ⋈ 𝑙𝑜 𝑅2 𝑆 𝑅 R⋈ 𝑙𝑜 S Snapshot(2) 21 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 ? ? ? ? ? ? With 2 time points: Expand operator tree into 3 trees: Snapshot(1), Delta(1,2), and Snapshot(2)
  • 22.
    TVR-Generating Rules 𝑆2 ⋈ 𝑙𝑜 𝑅2 𝑆 𝑅 R⋈ 𝑙𝑜 S Snapshot(2) 22 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 Δ ⋈ 𝑙𝑜 ? ? ? Compute the delta of left outer join.
  • 23.
    TVR-Generating Rules 𝑆2 ⋈ 𝑙𝑜 𝑅2 𝑆 𝑅 R⋈ 𝑙𝑜 S Snapshot(2) 23 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 Σ Δ ⋈ 𝑙𝑜 Compute the delta of aggregation.
  • 24.
    Intra-TVR Rules 𝑆2 ⋈ 𝑙𝑜 𝑅2 𝑆 𝑅 R⋈ 𝑙𝑜 S Snapshot(2) 24 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 Σ Δ ⋈ 𝑙𝑜 +𝑈𝑛𝑖𝑜𝑛 How to merge snapshot and delta?
  • 25.
    Intra-TVR Rules 𝑆2 ⋈ 𝑙𝑜 𝑅2 𝑆 𝑅 R⋈ 𝑙𝑜 S Snapshot(2) 25 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 Σ Δ ⋈ 𝑙𝑜 +𝑈𝑛𝑖𝑜𝑛 +𝑆𝑢𝑚 How to merge snapshot and delta?
  • 26.
    Inter-TVR Rules: Non-RetractableAlgorithm 𝑆2 ⋈ 𝑙𝑜 𝑅2 𝑆 𝑅 R ⋈ 𝑙𝑜 S Snapshot(2) 26 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) ⋈ ⋈ 𝑙𝑎 𝑅 ⋈ 𝑆 𝑅 ⋈ 𝑙𝑎 𝑆 𝑈𝑛𝑖𝑜𝑛 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ ⋈ ⋈ 𝑙𝑎 𝑈𝑛𝑖𝑜𝑛 Decompose Left Outer Join into: Left Anti Join - Retractable Inner Join - Insertion Only
  • 27.
    Full Search Space 𝑆2 ⋈𝑙𝑜 𝑅2 𝑆 𝑅 R ⋈ 𝑙𝑜 S Snapshot(2) 27 Zuozhi Wang Σ Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) ⋈ ⋈ 𝑙𝑎 𝑈𝑛𝑖𝑜𝑛 𝑅 ⋈ 𝑆 𝑅 ⋈ 𝑙𝑎 𝑆 Σ(R ⋈ 𝑙a S) Σ(R ⋈ S) Σ Σ 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 Σ ⋈ ⋈ 𝑙𝑎 Δ ⋈ Δ ⋈ 𝑙𝑜 𝑈𝑛𝑖𝑜𝑛 Σ Σ 𝑈𝑛𝑖𝑜𝑛 Σ 𝑈𝑛𝑖𝑜𝑛 +𝑆𝑢𝑚 +𝑈𝑛𝑖𝑜𝑛 Any path that can reach the final output is a valid execution plan.
  • 28.
    Plan: Retractable Algorithm 𝑆2 ⋈𝑙𝑜 𝑅2 𝑆 𝑅 R ⋈ 𝑙𝑜 S Snapshot(2) 28 Zuozhi Wang Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) ⋈ ⋈ 𝑙𝑎 𝑈𝑛𝑖𝑜𝑛 𝑅 ⋈ 𝑆 𝑅 ⋈ 𝑙𝑎 𝑆 Σ(R ⋈ 𝑙a S) Σ(R ⋈ S) Σ Σ 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 Σ ⋈ ⋈ 𝑙𝑎 Δ ⋈ Δ ⋈ 𝑙𝑜 𝑈𝑛𝑖𝑜𝑛 Σ Σ 𝑈𝑛𝑖𝑜𝑛 Σ Σ 𝑈𝑛𝑖𝑜𝑛 +𝑆𝑢𝑚 +𝑈𝑛𝑖𝑜𝑛
  • 29.
    Plan: Non-Retractable Algorithm 𝑆2 ⋈𝑙𝑜 𝑅2 𝑆 𝑅 R ⋈ 𝑙𝑜 S Snapshot(2) 29 Zuozhi Wang Σ(R ⋈ 𝑙𝑜 S) Snapshot(1) Delta(1,2) ⋈ ⋈ 𝑙𝑎 𝑈𝑛𝑖𝑜𝑛 𝑅 ⋈ 𝑆 𝑅 ⋈ 𝑙𝑎 𝑆 Σ(R ⋈ 𝑙a S) Σ(R ⋈ S) Σ 𝑆1 ⋈ 𝑙𝑜 𝑅1 Σ 𝑆2 𝑅2 Σ ⋈ ⋈ 𝑙𝑎 Δ ⋈ Δ ⋈ 𝑙𝑜 𝑈𝑛𝑖𝑜𝑛 Σ Σ 𝑈𝑛𝑖𝑜𝑛 Σ Σ 𝑈𝑛𝑖𝑜𝑛 +𝑆𝑢𝑚 +𝑈𝑛𝑖𝑜𝑛 Σ
  • 30.
    Tempura: More Detailsin the Paper • Integration with Volcano/Cascades-style Optimizer • Speed up Optimization Process • How to choose optimal plan? • Dynamic re-optimization • Statistics Estimation • …… 30 Zuozhi Wang
  • 31.
    Experimental Study • TPC-DS •4 basic incremental computation algorithms: • IM1 (Retractable Incremental Computation) • IM2 (Non-retractable Incremental Computation) • OJV (Outer-Join View Maintenance) • HOV (Higher-Order View Maintenance) • Tempura • Unify all 4 algorithms. 31 Zuozhi Wang
  • 32.
    Effectiveness 32 • 5 queries,4 data arrival patterns, and 2 cost functions. • Different methods are good at different scenarios. • Tempura is always the best. Zuozhi Wang
  • 33.
    Query Optimization Performance •For 80% TPC-DS query: optimization finished within 3 seconds • Slower than traditional optimizers but can generate much better plans. 33 Zuozhi Wang
  • 34.
    Tempura • Open Source •Built on top of Apache Calcite • https://github.com/alibaba/cost-based-incremental-optimizer • https://issues.apache.org/jira/browse/CALCITE-4568 34 Zuozhi Wang

Editor's Notes