Big data analytics using a
custom SQL engine
1
Andrii Tsvielodub, Zoomdata
2018
Agenda
1. What is Zoomdata and why we need SQL
2. What a typical Query Planner looks like and what it does
3. What is Apache Calcite and how it works
4. How we started using Apache Calcite in Zoomdata
2
Who am I
● Java Developer
● 8+ years of experience
● 3+ years in Zoomdata
● Work in Query Engine team
3
What is
Zoomdata
4
5
6
7
Cross-Database Queries
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
8
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
9
Zoomdata Architecture
10
Zoomdata Architecture
11
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
12
Zoomdata Architecture
13
Source
Zoomdata Architecture
14
Source
Zoomdata Architecture
Relational Algebra | Query Plan Optimization
15
SQL
Query Pushdown | Federated Queries
Zoomdata Architecture
Apache Calcite
A Foundational Framework
for Optimized Query Processing
Over Heterogeneous Data Sources
17
Apache Calcite - Features
● Standard SQL
○ Industry-standard SQL parser, validator and JDBC driver.
● Query optimization
○ Represent your query in relational algebra, transform using planning rules,
and optimize according to a cost model using smart algorithms.
● Any data, anywhere
○ Connect to third-party data sources, browse metadata, and optimize by
pushing the computation to the data.
18
Apache Calcite - Used By
19
Query Planner
20
Process
Manager
Client Communication Manager
Relational Query Processor
Transactional Store Manager
Shared
Components and
Utilities
Admission
Control
Dispatch &
Scheduling
Local Client Protocols Remote Client Protocols
Access Methods Buffer Methods
Lock Manager Log Manager
Catalog
Manager
Memory
Manager
Admin.,
Monitoring &
Utilities
Replication &
Loading
Services
Batch Utilities
Query Parsing & Authorization
Query Rewrite
Query Optimizer
Plan Executor
DDL & Utility Processing
Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007).
21
Process
Manager
Client Communication Manager
Relational Query Processor
Transactional Store Manager
Shared
Components and
Utilities
Admission
Control
Dispatch &
Scheduling
Local Client Protocols Remote Client Protocols
Access Methods Buffer Methods
Lock Manager Log Manager
Catalog
Manager
Memory
Manager
Admin.,
Monitoring &
Utilities
Replication &
Loading
Services
Batch Utilities
Query Parsing & Authorization
Query Rewrite
Query Optimizer
Plan Executor
DDL & Utility Processing
Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007).
22
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
23
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
24
Sales
tx_id
product_id
sale_date
qty
seller_id
buyer_id
Products
product_id
name
description
price
manufacturer_id
*
1
25
What the duck?
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
26
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
27
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
28
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Dataflow
29
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Dataflow
30
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Dataflow
31
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
32
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
LogicalScan
(sales)
LogicalScan
(product)
LogicalJoin
(product_id)
LogicalFilter
(name = ‘...’)
LogicalProject
(sale_date, qty, price)
LogicalAggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
33
RelBuilder b = ...;
RelNode query = b
.scan("sales").as("s")
.scan("product").as("p")
.join(JoinRelType.INNER, "product_id")
.filter(
b.call(SqlStdOperatorTable.EQUALS,
b.field("p.name"),
b.literal("Giant Rubber Duck")))
.project(
b.call(SqlStdOperatorTable.CEIL,
b.field("s.sale_date"), b.literal("DAY")),
b.call(SqlStdOperatorTable.MULTIPLY,
b.field("s.qty"), b.field("p.price")),
"sale_day", "sale")
.aggregate(
b.groupKey("sale_day"),
b.sum(false, "sale_sum", b.field("sale")))
.build();
LogicalScan
(sales)
LogicalScan
(product)
LogicalJoin
(product_id)
LogicalFilter
(name = ‘...’)
LogicalProject
(sale_date, qty, price)
LogicalAggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
34
Query Planner Overview Agenda
● Parse SQL query
○ SQL query is parsed into a logical plan
○ Logical plan is an abstract model of a query
○ Calcite has classes to build a logical plan
○ You can create the plan even without a SQL query
35
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
36
37
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
38
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
39
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
40
● org.apache.calcite.plan.RelOptRule
● Matches a pattern in plan tree
Apache Calcite - Optimization Rules
● Can check additional conditions, modifies the plan
● Rules can be specified dynamically, based on query properties
41
Apache Calcite - Optimization Rules
42
Apache Calcite - Query Optimizer
Query Optimizer makes all the difference in a DB.
Two general categories:
● Naive heuristic optimizer
● Cost-based optimizer
43
Apache Calcite - Query Optimizer
● Rule-based/Naive heuristic
○ Fast
○ Good fit for rules that
guarantee improvement
○ Limited in plan space
exploration
○ Not all aspects of query can be
optimised, e.g. joins
reordering
44
Apache Calcite - Query Optimizer
● Cost-based
○ Iteratively explores all possible changes to the plan
○ Estimates the cost of plans
○ Chooses the cheapest one
○ Stops when search criteria is met
❗ Can take much time
❗ For some queries, it might be faster to
execute it than to find the optimal plan
45
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
○ Relational Algebra is not that scary
○ Rules can make your plan better (but not always)
○ Calcite has lots of rules and you can write your own
○ There are multiple planning algorithms - simple and cost-based
○ Choose wisely
46
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
47
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
48
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
SortMergeJoin
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
HashAggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
49
Execution in DB
And then the physical plan is executed
50
Sales
tx_id
product_id
sale_date
qty
seller_id
buyer_id
Products
product_id
name
description
price
manufacturer_id
*
1
51
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
52
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
MongoTableScan
(sales)
JdbcTableScan
(product)
EnumerableJoin
(product_id)
JdbcFilter
(name = ‘...’)
EnumerableCalc
(sale_date, qty, price)
EnumerableAggregate
(group = sale_day,
agg = sum(qty * price))
JdbcProject
(p_id, name,
price)
MongoProject
(p_id, qty,
sale_date)
MongoTo
Enumerable
Converter
JdbcTo
Enumerable
Converter
53
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
MongoTableScan
(sales)
JdbcTableScan
(product)
EnumerableJoin
(product_id)
JdbcFilter
(name = ‘...’)
EnumerableCalc
(sale_date, qty, price)
EnumerableAggregate
(group = sale_day,
agg = sum(qty * price))
JdbcProject
(p_id, name,
price)
MongoProject
(p_id, qty,
sale_date)
MongoTo
Enumerable
Converter
JdbcTo
Enumerable
Converter
54
Apache Calcite - Plan Execution
● Standard JDBC ResultSet
● Specialized Iterable that returns the data
55
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
○ It’s rules again
○ Calling Conventions in Calcite is cool
○ You can convert between conventions
○ Calling Convention knows how to execute
56
Stop. Demo time!
57
Apache Calcite - Stuff To Remember
● Calcite knows how to parse SQL and
optimize queries
● Physical nodes have Calling Convention,
it says how to execute the plan
● You can mix Calling Conventions in one
plan using Converter Nodes
58
Back to
Zoomdata
59
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
60
Query Engine Architecture
Spark adapter
JD
BC
ZD
Rx
Query Engine
61
Connector
adapter
62
So what?
● New features - faster
● Optimisations - more reliable
● SQL interop - cool
63
● Steep learning curve
● Big complex codebase
● Obsolete/unfinished code
Apache Calcite - Contra
64
Apache Calcite - Pro
○ It really works
○ Friendly community
○ Constantly improving and evolving
○ Good place to start
65
Takeaways
66
The Theory
● SQL Query is parsed into a Logical Plan
- an abstract model of the query
● Logical Plan is optimised with special rules,
based on relational algebra
● Logical Plan is then converted to a Physical Plan
● Physical Plan can be executed to return the data
67
The Theory
● Query optimization is fun
● Relational Algebra is not scary at all
● Solid basis for optimization and execution
● At least you can mention this on interviews
68
The Practice
● Apache Calcite knows how to parse SQL
and optimize queries
● Has lots of optimization rules, and
state-of-the-art optimization algos
● Physical nodes have Calling Convention,
it says how to execute the plan
● You can mix Calling Conventions in one
plan to query multiple sources
69
The Practice
● When you want to do analytics over federated data
○ Consider using an existing product
○ Consider using existing engine, e.g. Calcite
● But be careful
○ This is not your regular ORM
○ Use FOR SQL not instead of
70
Where to start?
● org.apache.calcite.schema.ScannableTable
● org.apache.calcite.rel.RelNode
● org.apache.calcite.plan.RelOptRule
71
Further reading
● Apache Calcite Overview
○ https://arxiv.org/pdf/1802.10233.pdf
● Apache Calcite Documentation
○ https://calcite.apache.org/docs/
● Readings in Database Systems, 5th Edition
○ http://www.redbook.io/
72
select *
from questions
limit 5
73

Big data analytics using a custom SQL engine

  • 1.
    Big data analyticsusing a custom SQL engine 1 Andrii Tsvielodub, Zoomdata 2018
  • 2.
    Agenda 1. What isZoomdata and why we need SQL 2. What a typical Query Planner looks like and what it does 3. What is Apache Calcite and how it works 4. How we started using Apache Calcite in Zoomdata 2
  • 3.
    Who am I ●Java Developer ● 8+ years of experience ● 3+ years in Zoomdata ● Work in Query Engine team 3
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Zoomdata Architecture Relational Algebra| Query Plan Optimization 15 SQL Query Pushdown | Federated Queries
  • 16.
  • 17.
    Apache Calcite A FoundationalFramework for Optimized Query Processing Over Heterogeneous Data Sources 17
  • 18.
    Apache Calcite -Features ● Standard SQL ○ Industry-standard SQL parser, validator and JDBC driver. ● Query optimization ○ Represent your query in relational algebra, transform using planning rules, and optimize according to a cost model using smart algorithms. ● Any data, anywhere ○ Connect to third-party data sources, browse metadata, and optimize by pushing the computation to the data. 18
  • 19.
    Apache Calcite -Used By 19
  • 20.
  • 21.
    Process Manager Client Communication Manager RelationalQuery Processor Transactional Store Manager Shared Components and Utilities Admission Control Dispatch & Scheduling Local Client Protocols Remote Client Protocols Access Methods Buffer Methods Lock Manager Log Manager Catalog Manager Memory Manager Admin., Monitoring & Utilities Replication & Loading Services Batch Utilities Query Parsing & Authorization Query Rewrite Query Optimizer Plan Executor DDL & Utility Processing Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007). 21
  • 22.
    Process Manager Client Communication Manager RelationalQuery Processor Transactional Store Manager Shared Components and Utilities Admission Control Dispatch & Scheduling Local Client Protocols Remote Client Protocols Access Methods Buffer Methods Lock Manager Log Manager Catalog Manager Memory Manager Admin., Monitoring & Utilities Replication & Loading Services Batch Utilities Query Parsing & Authorization Query Rewrite Query Optimizer Plan Executor DDL & Utility Processing Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007). 22
  • 23.
    Query Planner OverviewAgenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 23
  • 24.
    Query Planner OverviewAgenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 24
  • 25.
  • 26.
    What the duck? select CEIL(s.sale_dateTO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day 26
  • 27.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day 27
  • 28.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) 28
  • 29.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Dataflow 29
  • 30.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Dataflow 30
  • 31.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Dataflow 31
  • 32.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Dataflow 32
  • 33.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day LogicalScan (sales) LogicalScan (product) LogicalJoin (product_id) LogicalFilter (name = ‘...’) LogicalProject (sale_date, qty, price) LogicalAggregate (group = sale_day, agg = sum(qty * price)) Dataflow 33
  • 34.
    RelBuilder b =...; RelNode query = b .scan("sales").as("s") .scan("product").as("p") .join(JoinRelType.INNER, "product_id") .filter( b.call(SqlStdOperatorTable.EQUALS, b.field("p.name"), b.literal("Giant Rubber Duck"))) .project( b.call(SqlStdOperatorTable.CEIL, b.field("s.sale_date"), b.literal("DAY")), b.call(SqlStdOperatorTable.MULTIPLY, b.field("s.qty"), b.field("p.price")), "sale_day", "sale") .aggregate( b.groupKey("sale_day"), b.sum(false, "sale_sum", b.field("sale"))) .build(); LogicalScan (sales) LogicalScan (product) LogicalJoin (product_id) LogicalFilter (name = ‘...’) LogicalProject (sale_date, qty, price) LogicalAggregate (group = sale_day, agg = sum(qty * price)) Dataflow 34
  • 35.
    Query Planner OverviewAgenda ● Parse SQL query ○ SQL query is parsed into a logical plan ○ Logical plan is an abstract model of a query ○ Calcite has classes to build a logical plan ○ You can create the plan even without a SQL query 35
  • 36.
    Query Planner OverviewAgenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 36
  • 37.
  • 38.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) 38
  • 39.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) 39
  • 40.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 40
  • 41.
    ● org.apache.calcite.plan.RelOptRule ● Matchesa pattern in plan tree Apache Calcite - Optimization Rules ● Can check additional conditions, modifies the plan ● Rules can be specified dynamically, based on query properties 41
  • 42.
    Apache Calcite -Optimization Rules 42
  • 43.
    Apache Calcite -Query Optimizer Query Optimizer makes all the difference in a DB. Two general categories: ● Naive heuristic optimizer ● Cost-based optimizer 43
  • 44.
    Apache Calcite -Query Optimizer ● Rule-based/Naive heuristic ○ Fast ○ Good fit for rules that guarantee improvement ○ Limited in plan space exploration ○ Not all aspects of query can be optimised, e.g. joins reordering 44
  • 45.
    Apache Calcite -Query Optimizer ● Cost-based ○ Iteratively explores all possible changes to the plan ○ Estimates the cost of plans ○ Chooses the cheapest one ○ Stops when search criteria is met ❗ Can take much time ❗ For some queries, it might be faster to execute it than to find the optimal plan 45
  • 46.
    Query Planner OverviewAgenda ● Parse SQL query ● Create an optimal logical plan of a query ○ Relational Algebra is not that scary ○ Rules can make your plan better (but not always) ○ Calcite has lots of rules and you can write your own ○ There are multiple planning algorithms - simple and cost-based ○ Choose wisely 46
  • 47.
    Query Planner OverviewAgenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 47
  • 48.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 48
  • 49.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) SortMergeJoin (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) HashAggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 49
  • 50.
    Execution in DB Andthen the physical plan is executed 50
  • 51.
  • 52.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 52
  • 53.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day MongoTableScan (sales) JdbcTableScan (product) EnumerableJoin (product_id) JdbcFilter (name = ‘...’) EnumerableCalc (sale_date, qty, price) EnumerableAggregate (group = sale_day, agg = sum(qty * price)) JdbcProject (p_id, name, price) MongoProject (p_id, qty, sale_date) MongoTo Enumerable Converter JdbcTo Enumerable Converter 53
  • 54.
    select CEIL(s.sale_date TO DAY)sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day MongoTableScan (sales) JdbcTableScan (product) EnumerableJoin (product_id) JdbcFilter (name = ‘...’) EnumerableCalc (sale_date, qty, price) EnumerableAggregate (group = sale_day, agg = sum(qty * price)) JdbcProject (p_id, name, price) MongoProject (p_id, qty, sale_date) MongoTo Enumerable Converter JdbcTo Enumerable Converter 54
  • 55.
    Apache Calcite -Plan Execution ● Standard JDBC ResultSet ● Specialized Iterable that returns the data 55
  • 56.
    Query Planner OverviewAgenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute ○ It’s rules again ○ Calling Conventions in Calcite is cool ○ You can convert between conventions ○ Calling Convention knows how to execute 56
  • 57.
  • 58.
    Apache Calcite -Stuff To Remember ● Calcite knows how to parse SQL and optimize queries ● Physical nodes have Calling Convention, it says how to execute the plan ● You can mix Calling Conventions in one plan using Converter Nodes 58
  • 59.
  • 60.
  • 61.
    Query Engine Architecture Sparkadapter JD BC ZD Rx Query Engine 61 Connector adapter
  • 62.
  • 63.
    So what? ● Newfeatures - faster ● Optimisations - more reliable ● SQL interop - cool 63
  • 64.
    ● Steep learningcurve ● Big complex codebase ● Obsolete/unfinished code Apache Calcite - Contra 64
  • 65.
    Apache Calcite -Pro ○ It really works ○ Friendly community ○ Constantly improving and evolving ○ Good place to start 65
  • 66.
  • 67.
    The Theory ● SQLQuery is parsed into a Logical Plan - an abstract model of the query ● Logical Plan is optimised with special rules, based on relational algebra ● Logical Plan is then converted to a Physical Plan ● Physical Plan can be executed to return the data 67
  • 68.
    The Theory ● Queryoptimization is fun ● Relational Algebra is not scary at all ● Solid basis for optimization and execution ● At least you can mention this on interviews 68
  • 69.
    The Practice ● ApacheCalcite knows how to parse SQL and optimize queries ● Has lots of optimization rules, and state-of-the-art optimization algos ● Physical nodes have Calling Convention, it says how to execute the plan ● You can mix Calling Conventions in one plan to query multiple sources 69
  • 70.
    The Practice ● Whenyou want to do analytics over federated data ○ Consider using an existing product ○ Consider using existing engine, e.g. Calcite ● But be careful ○ This is not your regular ORM ○ Use FOR SQL not instead of 70
  • 71.
    Where to start? ●org.apache.calcite.schema.ScannableTable ● org.apache.calcite.rel.RelNode ● org.apache.calcite.plan.RelOptRule 71
  • 72.
    Further reading ● ApacheCalcite Overview ○ https://arxiv.org/pdf/1802.10233.pdf ● Apache Calcite Documentation ○ https://calcite.apache.org/docs/ ● Readings in Database Systems, 5th Edition ○ http://www.redbook.io/ 72
  • 73.