Although the beginnings of SQL date back to 70s, the language is more relevant than ever. You can save your data in good old PostgreSQL, a fancy new NoSQL database, or even some in-house built storage, but everyone will want to use SQL to query it.
At Zoomdata we build a modern BI platform, which works natively with regular DBs and big data alike. We went through several implementations, employing different approaches and frameworks, but in the end, concluded that the best way to execute analytical queries is to use an engine that natively understands SQL and relational algebra.
In this talk I will introduce Apache Calcite – an open source framework that can help you build your own database, execute queries over distributed data sources, and much more.
1. Big data analytics using a
custom SQL engine
1
Andrii Tsvielodub, Zoomdata
2018
2. Agenda
1. What is Zoomdata and why we need SQL
2. What a typical Query Planner looks like and what it does
3. What is Apache Calcite and how it works
4. How we started using Apache Calcite in Zoomdata
2
3. Who am I
● Java Developer
● 8+ years of experience
● 3+ years in Zoomdata
● Work in Query Engine team
3
18. Apache Calcite - Features
● Standard SQL
○ Industry-standard SQL parser, validator and JDBC driver.
● Query optimization
○ Represent your query in relational algebra, transform using planning rules,
and optimize according to a cost model using smart algorithms.
● Any data, anywhere
○ Connect to third-party data sources, browse metadata, and optimize by
pushing the computation to the data.
18
26. What the duck?
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
26
27. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
27
28. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
28
29. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Dataflow
29
30. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Dataflow
30
31. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Dataflow
31
32. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
32
33. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
LogicalScan
(sales)
LogicalScan
(product)
LogicalJoin
(product_id)
LogicalFilter
(name = ‘...’)
LogicalProject
(sale_date, qty, price)
LogicalAggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
33
35. Query Planner Overview Agenda
● Parse SQL query
○ SQL query is parsed into a logical plan
○ Logical plan is an abstract model of a query
○ Calcite has classes to build a logical plan
○ You can create the plan even without a SQL query
35
36. Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
36
38. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
38
39. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
39
40. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
40
41. ● org.apache.calcite.plan.RelOptRule
● Matches a pattern in plan tree
Apache Calcite - Optimization Rules
● Can check additional conditions, modifies the plan
● Rules can be specified dynamically, based on query properties
41
43. Apache Calcite - Query Optimizer
Query Optimizer makes all the difference in a DB.
Two general categories:
● Naive heuristic optimizer
● Cost-based optimizer
43
44. Apache Calcite - Query Optimizer
● Rule-based/Naive heuristic
○ Fast
○ Good fit for rules that
guarantee improvement
○ Limited in plan space
exploration
○ Not all aspects of query can be
optimised, e.g. joins
reordering
44
45. Apache Calcite - Query Optimizer
● Cost-based
○ Iteratively explores all possible changes to the plan
○ Estimates the cost of plans
○ Chooses the cheapest one
○ Stops when search criteria is met
❗ Can take much time
❗ For some queries, it might be faster to
execute it than to find the optimal plan
45
46. Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
○ Relational Algebra is not that scary
○ Rules can make your plan better (but not always)
○ Calcite has lots of rules and you can write your own
○ There are multiple planning algorithms - simple and cost-based
○ Choose wisely
46
47. Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
47
48. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
48
49. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
SortMergeJoin
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
HashAggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
49
52. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
52
53. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
MongoTableScan
(sales)
JdbcTableScan
(product)
EnumerableJoin
(product_id)
JdbcFilter
(name = ‘...’)
EnumerableCalc
(sale_date, qty, price)
EnumerableAggregate
(group = sale_day,
agg = sum(qty * price))
JdbcProject
(p_id, name,
price)
MongoProject
(p_id, qty,
sale_date)
MongoTo
Enumerable
Converter
JdbcTo
Enumerable
Converter
53
54. select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
MongoTableScan
(sales)
JdbcTableScan
(product)
EnumerableJoin
(product_id)
JdbcFilter
(name = ‘...’)
EnumerableCalc
(sale_date, qty, price)
EnumerableAggregate
(group = sale_day,
agg = sum(qty * price))
JdbcProject
(p_id, name,
price)
MongoProject
(p_id, qty,
sale_date)
MongoTo
Enumerable
Converter
JdbcTo
Enumerable
Converter
54
55. Apache Calcite - Plan Execution
● Standard JDBC ResultSet
● Specialized Iterable that returns the data
55
56. Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
○ It’s rules again
○ Calling Conventions in Calcite is cool
○ You can convert between conventions
○ Calling Convention knows how to execute
56
58. Apache Calcite - Stuff To Remember
● Calcite knows how to parse SQL and
optimize queries
● Physical nodes have Calling Convention,
it says how to execute the plan
● You can mix Calling Conventions in one
plan using Converter Nodes
58
67. The Theory
● SQL Query is parsed into a Logical Plan
- an abstract model of the query
● Logical Plan is optimised with special rules,
based on relational algebra
● Logical Plan is then converted to a Physical Plan
● Physical Plan can be executed to return the data
67
68. The Theory
● Query optimization is fun
● Relational Algebra is not scary at all
● Solid basis for optimization and execution
● At least you can mention this on interviews
68
69. The Practice
● Apache Calcite knows how to parse SQL
and optimize queries
● Has lots of optimization rules, and
state-of-the-art optimization algos
● Physical nodes have Calling Convention,
it says how to execute the plan
● You can mix Calling Conventions in one
plan to query multiple sources
69
70. The Practice
● When you want to do analytics over federated data
○ Consider using an existing product
○ Consider using existing engine, e.g. Calcite
● But be careful
○ This is not your regular ORM
○ Use FOR SQL not instead of
70
71. Where to start?
● org.apache.calcite.schema.ScannableTable
● org.apache.calcite.rel.RelNode
● org.apache.calcite.plan.RelOptRule
71