SlideShare a Scribd company logo
Big data analytics using a
custom SQL engine
1
Andrii Tsvielodub, Zoomdata
2018
Agenda
1. What is Zoomdata and why we need SQL
2. What a typical Query Planner looks like and what it does
3. What is Apache Calcite and how it works
4. How we started using Apache Calcite in Zoomdata
2
Who am I
● Java Developer
● 8+ years of experience
● 3+ years in Zoomdata
● Work in Query Engine team
3
What is
Zoomdata
4
5
6
7
Cross-Database Queries
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
8
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
9
Zoomdata Architecture
10
Zoomdata Architecture
11
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
12
Zoomdata Architecture
13
Source
Zoomdata Architecture
14
Source
Zoomdata Architecture
Relational Algebra | Query Plan Optimization
15
SQL
Query Pushdown | Federated Queries
Zoomdata Architecture
Apache Calcite
A Foundational Framework
for Optimized Query Processing
Over Heterogeneous Data Sources
17
Apache Calcite - Features
● Standard SQL
○ Industry-standard SQL parser, validator and JDBC driver.
● Query optimization
○ Represent your query in relational algebra, transform using planning rules,
and optimize according to a cost model using smart algorithms.
● Any data, anywhere
○ Connect to third-party data sources, browse metadata, and optimize by
pushing the computation to the data.
18
Apache Calcite - Used By
19
Query Planner
20
Process
Manager
Client Communication Manager
Relational Query Processor
Transactional Store Manager
Shared
Components and
Utilities
Admission
Control
Dispatch &
Scheduling
Local Client Protocols Remote Client Protocols
Access Methods Buffer Methods
Lock Manager Log Manager
Catalog
Manager
Memory
Manager
Admin.,
Monitoring &
Utilities
Replication &
Loading
Services
Batch Utilities
Query Parsing & Authorization
Query Rewrite
Query Optimizer
Plan Executor
DDL & Utility Processing
Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007).
21
Process
Manager
Client Communication Manager
Relational Query Processor
Transactional Store Manager
Shared
Components and
Utilities
Admission
Control
Dispatch &
Scheduling
Local Client Protocols Remote Client Protocols
Access Methods Buffer Methods
Lock Manager Log Manager
Catalog
Manager
Memory
Manager
Admin.,
Monitoring &
Utilities
Replication &
Loading
Services
Batch Utilities
Query Parsing & Authorization
Query Rewrite
Query Optimizer
Plan Executor
DDL & Utility Processing
Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007).
22
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
23
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
24
Sales
tx_id
product_id
sale_date
qty
seller_id
buyer_id
Products
product_id
name
description
price
manufacturer_id
*
1
25
What the duck?
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
26
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
27
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
28
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Dataflow
29
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Dataflow
30
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Dataflow
31
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
32
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
LogicalScan
(sales)
LogicalScan
(product)
LogicalJoin
(product_id)
LogicalFilter
(name = ‘...’)
LogicalProject
(sale_date, qty, price)
LogicalAggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
33
RelBuilder b = ...;
RelNode query = b
.scan("sales").as("s")
.scan("product").as("p")
.join(JoinRelType.INNER, "product_id")
.filter(
b.call(SqlStdOperatorTable.EQUALS,
b.field("p.name"),
b.literal("Giant Rubber Duck")))
.project(
b.call(SqlStdOperatorTable.CEIL,
b.field("s.sale_date"), b.literal("DAY")),
b.call(SqlStdOperatorTable.MULTIPLY,
b.field("s.qty"), b.field("p.price")),
"sale_day", "sale")
.aggregate(
b.groupKey("sale_day"),
b.sum(false, "sale_sum", b.field("sale")))
.build();
LogicalScan
(sales)
LogicalScan
(product)
LogicalJoin
(product_id)
LogicalFilter
(name = ‘...’)
LogicalProject
(sale_date, qty, price)
LogicalAggregate
(group = sale_day,
agg = sum(qty * price))
Dataflow
34
Query Planner Overview Agenda
● Parse SQL query
○ SQL query is parsed into a logical plan
○ Logical plan is an abstract model of a query
○ Calcite has classes to build a logical plan
○ You can create the plan even without a SQL query
35
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
36
37
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
38
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
39
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
40
● org.apache.calcite.plan.RelOptRule
● Matches a pattern in plan tree
Apache Calcite - Optimization Rules
● Can check additional conditions, modifies the plan
● Rules can be specified dynamically, based on query properties
41
Apache Calcite - Optimization Rules
42
Apache Calcite - Query Optimizer
Query Optimizer makes all the difference in a DB.
Two general categories:
● Naive heuristic optimizer
● Cost-based optimizer
43
Apache Calcite - Query Optimizer
● Rule-based/Naive heuristic
○ Fast
○ Good fit for rules that
guarantee improvement
○ Limited in plan space
exploration
○ Not all aspects of query can be
optimised, e.g. joins
reordering
44
Apache Calcite - Query Optimizer
● Cost-based
○ Iteratively explores all possible changes to the plan
○ Estimates the cost of plans
○ Chooses the cheapest one
○ Stops when search criteria is met
❗ Can take much time
❗ For some queries, it might be faster to
execute it than to find the optimal plan
45
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
○ Relational Algebra is not that scary
○ Rules can make your plan better (but not always)
○ Calcite has lots of rules and you can write your own
○ There are multiple planning algorithms - simple and cost-based
○ Choose wisely
46
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
47
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
48
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
SortMergeJoin
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
HashAggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
49
Execution in DB
And then the physical plan is executed
50
Sales
tx_id
product_id
sale_date
qty
seller_id
buyer_id
Products
product_id
name
description
price
manufacturer_id
*
1
51
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
Scan
(sales)
Scan
(product)
Join
(product_id)
Filter
(name = ‘...’)
Project
(sale_date, qty, price)
Aggregate
(group = sale_day,
agg = sum(qty * price))
Project
(p_id, name,
price)
Project
(p_id, qty,
sale_date)
52
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
MongoTableScan
(sales)
JdbcTableScan
(product)
EnumerableJoin
(product_id)
JdbcFilter
(name = ‘...’)
EnumerableCalc
(sale_date, qty, price)
EnumerableAggregate
(group = sale_day,
agg = sum(qty * price))
JdbcProject
(p_id, name,
price)
MongoProject
(p_id, qty,
sale_date)
MongoTo
Enumerable
Converter
JdbcTo
Enumerable
Converter
53
select
CEIL(s.sale_date TO DAY) sale_day,
sum(s.qty*p.price) sale_sum
from sales s join product p
on p.product_id = s.product_id
where
p.name = 'Giant Rubber Duck'
group by sale_day
MongoTableScan
(sales)
JdbcTableScan
(product)
EnumerableJoin
(product_id)
JdbcFilter
(name = ‘...’)
EnumerableCalc
(sale_date, qty, price)
EnumerableAggregate
(group = sale_day,
agg = sum(qty * price))
JdbcProject
(p_id, name,
price)
MongoProject
(p_id, qty,
sale_date)
MongoTo
Enumerable
Converter
JdbcTo
Enumerable
Converter
54
Apache Calcite - Plan Execution
● Standard JDBC ResultSet
● Specialized Iterable that returns the data
55
Query Planner Overview Agenda
● Parse SQL query
● Create an optimal logical plan of a query
● Create an optimal physical plan and execute
○ It’s rules again
○ Calling Conventions in Calcite is cool
○ You can convert between conventions
○ Calling Convention knows how to execute
56
Stop. Demo time!
57
Apache Calcite - Stuff To Remember
● Calcite knows how to parse SQL and
optimize queries
● Physical nodes have Calling Convention,
it says how to execute the plan
● You can mix Calling Conventions in one
plan using Converter Nodes
58
Back to
Zoomdata
59
Zoomdata Architecture
DB
DB
Zoomdata
Query
Engine
60
Query Engine Architecture
Spark adapter
JD
BC
ZD
Rx
Query Engine
61
Connector
adapter
62
So what?
● New features - faster
● Optimisations - more reliable
● SQL interop - cool
63
● Steep learning curve
● Big complex codebase
● Obsolete/unfinished code
Apache Calcite - Contra
64
Apache Calcite - Pro
○ It really works
○ Friendly community
○ Constantly improving and evolving
○ Good place to start
65
Takeaways
66
The Theory
● SQL Query is parsed into a Logical Plan
- an abstract model of the query
● Logical Plan is optimised with special rules,
based on relational algebra
● Logical Plan is then converted to a Physical Plan
● Physical Plan can be executed to return the data
67
The Theory
● Query optimization is fun
● Relational Algebra is not scary at all
● Solid basis for optimization and execution
● At least you can mention this on interviews
68
The Practice
● Apache Calcite knows how to parse SQL
and optimize queries
● Has lots of optimization rules, and
state-of-the-art optimization algos
● Physical nodes have Calling Convention,
it says how to execute the plan
● You can mix Calling Conventions in one
plan to query multiple sources
69
The Practice
● When you want to do analytics over federated data
○ Consider using an existing product
○ Consider using existing engine, e.g. Calcite
● But be careful
○ This is not your regular ORM
○ Use FOR SQL not instead of
70
Where to start?
● org.apache.calcite.schema.ScannableTable
● org.apache.calcite.rel.RelNode
● org.apache.calcite.plan.RelOptRule
71
Further reading
● Apache Calcite Overview
○ https://arxiv.org/pdf/1802.10233.pdf
● Apache Calcite Documentation
○ https://calcite.apache.org/docs/
● Readings in Database Systems, 5th Edition
○ http://www.redbook.io/
72
select *
from questions
limit 5
73

More Related Content

Similar to Big data analytics using a custom SQL engine

Hug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overviewHug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overviewMostafa Mokhtar
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQueryMárton Kodok
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Fastly
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesNeo4j
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"Pavel Hardak
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldDatabricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Christopher Gutknecht
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia Bharat Kalia
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcitegianmerlino
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowLucas Arruda
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...tdc-globalcode
 
Why Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted ConfWhy Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted ConfIn Marketing We Trust
 
Data Warehousing
Data WarehousingData Warehousing
Data WarehousingHeena Madan
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?Brent Ozar
 
Advanced query parsing techniques
Advanced query parsing techniquesAdvanced query parsing techniques
Advanced query parsing techniqueslucenerevolution
 
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanelA Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanelData Science Club
 

Similar to Big data analytics using a custom SQL engine (20)

Hug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overviewHug meetup impala 2.5 performance overview
Hug meetup impala 2.5 performance overview
 
Supercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuerySupercharge your data analytics with BigQuery
Supercharge your data analytics with BigQuery
 
Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge Altitude San Francisco 2018: Logging at the Edge
Altitude San Francisco 2018: Logging at the Edge
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world""Lessons learned using Apache Spark for self-service data prep in SaaS world"
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
 
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS WorldLessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
Lessons Learned Using Apache Spark for Self-Service Data Prep in SaaS World
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)Building Data Products with BigQuery for PPC and SEO (SMX 2022)
Building Data Products with BigQuery for PPC and SEO (SMX 2022)
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcite
 
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud DataflowHow to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
How to build an ETL pipeline with Apache Beam on Google Cloud Dataflow
 
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
TDC2017 | São Paulo - Trilha BigData How we figured out we had a SRE team at ...
 
Why Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted ConfWhy Big Query is so Powerful - Trusted Conf
Why Big Query is so Powerful - Trusted Conf
 
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
 
Datawarehosuing
DatawarehosuingDatawarehosuing
Datawarehosuing
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
 
Advanced Relevancy Ranking
Advanced Relevancy RankingAdvanced Relevancy Ranking
Advanced Relevancy Ranking
 
Advanced query parsing techniques
Advanced query parsing techniquesAdvanced query parsing techniques
Advanced query parsing techniques
 
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanelA Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
A Big (Query) Frog in a Small Pond, Jakub Motyl, BuffPanel
 

Recently uploaded

Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...rajkumar669520
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...Alluxio, Inc.
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareinfo611746
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024Ortus Solutions, Corp
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownloadvrstrong314
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAlluxio, Inc.
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfMeon Technology
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisNeo4j
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandIES VE
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion Clinic
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageGlobus
 

Recently uploaded (20)

Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Studiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting softwareStudiovity film pre-production and screenwriting software
Studiovity film pre-production and screenwriting software
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
Abortion ^Clinic ^%[+971588192166''] Abortion Pill Al Ain (?@?) Abortion Pill...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 

Big data analytics using a custom SQL engine

  • 1. Big data analytics using a custom SQL engine 1 Andrii Tsvielodub, Zoomdata 2018
  • 2. Agenda 1. What is Zoomdata and why we need SQL 2. What a typical Query Planner looks like and what it does 3. What is Apache Calcite and how it works 4. How we started using Apache Calcite in Zoomdata 2
  • 3. Who am I ● Java Developer ● 8+ years of experience ● 3+ years in Zoomdata ● Work in Query Engine team 3
  • 5. 5
  • 6. 6
  • 15. Zoomdata Architecture Relational Algebra | Query Plan Optimization 15 SQL Query Pushdown | Federated Queries
  • 17. Apache Calcite A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources 17
  • 18. Apache Calcite - Features ● Standard SQL ○ Industry-standard SQL parser, validator and JDBC driver. ● Query optimization ○ Represent your query in relational algebra, transform using planning rules, and optimize according to a cost model using smart algorithms. ● Any data, anywhere ○ Connect to third-party data sources, browse metadata, and optimize by pushing the computation to the data. 18
  • 19. Apache Calcite - Used By 19
  • 21. Process Manager Client Communication Manager Relational Query Processor Transactional Store Manager Shared Components and Utilities Admission Control Dispatch & Scheduling Local Client Protocols Remote Client Protocols Access Methods Buffer Methods Lock Manager Log Manager Catalog Manager Memory Manager Admin., Monitoring & Utilities Replication & Loading Services Batch Utilities Query Parsing & Authorization Query Rewrite Query Optimizer Plan Executor DDL & Utility Processing Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007). 21
  • 22. Process Manager Client Communication Manager Relational Query Processor Transactional Store Manager Shared Components and Utilities Admission Control Dispatch & Scheduling Local Client Protocols Remote Client Protocols Access Methods Buffer Methods Lock Manager Log Manager Catalog Manager Memory Manager Admin., Monitoring & Utilities Replication & Loading Services Batch Utilities Query Parsing & Authorization Query Rewrite Query Optimizer Plan Executor DDL & Utility Processing Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007). 22
  • 23. Query Planner Overview Agenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 23
  • 24. Query Planner Overview Agenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 24
  • 26. What the duck? select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day 26
  • 27. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day 27
  • 28. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) 28
  • 29. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Dataflow 29
  • 30. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Dataflow 30
  • 31. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Dataflow 31
  • 32. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Dataflow 32
  • 33. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day LogicalScan (sales) LogicalScan (product) LogicalJoin (product_id) LogicalFilter (name = ‘...’) LogicalProject (sale_date, qty, price) LogicalAggregate (group = sale_day, agg = sum(qty * price)) Dataflow 33
  • 34. RelBuilder b = ...; RelNode query = b .scan("sales").as("s") .scan("product").as("p") .join(JoinRelType.INNER, "product_id") .filter( b.call(SqlStdOperatorTable.EQUALS, b.field("p.name"), b.literal("Giant Rubber Duck"))) .project( b.call(SqlStdOperatorTable.CEIL, b.field("s.sale_date"), b.literal("DAY")), b.call(SqlStdOperatorTable.MULTIPLY, b.field("s.qty"), b.field("p.price")), "sale_day", "sale") .aggregate( b.groupKey("sale_day"), b.sum(false, "sale_sum", b.field("sale"))) .build(); LogicalScan (sales) LogicalScan (product) LogicalJoin (product_id) LogicalFilter (name = ‘...’) LogicalProject (sale_date, qty, price) LogicalAggregate (group = sale_day, agg = sum(qty * price)) Dataflow 34
  • 35. Query Planner Overview Agenda ● Parse SQL query ○ SQL query is parsed into a logical plan ○ Logical plan is an abstract model of a query ○ Calcite has classes to build a logical plan ○ You can create the plan even without a SQL query 35
  • 36. Query Planner Overview Agenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 36
  • 37. 37
  • 38. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) 38
  • 39. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) 39
  • 40. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 40
  • 41. ● org.apache.calcite.plan.RelOptRule ● Matches a pattern in plan tree Apache Calcite - Optimization Rules ● Can check additional conditions, modifies the plan ● Rules can be specified dynamically, based on query properties 41
  • 42. Apache Calcite - Optimization Rules 42
  • 43. Apache Calcite - Query Optimizer Query Optimizer makes all the difference in a DB. Two general categories: ● Naive heuristic optimizer ● Cost-based optimizer 43
  • 44. Apache Calcite - Query Optimizer ● Rule-based/Naive heuristic ○ Fast ○ Good fit for rules that guarantee improvement ○ Limited in plan space exploration ○ Not all aspects of query can be optimised, e.g. joins reordering 44
  • 45. Apache Calcite - Query Optimizer ● Cost-based ○ Iteratively explores all possible changes to the plan ○ Estimates the cost of plans ○ Chooses the cheapest one ○ Stops when search criteria is met ❗ Can take much time ❗ For some queries, it might be faster to execute it than to find the optimal plan 45
  • 46. Query Planner Overview Agenda ● Parse SQL query ● Create an optimal logical plan of a query ○ Relational Algebra is not that scary ○ Rules can make your plan better (but not always) ○ Calcite has lots of rules and you can write your own ○ There are multiple planning algorithms - simple and cost-based ○ Choose wisely 46
  • 47. Query Planner Overview Agenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute 47
  • 48. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 48
  • 49. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) SortMergeJoin (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) HashAggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 49
  • 50. Execution in DB And then the physical plan is executed 50
  • 52. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day Scan (sales) Scan (product) Join (product_id) Filter (name = ‘...’) Project (sale_date, qty, price) Aggregate (group = sale_day, agg = sum(qty * price)) Project (p_id, name, price) Project (p_id, qty, sale_date) 52
  • 53. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day MongoTableScan (sales) JdbcTableScan (product) EnumerableJoin (product_id) JdbcFilter (name = ‘...’) EnumerableCalc (sale_date, qty, price) EnumerableAggregate (group = sale_day, agg = sum(qty * price)) JdbcProject (p_id, name, price) MongoProject (p_id, qty, sale_date) MongoTo Enumerable Converter JdbcTo Enumerable Converter 53
  • 54. select CEIL(s.sale_date TO DAY) sale_day, sum(s.qty*p.price) sale_sum from sales s join product p on p.product_id = s.product_id where p.name = 'Giant Rubber Duck' group by sale_day MongoTableScan (sales) JdbcTableScan (product) EnumerableJoin (product_id) JdbcFilter (name = ‘...’) EnumerableCalc (sale_date, qty, price) EnumerableAggregate (group = sale_day, agg = sum(qty * price)) JdbcProject (p_id, name, price) MongoProject (p_id, qty, sale_date) MongoTo Enumerable Converter JdbcTo Enumerable Converter 54
  • 55. Apache Calcite - Plan Execution ● Standard JDBC ResultSet ● Specialized Iterable that returns the data 55
  • 56. Query Planner Overview Agenda ● Parse SQL query ● Create an optimal logical plan of a query ● Create an optimal physical plan and execute ○ It’s rules again ○ Calling Conventions in Calcite is cool ○ You can convert between conventions ○ Calling Convention knows how to execute 56
  • 58. Apache Calcite - Stuff To Remember ● Calcite knows how to parse SQL and optimize queries ● Physical nodes have Calling Convention, it says how to execute the plan ● You can mix Calling Conventions in one plan using Converter Nodes 58
  • 61. Query Engine Architecture Spark adapter JD BC ZD Rx Query Engine 61 Connector adapter
  • 62. 62
  • 63. So what? ● New features - faster ● Optimisations - more reliable ● SQL interop - cool 63
  • 64. ● Steep learning curve ● Big complex codebase ● Obsolete/unfinished code Apache Calcite - Contra 64
  • 65. Apache Calcite - Pro ○ It really works ○ Friendly community ○ Constantly improving and evolving ○ Good place to start 65
  • 67. The Theory ● SQL Query is parsed into a Logical Plan - an abstract model of the query ● Logical Plan is optimised with special rules, based on relational algebra ● Logical Plan is then converted to a Physical Plan ● Physical Plan can be executed to return the data 67
  • 68. The Theory ● Query optimization is fun ● Relational Algebra is not scary at all ● Solid basis for optimization and execution ● At least you can mention this on interviews 68
  • 69. The Practice ● Apache Calcite knows how to parse SQL and optimize queries ● Has lots of optimization rules, and state-of-the-art optimization algos ● Physical nodes have Calling Convention, it says how to execute the plan ● You can mix Calling Conventions in one plan to query multiple sources 69
  • 70. The Practice ● When you want to do analytics over federated data ○ Consider using an existing product ○ Consider using existing engine, e.g. Calcite ● But be careful ○ This is not your regular ORM ○ Use FOR SQL not instead of 70
  • 71. Where to start? ● org.apache.calcite.schema.ScannableTable ● org.apache.calcite.rel.RelNode ● org.apache.calcite.plan.RelOptRule 71
  • 72. Further reading ● Apache Calcite Overview ○ https://arxiv.org/pdf/1802.10233.pdf ● Apache Calcite Documentation ○ https://calcite.apache.org/docs/ ● Readings in Database Systems, 5th Edition ○ http://www.redbook.io/ 72