Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Julian Hyde
Julian HydeSoftware Developer at Google, Apache Software Foundation
Apache Calcite: A Foundational
Framework for Optimized Query
Processing Over Heterogeneous Data
Sources
Edmon Begoli, Jesú s Camacho-Rodrı́guez, Julian Hyde,
Michael J. Mior, Daniel Lemire
2018 SIGMOD, Houston, Texas, USA
Outline
Background and History
Architecture
Adapter Design
Optimizer and Planner
Adoption
Uses in Research and Scholastic Potential
Roadmap and Future Work
What is Calcite?
Apache Calcite is an extensible framework for
building data management systems.
It is an open source project governed by the
Apache Software Foundation, is written in
Java, and is used by dozens of projects and
companies, and several research projects.
Origins and Design Principles
Origins 2004 – LucidEra and SQLstream were each building SQL systems;
2012 – Pare down code base, enter Apache as incubator project
Problem Building a high-quality database requires ~ 20 person years (effort)
and 5 years (elapsed)
Solution Create an open source framework that a community can contribute
to, and use to build their own DBMSs
Design
principles
Flexible → Relational algebra
Extensible/composable → Volcano-style planner
Easy to contribute to → Java, FP style
Alternatives PostgreSQL, Apache Spark, AsterixDB
Architecture
Core – Operator expressions
(relational algebra) and planner
(based on Volcano/Cascades)
External – Data storage, algorithms
and catalog
Optional – SQL parser, JDBC &
ODBC drivers
Extensible – Planner rewrite rules,
statistics, cost model, algebra, UDFs
Adapter Design
A pattern that defines how
Calcite incorporates diverse
data sources for general
access.
Model – specification of the
physical properties of the data
source.
Schema – definition of the data
(format and layouts) found in
the model.
Represent query as
relational algebra
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: products
Table: splunk
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
Optimize query by
applying transformation
rules
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:
action = 'purchase'
sort
Key: c desc
scan
scan
Table: splunk
Table: products
select p.productName, count(*) as c
from splunk.splunk as s
join mysql.products as p
on s.productId = p.productId
where s.action = 'purchase'
group by p.productName
order by c desc
1. Plans start
as logical
nodes.
3. Fire rules to
propagate conventions
to other nodes.
2. Assign each
Scan its table’s
native
convention.
4. The best plan may
use an engine not tied
to any native format.
To implement, generate
a program that calls out
to query1 and query2.
Join
Filter Scan
ScanScan
Join
Conventions
Join
Filter Scan
ScanScan
Join
Scan
ScanScan
Join
Filter
Join
Join
Filter Scan
ScanScan
Join
Conventions & adapters
Scan Scan
Join
Filter
Join
Scan
Convention provides a uniform
representation for hybrid queries
Like ordering and distribution,
convention is a physical property of
nodes
Adapter =
schema factory (lists tables)
+ convention
+ rules to convert nodes to convention
Stream ~= append-only table
Streaming queries return deltas
Stream-table duality: Orders is used as
both stream and table
Our contributions:
➢ Popularize streaming SQL
➢ SQL parser / validator / rules
➢ Reference implementation & TCK
select stream *
from Orders as o
where units > (
select avg(units)
from Orders as h
where h.productId = o.productId
and h.rowtime >
o.rowtime - interval ‘1’ year)
“Show me real-time orders whose size is larger
than the average for that product over the
preceding year”
Streaming SQL
Uses and Adoption
Uses in Research
● Polystore research – use as lightweight
heterogeneous data processing platform
● Optimization and query profiling –
general performance, and optimizer
research
● Reasoning over Streams, Graphs –
under consideration
● Open-source, production grade learning
and research platform
Future Work and Roadmap
● Support its use as a standalone engine – DDL, materialized views,
indexes and constraints.
● Improvements to the design and extensibility of the planner
(modularity, pluggability)
● Incorporation of new parametric approaches into the design of the
optimizer.
● Support for an extended set of SQL commands, functions, and
utilities, including full compliance with OpenGIS (spatial).
● New adapters for non-relational data sources such as array
databases.
● Improvements to performance profiling and instrumentation.
Thank you! Questions?
@ApacheCalcite
https://calcite.apache.org
https://arxiv.org/abs/1802.10233
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Extra slides
Calcite framework
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinctRowCount
• RelMdSelectivity
SQL parser
SqlNode
SqlParser
SqlValidator
Transformation rules
RelOptRule
• FilterMergeRule
• AggregateUnionTransposeRule
• 100+ more
Global transformations
• Unification (materialized view)
• Column trimming
• De-correlation
Relational algebra
RelNode (operator)
• TableScan
• Filter
• Project
• Union
• Aggregate
• …
RelDataType (type)
RexNode (expression)
RelTrait (physical property)
• RelConvention (calling-convention)
• RelCollation (sortedness)
• RelDistribution (partitioning)
RelBuilder
JDBC driver
Metadata
Schema
Table
Function
• TableFunction
• TableMacro
Lattice
Avatica
● Database connectivity
stack
● Self-contained sub-project
of Calcite
● Fast, open, stable
● Protobuf or JSON over
HTTP
● Powers Phoenix Query
Server
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y,
m) 831k
raw 1m
(z, s, g,
m) 644k
(z, s, g,
y) 392k
(y, m)
60
(z, s)
43.4k
(z, s, g)
83.6k
(g, y) 10
(g, y, m)
120
(g, m)
24
Key
z zipcode (43k)
s state (50)
g gender (2)
y year (5)
m month (12)
Aggregation and windows on
streams
GROUP BY aggregates multiple rows into
sub-totals
➢ In regular GROUP BY each row contributes to
exactly one sub-total
➢ In multi-GROUP BY (e.g. HOP, GROUPING
SETS) a row can contribute to more than one
sub-total
Window functions (OVER) leave the number of
rows unchanged, but compute extra expressions
for each row (based on neighboring rows)
Multi
GROUP BY
Window
functions
GROUP BY
Tumbling, hopping & session windows in SQL
Tumbling window
Hopping window
Session window
select stream … from Orders
group by floor(rowtime to hour)
select stream … from Orders
group by tumble(rowtime, interval ‘1’ hour)
select stream … from Orders
group by hop(rowtime, interval ‘1’ hour,
interval ‘2’ hour)
select stream … from Orders
group by session(rowtime, interval ‘1’ hour)
Controlling when data is emitted
Early emission is the defining
characteristic of a streaming query.
The emit clause is a SQL extension
inspired by Apache Beam’s “trigger”
notion. (Still experimental… and
evolving.)
A relational (non-streaming) query is
just a query with the most conservative
possible emission strategy.
select stream productId,
count(*) as c
from Orders
group by productId,
floor(rowtime to hour)
emit at watermark,
early interval ‘2’ minute,
late limit 1;
select *
from Orders
emit when complete;
1 of 23

Recommended

Apache Calcite overview by
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
19.5K views9 slides
Apache Calcite Tutorial - BOSS 21 by
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
663 views83 slides
Introduction to Apache Calcite by
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache CalciteJordan Halterman
16.3K views109 slides
SQL for NoSQL and how Apache Calcite can help by
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
1.2K views34 slides
Cost-based Query Optimization in Apache Phoenix using Apache Calcite by
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
8K views32 slides
Apache Calcite: One Frontend to Rule Them All by
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllMichael Mior
675 views44 slides

More Related Content

What's hot

Apache Calcite: One planner fits all by
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
6.7K views10 slides
Data all over the place! How SQL and Apache Calcite bring sanity to streaming... by
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
4K views56 slides
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in... by
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Julian Hyde
1.6K views45 slides
Cost-based query optimization in Apache Hive by
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
8.7K views29 slides
The evolution of Apache Calcite and its Community by
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
330 views26 slides
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite by
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
1.1K views19 slides

What's hot(20)

Apache Calcite: One planner fits all by Julian Hyde
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde6.7K views
Data all over the place! How SQL and Apache Calcite bring sanity to streaming... by Julian Hyde
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Julian Hyde4K views
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in... by Julian Hyde
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Julian Hyde1.6K views
Cost-based query optimization in Apache Hive by Julian Hyde
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde8.7K views
The evolution of Apache Calcite and its Community by Julian Hyde
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
Julian Hyde330 views
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite by Julian Hyde
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Julian Hyde1.1K views
Fast federated SQL with Apache Calcite by Chris Baynes
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes1.4K views
Adding measures to Calcite SQL by Julian Hyde
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
Julian Hyde392 views
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P... by Databricks
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks1K views
Designing Structured Streaming Pipelines—How to Architect Things Right by Databricks
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
Databricks4.5K views
Morel, a Functional Query Language by Julian Hyde
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
Julian Hyde1.1K views
Building robust CDC pipeline with Apache Hudi and Debezium by Tathastu.ai
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
Tathastu.ai2.7K views
Pandas UDF: Scalable Analysis with Python and PySpark by Li Jin
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
Li Jin343 views
SQL on everything, in memory by Julian Hyde
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde7.4K views
Introducing DataFrames in Spark for Large Scale Data Science by Databricks
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
Databricks41K views
Introduction to PostgreSQL by Jim Mlodgenski
Introduction to PostgreSQLIntroduction to PostgreSQL
Introduction to PostgreSQL
Jim Mlodgenski6.5K views
Data Source API in Spark by Databricks
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks18.9K views
Apache Tez: Accelerating Hadoop Query Processing by DataWorks Summit
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit31.7K views
Hive Bucketing in Apache Spark with Tejas Patil by Databricks
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks14.9K views
SQL Query Optimization: Why Is It So Hard to Get Right? by Brent Ozar
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
Brent Ozar4.6K views

Similar to Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite by
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteSalesforce Engineering
322 views20 slides
Spark DataFrames and ML Pipelines by
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
11.8K views30 slides
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan... by
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
538 views33 slides
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite by
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
1.7K views20 slides
Querying Druid in SQL with Superset by
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with SupersetDataWorks Summit
5.5K views26 slides
Streaming SQL by
Streaming SQLStreaming SQL
Streaming SQLJulian Hyde
3.6K views30 slides

Similar to Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources(20)

A Smarter Pig: Building a SQL interface to Pig using Apache Calcite by Salesforce Engineering
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
Spark DataFrames and ML Pipelines by Databricks
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks11.8K views
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan... by Jürgen Ambrosi
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi538 views
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite by Julian Hyde
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
Julian Hyde1.7K views
Querying Druid in SQL with Superset by DataWorks Summit
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
DataWorks Summit5.5K views
Streaming SQL by Julian Hyde
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde3.6K views
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark by Vital.AI
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital.AI15.2K views
Cost-based query optimization in Apache Hive 0.14 by Julian Hyde
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
Julian Hyde4K views
The Developer Data Scientist – Creating New Analytics Driven Applications usi... by Microsoft Tech Community
The Developer Data Scientist – Creating New Analytics Driven Applications usi...The Developer Data Scientist – Creating New Analytics Driven Applications usi...
The Developer Data Scientist – Creating New Analytics Driven Applications usi...
Spark SQL In Depth www.syedacademy.com by Syed Hadoop
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Syed Hadoop218 views
Emerging technologies /frameworks in Big Data by Rahul Jain
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain2.7K views
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer by Sachin Aggarwal
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal1.7K views
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms by DataStax Academy
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
DataStax Academy4.1K views
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ... by Jose Quesada (hiring)
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)10.1K views
Declarative benchmarking of cassandra and it's data models by Monal Daxini
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
Monal Daxini310 views
SQL on Hadoop by nvvrajesh
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh2.3K views
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python by Miklos Christine
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine1.4K views
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive by Xu Jiang
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Xu Jiang33.1K views

More from Julian Hyde

Building a semantic/metrics layer using Calcite by
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteJulian Hyde
259 views55 slides
Cubing and Metrics in SQL, oh my! by
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Julian Hyde
371 views31 slides
Morel, a data-parallel programming language by
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
174 views81 slides
Is there a perfect data-parallel programming language? (Experiments with More... by
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Julian Hyde
366 views97 slides
What to expect when you're Incubating by
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're IncubatingJulian Hyde
402 views15 slides
Efficient spatial queries on vanilla databases by
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesJulian Hyde
329 views35 slides

More from Julian Hyde(20)

Building a semantic/metrics layer using Calcite by Julian Hyde
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
Julian Hyde259 views
Cubing and Metrics in SQL, oh my! by Julian Hyde
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
Julian Hyde371 views
Morel, a data-parallel programming language by Julian Hyde
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
Julian Hyde174 views
Is there a perfect data-parallel programming language? (Experiments with More... by Julian Hyde
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
Julian Hyde366 views
What to expect when you're Incubating by Julian Hyde
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
Julian Hyde402 views
Efficient spatial queries on vanilla databases by Julian Hyde
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
Julian Hyde329 views
Tactical data engineering by Julian Hyde
Tactical data engineeringTactical data engineering
Tactical data engineering
Julian Hyde971 views
Don't optimize my queries, organize my data! by Julian Hyde
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
Julian Hyde1.2K views
Spatial query on vanilla databases by Julian Hyde
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
Julian Hyde1.1K views
Lazy beats Smart and Fast by Julian Hyde
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
Julian Hyde975 views
Don’t optimize my queries, optimize my data! by Julian Hyde
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde4.4K views
Data profiling with Apache Calcite by Julian Hyde
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
Julian Hyde1.9K views
Data Profiling in Apache Calcite by Julian Hyde
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
Julian Hyde1.7K views
Streaming SQL (at FlinkForward, Berlin, 2016/09/12) by Julian Hyde
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Julian Hyde2.8K views
Streaming SQL by Julian Hyde
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde1.3K views
Streaming SQL with Apache Calcite by Julian Hyde
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
Julian Hyde7.7K views
Streaming SQL by Julian Hyde
Streaming SQLStreaming SQL
Streaming SQL
Julian Hyde6.3K views
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ... by Julian Hyde
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Julian Hyde2.7K views
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident by Julian Hyde
 Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Julian Hyde3.3K views

Recently uploaded

HarshithAkkapelli_Presentation.pdf by
HarshithAkkapelli_Presentation.pdfHarshithAkkapelli_Presentation.pdf
HarshithAkkapelli_Presentation.pdfharshithakkapelli
11 views16 slides
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...Marc Müller
38 views62 slides
MariaDB stored procedures and why they should be improved by
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improvedFederico Razzoli
8 views32 slides
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs by
DSD-INT 2023 The Danube Hazardous Substances Model - KovacsDSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
DSD-INT 2023 The Danube Hazardous Substances Model - KovacsDeltares
8 views17 slides
Airline Booking Software by
Airline Booking SoftwareAirline Booking Software
Airline Booking SoftwareSharmiMehta
5 views26 slides
Agile 101 by
Agile 101Agile 101
Agile 101John Valentino
7 views20 slides

Recently uploaded(20)

.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra... by Marc Müller
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra....NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
.NET Developer Conference 2023 - .NET Microservices mit Dapr – zu viel Abstra...
Marc Müller38 views
MariaDB stored procedures and why they should be improved by Federico Razzoli
MariaDB stored procedures and why they should be improvedMariaDB stored procedures and why they should be improved
MariaDB stored procedures and why they should be improved
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs by Deltares
DSD-INT 2023 The Danube Hazardous Substances Model - KovacsDSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
DSD-INT 2023 The Danube Hazardous Substances Model - Kovacs
Deltares8 views
Airline Booking Software by SharmiMehta
Airline Booking SoftwareAirline Booking Software
Airline Booking Software
SharmiMehta5 views
Fleet Management Software in India by Fleetable
Fleet Management Software in India Fleet Management Software in India
Fleet Management Software in India
Fleetable11 views
DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ... by Deltares
DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ...DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ...
DSD-INT 2023 Wave-Current Interaction at Montrose Tidal Inlet System and Its ...
Deltares10 views
A first look at MariaDB 11.x features and ideas on how to use them by Federico Razzoli
A first look at MariaDB 11.x features and ideas on how to use themA first look at MariaDB 11.x features and ideas on how to use them
A first look at MariaDB 11.x features and ideas on how to use them
Federico Razzoli45 views
FIMA 2023 Neo4j & FS - Entity Resolution.pptx by Neo4j
FIMA 2023 Neo4j & FS - Entity Resolution.pptxFIMA 2023 Neo4j & FS - Entity Resolution.pptx
FIMA 2023 Neo4j & FS - Entity Resolution.pptx
Neo4j6 views
Dapr Unleashed: Accelerating Microservice Development by Miroslav Janeski
Dapr Unleashed: Accelerating Microservice DevelopmentDapr Unleashed: Accelerating Microservice Development
Dapr Unleashed: Accelerating Microservice Development
Miroslav Janeski10 views
Software testing company in India.pptx by SakshiPatel82
Software testing company in India.pptxSoftware testing company in India.pptx
Software testing company in India.pptx
SakshiPatel827 views
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the... by Deltares
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
DSD-INT 2023 Leveraging the results of a 3D hydrodynamic model to improve the...
Deltares6 views
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme... by Deltares
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...
DSD-INT 2023 Salt intrusion Modelling of the Lauwersmeer, towards a measureme...
Deltares5 views
Navigating container technology for enhanced security by Niklas Saari by Metosin Oy
Navigating container technology for enhanced security by Niklas SaariNavigating container technology for enhanced security by Niklas Saari
Navigating container technology for enhanced security by Niklas Saari
Metosin Oy13 views

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

  • 1. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources Edmon Begoli, Jesú s Camacho-Rodrı́guez, Julian Hyde, Michael J. Mior, Daniel Lemire 2018 SIGMOD, Houston, Texas, USA
  • 2. Outline Background and History Architecture Adapter Design Optimizer and Planner Adoption Uses in Research and Scholastic Potential Roadmap and Future Work
  • 3. What is Calcite? Apache Calcite is an extensible framework for building data management systems. It is an open source project governed by the Apache Software Foundation, is written in Java, and is used by dozens of projects and companies, and several research projects.
  • 4. Origins and Design Principles Origins 2004 – LucidEra and SQLstream were each building SQL systems; 2012 – Pare down code base, enter Apache as incubator project Problem Building a high-quality database requires ~ 20 person years (effort) and 5 years (elapsed) Solution Create an open source framework that a community can contribute to, and use to build their own DBMSs Design principles Flexible → Relational algebra Extensible/composable → Volcano-style planner Easy to contribute to → Java, FP style Alternatives PostgreSQL, Apache Spark, AsterixDB
  • 5. Architecture Core – Operator expressions (relational algebra) and planner (based on Volcano/Cascades) External – Data storage, algorithms and catalog Optional – SQL parser, JDBC & ODBC drivers Extensible – Planner rewrite rules, statistics, cost model, algebra, UDFs
  • 6. Adapter Design A pattern that defines how Calcite incorporates diverse data sources for general access. Model – specification of the physical properties of the data source. Schema – definition of the data (format and layouts) found in the model.
  • 7. Represent query as relational algebra MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products Table: splunk select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 8. Optimize query by applying transformation rules MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  • 9. 1. Plans start as logical nodes. 3. Fire rules to propagate conventions to other nodes. 2. Assign each Scan its table’s native convention. 4. The best plan may use an engine not tied to any native format. To implement, generate a program that calls out to query1 and query2. Join Filter Scan ScanScan Join Conventions Join Filter Scan ScanScan Join Scan ScanScan Join Filter Join Join Filter Scan ScanScan Join
  • 10. Conventions & adapters Scan Scan Join Filter Join Scan Convention provides a uniform representation for hybrid queries Like ordering and distribution, convention is a physical property of nodes Adapter = schema factory (lists tables) + convention + rules to convert nodes to convention
  • 11. Stream ~= append-only table Streaming queries return deltas Stream-table duality: Orders is used as both stream and table Our contributions: ➢ Popularize streaming SQL ➢ SQL parser / validator / rules ➢ Reference implementation & TCK select stream * from Orders as o where units > ( select avg(units) from Orders as h where h.productId = o.productId and h.rowtime > o.rowtime - interval ‘1’ year) “Show me real-time orders whose size is larger than the average for that product over the preceding year” Streaming SQL
  • 13. Uses in Research ● Polystore research – use as lightweight heterogeneous data processing platform ● Optimization and query profiling – general performance, and optimizer research ● Reasoning over Streams, Graphs – under consideration ● Open-source, production grade learning and research platform
  • 14. Future Work and Roadmap ● Support its use as a standalone engine – DDL, materialized views, indexes and constraints. ● Improvements to the design and extensibility of the planner (modularity, pluggability) ● Incorporation of new parametric approaches into the design of the optimizer. ● Support for an extended set of SQL commands, functions, and utilities, including full compliance with OpenGIS (spatial). ● New adapters for non-relational data sources such as array databases. ● Improvements to performance profiling and instrumentation.
  • 18. Calcite framework Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • FilterMergeRule • AggregateUnionTransposeRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • RelDistribution (partitioning) RelBuilder JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  • 19. Avatica ● Database connectivity stack ● Self-contained sub-project of Calcite ● Fast, open, stable ● Protobuf or JSON over HTTP ● Powers Phoenix Query Server
  • 20. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  • 21. Aggregation and windows on streams GROUP BY aggregates multiple rows into sub-totals ➢ In regular GROUP BY each row contributes to exactly one sub-total ➢ In multi-GROUP BY (e.g. HOP, GROUPING SETS) a row can contribute to more than one sub-total Window functions (OVER) leave the number of rows unchanged, but compute extra expressions for each row (based on neighboring rows) Multi GROUP BY Window functions GROUP BY
  • 22. Tumbling, hopping & session windows in SQL Tumbling window Hopping window Session window select stream … from Orders group by floor(rowtime to hour) select stream … from Orders group by tumble(rowtime, interval ‘1’ hour) select stream … from Orders group by hop(rowtime, interval ‘1’ hour, interval ‘2’ hour) select stream … from Orders group by session(rowtime, interval ‘1’ hour)
  • 23. Controlling when data is emitted Early emission is the defining characteristic of a streaming query. The emit clause is a SQL extension inspired by Apache Beam’s “trigger” notion. (Still experimental… and evolving.) A relational (non-streaming) query is just a query with the most conservative possible emission strategy. select stream productId, count(*) as c from Orders group by productId, floor(rowtime to hour) emit at watermark, early interval ‘2’ minute, late limit 1; select * from Orders emit when complete;