ONE FOR ALL! Using Apache Calcite to make SQL smart

ONE FOR ALL!  
Using Apache Calcite to
Make SQL Smart
Evans Ye
Technical Expert @ Alibaba
DataCon.TW 2018

Evans Ye
• Alibaba MaxCompute team
• One of world's leading  
cloud-based data warehouse
• Apache Member, 
Apache Bigtop PMC, former VP
• Director of Taiwan Data Engineering  
Association(TDEA)

Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shufﬂe Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL

Way back to Big Data
wasn't there…

Okay, let’s make  
big data work
My database can’t  
handle big data…
User Developer

Join is not supported.
Let’s just get rid of SQL
and call it NoSQL
Can I use SQL to query
my data? 
How do I do join?
Just kidding. NoSQL = Not only SQL
User Developer

-ACID
+Fault Tolerance
+Scalability
-SQL
+Unstructured Data
+ACID
-Fault Tolerance
-Scalability
+SQL
-Unstructured Data
NoSQLSQL

Let’s ﬁnd some pattern
through real world cases

Pattern I
• Phase I: Make it work
• Hive QL
• Phase II: Make it fast/efﬁcient
• The Stinger Initiative: Making Apache Hive 100
Times Faster
• Phase III: Make it easy to use
• Standard SQL with ACID support

Pattern II
• Spark RDD
• Project Tungsten: Bringing Apache Spark Closer to
Bare Metal
• Spark SQL, pySpark, sparkR

Generalize It
• Hadoop ecosystem
• In-memory Computing, Off-heap, Caching, etc
• User friendly APIs, SQL interface

Hadoop Ecosystem 
SQL Adoptions
System Query Language
Apache Drill SQL + extensions
Apache Hive SQL + extensions
Apache Solr SQL
Apache Phoenix SQL
Apache Kylin SQL
Apache Apex Streaming SQL
Apache Flink Streaming SQL
Apache Samza Streaming SQL
Apache Storm Streaming SQL

Why SQL?
• Universal standard
• Low entry barrier, people knows it
• Integration with 3rd party apps such as BI tools
• Detach user interface from actual implementation,
making query optimization possible

NewSQL
• Combining the good parts of SQL and NoSQL
• +Fault Tolerance
• +Scalability
• +Unstructured / Semi Structured Data
• +SQL
• +ACID

Apache Calcite
• Apache top-level project since Oct. 2015
• Led by Julian Hyde (Hortonworks -> Looker)
• Latest version: 1.17 released July 2018

Apache Calcite is a
dynamic data
management framework
WAT?

Let’s put it this way
• A database without：
• Storage of data
• Algorithms to process data
• Storage of metadata

Conventional DB Architecture
SQL Parser/Validator
Query Optimizer
Operators
Storage Engine
JDBC Server
Meta
Storage of
metadata
Algorithms to
process data
Storage of data

What Calcite Implements
Query Optimizer
Operators
Storage Engine
JDBC Server
Meta
Storage of
metadata
Algorithms to
process data
Storage of data

The beauty of  
software architecture

Embedded
Query Optimizer
Phoenix Operators
HBase
JDBC Server
Meta

Embedded
Hive Parser
Query Optimizer
Hive Operators
HDFS
Hive Server / CLI
Meta 
store

Query Federation
Query Optimizer
JDBC Server
Meta
Example 2: 
A query to aggregate 
Hive and MySQL data
Example 1: 
A query to join  
Kafka and HBase in Spark
MySQL

https://calcite.apache.org/docs/powered_by.html
Powered by Apache Calcite

Key Features
• JDBC driver (Avatica, a sub project of Calcite)
• SQL Parser/Validator (JavaCC)
• Query Optimizer
• Rule-based / Cost-based Optimizer
• A bunch of built-in optimization rules
• Several Adapters out-of-the-box
• Materialized View support

Join
Scan 
products
Scan 
sales
Filter
Aggregate
Sort
Join
Scan
Products
Filter
Aggregate
Sort
Scan
sales

• Merge projections
• Converting sub-queries to joins
• Reorder joins
• Push down ﬁlters
• Push down projections
• And more…
Query Optimization

General Idea: 
Reduce the amount of data to be
processed as early as possible

A join B join C: 
(A join B) join C ? 
A join (B join C) ?
BUT…

A join B: 
Broadcast Hash Join ? 
Shufﬂed Hash Join ?
Sort Merge Join ?
WHAT ALGORITHM TO CHOOSE?

• Taking statistics into consideration and select a 
plan with cheapest execution cost
• row count
• CPU cost
• Disk I/O cost
• Network I/O cost
Cost-based Optimizer

• The Volcano optimizer generator: Extensibility and efﬁcient
search
• An implementation of Cost-based Optimizer
• Apply rules iteratively, select plan with cheapest cost
• Dynamic programing -> avoid duplicate search
• Heuristic stop point
• 1) Exhaustively explored, 2) Certain time elapsed, 3)
cost has not improved for several iterations
Calcite’s VolcanoPlanner

• Pattern matching
Transformer Rule

• Convert from one Convention to another
• Convention is used to represent a data source
Converter Rule
Flink
Logical 
Join
FlinkLogicalNat
iveTableScan
FlinkLogicalNat
iveTableScan
Data
Stream 
Join
DataStream 
Scan
DataStream 
Scan
FlinkConventions.LOGICAL FlinkConventions.DATASTREAM

Sort Merge Join
SortMergeJoin
Sort by
Hash(key) 
Sort by
Hash(key)
Physical 
TableScan
Physical
TableScan
Map
Shufﬂe
Reduce
Hash=0 Hash=0

What if the data has been
properly distributed, 
and sorted?

Shufﬂe Optimization
SortMergeJoin
Physical 
TableScan
Physical
TableScan
Map
Hash=0 Hash=0
≈2X speed up!!

• Physical property associated with an operator
• 3 primary trait types:
• Convention: data source (we’ve seen this)
• Collation: sort order
• Distribution: Hash or Range distributed
Calcite Trait

• MaxCompute embeds Calcite’s Cost-based Optimizer
• Model data distribution, sort order as Calcite Traits
• If required traits are satisﬁed, then no need to shufﬂe
Achieved via  
Calcite Optimizer
SortMergeJoin
Physical 
TableScan
Physical 
TableScan
hash(key) 
sort(key)
hash(key) 
sort(key)

Agenda
• Deep Dive into How Optimizer Works
• Shufﬂe Optimization in MaxCompute 2.0
• Stream SQL

• A materialized view is a database object that
contains the results of a query
• Automatically rewrite incoming queries using
Materialized View
• Idea implemented in Calcite:  
Optimizing queries using materialized views: A
practical, scalable solution
Materialized View Rewriting

• Rewriting:
• Materialized View Deﬁnition:
• Query:
Example

Agenda
• Deep Dive into How Optimizer Works
• MaxCompute Shufﬂe Optimization
• Stream SQL

• The STREAM keyword tells the system that user is
interesting in incoming records, not existing ones
Stream SQL

• Stream-Stream join 
achieved with time 
window expression
Join in Stream SQL
Orders Shipments

• NewSQL is the new industry standard
• Calcite is a highly extensible database framework:
• SQL Optimizer with a bunch of built-in rules
• Supports Query Federation
• Supports highly customization such as  
Shufﬂe Optimization in MaxCompute 2.0
• Supports Materialized View Rewriting & Stream SQL
Recap

Evans Ye
MaxCompute, Alibaba
yuhsin.yyh@alibaba-inc.com
SELECT questions  
FROM audience;

ONE FOR ALL! Using Apache Calcite to make SQL smart

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ONE FOR ALL! Using Apache Calcite to make SQL smart

Similar to ONE FOR ALL! Using Apache Calcite to make SQL smart (20)

More from Evans Ye

More from Evans Ye (20)

Recently uploaded

Recently uploaded (20)

ONE FOR ALL! Using Apache Calcite to make SQL smart