SlideShare a Scribd company logo
1 of 53
Download to read offline
ONE FOR ALL! 

Using Apache Calcite to
Make SQL Smart
Evans Ye
Technical Expert @ Alibaba
DataCon.TW 2018
Evans Ye
• Alibaba MaxCompute team
• One of world's leading 

cloud-based data warehouse
• Apache Member,

Apache Bigtop PMC, former VP
• Director of Taiwan Data Engineering 

Association(TDEA)
Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
Way back to Big Data
wasn't there…
Okay, let’s make 

big data work
My database can’t 

handle big data…
User Developer
Join is not supported.
Let’s just get rid of SQL
and call it NoSQL
Can I use SQL to query
my data?

How do I do join?
Just kidding. NoSQL = Not only SQL
User Developer
-ACID
+Fault Tolerance
+Scalability
-SQL
+Unstructured Data
+ACID
-Fault Tolerance
-Scalability
+SQL
-Unstructured Data
NoSQLSQL
Let’s find some pattern
through real world cases
Pattern I
• Phase I: Make it work
• Hive QL
• Phase II: Make it fast/efficient
• The Stinger Initiative: Making Apache Hive 100
Times Faster
• Phase III: Make it easy to use
• Standard SQL with ACID support
Pattern II
• Phase I: Make it work
• Spark RDD
• Phase II: Make it fast/efficient
• Project Tungsten: Bringing Apache Spark Closer to
Bare Metal
• Phase III: Make it easy to use
• Spark SQL, pySpark, sparkR
Generalize It
• Phase I: Make it work
• Hadoop ecosystem
• Phase II: Make it fast/efficient
• In-memory Computing, Off-heap, Caching, etc
• Phase III: Make it easy to use
• User friendly APIs, SQL interface
Hadoop Ecosystem

SQL Adoptions
System Query Language
Apache Drill SQL + extensions
Apache Hive SQL + extensions
Apache Solr SQL
Apache Phoenix SQL
Apache Kylin SQL
Apache Apex Streaming SQL
Apache Flink Streaming SQL
Apache Samza Streaming SQL
Apache Storm Streaming SQL
Why SQL?
• Universal standard
• Low entry barrier, people knows it
• Integration with 3rd party apps such as BI tools
• Detach user interface from actual implementation,
making query optimization possible
NewSQL
• Combining the good parts of SQL and NoSQL
• +Fault Tolerance
• +Scalability
• +Unstructured / Semi Structured Data
• +SQL
• +ACID
Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
Apache Calcite
• Apache top-level project since Oct. 2015
• Led by Julian Hyde (Hortonworks -> Looker)
• Latest version: 1.17 released July 2018

Apache Calcite is a
dynamic data
management framework
WAT?
Let’s put it this way
• A database without:
• Storage of data
• Algorithms to process data
• Storage of metadata
Conventional DB Architecture
SQL Parser/Validator
Query Optimizer
Operators
Storage Engine
JDBC Server
Meta
Storage of
metadata
Algorithms to
process data
Storage of data
What Calcite Implements
SQL Parser/Validator
Query Optimizer
Operators
Storage Engine
JDBC Server
Meta
Storage of
metadata
Algorithms to
process data
Storage of data
The beauty of 

software architecture
Embedded
SQL Parser/Validator
Query Optimizer
Phoenix Operators
HBase
JDBC Server
Meta
Embedded
Hive Parser
Query Optimizer
Hive Operators
HDFS
Hive Server / CLI
Meta

store
Query Federation
SQL Parser/Validator
Query Optimizer
JDBC Server
Meta
Example 2:

A query to aggregate

Hive and MySQL data
Example 1:

A query to join 

Kafka and HBase in Spark
MySQL
https://calcite.apache.org/docs/powered_by.html
Powered by Apache Calcite
Key Features
• JDBC driver (Avatica, a sub project of Calcite)
• SQL Parser/Validator (JavaCC)
• Query Optimizer
• Rule-based / Cost-based Optimizer
• A bunch of built-in optimization rules
• Several Adapters out-of-the-box
• Materialized View support
Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
Example
Join
Scan

products
Scan

sales
Filter
Aggregate
Sort
Join
Scan
Products
Filter
Aggregate
Sort
Scan
sales
• Merge projections
• Converting sub-queries to joins
• Reorder joins
• Push down filters
• Push down projections
• And more…
Query Optimization
General Idea:

Reduce the amount of data to be
processed as early as possible
A join B join C:

(A join B) join C ?

A join (B join C) ?
BUT…
A join B:

Broadcast Hash Join ?

Shuffled Hash Join ?
Sort Merge Join ?
WHAT ALGORITHM TO CHOOSE?
• Taking statistics into consideration and select a

plan with cheapest execution cost
• row count
• CPU cost
• Disk I/O cost
• Network I/O cost
Cost-based Optimizer
• The Volcano optimizer generator: Extensibility and efficient
search
• An implementation of Cost-based Optimizer
• Apply rules iteratively, select plan with cheapest cost
• Dynamic programing -> avoid duplicate search
• Heuristic stop point
• 1) Exhaustively explored, 2) Certain time elapsed, 3)
cost has not improved for several iterations
Calcite’s VolcanoPlanner
• Pattern matching
Transformer Rule
• Convert from one Convention to another
• Convention is used to represent a data source
Converter Rule
Flink
Logical

Join
FlinkLogicalNat
iveTableScan
FlinkLogicalNat
iveTableScan
Data
Stream

Join
DataStream

Scan
DataStream

Scan
FlinkConventions.LOGICAL FlinkConventions.DATASTREAM
Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
Sort Merge Join
SortMergeJoin
Sort by
Hash(key)

Sort by
Hash(key)
Physical

TableScan
Physical
TableScan
Map
Shuffle
Reduce
Hash=0 Hash=0
What if the data has been
properly distributed,

and sorted?
Shuffle Optimization
SortMergeJoin
Physical

TableScan
Physical
TableScan
Map
Hash=0 Hash=0
≈2X speed up!!
• Physical property associated with an operator
• 3 primary trait types:
• Convention: data source (we’ve seen this)
• Collation: sort order
• Distribution: Hash or Range distributed
Calcite Trait
• MaxCompute embeds Calcite’s Cost-based Optimizer
• Model data distribution, sort order as Calcite Traits
• If required traits are satisfied, then no need to shuffle
Achieved via 

Calcite Optimizer
SortMergeJoin
Physical

TableScan
Physical

TableScan
hash(key)

sort(key)
hash(key)

sort(key)
Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• Deep Dive into How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
• A materialized view is a database object that
contains the results of a query
• Automatically rewrite incoming queries using
Materialized View
• Idea implemented in Calcite: 

Optimizing queries using materialized views: A
practical, scalable solution
Materialized View Rewriting
• Rewriting:
• Materialized View Definition:
• Query:
Example
Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• Deep Dive into How Optimizer Works
• MaxCompute Shuffle Optimization
• Materialized View
• Stream SQL
• The STREAM keyword tells the system that user is
interesting in incoming records, not existing ones
Stream SQL
• Stream-Stream join

achieved with time

window expression
Join in Stream SQL
Orders Shipments
Recap
• NewSQL is the new industry standard
• Calcite is a highly extensible database framework:
• SQL Optimizer with a bunch of built-in rules
• Supports Query Federation
• Supports highly customization such as 

Shuffle Optimization in MaxCompute 2.0
• Supports Materialized View Rewriting & Stream SQL
Recap
Evans Ye
MaxCompute, Alibaba
yuhsin.yyh@alibaba-inc.com
SELECT questions 

FROM audience;

More Related Content

What's hot

What's hot (20)

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
SQL on Big Data using Optiq
SQL on Big Data using OptiqSQL on Big Data using Optiq
SQL on Big Data using Optiq
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache CalciteA Smarter Pig: Building a SQL interface to Pig using Apache Calcite
A Smarter Pig: Building a SQL interface to Pig using Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
Planning with Polyalgebra: Bringing Together Relational, Complex and Machine ...
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.Why is data independence (still) so important? Optiq and Apache Drill.
Why is data independence (still) so important? Optiq and Apache Drill.
 
SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.SQL Now! How Optiq brings the best of SQL to NoSQL data.
SQL Now! How Optiq brings the best of SQL to NoSQL data.
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Discardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With HadoopDiscardable In-Memory Materialized Queries With Hadoop
Discardable In-Memory Materialized Queries With Hadoop
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
 
Tactical data engineering
Tactical data engineeringTactical data engineering
Tactical data engineering
 

Similar to ONE FOR ALL! Using Apache Calcite to make SQL smart

SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
Lucidworks
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
IDERA Software
 
Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easy
Tokyo Azure Meetup
 

Similar to ONE FOR ALL! Using Apache Calcite to make SQL smart (20)

Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep DiveApache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
Apache Kylin: OLAP Engine on Hadoop - Tech Deep Dive
 
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionEnterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
 
The Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBaseThe Evolution of a Relational Database Layer over HBase
The Evolution of a Relational Database Layer over HBase
 
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, LucidworksngineersSQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
 
Bitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouseBitquery GraphQL for Analytics on ClickHouse
Bitquery GraphQL for Analytics on ClickHouse
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Introduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrepIntroduction to GCP BigQuery and DataPrep
Introduction to GCP BigQuery and DataPrep
 
Intro to Azure Data Factory v1
Intro to Azure Data Factory v1Intro to Azure Data Factory v1
Intro to Azure Data Factory v1
 
Tokyo azure meetup #2 big data made easy
Tokyo azure meetup #2   big data made easyTokyo azure meetup #2   big data made easy
Tokyo azure meetup #2 big data made easy
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
Tokyo Azure Meetup #7 - Introduction to Serverless Architectures with Azure F...
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches
 
Cloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data LakeCloud-native Semantic Layer on Data Lake
Cloud-native Semantic Layer on Data Lake
 
Azure Functions Real World Examples
Azure Functions Real World Examples Azure Functions Real World Examples
Azure Functions Real World Examples
 
Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview
 
Big data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructureBig data architecture on cloud computing infrastructure
Big data architecture on cloud computing infrastructure
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 

More from Evans Ye

Docker workshop
Docker workshopDocker workshop
Docker workshop
Evans Ye
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devops
Evans Ye
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBase
Evans Ye
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
Evans Ye
 

More from Evans Ye (20)

Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdfJoin ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
 
非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningTensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
 
2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public2017 big data landscape and cutting edge innovations public
2017 big data landscape and cutting edge innovations public
 
The Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward SuccessThe Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward Success
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
BigTop vm and docker provisioner
BigTop vm and docker provisionerBigTop vm and docker provisioner
BigTop vm and docker provisioner
 
Docker workshop
Docker workshopDocker workshop
Docker workshop
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devops
 
Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...
 
Deep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaDeep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through Impala
 
How we lose etu hadoop competition
How we lose etu hadoop competitionHow we lose etu hadoop competition
How we lose etu hadoop competition
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBase
 
Vagrant
VagrantVagrant
Vagrant
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
 

Recently uploaded

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 

Recently uploaded (20)

%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 

ONE FOR ALL! Using Apache Calcite to make SQL smart

  • 1. ONE FOR ALL! 
 Using Apache Calcite to Make SQL Smart Evans Ye Technical Expert @ Alibaba DataCon.TW 2018
  • 2. Evans Ye • Alibaba MaxCompute team • One of world's leading 
 cloud-based data warehouse • Apache Member,
 Apache Bigtop PMC, former VP • Director of Taiwan Data Engineering 
 Association(TDEA)
  • 3. Agenda • SQL -> NoSQL -> NewSQL • Introduction to Apache Calcite • How Optimizer Works • Shuffle Optimization in MaxCompute 2.0 • Materialized View • Stream SQL
  • 4. Way back to Big Data wasn't there…
  • 5. Okay, let’s make 
 big data work My database can’t 
 handle big data… User Developer
  • 6.
  • 7. Join is not supported. Let’s just get rid of SQL and call it NoSQL Can I use SQL to query my data?
 How do I do join? Just kidding. NoSQL = Not only SQL User Developer
  • 8. -ACID +Fault Tolerance +Scalability -SQL +Unstructured Data +ACID -Fault Tolerance -Scalability +SQL -Unstructured Data NoSQLSQL
  • 9. Let’s find some pattern through real world cases
  • 10. Pattern I • Phase I: Make it work • Hive QL • Phase II: Make it fast/efficient • The Stinger Initiative: Making Apache Hive 100 Times Faster • Phase III: Make it easy to use • Standard SQL with ACID support
  • 11. Pattern II • Phase I: Make it work • Spark RDD • Phase II: Make it fast/efficient • Project Tungsten: Bringing Apache Spark Closer to Bare Metal • Phase III: Make it easy to use • Spark SQL, pySpark, sparkR
  • 12. Generalize It • Phase I: Make it work • Hadoop ecosystem • Phase II: Make it fast/efficient • In-memory Computing, Off-heap, Caching, etc • Phase III: Make it easy to use • User friendly APIs, SQL interface
  • 13. Hadoop Ecosystem
 SQL Adoptions System Query Language Apache Drill SQL + extensions Apache Hive SQL + extensions Apache Solr SQL Apache Phoenix SQL Apache Kylin SQL Apache Apex Streaming SQL Apache Flink Streaming SQL Apache Samza Streaming SQL Apache Storm Streaming SQL
  • 14. Why SQL? • Universal standard • Low entry barrier, people knows it • Integration with 3rd party apps such as BI tools • Detach user interface from actual implementation, making query optimization possible
  • 15. NewSQL • Combining the good parts of SQL and NoSQL • +Fault Tolerance • +Scalability • +Unstructured / Semi Structured Data • +SQL • +ACID
  • 16. Agenda • SQL -> NoSQL -> NewSQL • Introduction to Apache Calcite • How Optimizer Works • Shuffle Optimization in MaxCompute 2.0 • Materialized View • Stream SQL
  • 17. Apache Calcite • Apache top-level project since Oct. 2015 • Led by Julian Hyde (Hortonworks -> Looker) • Latest version: 1.17 released July 2018

  • 18. Apache Calcite is a dynamic data management framework WAT?
  • 19. Let’s put it this way • A database without: • Storage of data • Algorithms to process data • Storage of metadata
  • 20. Conventional DB Architecture SQL Parser/Validator Query Optimizer Operators Storage Engine JDBC Server Meta Storage of metadata Algorithms to process data Storage of data
  • 21. What Calcite Implements SQL Parser/Validator Query Optimizer Operators Storage Engine JDBC Server Meta Storage of metadata Algorithms to process data Storage of data
  • 22. The beauty of 
 software architecture
  • 23. Embedded SQL Parser/Validator Query Optimizer Phoenix Operators HBase JDBC Server Meta
  • 24. Embedded Hive Parser Query Optimizer Hive Operators HDFS Hive Server / CLI Meta
 store
  • 25. Query Federation SQL Parser/Validator Query Optimizer JDBC Server Meta Example 2:
 A query to aggregate
 Hive and MySQL data Example 1:
 A query to join 
 Kafka and HBase in Spark MySQL
  • 27. Key Features • JDBC driver (Avatica, a sub project of Calcite) • SQL Parser/Validator (JavaCC) • Query Optimizer • Rule-based / Cost-based Optimizer • A bunch of built-in optimization rules • Several Adapters out-of-the-box • Materialized View support
  • 28. Agenda • SQL -> NoSQL -> NewSQL • Introduction to Apache Calcite • How Optimizer Works • Shuffle Optimization in MaxCompute 2.0 • Materialized View • Stream SQL
  • 31. • Merge projections • Converting sub-queries to joins • Reorder joins • Push down filters • Push down projections • And more… Query Optimization
  • 32. General Idea:
 Reduce the amount of data to be processed as early as possible
  • 33. A join B join C:
 (A join B) join C ?
 A join (B join C) ? BUT…
  • 34. A join B:
 Broadcast Hash Join ?
 Shuffled Hash Join ? Sort Merge Join ? WHAT ALGORITHM TO CHOOSE?
  • 35. • Taking statistics into consideration and select a
 plan with cheapest execution cost • row count • CPU cost • Disk I/O cost • Network I/O cost Cost-based Optimizer
  • 36. • The Volcano optimizer generator: Extensibility and efficient search • An implementation of Cost-based Optimizer • Apply rules iteratively, select plan with cheapest cost • Dynamic programing -> avoid duplicate search • Heuristic stop point • 1) Exhaustively explored, 2) Certain time elapsed, 3) cost has not improved for several iterations Calcite’s VolcanoPlanner
  • 38. • Convert from one Convention to another • Convention is used to represent a data source Converter Rule Flink Logical
 Join FlinkLogicalNat iveTableScan FlinkLogicalNat iveTableScan Data Stream
 Join DataStream
 Scan DataStream
 Scan FlinkConventions.LOGICAL FlinkConventions.DATASTREAM
  • 39. Agenda • SQL -> NoSQL -> NewSQL • Introduction to Apache Calcite • How Optimizer Works • Shuffle Optimization in MaxCompute 2.0 • Materialized View • Stream SQL
  • 40. Sort Merge Join SortMergeJoin Sort by Hash(key)
 Sort by Hash(key) Physical
 TableScan Physical TableScan Map Shuffle Reduce Hash=0 Hash=0
  • 41. What if the data has been properly distributed,
 and sorted?
  • 43. • Physical property associated with an operator • 3 primary trait types: • Convention: data source (we’ve seen this) • Collation: sort order • Distribution: Hash or Range distributed Calcite Trait
  • 44. • MaxCompute embeds Calcite’s Cost-based Optimizer • Model data distribution, sort order as Calcite Traits • If required traits are satisfied, then no need to shuffle Achieved via 
 Calcite Optimizer SortMergeJoin Physical
 TableScan Physical
 TableScan hash(key)
 sort(key) hash(key)
 sort(key)
  • 45. Agenda • SQL -> NoSQL -> NewSQL • Introduction to Apache Calcite • Deep Dive into How Optimizer Works • Shuffle Optimization in MaxCompute 2.0 • Materialized View • Stream SQL
  • 46. • A materialized view is a database object that contains the results of a query • Automatically rewrite incoming queries using Materialized View • Idea implemented in Calcite: 
 Optimizing queries using materialized views: A practical, scalable solution Materialized View Rewriting
  • 47. • Rewriting: • Materialized View Definition: • Query: Example
  • 48. Agenda • SQL -> NoSQL -> NewSQL • Introduction to Apache Calcite • Deep Dive into How Optimizer Works • MaxCompute Shuffle Optimization • Materialized View • Stream SQL
  • 49. • The STREAM keyword tells the system that user is interesting in incoming records, not existing ones Stream SQL
  • 50. • Stream-Stream join
 achieved with time
 window expression Join in Stream SQL Orders Shipments
  • 51. Recap
  • 52. • NewSQL is the new industry standard • Calcite is a highly extensible database framework: • SQL Optimizer with a bunch of built-in rules • Supports Query Federation • Supports highly customization such as 
 Shuffle Optimization in MaxCompute 2.0 • Supports Materialized View Rewriting & Stream SQL Recap