In the past when Hadoop was born, the big data world were focusing on how to build systems that scales. Now the world has evolved. HBase hits 2.0, Cassandra hits 3.0, Hive hits 3.0, etc. When scalability is conquered, what's next? That’s right, usability comes into play. If we look back into the history, NoSQL is really just using divide and concur mechanism to tackle big data problems by trading off SQL capabilities. But once big data problem solved, we see more and more NoSQL and data processing engines start to build up SQL or SQL-like interfaces. Therefore, a generic SQL engine that provides core SQL capabilities such as query parsing, relational algebra, and query optimization starts to shine.
In this talk, I'll walk you through the architecture, functionality, and design concept of Apache Calcite. Notice that Calcite itself is not a database, but many well known systems already incorporate Calcite as a library. For instance, Hive, Drill, Druid, Phoenix, Apex, Flink, Storm, Samza, and more. To better illustrate how Calcite works, I'll choose some of the systems and describe how they adopt Calcite and which part is enhanced by Calcite. Furthermore, I'll talk about several features that Calcite provides such as query optimization, heterogeneous data source, materialized view, and Stream SQL. From user's perspective, knowing better how these systems work behind the scene equips you with more knowledge to chose a system that ultimately suits your needs.
%in Midrand+277-882-255-28 abortion pills for sale in midrand
ONE FOR ALL! Using Apache Calcite to make SQL smart
1. ONE FOR ALL!
Using Apache Calcite to
Make SQL Smart
Evans Ye
Technical Expert @ Alibaba
DataCon.TW 2018
2. Evans Ye
• Alibaba MaxCompute team
• One of world's leading
cloud-based data warehouse
• Apache Member,
Apache Bigtop PMC, former VP
• Director of Taiwan Data Engineering
Association(TDEA)
3. Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
5. Okay, let’s make
big data work
My database can’t
handle big data…
User Developer
6.
7. Join is not supported.
Let’s just get rid of SQL
and call it NoSQL
Can I use SQL to query
my data?
How do I do join?
Just kidding. NoSQL = Not only SQL
User Developer
10. Pattern I
• Phase I: Make it work
• Hive QL
• Phase II: Make it fast/efficient
• The Stinger Initiative: Making Apache Hive 100
Times Faster
• Phase III: Make it easy to use
• Standard SQL with ACID support
11. Pattern II
• Phase I: Make it work
• Spark RDD
• Phase II: Make it fast/efficient
• Project Tungsten: Bringing Apache Spark Closer to
Bare Metal
• Phase III: Make it easy to use
• Spark SQL, pySpark, sparkR
12. Generalize It
• Phase I: Make it work
• Hadoop ecosystem
• Phase II: Make it fast/efficient
• In-memory Computing, Off-heap, Caching, etc
• Phase III: Make it easy to use
• User friendly APIs, SQL interface
14. Why SQL?
• Universal standard
• Low entry barrier, people knows it
• Integration with 3rd party apps such as BI tools
• Detach user interface from actual implementation,
making query optimization possible
15. NewSQL
• Combining the good parts of SQL and NoSQL
• +Fault Tolerance
• +Scalability
• +Unstructured / Semi Structured Data
• +SQL
• +ACID
16. Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
17. Apache Calcite
• Apache top-level project since Oct. 2015
• Led by Julian Hyde (Hortonworks -> Looker)
• Latest version: 1.17 released July 2018
19. Let’s put it this way
• A database without:
• Storage of data
• Algorithms to process data
• Storage of metadata
20. Conventional DB Architecture
SQL Parser/Validator
Query Optimizer
Operators
Storage Engine
JDBC Server
Meta
Storage of
metadata
Algorithms to
process data
Storage of data
21. What Calcite Implements
SQL Parser/Validator
Query Optimizer
Operators
Storage Engine
JDBC Server
Meta
Storage of
metadata
Algorithms to
process data
Storage of data
25. Query Federation
SQL Parser/Validator
Query Optimizer
JDBC Server
Meta
Example 2:
A query to aggregate
Hive and MySQL data
Example 1:
A query to join
Kafka and HBase in Spark
MySQL
27. Key Features
• JDBC driver (Avatica, a sub project of Calcite)
• SQL Parser/Validator (JavaCC)
• Query Optimizer
• Rule-based / Cost-based Optimizer
• A bunch of built-in optimization rules
• Several Adapters out-of-the-box
• Materialized View support
28. Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
33. A join B join C:
(A join B) join C ?
A join (B join C) ?
BUT…
34. A join B:
Broadcast Hash Join ?
Shuffled Hash Join ?
Sort Merge Join ?
WHAT ALGORITHM TO CHOOSE?
35. • Taking statistics into consideration and select a
plan with cheapest execution cost
• row count
• CPU cost
• Disk I/O cost
• Network I/O cost
Cost-based Optimizer
36. • The Volcano optimizer generator: Extensibility and efficient
search
• An implementation of Cost-based Optimizer
• Apply rules iteratively, select plan with cheapest cost
• Dynamic programing -> avoid duplicate search
• Heuristic stop point
• 1) Exhaustively explored, 2) Certain time elapsed, 3)
cost has not improved for several iterations
Calcite’s VolcanoPlanner
38. • Convert from one Convention to another
• Convention is used to represent a data source
Converter Rule
Flink
Logical
Join
FlinkLogicalNat
iveTableScan
FlinkLogicalNat
iveTableScan
Data
Stream
Join
DataStream
Scan
DataStream
Scan
FlinkConventions.LOGICAL FlinkConventions.DATASTREAM
39. Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
43. • Physical property associated with an operator
• 3 primary trait types:
• Convention: data source (we’ve seen this)
• Collation: sort order
• Distribution: Hash or Range distributed
Calcite Trait
44. • MaxCompute embeds Calcite’s Cost-based Optimizer
• Model data distribution, sort order as Calcite Traits
• If required traits are satisfied, then no need to shuffle
Achieved via
Calcite Optimizer
SortMergeJoin
Physical
TableScan
Physical
TableScan
hash(key)
sort(key)
hash(key)
sort(key)
45. Agenda
• SQL -> NoSQL -> NewSQL
• Introduction to Apache Calcite
• Deep Dive into How Optimizer Works
• Shuffle Optimization in MaxCompute 2.0
• Materialized View
• Stream SQL
46. • A materialized view is a database object that
contains the results of a query
• Automatically rewrite incoming queries using
Materialized View
• Idea implemented in Calcite:
Optimizing queries using materialized views: A
practical, scalable solution
Materialized View Rewriting
52. • NewSQL is the new industry standard
• Calcite is a highly extensible database framework:
• SQL Optimizer with a bunch of built-in rules
• Supports Query Federation
• Supports highly customization such as
Shuffle Optimization in MaxCompute 2.0
• Supports Materialized View Rewriting & Stream SQL
Recap