SlideShare a Scribd company logo
1 of 44
Download to read offline
EUROPE
APACHECON
EUROPE Oct. 22nd
- 24th
201
Fast federated SQL with
Apache Calcite
Chris Baynes
@binaryexp
chris@contiamo.com
● CTO at Contiamo
● Apache Calcite committer
Content
● Intro to Apache Calcite
● Calcite toolbox
● Performance - single data sources
● Federated queries
Intro to Apache Calcite
Secret sauce behind lots of existing tech:
Intro to Apache Calcite
Important piece in the “deconstructed database” *
* https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
Intro to Apache Calcite
Pluggable framework:
● Standard SQL query parsing (lots of different dialects via Babel)
● Query rewriting
● Materializations (even auto-generation with lattices)
● Query optimisation (rule based and cost based optimizers)
● Support datasource connections either as jdbc drivers or customer adapters
● In memory execution model
● SQL Streams
● ...
Calcite toolbox
Calcite terminology & architecture
SqlNode: SQL tree constructed by parser
RelNode: The query plan (relational algebra AST)
Program: Planner/Optimizer
● Cost based (Volcano)
● Rule based (Hep)
Babel parser (SQL parser)
Parse almost any SQL
● Not dialect specific parsing
● Avoids making assumptions about
keywords
● Handles special syntax (e.g. cast using ::
in Postgres)
Useful when:
● Impersonating a database server
● Clients are more familiar with a dialect
SqlShuttle / RelShuttle (Rewriters)
Uses visitor pattern to transform one
SqlNode/RelNode tree into another
Useful for:
● View rewrites
● Row level security
RelBuilder (Plan builder)
● Programmatically build a RelNode
● Skip SQL parsing
Useful when:
● Query is generated by model / UI
● Avoid writing SQL by hand
Statistics
Provide metadata about physical tables.
● Number of rows (100 default)
● Column uniqueness
● Distribution of values
Useful when:
● Improving planner costs
Materializations
● Define materialization with SQL
● Automatically rewrite queries to use
materializations
1. Materialize pageviews by country & city (pviews_by_loc)
2. Make query:
select country, city, cnt
from pviews_by_loc
order by country
select country, city, count(*)
from pageviews
group by country, city
order by country
SqlDialect
● Convert to dialect specific SQL string
● Many SQL dialects supported (Oracle,
Mysql, Postgres, Hive, …)
● Easy to plug in your own dialect
public class CustomDialect extends SqlDialect {
@Override public boolean supportsNestedAggregations() { return false; }
@Override public boolean supportsWindowFunctions() { return false; }
@Override public SqlNode getCastSpec(RelDataType type) {
// translate type names ...
}
@Override public void unparseCall(SqlWriter writer, SqlCall call,
int leftPrec, int rightPrec) {
// make changes to the SqlCall (SqlNode) ...
}
Lattices
● Define dimensions on star
schemas
● Combinations of dimensions form
summary tables
● Materialized summary tables are
called tiles
● Trick is to figure out which tiles to
pick
https://www.slideshare.net/julianhyde/dont-optimize-my-queries-optimize-my-data-81360892
Conclusion - tools
● Many different tools
● Can be used to build complex applications
● Consider affect on performance
Performance - Single data source
Measure - benchmark
Use a standard SQL benchmark tool: TPC
TPC-H:
● 22 queries
● data size configurable: 1 GB, 10, 30, ....
● generated for lots of dialects
Measure overhead
Configuration & hardware not so important for us
Measure - trace
Query push down
select name, id from customers order by name limit 10
JdbcToEnumerableConverter
JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10])
JdbcProject(name, id)
JdbcTableScan(table=[customers])
EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10)
JdbcToEnumerableConverter
JdbcProject(name, id)
JdbcTableScan(table=[customers])
Sort and limit in memory Full query push down✘ ✔
Benchmark runs
4 benchmark runs:
0. Direct against database
1. Query engine - unoptimized (defaults)
2. Query engine - with statistics
3. Query engine - with multi-phase optimiser
Benchmark baseline (run 0)
Benchmark run 1 - unoptimised
total = 158.1s
Benchmark run 1 - unoptimised
Results
● Slow!
● Very little push-down
● Planning consumes a lot of time
Try
● Using table statistics
Benchmark run 2 - use table stats
Benchmark run 2 - use table stats
Results
● Some queries noticeably faster
● Most work still being done in memory
● Query planning is very slow
Next, try: Planning phases
Benchmark run 3 - planning phases
Split planning into 2 phases (create separate Calcite Programs):
1. Logical phase:
Use rule-based optimiser (Hep), apply all logical optimisation rules
2. Physical phase:
Use cost-based optimiser (Volcano), apply all physical rules and conversions
(jdbc, druid, …)
Benchmark run 3 - planning phases vs stats
Benchmark run 3 - planning phases vs baseline
Benchmark run 3 - planning phases
Results
● Planning and execution are much faster
● Full query push down
● Almost all query time being spent in data source
● Some queries are faster than baseline
Conclusion - single data source
The goal: full query push down
Fastest way to get there:
● Apply most rules in rule-based optimiser
● Avoid giving cost-based optimiser too many rules
● Rely on underlying data source optimiser
Federated queries
“Single query based on information from multiple data sources”
Obstacles to efficiency
● Joins of large tables
● Large driver fetchSize
● Aggregations
● Other in-memory computations
Can lead to:
● Slow queries
● OOM
Solution
● Scales out to many nodes
● Avoid OOM for large datasets
● Some calcite support
Bonus
● Eases access to data lakes
Use Spark as the execution engine
Calcite support for Spark
Currently some options available:
● Calcite spark adapter
○ Translates Calcite plan => Spark plan
○ Sends Spark plan to cluster
○ - Needs more work to make translation complete
● Calcite SparkSqlDialect
○ Use RelToSqlConverter to generate a query in Spark SQL
○ - Spark completely re-plans/optimises the query
○ - Spark push downs are not as full
● Send SQL queries directly to Spark
○ - Can’t use Calcite toolbox
○ - Spark push downs are not as full
A hybrid approach
1. Generate a Calcite plan
2. Identify pieces related to separate data
sources
3. Squash those pieces into “Spark tables”
4. Construct a dataframe for every “Spark
table”
5. Convert the remaining plan to SparkSQL
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Join two tables from different data
sources, X & Y.
select cust_id, sale_id
from X.sales join Y.customer
using (cust_id)
where X.customer.city = ‘Berlin’
limit 10
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t1: select cust_id, sale_id
from X.sales
Hybrid approach - example
EnumerableLimit(fetch=10)
EnumerableHashJoin
JdbcToEnumerableConverter(datasource=[X])
JdbcProject(cust_id, sale_id)
JdbcTableScan(table=[sales])
JdbcToEnumerableConverter(datasource=[Y])
JdbcProject(cust_id)
JdbcFilter(city=[Berlin])
JdbcTableScan(table=[customer])
Identify data source pieces
t2: select cust_id
from Y.customer
where city = ‘Berlin’
Hybrid approach - example
SparkLimit(fetch=10)
SparkHashJoin
SparkTable(table=t1, ds=[X], query=[...])
SparkTable(table=t2, ds=[Y], query=[...])
select *
from t1 join t2
using (cust_id)
limit 10
Use the Spark SqlDialect to transform into a Spark query
Conclusion - Hybrid approach
● Full scalability of spark
● Push down ability of Calcite
● No bottleneck through in-memory layer
● Can be persisted to external db / file using spark
● Approach would also work for other engines (Drill, Flink, Presto)
Thanks!
@binaryexp
chris@contiamo.com
calcite.apache.org

More Related Content

What's hot

Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityJulian Hyde
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Christian Tzolov
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache CalciteJulian Hyde
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysDatabricks
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query LanguageJulian Hyde
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache CalciteDataWorks Summit
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkKazuaki Ishizaki
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks EDB
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 

What's hot (20)

Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
Using Apache Calcite for Enabling SQL and JDBC Access to Apache Geode and Oth...
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Spark
SparkSpark
Spark
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Parallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected WaysParallelizing with Apache Spark in Unexpected Ways
Parallelizing with Apache Spark in Unexpected Ways
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Data profiling in Apache Calcite
Data profiling in Apache CalciteData profiling in Apache Calcite
Data profiling in Apache Calcite
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks MongoDB vs. Postgres Benchmarks
MongoDB vs. Postgres Benchmarks
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 

Similar to Fast federated SQL queries with Apache Calcite and Spark

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteJulian Hyde
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Holden Karau
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsMonal Daxini
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTSolidify
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 

Similar to Fast federated SQL queries with Apache Calcite and Spark (20)

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Declarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data modelsDeclarative benchmarking of cassandra and it's data models
Declarative benchmarking of cassandra and it's data models
 
Couchbas for dummies
Couchbas for dummiesCouchbas for dummies
Couchbas for dummies
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
Meetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDTMeetup developing building and_deploying databases with SSDT
Meetup developing building and_deploying databases with SSDT
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 

Recently uploaded

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Recently uploaded (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 

Fast federated SQL queries with Apache Calcite and Spark

  • 1. EUROPE APACHECON EUROPE Oct. 22nd - 24th 201 Fast federated SQL with Apache Calcite Chris Baynes
  • 2. @binaryexp chris@contiamo.com ● CTO at Contiamo ● Apache Calcite committer
  • 3.
  • 4. Content ● Intro to Apache Calcite ● Calcite toolbox ● Performance - single data sources ● Federated queries
  • 5. Intro to Apache Calcite Secret sauce behind lots of existing tech:
  • 6. Intro to Apache Calcite Important piece in the “deconstructed database” * * https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database
  • 7. Intro to Apache Calcite Pluggable framework: ● Standard SQL query parsing (lots of different dialects via Babel) ● Query rewriting ● Materializations (even auto-generation with lattices) ● Query optimisation (rule based and cost based optimizers) ● Support datasource connections either as jdbc drivers or customer adapters ● In memory execution model ● SQL Streams ● ...
  • 9. Calcite terminology & architecture SqlNode: SQL tree constructed by parser RelNode: The query plan (relational algebra AST) Program: Planner/Optimizer ● Cost based (Volcano) ● Rule based (Hep)
  • 10. Babel parser (SQL parser) Parse almost any SQL ● Not dialect specific parsing ● Avoids making assumptions about keywords ● Handles special syntax (e.g. cast using :: in Postgres) Useful when: ● Impersonating a database server ● Clients are more familiar with a dialect
  • 11. SqlShuttle / RelShuttle (Rewriters) Uses visitor pattern to transform one SqlNode/RelNode tree into another Useful for: ● View rewrites ● Row level security
  • 12. RelBuilder (Plan builder) ● Programmatically build a RelNode ● Skip SQL parsing Useful when: ● Query is generated by model / UI ● Avoid writing SQL by hand
  • 13. Statistics Provide metadata about physical tables. ● Number of rows (100 default) ● Column uniqueness ● Distribution of values Useful when: ● Improving planner costs
  • 14. Materializations ● Define materialization with SQL ● Automatically rewrite queries to use materializations 1. Materialize pageviews by country & city (pviews_by_loc) 2. Make query: select country, city, cnt from pviews_by_loc order by country select country, city, count(*) from pageviews group by country, city order by country
  • 15. SqlDialect ● Convert to dialect specific SQL string ● Many SQL dialects supported (Oracle, Mysql, Postgres, Hive, …) ● Easy to plug in your own dialect public class CustomDialect extends SqlDialect { @Override public boolean supportsNestedAggregations() { return false; } @Override public boolean supportsWindowFunctions() { return false; } @Override public SqlNode getCastSpec(RelDataType type) { // translate type names ... } @Override public void unparseCall(SqlWriter writer, SqlCall call, int leftPrec, int rightPrec) { // make changes to the SqlCall (SqlNode) ... }
  • 16. Lattices ● Define dimensions on star schemas ● Combinations of dimensions form summary tables ● Materialized summary tables are called tiles ● Trick is to figure out which tiles to pick
  • 18. Conclusion - tools ● Many different tools ● Can be used to build complex applications ● Consider affect on performance
  • 19. Performance - Single data source
  • 20. Measure - benchmark Use a standard SQL benchmark tool: TPC TPC-H: ● 22 queries ● data size configurable: 1 GB, 10, 30, .... ● generated for lots of dialects Measure overhead Configuration & hardware not so important for us
  • 22. Query push down select name, id from customers order by name limit 10 JdbcToEnumerableConverter JdbcSort(sort0=[$1], dir0=[ASC], fetch=[10]) JdbcProject(name, id) JdbcTableScan(table=[customers]) EnumerableSort(sort0=[$1], dir0=[ASC], fetch=10) JdbcToEnumerableConverter JdbcProject(name, id) JdbcTableScan(table=[customers]) Sort and limit in memory Full query push down✘ ✔
  • 23. Benchmark runs 4 benchmark runs: 0. Direct against database 1. Query engine - unoptimized (defaults) 2. Query engine - with statistics 3. Query engine - with multi-phase optimiser
  • 25. Benchmark run 1 - unoptimised total = 158.1s
  • 26. Benchmark run 1 - unoptimised Results ● Slow! ● Very little push-down ● Planning consumes a lot of time Try ● Using table statistics
  • 27. Benchmark run 2 - use table stats
  • 28. Benchmark run 2 - use table stats Results ● Some queries noticeably faster ● Most work still being done in memory ● Query planning is very slow Next, try: Planning phases
  • 29. Benchmark run 3 - planning phases Split planning into 2 phases (create separate Calcite Programs): 1. Logical phase: Use rule-based optimiser (Hep), apply all logical optimisation rules 2. Physical phase: Use cost-based optimiser (Volcano), apply all physical rules and conversions (jdbc, druid, …)
  • 30. Benchmark run 3 - planning phases vs stats
  • 31. Benchmark run 3 - planning phases vs baseline
  • 32. Benchmark run 3 - planning phases Results ● Planning and execution are much faster ● Full query push down ● Almost all query time being spent in data source ● Some queries are faster than baseline
  • 33. Conclusion - single data source The goal: full query push down Fastest way to get there: ● Apply most rules in rule-based optimiser ● Avoid giving cost-based optimiser too many rules ● Rely on underlying data source optimiser
  • 34. Federated queries “Single query based on information from multiple data sources”
  • 35. Obstacles to efficiency ● Joins of large tables ● Large driver fetchSize ● Aggregations ● Other in-memory computations Can lead to: ● Slow queries ● OOM
  • 36. Solution ● Scales out to many nodes ● Avoid OOM for large datasets ● Some calcite support Bonus ● Eases access to data lakes Use Spark as the execution engine
  • 37. Calcite support for Spark Currently some options available: ● Calcite spark adapter ○ Translates Calcite plan => Spark plan ○ Sends Spark plan to cluster ○ - Needs more work to make translation complete ● Calcite SparkSqlDialect ○ Use RelToSqlConverter to generate a query in Spark SQL ○ - Spark completely re-plans/optimises the query ○ - Spark push downs are not as full ● Send SQL queries directly to Spark ○ - Can’t use Calcite toolbox ○ - Spark push downs are not as full
  • 38. A hybrid approach 1. Generate a Calcite plan 2. Identify pieces related to separate data sources 3. Squash those pieces into “Spark tables” 4. Construct a dataframe for every “Spark table” 5. Convert the remaining plan to SparkSQL
  • 39. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Join two tables from different data sources, X & Y. select cust_id, sale_id from X.sales join Y.customer using (cust_id) where X.customer.city = ‘Berlin’ limit 10
  • 40. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Identify data source pieces t1: select cust_id, sale_id from X.sales
  • 41. Hybrid approach - example EnumerableLimit(fetch=10) EnumerableHashJoin JdbcToEnumerableConverter(datasource=[X]) JdbcProject(cust_id, sale_id) JdbcTableScan(table=[sales]) JdbcToEnumerableConverter(datasource=[Y]) JdbcProject(cust_id) JdbcFilter(city=[Berlin]) JdbcTableScan(table=[customer]) Identify data source pieces t2: select cust_id from Y.customer where city = ‘Berlin’
  • 42. Hybrid approach - example SparkLimit(fetch=10) SparkHashJoin SparkTable(table=t1, ds=[X], query=[...]) SparkTable(table=t2, ds=[Y], query=[...]) select * from t1 join t2 using (cust_id) limit 10 Use the Spark SqlDialect to transform into a Spark query
  • 43. Conclusion - Hybrid approach ● Full scalability of spark ● Push down ability of Calcite ● No bottleneck through in-memory layer ● Can be persisted to external db / file using spark ● Approach would also work for other engines (Drill, Flink, Presto)