Karol Sobczak
Kamil Bajda-Pawlikowski
Presto Cost-Based Optimizer
Presto Summit 2018
Menlo Park, CA
Who are we?
We are a different kind of company
© 2018 2
• No venture capital
• Employee owned
• Global reach
• Profitable
Why Starburst?
© 2018 3
We know Presto inside out
© 2018 4
Founded by Presto committers
Over 3 years of contributions to Presto
Supporting large enterprise customers
Notable features contributed:
● ANSI SQL syntax enhancements
● Execution engine improvements
● Security integrations
● Spill to disk
● Cost-Based Optimizer
● and much more!
Our offerings
Enterprise Support
24/7 enterprise support of Presto
On-premises or in the cloud
Includes Presto roadmap influence
© 2018 5
Our offerings
PrestoCare™
Fully managed service
Administration, monitoring and
support of the Presto platform
and services
© 2018 6
Our offerings
Enterprise-Grade Presto in
the Cloud
Starburst’s leading distribution is
now available on AWS marketplace
© 2018 7
● Pay-as-you-go pricing
● Deployed in your own VPC
● Integrated auto scaling
● Fully configurable
https://www.starburstdata.com/aws
Ecosystem
© 2018 8
Presto is SQL on anything
Query anything, anywhere
© 2018 9
Minio Object Storage
10
https://www.starburstdata.com/technical-blog/data-lakes-without-hadoop/
Alluxio: Virtual Data Layer for Heterogeneous Storage
• Storage abstracted and unified
• Single access point, shared data pool
• Any format, any location, cloud friendly
• Memory speed performance
• Standard application interface
• No code changes, shared access
• Simultaneous access to HDFS & Object Stores
(also transparent migration)
• Intelligent cache management
• Policy-based data placement / tiering
• Location aware
• Key use cases:
• Analytics, Machine Learning, Cloud adoption
Alluxio-Presto Deployments:
Extending Interactive SQL to stored streams - Apache Pulsar & Presto
● Pulsar allows to include event for querying as soon as it arrives
● Pulsar tiered Storage allows to move data between hot and cold tiers
● Presto integration provides processing parallelism at “segment” level
● Presto uses the data both in hot tier and cold tier while executing query
● Presto allows for native schema registry integration of Apache Pulsar
● Presto enables integration with existing BI tools
If you are interested talk to Jon Bock, Jerry Peng, Matteo Merli
Presto Optimizer
13© 2018
CBO in a nutshell
Cost-Based Optimizer v1 includes:
• support for statistics stored in Hive Metastore
• join reordering based on selectivity estimates and cost
• automatic join type selection (repartitioned vs broadcast)
• automatic left/right side selection for joined tables
https://www.starburstdata.com/technical-blog/
14© 2018
Statistics & Cost
Hive Metastore statistics:
• number of rows in a table
• number of distinct values in a column
• fraction of NULL values in a column
• minimum/maximum value in a column
• average data size for a column
Cost calculation includes:
• CPU
• Memory
• Network I/O
15© 2018
Join type selection
16© 2018
Join left/right side decision
17© 2018
Join reordering
18© 2018
Join reordering with filter
19© 2018
Join tree shapes
20© 2018
Benchmark results (on prem)
CBO off
CBO on
https://www.starburstdata.com/technical-blog/presto-cost-based-optimizer-rocks-the-tpc-benchmarks/
21© 2018
Benchmark results (cloud)
https://www.starburstdata.com/technical-blog/starburst-presto-on-aws-18x-faster-than-emr/
22© 2018
Cloud cost reduction
https://www.starburstdata.com/aws
● on average 7x improvement vs EMR Presto
● EMR Presto cannot execute many TPC-DS queries
● All TPC-DS queries pass on Starburst Presto
23© 2018
Worth reading
https://www.starburstdata.com/category/technical-blog/
https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-redshift/
http://bytes.schibsted.com/bigdata-sql-query-engine-benchmark/
https://virtuslab.com/blog/benchmarking-spark-sql-presto-hive-bi-processing-g
oogles-cloud-dataproc/
24
Thank You!
25
Twitter: @starburstdata
Blog: https://www.starburstdata.com/technical-blog/
© 2018

Presto Summit 2018 - 03 - Starburst CBO