SlideShare a Scribd company logo
1 of 35
Download to read offline
The Presto Cost-Based Optimizer
for interactive SQL on anything
Strata Data Conference - London
May 1, 2019
Wojciech.Biela@starburstdata.com
Piotr.Findeisen@starburstdata.com
Agenda
• Who we are?
• Presto - quick recap
• Presto’s CBO launch
• Recent CBO enhancements
• CBO Roadmap
2© 2019
Starburst
© 2019 3
Starburst Data
© 2019 4
Founded by Presto committers:
● Over 4 years of contributions to Presto
● Presto distro for on-prem and cloud env
● Supporting large customers in production
● Enterprise subscription add-ons
Notable features contributed:
● ANSI SQL syntax enhancements
● Execution engine improvements
● Security integrations
● Spill to disk
● Cost-Based Optimizer
https://www.starburstdata.com/presto-enterprise/
Starburst Presto & Cloud
© 2019 5
https://www.starburstdata.com/technical-blog/announcing-starburst-enterprise-302e-with-mission-control/
Presto
© 2019 6
7
Project History
8© 2019
FALL 2012
4 developers
start Presto
development
SUMMER 2017
180+ Releases
50+ Contributors
5000+ Commits
WINTER 2017
Starburst is founded by
a team of Presto
committers, Teradata
veterans
FALL 2013
Facebook open
sources Presto
SPRING 2015
Teradata joins the
community, begins
investing heavily in
the project
WINTER 2019
Presto Software
Foundation
established
The Presto
fan club
9© 2019
See more at
https://github.com/prestosql/presto/wiki/Presto-Users
Presto in Production
• Facebook: 10,000+ of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users
• Uber: 2,000+ nodes (clusters on prem.) with 160K+ queries daily over HDFS (Parquet/ORC)
• Twitter: 2,000+ nodes (several clusters on premises and GCP), 20K+ queries daily (Parquet)
• LinkedIn: 500+ nodes, 200K+ queries daily over HDFS (ORC), and ~1000 users
• Lyft: 400+ nodes in AWS, 100K+ queries daily, 20+ PBs in S3 (Parquet)
• Netflix: 300+ nodes in AWS, 100+ PB in S3 (Parquet)
• Yahoo! Japan: 200+ nodes for HDFS (ORC), and ObjectStore
• FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users
10© 2019
Why Presto?
Community-driven
open source project
High performance ANSI SQL engine
• New Cost-Based Query Optimizer
• Proven scalability
• High concurrency
Separation of compute and
storage
• Scale storage and compute
independently
• No ETL or data integration
necessary to get to insights
• SQL-on-anything
No vendor lock-in
• No Hadoop distro vendor lock-in
• No storage engine vendor lock-in
• No cloud vendor lock-in
© 2019 11
Built for Performance
Query Execution Engine:
• MPP-style pipelined in-memory execution
• Vectorized data processing
• Runtime query bytecode generation
• Memory efficient data structures
• Multi-threaded multi-core execution
• Optimized readers for columnar formats (ORC and Parquet)
• Predicate and column projection pushdown
• Now also Cost-Based Optimizer
12© 2019
Architecture
13© 2019
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer
Planner Scheduler
Worker
Client
Data location
API
Pluggable
Presto CBO
© 2019 14
Before we optimize ...
Join in Presto
• Hash Join
• Right table is in memory ("build table")
• Left table is streamed ("probe table")
• Can be broadcast or repartitioned
15© 2019
Before we optimize ...
Join in Presto
• Hash Join
• Right table is in memory ("build table")
• Left table is streamed ("probe table")
• Can be broadcast or repartitioned
• A join can be followed by a join, can be followed by a join…
16© 2019
CBO in a nutshell
Cost-Based Optimizer v1 includes:
• join reordering based on selectivity estimates and cost
• automatic join type selection (repartitioned vs broadcast)
• automatic left/right side selection for joined tables
• support for statistics stored in Hive Metastore
https://www.starburstdata.com/technical-blog/
17© 2019
Cost and Statistics
Cost calculation includes:
• CPU
• Memory usage
• Network I/O
Hive Metastore statistics:
• number of rows in a table
• number of distinct values in a column
• fraction of NULL values in a column
• minimum/maximum value in a column
• average data size for a column
18
Join left/right side decision
19© 2019
Join type selection
20© 2019
Join reordering
"Which customers are spending the most at our shop?"
SELECT c.custkey, sum(l.price)
FROM customer c
JOIN orders o ON c.custkey = o.custkey
JOIN lineitem l ON o.orderkey = l.orderkey
GROUP BY c.custkey
ORDER BY sum(l.price) DESC;
21
Join reordering
22© 2019
Join reordering with filter
"Which customers are spending the most on coffee?"
SELECT c.custkey, sum(l.price)
FROM customer c
JOIN orders o ON c.custkey = o.custkey
JOIN lineitem l ON o.orderkey = l.orderkey
WHERE l.item = 'coffee'
GROUP BY c.custkey
ORDER BY sum(l.price) DESC;
23
Join reordering with filter
24© 2019
Presto CBO Speedup
Duration of TPC-DS queries (lower is better)
© 2019 25
https://www.starburstdata.com/presto-benchmarks/
Cloud cost reduction
● on average 7x improvement vs EMR Presto (up to 18x faster!)
● EMR Presto cannot execute many TPC-DS queries (all pass on Starburst Presto)
● on average 4x improvement on simpler queries benchmark (TPC-H, 3rd party benchmark)
26© 2019
https://www.starburstdata.com/presto-aws/
https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-emr-sql/
CBO Next Chapter
27© 2019
Enhancements to date
• Deciding on semi-join distribution type based on cost
• "… WHERE x IN (SELECT y FROM ...)" queries
• Capping a broadcasted table size
• Various minor fixes in cardinality estimation
• Peak memory estimation (more meaningful memory cost metric)
28© 2019
Collecting statistics
• ANALYZE table
• native in Presto
• Hive runtime no longer needed to collect stats
• Collecting table statistics on write
• low overhead, the data is being processed in memory
• Stats for AWS Glue Catalog
• Glue does not provide statistics support out of the box
• exclusive from Starburst
29© 2019
Statistics in other connectors
• CBO initially supported with Hive connector only
• We added stats for RDBMS connectors (Starburst)
• PostgreSQL
• MySQL
• SQL Server
• Teradata
• Oracle
• Enables CBO in federation use-cases
30© 2019
CBO Roadmap
© 2019 31
What’s next
• Stats support
• Improved stats for Hive and Glue, e.g. NDV estimates across partitions
• Stats for NoSQL connectors
• Core CBO and engine enhancements
• Involve connectors in optimizations (" *-pushdown ")
• Peak memory-based decisions
• Adjust cost model weights based on the hardware
• Cost and stats for more operation types
• Introduce Traits — consider alternative plans during optimizations
• Late materialization
• Adaptive optimizations
32© 2019
Further reading
https://prestosql.io/
https://www.starburstdata.com/technical-blog/
https://fivetran.com/blog/warehouse-benchmark
https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-emr-sql/
http://bytes.schibsted.com/bigdata-sql-query-engine-benchmark/
https://virtuslab.com/blog/benchmarking-spark-sql-presto-hive-bi-processing-g
oogles-cloud-dataproc/
33
Thank You!
34
Twitter: @starburstdata @prestosql
Blog: www.starburstdata.com/technical-blog/
Newsletter: www.starburstdata.com/newsletter
© 2019
Rate today’s session
Session page on conference website O’Reilly Events App

More Related Content

What's hot

Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Prestokbajda
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleDatabricks
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...Databricks
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkBurak Yavuz
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performanceDataWorks Summit
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Databricks
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Data Con LA
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Spark Summit
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcitegianmerlino
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010Membase
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Data Con LA
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Karanjeet Singh
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureDatabricks
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexDataWorks Summit
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks
 

What's hot (20)

Presto Summit 2018 - 01 - Facebook Presto
Presto Summit 2018  - 01 - Facebook PrestoPresto Summit 2018  - 01 - Facebook Presto
Presto Summit 2018 - 01 - Facebook Presto
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Building Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks DeltaBuilding Robust Production Data Pipelines with Databricks Delta
Building Robust Production Data Pipelines with Databricks Delta
 
Presto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performancePresto query optimizer: pursuit of performance
Presto query optimizer: pursuit of performance
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
NoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache CalciteNoSQL no more: SQL on Druid with Apache Calcite
NoSQL no more: SQL on Druid with Apache Calcite
 
Membase Meetup 2010
Membase Meetup 2010Membase Meetup 2010
Membase Meetup 2010
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
 

Similar to Presto CBO for interactive SQL

Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBOkbajda
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Kent Graziano
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Bostonkbajda
 
Presto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectivePresto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectiveAlluxio, Inc.
 
Inawsidom - Data Journey
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data JourneyPhilipBasford
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesAlluxio, Inc.
 
CloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloudCloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloudTim Mackey
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Standard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud JourneyStandard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud JourneyAmazon Web Services
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL Torsten Steinbach
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Technology
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon RedshiftAmazon Web Services
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scaleJohn Varghese
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyNeo4j
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
 

Similar to Presto CBO for interactive SQL (20)

Presto Summit 2018 - 03 - Starburst CBO
Presto Summit 2018  - 03 - Starburst CBOPresto Summit 2018  - 03 - Starburst CBO
Presto Summit 2018 - 03 - Starburst CBO
 
Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)Demystifying Data Warehousing as a Service (GLOC 2019)
Demystifying Data Warehousing as a Service (GLOC 2019)
 
Presto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 BostonPresto talk @ Global AI conference 2018 Boston
Presto talk @ Global AI conference 2018 Boston
 
Presto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectivePresto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspective
 
Inawsidom - Data Journey
Inawsidom - Data JourneyInawsidom - Data Journey
Inawsidom - Data Journey
 
Query Anything, Anywhere with Kubernetes
Query Anything, Anywhere with KubernetesQuery Anything, Anywhere with Kubernetes
Query Anything, Anywhere with Kubernetes
 
CloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloudCloudOpen Japan - Controlling the cost of your first cloud
CloudOpen Japan - Controlling the cost of your first cloud
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Standard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud JourneyStandard Chartered Bank Cloud Journey
Standard Chartered Bank Cloud Journey
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020HostBridge Virtual User Group December 2020
HostBridge Virtual User Group December 2020
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Cruising in data lake from zero to scale
Cruising in data lake from zero to scaleCruising in data lake from zero to scale
Cruising in data lake from zero to scale
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User StoreAzure + DataStax Enterprise Powers Office 365 Per User Store
Azure + DataStax Enterprise Powers Office 365 Per User Store
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Presto CBO for interactive SQL

  • 1. The Presto Cost-Based Optimizer for interactive SQL on anything Strata Data Conference - London May 1, 2019 Wojciech.Biela@starburstdata.com Piotr.Findeisen@starburstdata.com
  • 2. Agenda • Who we are? • Presto - quick recap • Presto’s CBO launch • Recent CBO enhancements • CBO Roadmap 2© 2019
  • 4. Starburst Data © 2019 4 Founded by Presto committers: ● Over 4 years of contributions to Presto ● Presto distro for on-prem and cloud env ● Supporting large customers in production ● Enterprise subscription add-ons Notable features contributed: ● ANSI SQL syntax enhancements ● Execution engine improvements ● Security integrations ● Spill to disk ● Cost-Based Optimizer https://www.starburstdata.com/presto-enterprise/
  • 5. Starburst Presto & Cloud © 2019 5 https://www.starburstdata.com/technical-blog/announcing-starburst-enterprise-302e-with-mission-control/
  • 7. 7
  • 8. Project History 8© 2019 FALL 2012 4 developers start Presto development SUMMER 2017 180+ Releases 50+ Contributors 5000+ Commits WINTER 2017 Starburst is founded by a team of Presto committers, Teradata veterans FALL 2013 Facebook open sources Presto SPRING 2015 Teradata joins the community, begins investing heavily in the project WINTER 2019 Presto Software Foundation established
  • 9. The Presto fan club 9© 2019 See more at https://github.com/prestosql/presto/wiki/Presto-Users
  • 10. Presto in Production • Facebook: 10,000+ of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users • Uber: 2,000+ nodes (clusters on prem.) with 160K+ queries daily over HDFS (Parquet/ORC) • Twitter: 2,000+ nodes (several clusters on premises and GCP), 20K+ queries daily (Parquet) • LinkedIn: 500+ nodes, 200K+ queries daily over HDFS (ORC), and ~1000 users • Lyft: 400+ nodes in AWS, 100K+ queries daily, 20+ PBs in S3 (Parquet) • Netflix: 300+ nodes in AWS, 100+ PB in S3 (Parquet) • Yahoo! Japan: 200+ nodes for HDFS (ORC), and ObjectStore • FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users 10© 2019
  • 11. Why Presto? Community-driven open source project High performance ANSI SQL engine • New Cost-Based Query Optimizer • Proven scalability • High concurrency Separation of compute and storage • Scale storage and compute independently • No ETL or data integration necessary to get to insights • SQL-on-anything No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in © 2019 11
  • 12. Built for Performance Query Execution Engine: • MPP-style pipelined in-memory execution • Vectorized data processing • Runtime query bytecode generation • Memory efficient data structures • Multi-threaded multi-core execution • Optimized readers for columnar formats (ORC and Parquet) • Predicate and column projection pushdown • Now also Cost-Based Optimizer 12© 2019
  • 13. Architecture 13© 2019 Data stream API Worker Data stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data location API Pluggable
  • 15. Before we optimize ... Join in Presto • Hash Join • Right table is in memory ("build table") • Left table is streamed ("probe table") • Can be broadcast or repartitioned 15© 2019
  • 16. Before we optimize ... Join in Presto • Hash Join • Right table is in memory ("build table") • Left table is streamed ("probe table") • Can be broadcast or repartitioned • A join can be followed by a join, can be followed by a join… 16© 2019
  • 17. CBO in a nutshell Cost-Based Optimizer v1 includes: • join reordering based on selectivity estimates and cost • automatic join type selection (repartitioned vs broadcast) • automatic left/right side selection for joined tables • support for statistics stored in Hive Metastore https://www.starburstdata.com/technical-blog/ 17© 2019
  • 18. Cost and Statistics Cost calculation includes: • CPU • Memory usage • Network I/O Hive Metastore statistics: • number of rows in a table • number of distinct values in a column • fraction of NULL values in a column • minimum/maximum value in a column • average data size for a column 18
  • 19. Join left/right side decision 19© 2019
  • 21. Join reordering "Which customers are spending the most at our shop?" SELECT c.custkey, sum(l.price) FROM customer c JOIN orders o ON c.custkey = o.custkey JOIN lineitem l ON o.orderkey = l.orderkey GROUP BY c.custkey ORDER BY sum(l.price) DESC; 21
  • 23. Join reordering with filter "Which customers are spending the most on coffee?" SELECT c.custkey, sum(l.price) FROM customer c JOIN orders o ON c.custkey = o.custkey JOIN lineitem l ON o.orderkey = l.orderkey WHERE l.item = 'coffee' GROUP BY c.custkey ORDER BY sum(l.price) DESC; 23
  • 24. Join reordering with filter 24© 2019
  • 25. Presto CBO Speedup Duration of TPC-DS queries (lower is better) © 2019 25 https://www.starburstdata.com/presto-benchmarks/
  • 26. Cloud cost reduction ● on average 7x improvement vs EMR Presto (up to 18x faster!) ● EMR Presto cannot execute many TPC-DS queries (all pass on Starburst Presto) ● on average 4x improvement on simpler queries benchmark (TPC-H, 3rd party benchmark) 26© 2019 https://www.starburstdata.com/presto-aws/ https://www.concurrencylabs.com/blog/starburst-presto-vs-aws-emr-sql/
  • 28. Enhancements to date • Deciding on semi-join distribution type based on cost • "… WHERE x IN (SELECT y FROM ...)" queries • Capping a broadcasted table size • Various minor fixes in cardinality estimation • Peak memory estimation (more meaningful memory cost metric) 28© 2019
  • 29. Collecting statistics • ANALYZE table • native in Presto • Hive runtime no longer needed to collect stats • Collecting table statistics on write • low overhead, the data is being processed in memory • Stats for AWS Glue Catalog • Glue does not provide statistics support out of the box • exclusive from Starburst 29© 2019
  • 30. Statistics in other connectors • CBO initially supported with Hive connector only • We added stats for RDBMS connectors (Starburst) • PostgreSQL • MySQL • SQL Server • Teradata • Oracle • Enables CBO in federation use-cases 30© 2019
  • 32. What’s next • Stats support • Improved stats for Hive and Glue, e.g. NDV estimates across partitions • Stats for NoSQL connectors • Core CBO and engine enhancements • Involve connectors in optimizations (" *-pushdown ") • Peak memory-based decisions • Adjust cost model weights based on the hardware • Cost and stats for more operation types • Introduce Traits — consider alternative plans during optimizations • Late materialization • Adaptive optimizations 32© 2019
  • 34. Thank You! 34 Twitter: @starburstdata @prestosql Blog: www.starburstdata.com/technical-blog/ Newsletter: www.starburstdata.com/newsletter © 2019
  • 35. Rate today’s session Session page on conference website O’Reilly Events App