Query Optimizer: pursuit of performance
Martin Traverso, Facebook
Kamil Bajda-Pawlikowski, Starburst
@prestodb @starburstdata
DataWorks Summit
2018 @ San Jose, CA
Presto: SQL-on-Anything
Deploy Anywhere, Query Anything
Key Highlights
● ANSI SQL
● Interactive performance
● High concurrency
● Proven scalability
● Separation of compute and storage
● Query data where it lives (no ETL needed)
● Hadoop / cloud vendor agnostic
● Community-driven open source project
● Apache licence, hosted on GitHub
Project Timeline
©2017 Starburst Data, Inc. All Rights Reserved
FALL 2012
6 developers
start Presto
development
SUMMER 2017
180+ Releases
50+ Contributors
5000+ Commits
WINTER 2017
Starburst is founded by
a team of Presto
committers, Teradata
veterans
FALL 2013
Facebook open
sources Presto
SPRING 2015
Teradata joins the
community, begins
investing heavily in
the project,
connects Teradata
to Presto via
QueryGrid
FALL 2008
Facebook open
sources Hive
Presto Community
See more at https://github.com/prestodb/presto/wiki/Presto-Users
Presto in Production
Facebook: 1000s of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users
Uber: 800+ nodes (2 clusters on premises) with 200K+ queries daily over HDFS (Parquet/ORC)
Twitter: 800+ nodes (several clusters on premises) for HDFS (Parquet)
LinkedIn: 350+ nodes (2 clusters on premises), 40K+ queries daily over HDFS (ORC), 600+ users
Netflix: 250+ nodes in AWS, 40+ PB in S3 (Parquet)
Lyft: 200+ nodes in AWS, 20K+ queries daily, 20+ PBs in Parquet
Yahoo! Japan: 200+ nodes (4 clusters on premises) for HDFS (ORC), ObjectStore, and Cassandra
FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users
Built for Performance
Query Execution Engine:
● MPP-style pipelined in-memory execution
● Columnar and vectorized data processing
● Runtime query bytecode compilation
● Memory efficient data structures
● Multi-threaded multi-core execution
● Optimized readers for columnar formats (ORC and Parquet)
● Now also Cost-Based Optimizer
Evolving the optimizer - Challenges
● Diverse and widespread production workloads
● Fast-changing codebase
● Many developers
● Large surface area and usage of plan IR
Before
● Monolithic visitor-based plan transformations
● Visitors responsible for walking and transforming plan tree
● Problems
○ Hard to add new operations (IR node types)
○ Hard to add new optimizations
○ Hard to test optimizers
class LimitPushdown {
Plan optimize(Plan) { return plan.root.accept(this) }
Node visitLimit(LimitNode) { ... }
Node visitProject(ProjectNode) { ... }
Node visitFilter(FilterNode) { ... }
...
}
Now
● Granular rule-based transformations
● Rules responsible for transforming localized subplan structure
● Optimizer loop responsible for walking plan and driving rule application
● Benefits
○ Decouples traversal from rule application
○ Decouples adding new optimizations from adding new operations (IR node types)
○ Easier to reason about and test individual rule behavior
class PushLimitThroughProjectRule {
Pattern getPattern() { Patterns.limit().with(source().matching(project())) }
Node apply(Node) { ... }
}
Migrating from monolithic optimizers to rules
● Fallback behavior
● Controlled via config option or
per-query session property
● Removed after a few releases
optimizers = [
RuleBasedOptimizer(
legacy = LimitPushdown,
rules = [
PushLimitThroughProject,
PushLimitThroughUnion,
PushLimitThroughJoin
]
),
PredicatePushdown,
PruneUnusedColumns,
AddExchanges,
EliminateCrossJoins,
...
]
Adding cost-aware optimizers
● Just another rule
● Can reason about cost
optimizers = [
RuleBasedOptimizer(
rules = [
PushLimitThroughProject,
PushLimitThroughUnion,
PushLimitThroughJoin,
ReorderJoins
]
),
...
]
class ReorderJoins {
ReorderJoins(CostComparator) { ... }
Pattern getPattern() { ... }
Node apply(Node) { ... }
}
CBO in a nutshell
Cost-Based Optimizer v1 includes:
● support for statistics stored in Hive Metastore
● join reordering based on selectivity estimates and cost
● automatic join type selection (repartitioned vs broadcast)
● automatic left/right side selection for joined tables
https://www.starburstdata.com/technical-blog/
Statistics & Cost
Hive Metastore statistics:
● number of rows in a table
● number of distinct values in a column
● fraction of NULL values in a column
● minimum/maximum value in a column
● average data size for a column
Cost calculation includes:
● CPU
● Memory
● Network I/O
Join type selection
Join left/right side decision
Join reordering
Join reordering with filter
Filter estimation
Join tree shapes
Benchmark results (on prem)
CBO off
CBO on
https://www.starburstdata.com/technical-blog/presto-cost-based-optimizer-rocks-the-tpc-benchmarks/
Benchmark results (cloud)
https://www.starburstdata.com/aws
Benchmark results (Facebook)
Roadmap
● CBO enhancements:
○ Additional rewrites
○ Costing for more operators
○ Built-in statistics collection
○ Exposing statistics for additional connectors
○ Additional types of statistics (e.g., histograms)
● General functionality:
○ Spill to disk enhancements
○ Geospatial functions performance
○ New connectors (ElasticSearch, Kudu)
○ Resource-aware query submission
○ Misc performance improvements
Further reading
www.prestodb.io
www.starburstdata.com
https://eng.uber.com/presto/
https://www.kdnuggets.com/2018/04/presto-data-scientists-sql.html
https://www.oreilly.com/ideas/query-the-planet-geospatial-big-data-analytics-at-uber
https://allegro.tech/2017/06/presto-small-step-for-devops-engineer-big-step-for-big-data-analyst.html
https://www.slideshare.net/MartinTraverso/presto-at-facebook-presto-meetup-boston-1062015
http://engineering.grab.com/scaling-like-a-boss-with-presto
https://www.techatbloomberg.com/blog/reducing-application-development-time-connecting-apache-presto-
accumulo/
Thank You!
@prestodb @starburstdata
www.starburstdata.comwww.prestodb.io

Presto query optimizer at DataWorks Summit 2018