The Future of PrestoDB

The past, present,
and Future Of

Philip S Bell
Developer Advocate @Meta | Presto Foundation Governing Board Member
Tim Meehan
Software Engineer @Meta | Presto Foundation Technical Steering Committee Chair

● Open Source
● Fast and Secure Distributed SQL
● Pluggable Connectors
If you haven’t heard of Presto
State of the art SQL engine for Open Data Lake analytics

Free and Open Source
● Permissive License
● Open and permissive governance
● Vibrant community of evolution

Truly Standard Query Language
● Use SQL against every data source
● Interactive speeds
● Connect all of your BI tools to power dashboards

Connect to everything in your architecture
● Use the Connector API to tap into your data where
it lives
● Wide variety of supported file & table formats

● Accumulo
● BigQuery
● Black Hole
● Cassandra
● Delta Lake
● Druid
● Elasticsearch
● Hive
● Hive Security Config
● Iceberg
● JMX
● Kafka
● Kafka Tutorial
● Kudu
● Lark Sheets
● Local File
● Memory
● MongoDB
● MySQL
● Oracle
● Apache Pinot
● PostgreSQL
● Prometheus
● Redis
● Redshift
● SQL Server
● System
● Thrift
Connect to everything in your architecture
CONNECTORS
● ORC
● Parquet
● Avro
● RCFile
● SequenceFile
● JSON
● Text
● CSV
SUPPORTED FILETYPES

● Used and maintained by industry leaders
● Proven at Meta Scale
Why use Presto?

● 40,000+ users
● 20+ data centers
● 100k+ servers
● Exabytes
● Intense SLA
What is Meta Scale?

• Connectors
• Cloud Support
• Language Features
• Performance & Efficiency
• Reliability
• Security
• Transaction Manager Support
• User Experience
• Writer enhancements
OSS Roadmap 2021 H2

What makes good Open Source?
● Clear and explicit guidelines for how to collaborate
● Code quality standards
● Mentorship of new contributors
● Open governance
● Responsive and respectful community interaction
● Room to experiment
● Truly flexible design from the beginning

• Connectors ☑
• Cloud Support ☑
• Language Features ☑
• Performance & Efficiency ☑
• Reliability ☑
• Security ☑
• Transaction Manager Support ☑
• User Experience ☑
• Writer enhancements ☑
State of OSS 2022 H2

Connectors
● BigQuery connector fixes (Ahana)
● Hive integration tests for 3.0 (Ahana)
● Kafka enhanced connector (Uber | Yang Yang, Hitarth Trivedi)
● Google sheets connector (Uber | Chen Liang)

Cloud Support
● Add flexible S3 Security Mapping, allowing for separate credentials or IAM roles (Ahana |
Jalpreet)
● Additional AWS Glue integration tests (Ahana)
● Additional S3 integration tests (Ahana)

Language features
● Support for informational constraints ( Ahana | Dave Simmen)
● Add semi join support (ByteDance | Yue Long)
● Add lateral view support (ByteDance | Yue Long)
● Support insert for Hive tables (ByteDance | Yue Long)
● Support insert overwrite/insert into for partitions (ByteDance | Yue Long)
● Support more implicit type conversions (ByteDance | Yue Long)
● Add functions: to_date, get_json_object, murmur3hash, max_pt (ByteDance | Yue Long)
● Add support for Hive UDF wrappers to reuse Hive functions in Presto (ByteDance | Yue Long)
● Support microsecond precision in Presto (ByteDance | Yue Long)

Performance & Efficiency
● Dynamic filtering with comparison operators (Twitter | Zhenxiao Luo)
● Support distributed spatial left join (Twitter | Zhenxiao Luo)
● Improve Hive split efficiency for small files (ByteDance | Yue Long)
● Remove redundant shuffles for special expressions and call expressions in group by and
select statements (ByteDance | Yue Long)
● Add dynamic filtering for hash join (ByteDance | Yue Long)
● Limit CBO to certain known tables (Uber | Hitarth Trivedi)
● Enabling Multi level Caching with RaptorX for AWS (Ahana)
● Aria operator abstraction to support additional legs / data structures (Ahana | Vivek)
● Aria optimizations for Parquet (Ahana | Vivek)
● Co-located join via replicated read for data lake tables (Ahana | George)

Performance & Efficiency
● Execution strategy improvement for ORDER-BY performance (sort optimization) (Ahana)
● Provide Node level and table level insights into Presto-Alluxio cache (Alluxio | Beinan Wang)
● Provide Query-level insights into Presto-Alluxio cache (Alluxio | Bin Fan)
● Enhanced Soft-affinity scheduling (Alluxio | Beinan Wang)
● Enhanced Presto-Alluxio cache (Alluxio | Bin Fan)
● Unified FRC with Presto data cache (Alluxio | Bowen Ding)
● Implement filter/limit/aggregation pushdown for JDBC Database (Alibaba | Wu Yangping)
● Implement filter/limit/aggregation pushdown for Alibaba Tablestore (Alibaba | Wu Yangping)
● Implement filter/limit/aggregation pushdown for MongoDB (Alibaba | Wu Yangping)
● Implement JOIN pushdown for JDBC Database (Alibaba | Wu Yangping)
● Implement Query Result Cache (Alibaba | Ma Zhenlin)
● Improve Cassandra query performance by adding metadata cache (Alibaba | James Xu)
● Implement ORDER BY pushdown for Druid connector (Twitter | Chunxu Tang)
● Implement JOIN pushdown for Druid connector (Twitter | Chunxu Tang)

Reliability
● Disaggregated coordinators support in Presto. (Meta | Tim Meehan)
● Continued improvements to Presto on Spark. (Meta | Andrii Rosa)
● Hive metastore fall-back (Alluxio | Shouwei Chen)

Security
● Apache Ranger Integration (Ahana | Ashish and Reetika)
● Apache Ranger auditing support (Ahana | Reetika)
● Row level filtering (Ahana | Reetika)
● OAuth2 support in Presto (Ahana)
● Parquet encryption features (Uber | Chen Liang)

Transaction Manager Support
● DeltaLake connector for querying (Ahana)
● Improved Apache Hudi support in Presto (Ahana)
● Presto Iceberg connector support for ORC format (Twitter | Chunxu Tang)
● Presto Iceberg connector support for materialized view (Twitter | Chunxu Tang)
● Presto Iceberg connector v2 table (Alluxio | Beinan Wang)
● Presto Iceberg connector unit test improvement (Twitter | Chunxu Tang)

User Experience
● Adding line numbers to Presto plan. (Meta | Sreeni Viswanadha)
● Operator spilling (Meta | Vic Zhang, Rebecca Schlussel, Arjun Gupta)
○ Bug fixes
○ Memory efficiency improvement
○ Sapphire (Presto-on-Spark) support
○ Stretch goals
■ More memory efficient distinct aggregation
■ Support more joins beyond inner join

Writer Enhancements
● Parquet Writer Optimization (Uber | Chen Liang)
● Integration of Parquet Column Index (Uber | Chen Liang)

• One Presto
OSS Roadmap 2023

• One Presto
– Single language for batch and all of interactive
OSS Roadmap 2023

• One Presto
– Great performance for low latency and fast batch
OSS Roadmap 2023

• One Presto
– Great performance for low latency and fast batch
– Better out of the box reliability and scalability for a variety of uses
OSS Roadmap 2023

Colocated
Analytics
Dashboards
Online Analytics
Optimized queries
Point queries
Ad-hoc querying
Notebooks/BI tools
Unknown resource usage
Typically short running,
with long running outliers
Presto
Short Running Batch
Adhoc queries adapted
into recurring pipelines
Testing
Presto
Long Running Batch
Large ETL
Spark

Problems:
● Resource usage tends to grow, not shrink
● Without fault tolerance, error rate
increases with duration
● Local memory failure triggers spilling,
which decreases throughput
● Replatforming

Presto
Dashboards
Online Analytics
Optimized queries
Point queries
Ad-hoc querying
Notebooks/BI tools
Unknown resource usage
Typically short running,
with long running outliers
Presto
Short Running Batch
Adhoc queries adapted
into recurring pipelines
Testing
Presto
Long Running Batch
Large ETL
Presto

● Batch
○ LBM deprecation
○ Presto-on-Spark hardening
○ Functionality improvements
● Interactive
○ Native acceleration
One language

● History based optimizer
○ Automatic hash partition count
○ Improved join ordering
○ Improve hash aggregation
enablement
Auto-awesome

● Adaptive query execution (Sapphire)
○ Replan between each stage based
on output stats
○ Join reordering
○ Skew join support
Auto-awesome

● Presto Pack
○ Schedule workload where more
cycles are available
○ Increase performance and
throughput
Auto-awesome

● Native integration
○ Move JVM off of the data path
○ Velox integration on streaming
○ Velox + Sapphire for fast, native
batch execution
Auto-awesome

● Interactive fault tolerance
○ Low latency failure recoverability
○ Capacity pools to put a portion of
workload on spot instances
Auto-awesome

● Large cluster support
○ Make Presto more flexible to
influxes and reductions in capacity
○ Less horizontal scaling and smarter
local routing
○ More hands-free
Auto-awesome

wants you to help build
the future

Where you can contribute to Presto
● Documentation
● File issues for missing functionality or ideas you have for
Presto: GitHub issues
● Take open issues and offer to work on them.
● Contribute blog posts for your Presto use cases. Create a PR
● Contribute a connector/bug fix/etc.

● Work through “Getting Started”
● Read and agree to Code of Conduct
● Join the Presto Slack
● File Issues and help with others
● Follow commit standards and coding guidelines
● New tests added or existing tests modified with each commit
● Run local tests to pass before submitting PR
● Tests run against public CI/CD (when available in the future)
● Peer review code often
● Review Pull Requests for coding guidelines adherence
● Update documentation
● Tag Issues and Pull Requests with relevant code areas
Path to committership

Tools used for presentation
● gource.io
● lucidchart.com
● slides.google.com
● draculatheme.com

Q&A
events.linuxfoundation.org/prestocon

The Future of PrestoDB

More Related Content

Similar to The Future of PrestoDB

More from All Things Open

Recently uploaded

The Future of PrestoDB