The past, present,
and Future Of
Philip S Bell
Developer Advocate @Meta | Presto Foundation Governing Board Member
Tim Meehan
Software Engineer @Meta | Presto Foundation Technical Steering Committee Chair
● Open Source
● Fast and Secure Distributed SQL
● Pluggable Connectors
If you haven’t heard of Presto
State of the art SQL engine for Open Data Lake analytics
Free and Open Source
● Permissive License
● Open and permissive governance
● Vibrant community of evolution
Truly Standard Query Language
● Use SQL against every data source
● Interactive speeds
● Connect all of your BI tools to power dashboards
Connect to everything in your architecture
● Use the Connector API to tap into your data where
it lives
● Wide variety of supported file & table formats
● Accumulo
● BigQuery
● Black Hole
● Cassandra
● Delta Lake
● Druid
● Elasticsearch
● Hive
● Hive Security Config
● Iceberg
● JMX
● Kafka
● Kafka Tutorial
● Kudu
● Lark Sheets
● Local File
● Memory
● MongoDB
● MySQL
● Oracle
● Apache Pinot
● PostgreSQL
● Prometheus
● Redis
● Redshift
● SQL Server
● System
● Thrift
Connect to everything in your architecture
CONNECTORS
● ORC
● Parquet
● Avro
● RCFile
● SequenceFile
● JSON
● Text
● CSV
SUPPORTED FILETYPES
How Presto works
Common Use Cases
● Used and maintained by industry leaders
● Proven at Meta Scale
Why use Presto?
Users
Contributors
● 40,000+ users
● 20+ data centers
● 100k+ servers
● Exabytes
● Intense SLA
What is Meta Scale?
3 Years of OSS in summary
• Connectors
• Cloud Support
• Language Features
• Performance & Efficiency
• Reliability
• Security
• Transaction Manager Support
• User Experience
• Writer enhancements
OSS Roadmap 2021 H2
What does that look like?
What makes good Open Source?
● Clear and explicit guidelines for how to collaborate
● Code quality standards
● Mentorship of new contributors
● Open governance
● Responsive and respectful community interaction
● Room to experiment
● Truly flexible design from the beginning
• Connectors ☑
• Cloud Support ☑
• Language Features ☑
• Performance & Efficiency ☑
• Reliability ☑
• Security ☑
• Transaction Manager Support ☑
• User Experience ☑
• Writer enhancements ☑
State of OSS 2022 H2
Connectors
● BigQuery connector fixes (Ahana)
● Hive integration tests for 3.0 (Ahana)
● Kafka enhanced connector (Uber | Yang Yang, Hitarth Trivedi)
● Google sheets connector (Uber | Chen Liang)
Cloud Support
● Add flexible S3 Security Mapping, allowing for separate credentials or IAM roles (Ahana |
Jalpreet)
● Additional AWS Glue integration tests (Ahana)
● Additional S3 integration tests (Ahana)
Language features
● Support for informational constraints ( Ahana | Dave Simmen)
● Add semi join support (ByteDance | Yue Long)
● Add lateral view support (ByteDance | Yue Long)
● Support insert for Hive tables (ByteDance | Yue Long)
● Support insert overwrite/insert into for partitions (ByteDance | Yue Long)
● Support more implicit type conversions (ByteDance | Yue Long)
● Add functions: to_date, get_json_object, murmur3hash, max_pt (ByteDance | Yue Long)
● Add support for Hive UDF wrappers to reuse Hive functions in Presto (ByteDance | Yue Long)
● Support microsecond precision in Presto (ByteDance | Yue Long)
Performance & Efficiency
● Dynamic filtering with comparison operators (Twitter | Zhenxiao Luo)
● Support distributed spatial left join (Twitter | Zhenxiao Luo)
● Improve Hive split efficiency for small files (ByteDance | Yue Long)
● Remove redundant shuffles for special expressions and call expressions in group by and
select statements (ByteDance | Yue Long)
● Add dynamic filtering for hash join (ByteDance | Yue Long)
● Limit CBO to certain known tables (Uber | Hitarth Trivedi)
● Enabling Multi level Caching with RaptorX for AWS (Ahana)
● Aria operator abstraction to support additional legs / data structures (Ahana | Vivek)
● Aria optimizations for Parquet (Ahana | Vivek)
● Co-located join via replicated read for data lake tables (Ahana | George)
Performance & Efficiency
● Execution strategy improvement for ORDER-BY performance (sort optimization) (Ahana)
● Provide Node level and table level insights into Presto-Alluxio cache (Alluxio | Beinan Wang)
● Provide Query-level insights into Presto-Alluxio cache (Alluxio | Bin Fan)
● Enhanced Soft-affinity scheduling (Alluxio | Beinan Wang)
● Enhanced Presto-Alluxio cache (Alluxio | Bin Fan)
● Unified FRC with Presto data cache (Alluxio | Bowen Ding)
● Implement filter/limit/aggregation pushdown for JDBC Database (Alibaba | Wu Yangping)
● Implement filter/limit/aggregation pushdown for Alibaba Tablestore (Alibaba | Wu Yangping)
● Implement filter/limit/aggregation pushdown for MongoDB (Alibaba | Wu Yangping)
● Implement JOIN pushdown for JDBC Database (Alibaba | Wu Yangping)
● Implement Query Result Cache (Alibaba | Ma Zhenlin)
● Improve Cassandra query performance by adding metadata cache (Alibaba | James Xu)
● Implement ORDER BY pushdown for Druid connector (Twitter | Chunxu Tang)
● Implement JOIN pushdown for Druid connector (Twitter | Chunxu Tang)
Reliability
● Disaggregated coordinators support in Presto. (Meta | Tim Meehan)
● Continued improvements to Presto on Spark. (Meta | Andrii Rosa)
● Hive metastore fall-back (Alluxio | Shouwei Chen)
Security
● Apache Ranger Integration (Ahana | Ashish and Reetika)
● Apache Ranger auditing support (Ahana | Reetika)
● Row level filtering (Ahana | Reetika)
● OAuth2 support in Presto (Ahana)
● Parquet encryption features (Uber | Chen Liang)
Transaction Manager Support
● DeltaLake connector for querying (Ahana)
● Improved Apache Hudi support in Presto (Ahana)
● Presto Iceberg connector support for ORC format (Twitter | Chunxu Tang)
● Presto Iceberg connector support for materialized view (Twitter | Chunxu Tang)
● Presto Iceberg connector v2 table (Alluxio | Beinan Wang)
● Presto Iceberg connector unit test improvement (Twitter | Chunxu Tang)
User Experience
● Adding line numbers to Presto plan. (Meta | Sreeni Viswanadha)
● Operator spilling (Meta | Vic Zhang, Rebecca Schlussel, Arjun Gupta)
○ Bug fixes
○ Memory efficiency improvement
○ Sapphire (Presto-on-Spark) support
○ Stretch goals
■ More memory efficient distinct aggregation
■ Support more joins beyond inner join
Writer Enhancements
● Parquet Writer Optimization (Uber | Chen Liang)
● Integration of Parquet Column Index (Uber | Chen Liang)
• One Presto
OSS Roadmap 2023
• One Presto
– Single language for batch and all of interactive
OSS Roadmap 2023
• One Presto
– Single language for batch and all of interactive
– Great performance for low latency and fast batch
OSS Roadmap 2023
• One Presto
– Single language for batch and all of interactive
– Great performance for low latency and fast batch
– Better out of the box reliability and scalability for a variety of uses
OSS Roadmap 2023
Colocated
Analytics
Dashboards
Online Analytics
Optimized queries
Point queries
Ad-hoc querying
Notebooks/BI tools
Unknown resource usage
Typically short running,
with long running outliers
Presto
Short Running Batch
Adhoc queries adapted
into recurring pipelines
Testing
Presto
Long Running Batch
Large ETL
Spark
Problems:
● Resource usage tends to grow, not shrink
● Without fault tolerance, error rate
increases with duration
● Local memory failure triggers spilling,
which decreases throughput
● Replatforming
Presto
Dashboards
Online Analytics
Optimized queries
Point queries
Ad-hoc querying
Notebooks/BI tools
Unknown resource usage
Typically short running,
with long running outliers
Presto
Short Running Batch
Adhoc queries adapted
into recurring pipelines
Testing
Presto
Long Running Batch
Large ETL
Presto
● Batch
○ LBM deprecation
○ Presto-on-Spark hardening
○ Functionality improvements
● Interactive
○ Native acceleration
One language
● History based optimizer
○ Automatic hash partition count
○ Improved join ordering
○ Improve hash aggregation
enablement
Auto-awesome
● Adaptive query execution (Sapphire)
○ Replan between each stage based
on output stats
○ Join reordering
○ Skew join support
Auto-awesome
● Presto Pack
○ Schedule workload where more
cycles are available
○ Increase performance and
throughput
Auto-awesome
● Native integration
○ Move JVM off of the data path
○ Velox integration on streaming
○ Velox + Sapphire for fast, native
batch execution
Auto-awesome
● Interactive fault tolerance
○ Low latency failure recoverability
○ Capacity pools to put a portion of
workload on spot instances
Auto-awesome
● Large cluster support
○ Make Presto more flexible to
influxes and reductions in capacity
○ Less horizontal scaling and smarter
local routing
○ More hands-free
Auto-awesome
wants you to help build
the future
Where you can contribute to Presto
● Documentation
● File issues for missing functionality or ideas you have for
Presto: GitHub issues
● Take open issues and offer to work on them.
● Contribute blog posts for your Presto use cases. Create a PR
● Contribute a connector/bug fix/etc.
● Work through “Getting Started”
● Read and agree to Code of Conduct
● Join the Presto Slack
● File Issues and help with others
● Follow commit standards and coding guidelines
● New tests added or existing tests modified with each commit
● Run local tests to pass before submitting PR
● Tests run against public CI/CD (when available in the future)
● Peer review code often
● Review Pull Requests for coding guidelines adherence
● Update documentation
● Tag Issues and Pull Requests with relevant code areas
Path to committership
Tools used for presentation
● gource.io
● lucidchart.com
● slides.google.com
● draculatheme.com
Q&A
events.linuxfoundation.org/prestocon

The Future of PrestoDB

  • 1.
  • 2.
    Philip S Bell DeveloperAdvocate @Meta | Presto Foundation Governing Board Member Tim Meehan Software Engineer @Meta | Presto Foundation Technical Steering Committee Chair
  • 3.
    ● Open Source ●Fast and Secure Distributed SQL ● Pluggable Connectors If you haven’t heard of Presto State of the art SQL engine for Open Data Lake analytics
  • 4.
    Free and OpenSource ● Permissive License ● Open and permissive governance ● Vibrant community of evolution
  • 5.
    Truly Standard QueryLanguage ● Use SQL against every data source ● Interactive speeds ● Connect all of your BI tools to power dashboards
  • 6.
    Connect to everythingin your architecture ● Use the Connector API to tap into your data where it lives ● Wide variety of supported file & table formats
  • 7.
    ● Accumulo ● BigQuery ●Black Hole ● Cassandra ● Delta Lake ● Druid ● Elasticsearch ● Hive ● Hive Security Config ● Iceberg ● JMX ● Kafka ● Kafka Tutorial ● Kudu ● Lark Sheets ● Local File ● Memory ● MongoDB ● MySQL ● Oracle ● Apache Pinot ● PostgreSQL ● Prometheus ● Redis ● Redshift ● SQL Server ● System ● Thrift Connect to everything in your architecture CONNECTORS ● ORC ● Parquet ● Avro ● RCFile ● SequenceFile ● JSON ● Text ● CSV SUPPORTED FILETYPES
  • 8.
  • 9.
  • 10.
    ● Used andmaintained by industry leaders ● Proven at Meta Scale Why use Presto?
  • 11.
  • 12.
  • 13.
    ● 40,000+ users ●20+ data centers ● 100k+ servers ● Exabytes ● Intense SLA What is Meta Scale?
  • 14.
    3 Years ofOSS in summary
  • 15.
    • Connectors • CloudSupport • Language Features • Performance & Efficiency • Reliability • Security • Transaction Manager Support • User Experience • Writer enhancements OSS Roadmap 2021 H2
  • 16.
    What does thatlook like?
  • 17.
    What makes goodOpen Source? ● Clear and explicit guidelines for how to collaborate ● Code quality standards ● Mentorship of new contributors ● Open governance ● Responsive and respectful community interaction ● Room to experiment ● Truly flexible design from the beginning
  • 18.
    • Connectors ☑ •Cloud Support ☑ • Language Features ☑ • Performance & Efficiency ☑ • Reliability ☑ • Security ☑ • Transaction Manager Support ☑ • User Experience ☑ • Writer enhancements ☑ State of OSS 2022 H2
  • 19.
    Connectors ● BigQuery connectorfixes (Ahana) ● Hive integration tests for 3.0 (Ahana) ● Kafka enhanced connector (Uber | Yang Yang, Hitarth Trivedi) ● Google sheets connector (Uber | Chen Liang)
  • 20.
    Cloud Support ● Addflexible S3 Security Mapping, allowing for separate credentials or IAM roles (Ahana | Jalpreet) ● Additional AWS Glue integration tests (Ahana) ● Additional S3 integration tests (Ahana)
  • 21.
    Language features ● Supportfor informational constraints ( Ahana | Dave Simmen) ● Add semi join support (ByteDance | Yue Long) ● Add lateral view support (ByteDance | Yue Long) ● Support insert for Hive tables (ByteDance | Yue Long) ● Support insert overwrite/insert into for partitions (ByteDance | Yue Long) ● Support more implicit type conversions (ByteDance | Yue Long) ● Add functions: to_date, get_json_object, murmur3hash, max_pt (ByteDance | Yue Long) ● Add support for Hive UDF wrappers to reuse Hive functions in Presto (ByteDance | Yue Long) ● Support microsecond precision in Presto (ByteDance | Yue Long)
  • 22.
    Performance & Efficiency ●Dynamic filtering with comparison operators (Twitter | Zhenxiao Luo) ● Support distributed spatial left join (Twitter | Zhenxiao Luo) ● Improve Hive split efficiency for small files (ByteDance | Yue Long) ● Remove redundant shuffles for special expressions and call expressions in group by and select statements (ByteDance | Yue Long) ● Add dynamic filtering for hash join (ByteDance | Yue Long) ● Limit CBO to certain known tables (Uber | Hitarth Trivedi) ● Enabling Multi level Caching with RaptorX for AWS (Ahana) ● Aria operator abstraction to support additional legs / data structures (Ahana | Vivek) ● Aria optimizations for Parquet (Ahana | Vivek) ● Co-located join via replicated read for data lake tables (Ahana | George)
  • 23.
    Performance & Efficiency ●Execution strategy improvement for ORDER-BY performance (sort optimization) (Ahana) ● Provide Node level and table level insights into Presto-Alluxio cache (Alluxio | Beinan Wang) ● Provide Query-level insights into Presto-Alluxio cache (Alluxio | Bin Fan) ● Enhanced Soft-affinity scheduling (Alluxio | Beinan Wang) ● Enhanced Presto-Alluxio cache (Alluxio | Bin Fan) ● Unified FRC with Presto data cache (Alluxio | Bowen Ding) ● Implement filter/limit/aggregation pushdown for JDBC Database (Alibaba | Wu Yangping) ● Implement filter/limit/aggregation pushdown for Alibaba Tablestore (Alibaba | Wu Yangping) ● Implement filter/limit/aggregation pushdown for MongoDB (Alibaba | Wu Yangping) ● Implement JOIN pushdown for JDBC Database (Alibaba | Wu Yangping) ● Implement Query Result Cache (Alibaba | Ma Zhenlin) ● Improve Cassandra query performance by adding metadata cache (Alibaba | James Xu) ● Implement ORDER BY pushdown for Druid connector (Twitter | Chunxu Tang) ● Implement JOIN pushdown for Druid connector (Twitter | Chunxu Tang)
  • 24.
    Reliability ● Disaggregated coordinatorssupport in Presto. (Meta | Tim Meehan) ● Continued improvements to Presto on Spark. (Meta | Andrii Rosa) ● Hive metastore fall-back (Alluxio | Shouwei Chen)
  • 25.
    Security ● Apache RangerIntegration (Ahana | Ashish and Reetika) ● Apache Ranger auditing support (Ahana | Reetika) ● Row level filtering (Ahana | Reetika) ● OAuth2 support in Presto (Ahana) ● Parquet encryption features (Uber | Chen Liang)
  • 26.
    Transaction Manager Support ●DeltaLake connector for querying (Ahana) ● Improved Apache Hudi support in Presto (Ahana) ● Presto Iceberg connector support for ORC format (Twitter | Chunxu Tang) ● Presto Iceberg connector support for materialized view (Twitter | Chunxu Tang) ● Presto Iceberg connector v2 table (Alluxio | Beinan Wang) ● Presto Iceberg connector unit test improvement (Twitter | Chunxu Tang)
  • 27.
    User Experience ● Addingline numbers to Presto plan. (Meta | Sreeni Viswanadha) ● Operator spilling (Meta | Vic Zhang, Rebecca Schlussel, Arjun Gupta) ○ Bug fixes ○ Memory efficiency improvement ○ Sapphire (Presto-on-Spark) support ○ Stretch goals ■ More memory efficient distinct aggregation ■ Support more joins beyond inner join
  • 28.
    Writer Enhancements ● ParquetWriter Optimization (Uber | Chen Liang) ● Integration of Parquet Column Index (Uber | Chen Liang)
  • 29.
    • One Presto OSSRoadmap 2023
  • 30.
    • One Presto –Single language for batch and all of interactive OSS Roadmap 2023
  • 31.
    • One Presto –Single language for batch and all of interactive – Great performance for low latency and fast batch OSS Roadmap 2023
  • 32.
    • One Presto –Single language for batch and all of interactive – Great performance for low latency and fast batch – Better out of the box reliability and scalability for a variety of uses OSS Roadmap 2023
  • 33.
    Colocated Analytics Dashboards Online Analytics Optimized queries Pointqueries Ad-hoc querying Notebooks/BI tools Unknown resource usage Typically short running, with long running outliers Presto Short Running Batch Adhoc queries adapted into recurring pipelines Testing Presto Long Running Batch Large ETL Spark
  • 34.
    Problems: ● Resource usagetends to grow, not shrink ● Without fault tolerance, error rate increases with duration ● Local memory failure triggers spilling, which decreases throughput ● Replatforming
  • 35.
    Presto Dashboards Online Analytics Optimized queries Pointqueries Ad-hoc querying Notebooks/BI tools Unknown resource usage Typically short running, with long running outliers Presto Short Running Batch Adhoc queries adapted into recurring pipelines Testing Presto Long Running Batch Large ETL Presto
  • 36.
    ● Batch ○ LBMdeprecation ○ Presto-on-Spark hardening ○ Functionality improvements ● Interactive ○ Native acceleration One language
  • 37.
    ● History basedoptimizer ○ Automatic hash partition count ○ Improved join ordering ○ Improve hash aggregation enablement Auto-awesome
  • 38.
    ● Adaptive queryexecution (Sapphire) ○ Replan between each stage based on output stats ○ Join reordering ○ Skew join support Auto-awesome
  • 40.
    ● Presto Pack ○Schedule workload where more cycles are available ○ Increase performance and throughput Auto-awesome
  • 42.
    ● Native integration ○Move JVM off of the data path ○ Velox integration on streaming ○ Velox + Sapphire for fast, native batch execution Auto-awesome
  • 43.
    ● Interactive faulttolerance ○ Low latency failure recoverability ○ Capacity pools to put a portion of workload on spot instances Auto-awesome
  • 44.
    ● Large clustersupport ○ Make Presto more flexible to influxes and reductions in capacity ○ Less horizontal scaling and smarter local routing ○ More hands-free Auto-awesome
  • 45.
    wants you tohelp build the future
  • 46.
    Where you cancontribute to Presto ● Documentation ● File issues for missing functionality or ideas you have for Presto: GitHub issues ● Take open issues and offer to work on them. ● Contribute blog posts for your Presto use cases. Create a PR ● Contribute a connector/bug fix/etc.
  • 47.
    ● Work through“Getting Started” ● Read and agree to Code of Conduct ● Join the Presto Slack ● File Issues and help with others ● Follow commit standards and coding guidelines ● New tests added or existing tests modified with each commit ● Run local tests to pass before submitting PR ● Tests run against public CI/CD (when available in the future) ● Peer review code often ● Review Pull Requests for coding guidelines adherence ● Update documentation ● Tag Issues and Pull Requests with relevant code areas Path to committership
  • 48.
    Tools used forpresentation ● gource.io ● lucidchart.com ● slides.google.com ● draculatheme.com
  • 49.