Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto talk @ Global AI conference 2018 Boston


Published on

Presented at Global AI Conference in Boston 2018:

Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.

Published in: Data & Analytics
  • I like this service ⇒ ⇐ from Academic Writers. I don't have enough time write it by myself.
    Are you sure you want to  Yes  No
    Your message goes here
  • God bless you Ted. You saved me tons of money. I almost went to bought an overpriced side table until I saw your plans. Thanks for all the great ideas. It's gonna keep me occupied for a long time :) ➤➤
    Are you sure you want to  Yes  No
    Your message goes here

Presto talk @ Global AI conference 2018 Boston

  1. 1. Twitter : @bigdataconf #GAIC
  2. 2. Kamil Bajda-Pawlikowski Co-founder and CTO SQL-on-Anything Engine for Interactive Analytics Global AI conference 2018 @ Boston, MA
  3. 3. Presto is SQL on anything Query anything, anywhere © 2018 3
  4. 4. The Presto fan club 4© 2018 See more at
  5. 5. Presto in production Facebook: 1000s of nodes, HDFS (ORC, RCFile), sharded MySQL, 1000s of users Uber: 800+ nodes (2 clusters on premises) with 200K+ queries daily over HDFS (Parquet/ORC) Twitter: 800+ nodes (several clusters on premises) for HDFS (Parquet) LinkedIn: 350+ nodes (2 clusters on premises), 40K+ queries daily over HDFS (ORC), 600+ users Netflix: 250+ nodes in AWS, 40+ PB in S3 (Parquet) Lyft: 200+ nodes in AWS, 20K+ queries daily, 20+ PBs in Parquet Yahoo! Japan: 200+ nodes (4 clusters on premises) for HDFS (ORC), ObjectStore, and Cassandra FINRA: 120+ nodes in AWS, 4PB in S3 (ORC), 200+ users
  6. 6. 6©2017 Starburst Data, Inc. All Rights Reserved FALL 2012 6 developers start Presto development SUMMER 2017 180+ Releases 50+ Contributors 5000+ Commits WINTER 2017 Starburst is founded by team of leading Presto committers, Teradata veterans FALL 2013 Facebook open sources Presto SPRING 2015 Teradata joins the Presto community, begins investing heavily in the project, connects Teradata to Presto via QueryGrid FALL 2008 Facebook open sources Hive
  7. 7. 7
  8. 8. Why Presto? © 2018 8
  9. 9. Why Presto? Community-driven open source project High performance ANSI SQL engine • New Cost-Based Query Optimizer • Proven scalability • High concurrency Separation of compute and storage • Scale storage and compute independently • No ETL or data integration necessary to get to insights • SQL-on-anything No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in © 2018 9
  10. 10. Some key contributions from our team Presto-Admin For easy installation & management of Presto Security Integrations Such as Kerberos, LDAP, and in-transit encryption ANSI SQL syntax Enhancements to fully support TPC-H and TPC-DS ODBC and JDBC drivers To enable BI tools such as Tableau, Qlik, etc. Presto Connectors SQL Server, Cassandra, and Kafka Spill to disk Capabilities for large intermediate data sets Containerization Cost-Based Query Optimizer Providing 10-15x performance boost © 2018 10 Run Presto with Docker or Kubernetes
  11. 11. Beyond ANSI SQL Presto offers a wide variety of built-in functions including: ● regular expression functions ● lambda expressions and functions ● geospatial functions Complex data types: ● JSON ● ARRAY ● MAP ● ROW / STRUCT SELECT regexp_extract_all('1a 2b 14m', 'd+'); -- [1, 2, 14] SELECT filter(ARRAY [5, -6, NULL, 7], x -> x > 0); -- [5, 7] SELECT transform(ARRAY [5, 6], x -> x + 1); -- [6, 7] SELECT c.city_id, count(*) as trip_count FROM trips_table as t JOIN city_table as c ON st_contains(c.geo_shape, st_point(t.dest_lng, t.dest_lat)) WHERE t.trip_date = ‘2018-05-01’ GROUP BY 1
  12. 12. JDBC / ODBC drivers for BI/SQL tools C/C++, Go, Java, Node.js, Python, PHP, R and Ruby on Rails UDFs, UDAFs, Connector SPI Tools, bindings, extensibility
  13. 13. Presto on AWS Fully integrated with AWS: ● Amazon S3 ● AWS Glue Catalog ● Autoscaling ● AWS Marketplace
  14. 14. osoft-azure-customers/ Presto on Azure Fully integrated with Azure HDInsight: ● Azure Blob Storage ● Azure Data Lake Storage ● External Hive Metastore ● Microsoft PowerBI
  15. 15. Presto Performance 15© 2018
  16. 16. Built for Performance Query Execution Engine: ● MPP-style pipelined in-memory execution ● Columnar and vectorized data processing ● Runtime query bytecode compilation ● Memory efficient data structures ● Multi-threaded multi-core execution ● Optimized readers for columnar formats (ORC and Parquet) ● Now also Cost-Based Optimizer
  17. 17. CBO in a nutshell Cost-Based Optimizer v1 includes: • support for statistics stored in Hive Metastore • join reordering based on selectivity estimates and cost • automatic join type selection (repartitioned vs broadcast) • automatic left/right side selection for joined tables 17© 2018
  18. 18. Starburst Presto vs. Github Presto Duration of TPC-DS queries (lower is better) © 2018 18
  19. 19. Cloud cost reduction ● on average 7x improvement vs EMR Presto ● EMR Presto cannot execute many TPC-DS queries ● All TPC-DS queries pass on Starburst Presto 19© 2018
  20. 20. Roadmap ● CBO enhancements: ○ Additional rewrites ○ Costing for more operators ○ Built-in statistics collection ○ Exposing statistics for additional connectors ○ Additional types of statistics (e.g., histograms) ● General functionality: ○ Spill to disk enhancements ○ Geospatial functions performance ○ New connectors (Elasticsearch, Iceberg, Pulsar) ○ Resource-aware query submission ○ Misc performance improvements
  21. 21. Further reading oogles-cloud-dataproc/ 21
  22. 22. Thank You! 22 Twitter: @starburstdata @prestodb Blog: Newsletter: © 2018