Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto: SQL-on-anything


Published on

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook. One key feature in Presto is the ability to query data where it lives via a uniform ANSI SQL interface. Presto’s connector architecture creates an abstraction layer for anything that can be expressed in a row-like format, such as HDFS, Amazon S3, Azure Storage, NoSQL stores, relational databases, Kafka streams and even proprietary data stores. Furthermore, a single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

This talk will be co-presented by Facebook and Teradata, the two largest contributors to Presto. The talk will focus on Presto’s ability to query virtually any data source via it’s connector interface. Facebook and Teradata will present some of their use cases of Presto querying various data sources, discuss the existing connectors in Presto, and describe the anatomy of a connector.

Published in: Technology
  • Login to see the comments

Presto: SQL-on-anything

  1. 1. Presto: SQL-on-Anything DataWorks Summit, San Jose 2017 Martin Traverso, Facebook Matt Fuller, Teradata
  2. 2. Let’s Make Some Noise! @DataWorksSummit #DWS17 #prestodb #facebook #teradata
  3. 3. What is Presto? • Open source distributed SQL query engine • Originally developed by Facebook • ANSI SQL compliant • Like Hive, it’s not a database • Key Differentiators • Performance & Scale • Cross platform query capability, not only SQL on Hadoop • Supports federated queries • Used in production at many well known web-scale companies • Distributed under the Apache License, hosted on GitHub
  4. 4. Multiple clusters (1000s of nodes total) 300PB in HDFS, MySQL, and Raptor 1000s users, 10-100s concurrent queries Presto in Action
  5. 5. 250+ nodes on AWS 40+ PB stored in S3 (Parquet) Over 650 users with 6K+ queries daily Presto in Action
  6. 6. 300+ nodes (2 dedicated clusters) 100K+ & 20K+ queries daily Presto in Action
  7. 7. 200+ nodes on-premises Parquet nested data Presto in Action
  8. 8. 120+ nodes in AWS 2PB is S3 and 200+ users supported by Teradata Presto in Action
  9. 9. Data Stream API Worker Data Stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data Location API PluggablePresto Architecture
  10. 10. Presto is not a Database! • Presto is a distributed query execution engine • Storage Independent • Pluggable extensions • Connectors • Functions • Types • System access controllers • Resource group configuration managers • Event listeners • ... • Built-in core functionalities • parser, execution, types, sql functions, monitoring
  11. 11. Parser/ analyzer Planner Worker Metadata API Hive Cassandra Kafka MySQL … Scheduler Coordinator Presto Extensibility - Connector Data Location API Hive Cassandra Kafka MySQL … Data Stream API Hive Cassandra Kafka MySQL …
  12. 12. Amazon S3 Presto Connectors
  13. 13. Plugins
  14. 14. Classloader Isolation
  15. 15. Plugin Interface
  16. 16. Connector Configuration • Catalog namespace owned by connector Catalog Name
  17. 17. Connector Configuration
  18. 18. getTableHandle(...) Table Handle getTableMetadata(...) Query Analysis
  19. 19. Query Planning
  20. 20. getSplits(...) Iterator<Split> Query Execution Split • Handle to logical chunk of a table • Attributes • Remote access? • Location
  21. 21. Coordinator Worker Worker Worker + splits Query Execution + splits + splits
  22. 22. getNextPage() Table Scan Operator PageSource Page Filter Operator Aggregation Operator Query Execution
  23. 23. Presto Connectors @Facebook • Hive connector • Warehouse (ad-hoc / batch) • Raptor connector • Dashboards • Reporting backend for A/B testing framework • Sharded MySQL connector • Reporting backend for user-facing products • Other custom connectors for specialized data stores
  24. 24. Presto Connectors @Teradata Customers • Teradata QueryGrid + Presto • Teradata, Hadoop, S3, Cassandra, RESTful • Customer Use Cases • Recent sales data in Teradata needs to be joined with archived sales data that resided in Hadoop • Hadoop user using Presto needs to access pre-computed financial record in Teradata • Existing supplier data that is in Teradata is joined with archived product data that resides in Amazon S3
  25. 25. AMP AMP AMP AMP Q G E x c h a n g e Q G E x c h a n g e PE Coordinator Worker Thread Worker Thread Worker Thread Worker Thread Init & metadata exchange Bi-directional fully parallel data exchange TERADATA PRESTO • Key features: • Low latency • High performance • Concurrency • Pushdown • Data conversion • Compression • Efficient CPU usage Teradata QueryGrid (powered by Presto)
  26. 26. Teradata QueryGrid SQL Examples Teradata query joining data from Hadoop via Presto: SELECT * FROM websales_current UNION ALL SELECT * FROM websales_archive@presto; Presto query joining data in Teradata: SELECT * FROM td.sales.websales_current UNION ALL SELECT * FROM hive.sales.websales_archive;
  27. 27. Conclusions • Presto Connector API is expressive • 3rd Party data source is 1st class citizen • Single ANSI SQL to rule them all • Use BI tools on data which is not BI friendly • Rapid data integration
  28. 28. Write your own connector! • Issue SQL to GitHub! • • SELECT count(*) FROM prestodb.presto.stargazers; • Connector Example • • Documentation •
  29. 29. Additional Resources • Website • • Presto Users Groups • • GitHub: • • (Teradata’s development “fork”)