Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Presto - SQL on anything

1,814 views

Published on

One of the key differences between Presto and Hive, also a crucial functional requirement Facebook made when launching this new SQL engine project, was to have the opportunity to query different kinds of data sources via a uniform ANSI SQL interface.
Presto, an open source distributed analytical SQL engine, implements this with it’s connector architecture, creating an abstraction layer for anything that can be expressed as in a row-like format, ranging from MySQL tables, HDFS, Amazon S3 to NoSQL stores, Kafka streams and proprietary data sources. Presto connector SPI allows anyone to implement a Presto connector and benefit from the capabilities of the Presto SQL engine, enabling them to join data from various sources within a single SQL query.

Published in: Software
  • Be the first to comment

Presto - SQL on anything

  1. 1. 1 Presto - SQL on anything January 2017 Grzegorz Kokosiński Karol Sobczak Teradata Center for Hadoop
  2. 2. 2 Agenda - Who are we? - What is Presto? - What is data federation? - Different federation strategies in other databases (HIVE) - what is supported and what are the problems - Presto Connector - Show time
  3. 3. 3 Lets make some noise • Let tweet about this presentation! – #whug – #prestodb – #teradata • Later on we will query that data!
  4. 4. 4 Who are we
  5. 5. 5 What is Presto? • 100% open source distributed SQL query engine - Originally developed by Facebook • Key Differentiators: - Performance & Scale - Cross platform query capability, not only SQL on Hadoop • Apache licensed, hosted on GitHub - Certified distro & support from Teradata
  6. 6. 6 Presto Users See more at https://github.com/prestodb/presto/wiki/Presto-Users
  7. 7. 7 • Facebook – Multiple production clusters (100s of nodes total) - 300PB in HDFS, sharded MySQL, SSD-based Raptor – 1000s of internal daily active users – 10s-100s of concurrent queries • Netflix – 250+ node on EC2, 40+ PB in S3 (Parquet format) – Over 650 active users and 6K+ queries daily • Twitter – 200+ nodes on-premises over Parquet nested data • Uber – 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily • FINRA – 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata) Presto in Production
  8. 8. 8 • In-memory processing • Pipelined execution across nodes (MPP-style) – Vectorized columnar processing – Multithreaded execution keeps all CPU cores busy • Presto is written in highly tuned Java – Efficient memory management (reduced GC overhead) – Very careful coding of inner loops – Runtime bytecode generation • Optimized ORC & Parquet readers • Excellent performance with interactive SQL analytics – Enables to use BI tools Presto – Query Execution Performance
  9. 9. 9 • Hadoop/Hive connector & file formats (HDFS/S3): – HDFS & S3 + HCatalog – ORC, RCFile, Parquet, SequenceFile, Text • Raptor – columnar store on flash driven by Facebook • Open source data stores (driven by the community) – MySQL & PostgreSQL (non-parallel) – Cassandra (by Teradata) – Kafka – Redis – MongoDB – ElasticSearch – Accumulo (by Bloomberg) Supported data sources & file formats
  10. 10. 10 [ WITH with_query [, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ] In addition: • Windowing functions • UNNEST, TABLESAMPLE • ROLLUP, CUBE, GROUPING SETS • UNION, EXCEPT, INTERSECT • Subqueries (EXISTS, IN) ANSI SQL Support
  11. 11. 11 Presto is not a database! • Presto is a query execution engine (storage independent) • Pluggable custom user functionalities – Connectors – Functions – Types – System access controllers – Resource group configuration managers – Event listeners – … • Built-in core functionalities: – parser, execution, types, sql functions, monitoring
  12. 12. 12 Data federation • Query data from several data sources (databases) • Streaming – One to One - there is a single connection between database access points - e.g. PSQL via PSQL - using storage handlers to access RDBMS data from Hive – Many to One - many connections from one database nodes to a single access point of other database - Accessing REST from UDF in (possibly each) HIVE map/reduce task – Many to Many - workers talk to each other directly • Through storage – Needs (intermittent) data materialization • Presto supports them all!
  13. 13. 13 Data federation common problems • model incompatibilities • multinode streaming is not always possible • transactions • cost based optimizations (statistics) • SQL pushdown (predicates, projections, aggregations?, joins?)
  14. 14. 14 Connector • Presto interface to access arbitrary data source (hive, mysql, jmx) • Provides: – metadata – ability to distributed, parallel and streamed read/write – transaction boundary – physical data layouts – statistics – (SQL) predicate pushdown) – indexes (index join) – session or table properties – access control – procedures (CALL … – . . . • Most (if not all) of the above points are optional
  15. 15. 15 Presto Architecture Data stream API Worker Data stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data location API Pluggable
  16. 16. 16 Data federation with Presto • Through the storage • Demo – HIVE HDFS DataNode HDFS DataNode Hive Metastore HDFS Namenode data transfer Presto worker Presto worker Presto coordinator data transfer metadata metadata
  17. 17. 17 Data federation with Presto • One to One • Demo – psql – REST – and above with HIVE Presto worker Presto worker Presto coordinator SQL Database JDBC metadataJDBC data
  18. 18. 18 Many to many - data federation with Presto AMP AMP AMP AMP Q G E x c h a n g e Q G E x c h a n g e PE Coordinator Worker Thread Worker Thread Worker Thread Worker Thread Init & metadata exchange Bi-directional fully parallel data exchange TERADATA PRESTO • Key features: • Low latency • High performance • Concurrency • SQL pushdown • Data conversion • Compression • Efficient CPU usage
  19. 19. 19 Conclusion • Presto Connector is expressive • 3rd party data source is 1st class citizen • Single ANSI SQL to rule them all – use BI tools on data which is not BI friendly • Rapid data integration
  20. 20. 20 Certified Distro: www.teradata.com/presto Website: www.prestodb.io Presto Users Group: www.groups.google.com/group/presto-users GitHub: www.github.com/prestodb/presto www.github.com/Teradata/presto More information
  21. 21. 21 www.teradata.com/presto

×