Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What's new in SQL on Hadoop and Beyond

458 views

Published on

What's new in SQL on Hadoop and Beyond

Published in: Technology
  • Be the first to comment

What's new in SQL on Hadoop and Beyond

  1. 1. What's New in SQL-on-Hadoop and Beyond Martin Traverso, Facebook Kamil Bajda-Pawlikowski, Teradata
  2. 2. Agenda ● Introduction ● Presto at Facebook ● Presto users and use cases ● New features ● Roadmap
  3. 3. Introduction
  4. 4. What is Presto ● Open source distributed SQL engine ● ANSI SQL syntax ● Custom built for interactive analytic queries ● Queries data across multiple data stores ● Flexible deployment (on premise or cloud) ● Extensible
  5. 5. Presto at Facebook
  6. 6. Presto @ Facebook ● Ad-hoc/interactive queries for Hadoop warehouse ● Batch processing for Hadoop warehouse ● Analytics for user-facing products ● Analytics over various specialized stores
  7. 7. Hadoop Warehouse - Stats ● 1000s of internal daily active users ● Millions of queries each month ● Scan PBs of data every day ● Process trillions of rows every day ● 10s of concurrent queries
  8. 8. Hadoop Warehouse - Batch
  9. 9. Presto for User-facing Products ● Requirements ○ Hundreds of ms to seconds latency, low variability ○ Availability ○ Update semantics ○ 10 - 15 way joins ● Stats ○ > 99.99% query success rate ○ 100% system availability ○ 25 - 200 concurrent queries ○ 1 - 20 queries per second ○ <100ms - 5s latency
  10. 10. Presto with Raptor ● Large data sets (petabytes) ● Milliseconds to seconds latency ● Predictable performance ● 5-15 minute load latency ● Reliable data loads (no duplicates, no missing data) ● High availability ● 10s of concurrent queries
  11. 11. Presto users and use cases
  12. 12. Presto users See more at https://github.com/prestodb/presto/wiki/Presto-Users
  13. 13. Netflix stats Interactive, reporting, and app-driven queries Data warehouse: 40PB in S3 ~250 nodes across multiple clusters ~650 users with ~6K+ queries/day
  14. 14. Twitter stats Ad-hoc and low-latency queries ~200 nodes dedicated to Presto Parquet with nested data structures
  15. 15. Uber stats 2 clusters 100+ machines 2000+ queries per day HDFS on premise
  16. 16. FINRA stats 120+ EC2 nodes (r3.4xlarge) 2+ PBs of data on S3 (bzip2 & orc) 200+ users Distro supported by Teradata
  17. 17. New features
  18. 18. SQL features ● DDL syntax CREATE / ALTER / DROP TABLE ● DML syntax INSERT / DELETE ● SQL features: Data types: DECIMAL, VARCHAR(n), INT, SMALLINT, TINYINT CUBE, ROLLUP, GROUPING SETS INTERSECT Non-equi joins Uncorrelated subqueries
  19. 19. Other features ● Performance Join and aggregation optimizations ● Connectors Redis MongoDB ● Kerberos ● Presto-Admin ● Ambari and YARN (via Apache Slider)
  20. 20. ● Enterprise-grade ODBC & JDBC drivers ● BI tools certifications Information Builders, Looker, MicroStrategy, MS Power BI, Qlik, Tableau, ZoomData Drivers and BI tools
  21. 21. Roadmap
  22. 22. Short term ● LDAP ● SQL features Data types: FLOAT, CHAR(n), VAR/BINARY(n) EXISTS, EXCEPT Correlated subqueries Lambda expressions Prepared statements ● Connectors Accumulo (by Bloomberg)
  23. 23. Long term ● Materialized Query Tables ● Workload management ● Spill to disk ● Cost-based Optimizer See more at https://github.com/prestodb/presto/wiki/Roadmap
  24. 24. More about Presto GitHub: https://github.com/prestodb & https://github.com/Teradata/presto Website: http://prestodb.io Group: https://groups.google.com/group/presto-users Distro: http://www.teradata.com/presto

×