1
Presto - SQL on anything
January 2017
Grzegorz Kokosiński
Karol Sobczak
Teradata Center for Hadoop
2
Agenda
- Who are we?
- What is Presto?
- What is data federation?
- Different federation strategies in other databases (HIVE)
- what is supported and what are the problems
- Presto Connector
- Show time
3
Lets make some noise
• Let tweet about this presentation!
– #whug
– #prestodb
– #teradata
• Later on we will query that data!
4
Who are we
5
What is Presto?
• 100% open source distributed SQL query engine
- Originally developed by Facebook
• Key Differentiators:
- Performance & Scale
- Cross platform query capability, not only SQL on Hadoop
• Apache licensed, hosted on GitHub
- Certified distro & support from Teradata
6
Presto Users
See more at https://github.com/prestodb/presto/wiki/Presto-Users
7
• Facebook
– Multiple production clusters (100s of nodes total)
- 300PB in HDFS, sharded MySQL, SSD-based Raptor
– 1000s of internal daily active users
– 10s-100s of concurrent queries
• Netflix
– 250+ node on EC2, 40+ PB in S3 (Parquet format)
– Over 650 active users and 6K+ queries daily
• Twitter
– 200+ nodes on-premises over Parquet nested data
• Uber
– 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily
• FINRA
– 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata)
Presto in Production
8
• In-memory processing
• Pipelined execution across nodes (MPP-style)
– Vectorized columnar processing
– Multithreaded execution keeps all CPU cores busy
• Presto is written in highly tuned Java
– Efficient memory management (reduced GC overhead)
– Very careful coding of inner loops
– Runtime bytecode generation
• Optimized ORC & Parquet readers
• Excellent performance with interactive SQL analytics
– Enables to use BI tools
Presto – Query Execution Performance
9
• Hadoop/Hive connector & file formats (HDFS/S3):
– HDFS & S3 + HCatalog
– ORC, RCFile, Parquet, SequenceFile, Text
• Raptor
– columnar store on flash driven by Facebook
• Open source data stores (driven by the community)
– MySQL & PostgreSQL (non-parallel)
– Cassandra (by Teradata)
– Kafka
– Redis
– MongoDB
– ElasticSearch
– Accumulo (by Bloomberg)
Supported data sources & file formats
10
[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)]
[ WHERE condition ]
[ GROUP BY expression [, ...] ]
[ HAVING condition]
[ UNION [ ALL | DISTINCT ] select ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]
In addition:
• Windowing functions
• UNNEST, TABLESAMPLE
• ROLLUP, CUBE, GROUPING SETS
• UNION, EXCEPT, INTERSECT
• Subqueries (EXISTS, IN)
ANSI SQL Support
11
Presto is not a database!
• Presto is a query execution engine (storage independent)
• Pluggable custom user functionalities
– Connectors
– Functions
– Types
– System access controllers
– Resource group configuration managers
– Event listeners
– …
• Built-in core functionalities:
– parser, execution, types, sql functions, monitoring
12
Data federation
• Query data from several data sources (databases)
• Streaming
– One to One
- there is a single connection between database access points
- e.g. PSQL via PSQL
- using storage handlers to access RDBMS data from Hive
– Many to One
- many connections from one database nodes to a single access point of
other database
- Accessing REST from UDF in (possibly each) HIVE map/reduce task
– Many to Many
- workers talk to each other directly
• Through storage
– Needs (intermittent) data materialization
• Presto supports them all!
13
Data federation common problems
• model incompatibilities
• multinode streaming is not always possible
• transactions
• cost based optimizations (statistics)
• SQL pushdown (predicates, projections, aggregations?, joins?)
14
Connector
• Presto interface to access arbitrary data source (hive, mysql, jmx)
• Provides:
– metadata
– ability to distributed, parallel and streamed read/write
– transaction boundary
– physical data layouts
– statistics
– (SQL) predicate pushdown)
– indexes (index join)
– session or table properties
– access control
– procedures (CALL …
– . . .
• Most (if not all) of the above points are optional
15
Presto Architecture
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzer
Planner Scheduler
Worker
Client
Data location
API
Pluggable
16
Data federation with Presto
• Through the storage
• Demo
– HIVE
HDFS
DataNode
HDFS
DataNode
Hive
Metastore
HDFS
Namenode
data transfer
Presto
worker
Presto
worker
Presto
coordinator
data transfer
metadata
metadata
17
Data federation with Presto
• One to One
• Demo
– psql
– REST
– and above with HIVE
Presto
worker
Presto
worker
Presto
coordinator
SQL
Database
JDBC metadataJDBC data
18
Many to many - data federation with Presto
AMP
AMP
AMP
AMP
Q
G
E
x
c
h
a
n
g
e
Q
G
E
x
c
h
a
n
g
e
PE Coordinator
Worker Thread
Worker Thread
Worker Thread
Worker Thread
Init & metadata exchange
Bi-directional
fully parallel
data exchange
TERADATA PRESTO
• Key features:
• Low latency
• High performance
• Concurrency
• SQL pushdown
• Data conversion
• Compression
• Efficient CPU usage
19
Conclusion
• Presto Connector is expressive
• 3rd party data source is 1st class citizen
• Single ANSI SQL to rule them all
– use BI tools on data which is not BI friendly
• Rapid data integration
20
Certified Distro: www.teradata.com/presto
Website: www.prestodb.io
Presto Users Group: www.groups.google.com/group/presto-users
GitHub:
www.github.com/prestodb/presto
www.github.com/Teradata/presto
More information
21
www.teradata.com/presto

Presto - SQL on anything

  • 1.
    1 Presto - SQLon anything January 2017 Grzegorz Kokosiński Karol Sobczak Teradata Center for Hadoop
  • 2.
    2 Agenda - Who arewe? - What is Presto? - What is data federation? - Different federation strategies in other databases (HIVE) - what is supported and what are the problems - Presto Connector - Show time
  • 3.
    3 Lets make somenoise • Let tweet about this presentation! – #whug – #prestodb – #teradata • Later on we will query that data!
  • 4.
  • 5.
    5 What is Presto? •100% open source distributed SQL query engine - Originally developed by Facebook • Key Differentiators: - Performance & Scale - Cross platform query capability, not only SQL on Hadoop • Apache licensed, hosted on GitHub - Certified distro & support from Teradata
  • 6.
    6 Presto Users See moreat https://github.com/prestodb/presto/wiki/Presto-Users
  • 7.
    7 • Facebook – Multipleproduction clusters (100s of nodes total) - 300PB in HDFS, sharded MySQL, SSD-based Raptor – 1000s of internal daily active users – 10s-100s of concurrent queries • Netflix – 250+ node on EC2, 40+ PB in S3 (Parquet format) – Over 650 active users and 6K+ queries daily • Twitter – 200+ nodes on-premises over Parquet nested data • Uber – 200+ nodes (2 dedicated clusters) with 25K+ & 3K+ queries daily • FINRA – 120+ nodes in AWS, 2PB is S3, 200+ users (supported by Teradata) Presto in Production
  • 8.
    8 • In-memory processing •Pipelined execution across nodes (MPP-style) – Vectorized columnar processing – Multithreaded execution keeps all CPU cores busy • Presto is written in highly tuned Java – Efficient memory management (reduced GC overhead) – Very careful coding of inner loops – Runtime bytecode generation • Optimized ORC & Parquet readers • Excellent performance with interactive SQL analytics – Enables to use BI tools Presto – Query Execution Performance
  • 9.
    9 • Hadoop/Hive connector& file formats (HDFS/S3): – HDFS & S3 + HCatalog – ORC, RCFile, Parquet, SequenceFile, Text • Raptor – columnar store on flash driven by Facebook • Open source data stores (driven by the community) – MySQL & PostgreSQL (non-parallel) – Cassandra (by Teradata) – Kafka – Redis – MongoDB – ElasticSearch – Accumulo (by Bloomberg) Supported data sources & file formats
  • 10.
    10 [ WITH with_query[, ...] ] SELECT [ ALL | DISTINCT ] select_expr [, ...] [ FROM table1 [[ INNER | OUTER ] JOIN table2 ON (…)] [ WHERE condition ] [ GROUP BY expression [, ...] ] [ HAVING condition] [ UNION [ ALL | DISTINCT ] select ] [ ORDER BY expression [ ASC | DESC ] [, ...] ] [ LIMIT [ count | ALL ] ] In addition: • Windowing functions • UNNEST, TABLESAMPLE • ROLLUP, CUBE, GROUPING SETS • UNION, EXCEPT, INTERSECT • Subqueries (EXISTS, IN) ANSI SQL Support
  • 11.
    11 Presto is nota database! • Presto is a query execution engine (storage independent) • Pluggable custom user functionalities – Connectors – Functions – Types – System access controllers – Resource group configuration managers – Event listeners – … • Built-in core functionalities: – parser, execution, types, sql functions, monitoring
  • 12.
    12 Data federation • Querydata from several data sources (databases) • Streaming – One to One - there is a single connection between database access points - e.g. PSQL via PSQL - using storage handlers to access RDBMS data from Hive – Many to One - many connections from one database nodes to a single access point of other database - Accessing REST from UDF in (possibly each) HIVE map/reduce task – Many to Many - workers talk to each other directly • Through storage – Needs (intermittent) data materialization • Presto supports them all!
  • 13.
    13 Data federation commonproblems • model incompatibilities • multinode streaming is not always possible • transactions • cost based optimizations (statistics) • SQL pushdown (predicates, projections, aggregations?, joins?)
  • 14.
    14 Connector • Presto interfaceto access arbitrary data source (hive, mysql, jmx) • Provides: – metadata – ability to distributed, parallel and streamed read/write – transaction boundary – physical data layouts – statistics – (SQL) predicate pushdown) – indexes (index join) – session or table properties – access control – procedures (CALL … – . . . • Most (if not all) of the above points are optional
  • 15.
    15 Presto Architecture Data streamAPI Worker Data stream API Worker Coordinator Metadata API Parser/ analyzer Planner Scheduler Worker Client Data location API Pluggable
  • 16.
    16 Data federation withPresto • Through the storage • Demo – HIVE HDFS DataNode HDFS DataNode Hive Metastore HDFS Namenode data transfer Presto worker Presto worker Presto coordinator data transfer metadata metadata
  • 17.
    17 Data federation withPresto • One to One • Demo – psql – REST – and above with HIVE Presto worker Presto worker Presto coordinator SQL Database JDBC metadataJDBC data
  • 18.
    18 Many to many- data federation with Presto AMP AMP AMP AMP Q G E x c h a n g e Q G E x c h a n g e PE Coordinator Worker Thread Worker Thread Worker Thread Worker Thread Init & metadata exchange Bi-directional fully parallel data exchange TERADATA PRESTO • Key features: • Low latency • High performance • Concurrency • SQL pushdown • Data conversion • Compression • Efficient CPU usage
  • 19.
    19 Conclusion • Presto Connectoris expressive • 3rd party data source is 1st class citizen • Single ANSI SQL to rule them all – use BI tools on data which is not BI friendly • Rapid data integration
  • 20.
    20 Certified Distro: www.teradata.com/presto Website:www.prestodb.io Presto Users Group: www.groups.google.com/group/presto-users GitHub: www.github.com/prestodb/presto www.github.com/Teradata/presto More information
  • 21.