Level 101 for Presto
SQL on Everything
Part 1 of the Tech Talk Series for Presto
What is PrestoDB?
What’s the difference?Beinan Wang
Sr. Software Engineer, Twitter
Dipti Borkar
Co-Founder & CPO, Ahana
Presto 101 Outline
● What is Presto?
● How are we using Presto?
● What made Presto different?
○ Scalable architecture
○ Flexible Connectors
○ Performance
● The life of a query
2
What is Presto?
● Distributed SQL query engine
○ ANSI SQL on Hadoop, Kafka, Druid etc.
○ Designed to be interactive
○ Access to petabytes of data
● Opensource, hosted on github
○ https://github.com/prestodb
● Open question:
○ Is presto a database?
3
How are we using Presto?
● Adhoc
● BI tools
● Dashboard
● A/B testing
● ETL/scheduled job
● Online service *
4
What made presto different?
● Scalable architecture
● Pluggable Connectors
● Performance
5
Scalable architecture
● Two roles -- coordinator and worker
● Easy scale up and scale down
○ Scale up to 1000 workers*
○ Fit in web scaled companies
6
Pluggable Presto Connectors
Presto Connector Data Model
● Connector: Driver for a data source.
○ Example: HDFS, Cassandra, Kafka, SQL Server
● Catalog: Contains schemas from a datasource specified by the connector
● Schemas: Namespace to organize tables.
● Tables: Set of unordered rows organized into columns with types.
8
Presto Hive Connector
9
Presto Hive Connector -- Access Control
10
Presto Hive Connector -- Data File Types
11
● Supported File Types
○ ORC
○ Parquet
○ Avro
○ RCFile
○ SequenceFile
○ JSON
○ Text
● No data ingestion needed
Presto Druid Connector
12
Why Presto is Fast
● In-Memory processing
● Pull model
● Columnar storage and execution
● Bytecode generation
13
The Life of a Query -- Simple Scan
SELECT *
FROM orders
WHERE discount = 0
The Life of a Query -- Join and Aggregation
SELECT
orders.orderkey, SUM(tax)
FROM orders
LEFT JOIN lineitem
ON orders.orderkey = lineitem.orderkey
WHERE discount = 0
GROUP BY orders.orderkey
This example is from Presto: SQL on Everything https://research.fb.com/publications/presto-sql-on-everything/
Logical Plan -- do NOT join two big tables
This example is from Presto: SQL on Everything https://research.fb.com/publications/presto-sql-on-everything/
Limitations
● Memory Limitation
● Fault Tolerance
● Single Point of Failure: Coordinator
17
Time for a demo!
Local Setup
Query TPC-DS
Cloud Setup
Query S3 / Parquet
Docker Sandbox for Presto
https://hub.docker.com/r/ahanaio/prestodb-sandbox
AWS Sandbox AMI for Presto
https://ahana.io/tutorials/aws-sandbox/
Q&A
Join the Presto Community
● Require new feature or file a bug: github.com/prestodb/presto
● Slack: prestodb.slack.com
● Twitter: @prestodb
22
Stay up-to-date with Ahana
● URL: ahana.io
● Twitter: @ahanaio

Level 101 for Presto: What is PrestoDB?

  • 1.
    Level 101 forPresto SQL on Everything Part 1 of the Tech Talk Series for Presto What is PrestoDB? What’s the difference?Beinan Wang Sr. Software Engineer, Twitter Dipti Borkar Co-Founder & CPO, Ahana
  • 2.
    Presto 101 Outline ●What is Presto? ● How are we using Presto? ● What made Presto different? ○ Scalable architecture ○ Flexible Connectors ○ Performance ● The life of a query 2
  • 3.
    What is Presto? ●Distributed SQL query engine ○ ANSI SQL on Hadoop, Kafka, Druid etc. ○ Designed to be interactive ○ Access to petabytes of data ● Opensource, hosted on github ○ https://github.com/prestodb ● Open question: ○ Is presto a database? 3
  • 4.
    How are weusing Presto? ● Adhoc ● BI tools ● Dashboard ● A/B testing ● ETL/scheduled job ● Online service * 4
  • 5.
    What made prestodifferent? ● Scalable architecture ● Pluggable Connectors ● Performance 5
  • 6.
    Scalable architecture ● Tworoles -- coordinator and worker ● Easy scale up and scale down ○ Scale up to 1000 workers* ○ Fit in web scaled companies 6
  • 7.
  • 8.
    Presto Connector DataModel ● Connector: Driver for a data source. ○ Example: HDFS, Cassandra, Kafka, SQL Server ● Catalog: Contains schemas from a datasource specified by the connector ● Schemas: Namespace to organize tables. ● Tables: Set of unordered rows organized into columns with types. 8
  • 9.
  • 10.
    Presto Hive Connector-- Access Control 10
  • 11.
    Presto Hive Connector-- Data File Types 11 ● Supported File Types ○ ORC ○ Parquet ○ Avro ○ RCFile ○ SequenceFile ○ JSON ○ Text ● No data ingestion needed
  • 12.
  • 13.
    Why Presto isFast ● In-Memory processing ● Pull model ● Columnar storage and execution ● Bytecode generation 13
  • 14.
    The Life ofa Query -- Simple Scan SELECT * FROM orders WHERE discount = 0
  • 15.
    The Life ofa Query -- Join and Aggregation SELECT orders.orderkey, SUM(tax) FROM orders LEFT JOIN lineitem ON orders.orderkey = lineitem.orderkey WHERE discount = 0 GROUP BY orders.orderkey This example is from Presto: SQL on Everything https://research.fb.com/publications/presto-sql-on-everything/
  • 16.
    Logical Plan --do NOT join two big tables This example is from Presto: SQL on Everything https://research.fb.com/publications/presto-sql-on-everything/
  • 17.
    Limitations ● Memory Limitation ●Fault Tolerance ● Single Point of Failure: Coordinator 17
  • 18.
    Time for ademo! Local Setup Query TPC-DS Cloud Setup Query S3 / Parquet
  • 19.
    Docker Sandbox forPresto https://hub.docker.com/r/ahanaio/prestodb-sandbox
  • 20.
    AWS Sandbox AMIfor Presto https://ahana.io/tutorials/aws-sandbox/
  • 21.
  • 22.
    Join the PrestoCommunity ● Require new feature or file a bug: github.com/prestodb/presto ● Slack: prestodb.slack.com ● Twitter: @prestodb 22 Stay up-to-date with Ahana ● URL: ahana.io ● Twitter: @ahanaio