SQL for 
Everything 
Presto: Distributed SQL Query Engine 
Masahiro Nakagawa 
Nov 6, 2014 
Cloudera World Tokyo
Who are you? 
> Masahiro Nakagawa 
> github/twitter: @repeatedly 
> Ingress: Blue 
> Treasure Data, Inc. 
> Senior Software Engineer 
> Fluentd / td-agent developer 
> I love OSS :) 
> D language - Phobos committer 
> Fluentd - Main maintainer 
> MessagePack / RPC- D and Python (only RPC) 
> The organizer of Presto Source Code Reading 
> etc…
SQL on Hadoop?
SQL Players on Hadoop 
This color indicates a commercial product 
> Hive 
> Spark SQL Batch 
Short Batch 
Low latency 
Stream 
> Presto 
> Impala 
> Drill 
> Norikra 
> StreamSQL 
> HAWQ 
> Actian 
> etc… 
Latency: minutes - hours 
Latency: seconds - minutes 
Latency: immediate
SQL Players on Hadoop 
This color indicates a commercial product 
> Hive 
> Spark SQL 
Batch 
Short Batch 
Low latency 
Stream 
> Presto 
> Impala 
> Drill 
> HAWQ 
> Actian 
> etc… 
Red Ocean 
Blue Ocean? 
> Norikra 
> StreamSQL
Presto 
http://prestodb.io/
Presto overview 
> Open sourced by Facebook 
> https://github.com/facebook/presto 
• github is a primary 
> written in Java 
> latest version is 0.81 
> Built-in useful features 
> Connectors 
> Machine Learning 
> Window function 
> Approximate query 
> etc…
What’s Presto? 
A distributed SQL query engine 
for interactive data analisys 
against GBs to PBs of data.
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
What problems does it solve? 
> We couldn’t visualize data in HDFS directly 
using dashboards or BI tools 
> because Hive is too slow (not interactive) 
> or ODBC connectivity is unavailable/unstable 
> We needed to store daily-batch results to an 
interactive DB for quick response 
(PostgreSQL, Redshift, etc.) 
> Interactive DB costs more & less scalable 
> Some data are not stored in HDFS 
> We need to copy the data into HDFS to analyze
HDFS 
Hive Dashboard 
Presto 
PostgreSQL, etc. 
Daily/Hourly Batch 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
Interactive query
Presto 
HDFS 
Hive 
Dashboard 
Daily/Hourly Batch 
Interactive query 
SQL on any data sets Commercial 
Cassandra MySQL Commertial DBs 
BI Tools 
✓ IBM Cognos 
✓ Tableau 
✓ ... 
Data analysis platform
Presto’s deployment 
> Facebook 
> Multiple geographical regions 
> scaled to 1,000 nodes 
> actively used by 1,000+ employees 
> processing 1PB/day 
> Netflix, Dropbox, Treasure Data, Airbnb, 
Qubole, LINE, GREE, Scaleout, etc 
> Presto as a Service 
> Treasure Data, Qubole
PostgreSQL gateway for Presto 
> A PostgreSQL protocol gateway based on 
PostgreSQL’s stable ODBC / JDBC drivers 
> Developed by Sadayuki Furuhashi 
https://github.com/treasure-data/prestogres
Distributed architecture
Client 
Coordinator Connector 
Plugin 
Worker 
Worker 
Worker 
Storage / Metadata 
Discovery Service
What’s Connectors? 
> Access to storage and metadata 
> provide table schema to coordinators 
> provide table rows to workers 
> Connectors are pluggable to Presto 
> written in Java 
> Implementations: 
> Hive(CDH, HDP, Community), Cassandra, 
MySQL, JDBC, Kafka, etc… 
> Or your own connector 
• Treasure Data has own connector
Client 
Coordinator 
other 
connectors 
... 
Worker 
Worker 
Worker 
Cassandra 
Discovery Service 
find servers in a cluster 
Hive 
Connector 
HDFS / Metastore 
Multiple connectors in a query 
Cassandra 
Connector 
Other data sources...
Distributed architecture 
> 3 type of servers: 
> Coordinator, worker, discovery service 
> Get data/metadata through connector 
plugins. 
> Presto is NOT a database 
> Presto provides SQL to existent data stores 
> Client protocol is HTTP + JSON 
> Language bindings: 
Ruby, Python, PHP, Java (JDBC), R, Node.JS...
Presto’s execution model 
> Presto is NOT MapReduce 
> Use its own execution engine 
> Presto’s query plan is based on DAG 
> more like Apache Tez / Spark or 
traditional MPP databases 
> Impala and Drill use a similar model
Query Planner 
SQL 
SELECT 
name, 
count(*) AS c 
FROM impressions 
GROUP BY name 
Table schema 
impressions ( 
name varchar 
time bigint 
) 
Output 
(name, c) 
GROUP BY 
(name, 
count(*)) 
Table scan 
(name:varchar) 
+ 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Logical query plan 
Distributed query plan
Query Planner - Stages 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
inter-worker 
data transfer Stage-0 
pipelined 
aggregation 
inter-worker 
data transfer 
Stage-1 
Stage-2
Output 
Exchange 
Sink 
Partial aggr 
Table scan 
Sink 
Partial aggr 
Table scan 
Execution Planner 
+Node list 
✓ 2 workers 
Sink 
Final aggr 
Exchange 
Sink 
Final aggr 
Exchange 
Output 
Exchange 
Sink 
Final aggr 
Exchange 
Sink 
Partial aggr 
Table scan 
Worker 1 Worker 2
All stages are pipe-lined 
✓ No wait time 
✓ No fault-tolerance 
MapReduce vs. Presto 
MapReduce Presto 
reduce reduce 
disk 
map map 
disk 
reduce reduce 
map map 
task 
task 
task task 
task task 
memory-to-memory 
data transfer 
✓ No disk IO 
✓ Data chunk must 
fit in memory 
task 
disk 
Wait between 
stages 
Write data 
to disk
Demo
Presto Meetup 
The first half of 2015
Cloud service for the entire data pipeline, 
including Presto 
Check: treasuredata.com

SQL for Everything at CWT2014

  • 1.
    SQL for Everything Presto: Distributed SQL Query Engine Masahiro Nakagawa Nov 6, 2014 Cloudera World Tokyo
  • 2.
    Who are you? > Masahiro Nakagawa > github/twitter: @repeatedly > Ingress: Blue > Treasure Data, Inc. > Senior Software Engineer > Fluentd / td-agent developer > I love OSS :) > D language - Phobos committer > Fluentd - Main maintainer > MessagePack / RPC- D and Python (only RPC) > The organizer of Presto Source Code Reading > etc…
  • 3.
  • 4.
    SQL Players onHadoop This color indicates a commercial product > Hive > Spark SQL Batch Short Batch Low latency Stream > Presto > Impala > Drill > Norikra > StreamSQL > HAWQ > Actian > etc… Latency: minutes - hours Latency: seconds - minutes Latency: immediate
  • 5.
    SQL Players onHadoop This color indicates a commercial product > Hive > Spark SQL Batch Short Batch Low latency Stream > Presto > Impala > Drill > HAWQ > Actian > etc… Red Ocean Blue Ocean? > Norikra > StreamSQL
  • 6.
  • 7.
    Presto overview >Open sourced by Facebook > https://github.com/facebook/presto • github is a primary > written in Java > latest version is 0.81 > Built-in useful features > Connectors > Machine Learning > Window function > Approximate query > etc…
  • 8.
    What’s Presto? Adistributed SQL query engine for interactive data analisys against GBs to PBs of data.
  • 9.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 10.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 11.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 12.
    What problems doesit solve? > We couldn’t visualize data in HDFS directly using dashboards or BI tools > because Hive is too slow (not interactive) > or ODBC connectivity is unavailable/unstable > We needed to store daily-batch results to an interactive DB for quick response (PostgreSQL, Redshift, etc.) > Interactive DB costs more & less scalable > Some data are not stored in HDFS > We need to copy the data into HDFS to analyze
  • 13.
    HDFS Hive Dashboard Presto PostgreSQL, etc. Daily/Hourly Batch HDFS Hive Dashboard Daily/Hourly Batch Interactive query Interactive query
  • 14.
    Presto HDFS Hive Dashboard Daily/Hourly Batch Interactive query SQL on any data sets Commercial Cassandra MySQL Commertial DBs BI Tools ✓ IBM Cognos ✓ Tableau ✓ ... Data analysis platform
  • 15.
    Presto’s deployment >Facebook > Multiple geographical regions > scaled to 1,000 nodes > actively used by 1,000+ employees > processing 1PB/day > Netflix, Dropbox, Treasure Data, Airbnb, Qubole, LINE, GREE, Scaleout, etc > Presto as a Service > Treasure Data, Qubole
  • 16.
    PostgreSQL gateway forPresto > A PostgreSQL protocol gateway based on PostgreSQL’s stable ODBC / JDBC drivers > Developed by Sadayuki Furuhashi https://github.com/treasure-data/prestogres
  • 17.
  • 18.
    Client Coordinator Connector Plugin Worker Worker Worker Storage / Metadata Discovery Service
  • 19.
    What’s Connectors? >Access to storage and metadata > provide table schema to coordinators > provide table rows to workers > Connectors are pluggable to Presto > written in Java > Implementations: > Hive(CDH, HDP, Community), Cassandra, MySQL, JDBC, Kafka, etc… > Or your own connector • Treasure Data has own connector
  • 20.
    Client Coordinator other connectors ... Worker Worker Worker Cassandra Discovery Service find servers in a cluster Hive Connector HDFS / Metastore Multiple connectors in a query Cassandra Connector Other data sources...
  • 21.
    Distributed architecture >3 type of servers: > Coordinator, worker, discovery service > Get data/metadata through connector plugins. > Presto is NOT a database > Presto provides SQL to existent data stores > Client protocol is HTTP + JSON > Language bindings: Ruby, Python, PHP, Java (JDBC), R, Node.JS...
  • 22.
    Presto’s execution model > Presto is NOT MapReduce > Use its own execution engine > Presto’s query plan is based on DAG > more like Apache Tez / Spark or traditional MPP databases > Impala and Drill use a similar model
  • 23.
    Query Planner SQL SELECT name, count(*) AS c FROM impressions GROUP BY name Table schema impressions ( name varchar time bigint ) Output (name, c) GROUP BY (name, count(*)) Table scan (name:varchar) + Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan Logical query plan Distributed query plan
  • 24.
    Query Planner -Stages Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan inter-worker data transfer Stage-0 pipelined aggregation inter-worker data transfer Stage-1 Stage-2
  • 25.
    Output Exchange Sink Partial aggr Table scan Sink Partial aggr Table scan Execution Planner +Node list ✓ 2 workers Sink Final aggr Exchange Sink Final aggr Exchange Output Exchange Sink Final aggr Exchange Sink Partial aggr Table scan Worker 1 Worker 2
  • 26.
    All stages arepipe-lined ✓ No wait time ✓ No fault-tolerance MapReduce vs. Presto MapReduce Presto reduce reduce disk map map disk reduce reduce map map task task task task task task memory-to-memory data transfer ✓ No disk IO ✓ Data chunk must fit in memory task disk Wait between stages Write data to disk
  • 27.
  • 28.
    Presto Meetup Thefirst half of 2015
  • 29.
    Cloud service forthe entire data pipeline, including Presto Check: treasuredata.com