Understanding Presto - Presto meetup @ Tokyo #1

Sadayuki Furuhashi
Founder & Software Architect
Treasure Data, inc.
Understanding
Presto meetup @ Tokyo #1
Presto

A little about me...
> Sadayuki Furuhashi
> github/twitter: @frsyuki
> Treasure Data, Inc.
> Founder & Software Architect
> Open-source hacker
> MessagePack - Efﬁcient object serializer
> Fluentd - An uniﬁed data collection tool
> Prestogres - PostgreSQL protocol gateway for Presto
> Embulk - A bulk data loader with plugin-based architecture
> ServerEngine - A Ruby framework to build multiprocess servers
> LS4 - A distributed object storage with cross-region replication
> kumofs - A distributed strong-consistent key-value data store

Today’s talk
1. Distributed & plug-in architecture
2. Query planning
3. Cluster conﬁguration
4. Recent updates

1. Distributed & Plug-in architecture

Client
Coordinator Connector 
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service

Client
Coordinator Connector
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
1. find servers in a cluster

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
2. Client sends a query 
using HTTP

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
3. Coordinator builds 
a query plan
Connector plugin 
provides metadata
(table schema, etc.)

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
4. Coordinator sends 
tasks to workers

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
5. Workers read data 
through connector plugin

Client
Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
6. Workers run tasks 
in memory

Plugin
Worker
Worker
Worker
Storage / Metadata
Discovery Service
7. Client gets the result 
from a worker
Client

Client
Coordinator
Worker
Worker
Worker
Discovery Service
other 
connectors 
...
PostgreSQL
Hive 
Connector
HDFS / Metastore
JDBC 
Connector
Other data sources...

PostgreSQL
HDFS / Metastore
MySQL
Presto
select orderkey, orderdate, custkey, email 
from orders 
join mysql.presto_test.users 
on orders.custkey = users.id 
order by custkey, orderdate;
JOIN

PostgreSQL
HDFS / Metastore
MySQL
Presto
JOININSERT INTO
create table mysql.presto_test.recent_user_info 
as
select users.id, users.email, count(1) as count 
from orders 
join mysql.presto_test.users 
on orders.custkey = users.id 
group by 1, 2;

1. Distributed & Plug-in architecture
> 3 type of servers
> Coordinator, Worker, Discovery server
> Get data/metadata through connector plugins.
> Presto is state-less (Presto is NOT a database).
> Presto can provide distributed SQL to any data stores.
• connectors are loosely-coupled (may cause some overhead here)
> Client protocol is HTTP + JSON
> Language bindings: Ruby, Python, PHP, Java, R, etc.
> ODBC & JDBC support by Prestogres
> https://github.com/treasure-data/prestogres

Other Presto’s features
> Comprehensive SQL features
> WITH cte as (SELECT …) SELECT * FROM cte …;
> implicit JOIN (join criteria at WHERE)
> VIEW
> INSERT INTO … VALUES (1,2,3)
> Time & Date types & functions 
compatible both MySQL & PostgreSQL
> Culster management using SQL
> SELECT * FROM sys.node;
> sys.task, sys.query

Presto’s execution model
> Presto is NOT MapReduce
> Presto’s query plan is based on DAG
> more like Spark or traditional MPP databases

All stages are pipe-lined
✓ No wait time
✓ No fault-tolerance
MapReduce vs. Presto
MapReduce Presto
map map
reduce reduce
task task
task task
task
task
memory-to-memory
data transfer
✓ No disk IO
✓ Data chunk must
fit in memory
task
disk
map map
reduce reduce
disk
disk
Write data 
to disk
Wait between 
stages

Query Planner
SELECT
name,
count(*) AS c 
FROM access 
GROUP BY name
SQL
TABLE access (
name varchar
time bigint
)
Table schema
Table scan
(name:varchar)
GROUP BY
(name, count(*))
Output
(name, c)
+
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Logical query plan
Distributed query plan

Query Planner - Stages
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
inter-worker
data transfer
pipelined
aggregation
inter-worker
data transfer
Stage-0
Stage-1
Stage-2

Sink
Partial aggregation
Table scan
Sink
Partial aggregation
Table scan
Execution Planner
+ Node list
✓ 2 workers
Sink
Final aggregation
Exchange
Output
Exchange
Sink
Final aggregation
Exchange
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Worker 1 Worker 2
node-scheduler.min-candidates=2
query.initial-hash-partitions=2
node-scheduler.multiple-tasks-per-node-enabled

Execution Planner - Tasks
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Task
1 task / worker / stage
Output
Exchange
Worker 1 Worker 2
if node-scheduler.multiple-tasks-per-node-enabled=false

Execution Planner - Split
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Sink
Final aggregation
Exchange
Sink
Partial aggregation
Table scan
Output
Exchange
Split
many splits / task
= many threads / worker
(table scan)
1 split / task
= 1 thread / worker
Worker 1 Worker 2
1 split / worker
= 1 thread / worker

2. Query Planning
> SQL is converted into stages, tasks and splits
> All tasks run in parallel
> No wait time between stages (pipelined)
> If one task fails, all tasks fail at once (query fails)
> Memory-to-memory data transfer
> No disk IO
> If hash-partitioned aggregated data doesn’t ﬁt in memory, 
query fails
• Note: Query dies but worker doesn’t die. 
Memory consumption is fully managed.

coordinator=true
node-scheduler.include-coordinator=true
discovery-server.enabled=true
Single-server
client
> Most simple
Coordinator
+ 
Discovery Server 
+ 
Worker
✓ Task scheduling
✓ Failure detection
✓ Table scan
✓ Aggregation

coordinator=false
discovery.uri=http://the-coordinator.net:8080
coordinator=true
node-scheduler.include-coordinator=false
discovery-server.enabled=true
Multi-worker cluster
client
Worker
Worker
> More performance
Coordinator
+ 
Discovery Server
✓ Table scan
✓ Aggregation
✓ Task scheduling

coordinator=false
discovery.uri=http://the-discovery.net:8080
coordinator=true
discovery-server.enabled=false
Multi-worker cluster with separated Discovery Server
client
Worker
Worker
Discovery Server
https://repo1.maven.org/maven2/io/airlift/discovery/discovery-server/1.20/
discovery-server-1.20.tar.gz
> More reliable
✓ Task scheduling
✓ Table scan
✓ Aggregation
Coordinator

coordinator=false
coordinator=true
discovery-server.enabled=false
Multi-coordinator cluster
client
Worker
Worker
Discovery Server
Coordinator
Coordinator
HA by failover 
(or load-balance)
> Most reliable
✓ Table scan
✓ Aggregation

Recent updates
> Presto 0.75 (2014-08-21)
> max_by(col, compare_col) aggregation function
> Presto 0.76 (2014-09-18)
> MySQL, PostgreSQL and Kafka connectors
> Presto 0.77 (2014-10-01)
> Distributed JOIN
• enabled if distributed-joins-enabled=true

Recent updates
> Presto 0.78 (2014-10-08)
> ARRAY, MAP and JSON types
• json_extract(json, json_path)
• json_array_get(json, index)
• array || array
• contains(array, search_key)
> Presto 0.80 (2014-11-03)
> Optimized ORCFile reader
• enabled if hive.optimized-reader.enabled=true
> Metadata-only queries
• count(), count(distinct), min(), max(), etc.
> numeric_histogram(buckets, col) aggregation function

Recent updates
> Presto 0.86 (2014-12-01)
> ntile(n) window function
> Presto 0.87 (2014-12-03)
> JDK >= 8
> Presto 0.88 (2014-12-11)
> Any aggregation functions can be a window function
> Presto 0.90 (soon)
> ConnectorPageSink SPI
> year_of_week() function

Check: www.treasuredata.com
Cloud service for the entire data pipeline,
including Presto. We’re hiring!

Understanding Presto - Presto meetup @ Tokyo #1

Understanding Presto - Presto meetup @ Tokyo #1

More Related Content

What's hot

Viewers also liked

Similar to Understanding Presto - Presto meetup @ Tokyo #1

More from Sadayuki Furuhashi

Recently uploaded

In this document

Understanding Presto - Presto meetup @ Tokyo #1