Introduction to Presto at Treasure Data

Introduction to Presto
Making SQL Scalable
Taro L. Saito 
leo@treasure-data.com
Treasure Data, Inc.

How do we make SQL scalable?
• Problem
• Count access logs of each web page:
• SELECT page, count(*) FROM weblog 
GROUP BY page
• A Challenge
• How do you process millions of records in a
second?
• Making SQL scalable enough to handle large
data set
2

3
HDFS
• Translate SQL into MapReduce (Hadoop) programs
• MapReduce:
• Does the same job by using many machines
Hive
A B
A0
B0
A1
A2
B
B1
B2
B3
A
map reduce mergesplit
HDFS
Single CPU Job
Distributed Processing

SQL to MapReduce
• Mapping SQL stages into MapReduce program
• SELECT page, count(*) FROM weblog 
GROUP BY page
4
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result

HDFS is the bottleneck
• HDFS (Hadoop File System)
• Used for storing intermediate results
• Provides fault-tolerance, but slow
5
HDFS
A0
B0
A1
A2
B
B1
B2
B3
A
HDFS
TableScan(weblog)
GroupBy(hash(page))
result

Presto
• Distributed query engine developed by Facebook
• Uses HTTP for data transfer
• No intermediate storage like HDFS
• No fault-tolerance (but failure rate is less than 0.2%)
• Pipelining data transfer and data processing
6
A0
B0
A1
A2
B
B1
B2
B3
A
TableScan(weblog)
GroupBy(hash(page))
result

Architecture Comparison
7
Hive Presto Spark BigQuery
Performance Slow Fast Fast Ultra Fast
(using many disks)
Intermediate
Storage
HDFS None Memory/Disk Colossus (?)
Data
Transfer
HTTP HTTP HTTP ?
Query
Execution
Stage-wize 
MapReduce
Run all stages 
at once
(pipelining)
Stage-wise ?
Fault
Tolerance
Yes
None
(but, TD will retry
the query)
fromscratch)
Yes, but
limited
?
Multiple Job
Support
Good 
Can handle many
jobs
limited
(~ 5 concurrent queries
per account in TD) 
Require another
resource manager
(e.g. YARN, mesos)
limited
(Query queue)

Presto Usage Stats
• More than 99.8% queries ﬁnishes without any error
• 90%~ of queries ﬁnishes within 1 minute
• Treasure Data Presto Stats
• Processing more than 100,000 queries / day
• Processing 15 trillion records / day
• Facebook’s stat:
• 30,000~100,000 queries / day
• 1 trillion records / day
• Treasure data is No.1 Presto user in the world
8

Presto can process more than 1M rows /sec.
• N
9

Presto Overview
• A distributed SQL Engine developed by Facebook
• For interactive analysis on peta-scale dataset
• As a replacement of Hive
• Nov. 2013: Open sourced at GitHub
• Facebook now has 12 engineers working on Presto
• Code
• In-memory query engine, written in Java
• Based on ANSI SQL syntax
• Isolating query execution layer and storage access layer
• Connector provides data access methods
• Cassandra / Hive / JMX / Kafka / MySQL / PostgreSQL / MongoDB /
System / TPCH connectors
• td-presto is our connector to access PlazmaDB (Columnar Message
Pack Database)
10

Architectural overview
11
https://prestodb.io/overview.html
With Hive connector

Interactive Analysis with TD Presto + Jupyter
15
• https://github.com/treasure-data/td-
jupyter-notebooks/blob/master/
imported/pandas-td-tutorial.ipynb

Presto Internal 
Query Execution

Stage 1
Stage 2
Stage 0
Presto Architecture
Query
Task 0.0
Split
Task 1.0
Split
Task 1.1 Task 1.2
Split Split Split
Task 2.0
Split
Task 2.1 Task 2.2
Split Split Split Split Split Split Split
Split
TableScan
(FROM)
Aggregation
(GROUP BY)
Output
@worker#2 @worker#3 @worker#0

Logical Query Plan
Output[nationkey, _col1] => [nationkey:bigint, count:bigint] 
- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] 
- count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint]
- count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint]
- expr := 1
InnerJoin[("custkey" = "custkey_0")] =>
[custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] =>
[custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint= 
('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] 
- custkey := tpch:custkey:1 
- orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] =>
[custkey_0:bigint, nationkey:bigint] 
- custkey_0 := tpch:custkey:0 
- nationkey := tpch:nationkey:3
select 
c.nationkey, 
count(1) 
from orders o
join customer c 
on o.custkey = c.custkey
where
o.orderpriority = '1-URGENT'
group by c.nationkey

- _col1 := count
- expr := 1
- nationkey := tpch:nationkey:3 Stage 3
Table Scan
select 
c.nationkey, 
count(1) 
from orders o
join customer c 
where

- _col1 := count
- expr := 1
Stage 2
Logical Plan Optimization
select 
c.nationkey, 
count(1) 
from orders o
join customer c 
where

- _col1 := count
- expr := 1
Stage 2
Stage 1
select 
c.nationkey, 
count(1) 
from orders o
join customer c 
where

- _col1 := count
- expr := 1
Stage 2
Stage 1
Stage 0
Output Query Results (JSON)
select 
c.nationkey, 
count(1) 
from orders o
join customer c 
where

TD Storage Architecture
23
LogLogLogLogLogLog
1-hour 
partition1-hour 
partition1-hour 
partition
Hadoop 
MapReduce
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
Real-Time
Storage
Archive 
Storage
time column-based partitioning
…
Hive Presto
Log
many small log ﬁles log merge job
LogLogLogLogLog
Distributed SQL Query Engine

Utilizing Time Index
24
1-hour 
partition
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
time column-based partitioning
…
Hive/Presto
1-hour 
partition1-hour 
partition1-hour 
partition
TD_TIME_RANGE(time, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)
Query Results
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
…
Hive/Presto Query Results
TD_TIME_RANGE(non_time_column, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)
Scanning the whole data set
1-hour 
partition1-hour 
partition1-hour 
partition1-hour 
partition
Full Scan
Partial Scan

Queries with huge results
• SELECT col1, col2, col3, … FROM …
• INSERT INTO (table) SELECT col1, col2, …
• or CREATE TABLE AS
25
1-hour 
partition
header
col1
col2
…
…
Presto
Read query results in JSON
(single-thread task: slow)
msgack.gz
On Amazon S3
Presto
1-hour 
partition
1-hour 
partition
1-hour 
partition
Directly create 1-hour partition on S3 from query results
Runs in parallel: fast

Memory Consuming Operators
• DISTINCT col1, col2, … (duplicate elimination)
• Need to store the whole data set in a single node
• COUNT(DISTINCT col1), etc.
• Use approx_distinct(col1) instead
• order by col1, col2, …
• A single node task (in Presto)
• UNION
• performs duplicate elimination (single node)
• Use UNION ALL
26

Finding bottlenecks
• Table scan range
• Check TD_TIME_RANGE condition
• distinct
• duplicate elimination of all selected columns (single node)
• slow and memory consuming
• huge result output
• Output Stage (0) becomes the bottleneck
• Use DROP TABLE IF EXISTS …, then CREATE TABLE AS SELECT …
27

Resources
• Presto Query FAQs
• https://docs.treasuredata.com/articles/presto-
query-faq
• Presto Documentation
• https://prestodb.io/docs
28

Introduction to Presto at Treasure Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Presto at Treasure Data

Similar to Introduction to Presto at Treasure Data (20)

More from Taro L. Saito

More from Taro L. Saito (20)

Recently uploaded

Recently uploaded (20)

Introduction to Presto at Treasure Data