Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Presto at Treasure Data

1,044 views

Published on

A short tutorial on Presto and its internals at Treasure Data.

Published in: Technology
  • Be the first to comment

Introduction to Presto at Treasure Data

  1. 1. Introduction to Presto Making SQL Scalable Taro L. Saito
 leo@treasure-data.com Treasure Data, Inc.
  2. 2. How do we make SQL scalable? • Problem • Count access logs of each web page: • SELECT page, count(*) FROM weblog
 GROUP BY page • A Challenge • How do you process millions of records in a second? • Making SQL scalable enough to handle large data set 2
  3. 3. 3 HDFS • Translate SQL into MapReduce (Hadoop) programs • MapReduce: • Does the same job by using many machines Hive A B A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS Single CPU Job Distributed Processing
  4. 4. SQL to MapReduce • Mapping SQL stages into MapReduce program • SELECT page, count(*) FROM weblog
 GROUP BY page 4 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  5. 5. HDFS is the bottleneck • HDFS (Hadoop File System) • Used for storing intermediate results • Provides fault-tolerance, but slow 5 HDFS A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit HDFS TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  6. 6. Presto • Distributed query engine developed by Facebook • Uses HTTP for data transfer • No intermediate storage like HDFS • No fault-tolerance (but failure rate is less than 0.2%) • Pipelining data transfer and data processing 6 A0 B0 A1 A2 B B1 B2 B3 A map reduce mergesplit TableScan(weblog) GroupBy(hash(page)) count(weblog of a page) result
  7. 7. Architecture Comparison 7 Hive Presto Spark BigQuery Performance Slow Fast Fast Ultra Fast (using many disks) Intermediate Storage HDFS None Memory/Disk Colossus (?) Data Transfer HTTP HTTP HTTP ? Query Execution Stage-wize
 MapReduce Run all stages
 at once (pipelining) Stage-wise ? Fault Tolerance Yes None (but, TD will retry the query) fromscratch) Yes, but limited ? Multiple Job Support Good
 Can handle many jobs limited (~ 5 concurrent queries per account in TD)
 Require another resource manager (e.g. YARN, mesos) limited (Query queue)
  8. 8. Presto Usage Stats • More than 99.8% queries finishes without any error • 90%~ of queries finishes within 1 minute • Treasure Data Presto Stats • Processing more than 100,000 queries / day • Processing 15 trillion records / day • Facebook’s stat: • 30,000~100,000 queries / day • 1 trillion records / day • Treasure data is No.1 Presto user in the world 8
  9. 9. Presto can process more than 1M rows /sec. • N 9
  10. 10. Presto Overview • A distributed SQL Engine developed by Facebook • For interactive analysis on peta-scale dataset • As a replacement of Hive • Nov. 2013: Open sourced at GitHub • Facebook now has 12 engineers working on Presto • Code • In-memory query engine, written in Java • Based on ANSI SQL syntax • Isolating query execution layer and storage access layer • Connector provides data access methods • Cassandra / Hive / JMX / Kafka / MySQL / PostgreSQL / MongoDB / System / TPCH connectors • td-presto is our connector to access PlazmaDB (Columnar Message Pack Database) 10
  11. 11. Architectural overview 11 https://prestodb.io/overview.html With Hive connector
  12. 12. Presto Users • Facebook 12
  13. 13. • Dropbox 13
  14. 14. • Airbnb 14
  15. 15. Interactive Analysis with TD Presto + Jupyter 15 • https://github.com/treasure-data/td- jupyter-notebooks/blob/master/ imported/pandas-td-tutorial.ipynb
  16. 16. Presto Internal
 Query Execution
  17. 17. Stage 1 Stage 2 Stage 0 Presto Architecture Query Task 0.0 Split Task 1.0 Split Task 1.1 Task 1.2 Split Split Split Task 2.0 Split Task 2.1 Task 2.2 Split Split Split Split Split Split Split Split TableScan (FROM) Aggregation (GROUP BY) Output @worker#2 @worker#3 @worker#0
  18. 18. Logical Query Plan Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  19. 19. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Table Scan select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  20. 20. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Stage 2 Logical Plan Optimization select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  21. 21. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Stage 2 Stage 1 select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  22. 22. Output[nationkey, _col1] => [nationkey:bigint, count:bigint]
 - _col1 := count Exchange[GATHER] => nationkey:bigint, count:bigint Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint]
 - count := "count"("count_15") Exchange[REPARTITION] => nationkey:bigint, count_15:bigint Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr") Project => [nationkey:bigint, expr:bigint] - expr := 1 InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint] Project => [custkey:bigint] Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar] TableScan[tpch:tpch:orders:sf0.01, original constraint=
 ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar]
 - custkey := tpch:custkey:1
 - orderpriority := tpch:orderpriority:5 Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]
 - custkey_0 := tpch:custkey:0
 - nationkey := tpch:nationkey:3 Stage 3 Stage 2 Stage 1 Stage 0 Output Query Results (JSON) select
 c.nationkey,
 count(1)
 from orders o join customer c
 on o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
  23. 23. TD Storage Architecture 23 LogLogLogLogLogLog 1-hour
 partition1-hour
 partition1-hour
 partition Hadoop
 MapReduce 2015-09-29 01:00:00 2015-09-29 02:00:00 2015-09-29 03:00:00 Real-Time Storage Archive
 Storage time column-based partitioning … Hive Presto Log many small log files log merge job LogLogLogLogLog Distributed SQL Query Engine
  24. 24. Utilizing Time Index 24 1-hour
 partition 2015-09-29 01:00:00 2015-09-29 02:00:00 2015-09-29 03:00:00 time column-based partitioning … Hive/Presto 1-hour
 partition1-hour
 partition1-hour
 partition TD_TIME_RANGE(time, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’) Query Results 2015-09-29 01:00:00 2015-09-29 02:00:00 2015-09-29 03:00:00 … Hive/Presto Query Results TD_TIME_RANGE(non_time_column, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’) Scanning the whole data set 1-hour
 partition1-hour
 partition1-hour
 partition1-hour
 partition Full Scan Partial Scan
  25. 25. Queries with huge results • SELECT col1, col2, col3, … FROM … • INSERT INTO (table) SELECT col1, col2, … • or CREATE TABLE AS 25 1-hour
 partition header col1 col2 … … Presto Read query results in JSON (single-thread task: slow) msgack.gz On Amazon S3 Presto 1-hour
 partition 1-hour
 partition 1-hour
 partition Directly create 1-hour partition on S3 from query results Runs in parallel: fast
  26. 26. Memory Consuming Operators • DISTINCT col1, col2, … (duplicate elimination) • Need to store the whole data set in a single node • COUNT(DISTINCT col1), etc. • Use approx_distinct(col1) instead • order by col1, col2, … • A single node task (in Presto) • UNION • performs duplicate elimination (single node) • Use UNION ALL 26
  27. 27. Finding bottlenecks • Table scan range • Check TD_TIME_RANGE condition • distinct • duplicate elimination of all selected columns (single node) • slow and memory consuming • huge result output • Output Stage (0) becomes the bottleneck • Use DROP TABLE IF EXISTS …, then CREATE TABLE AS SELECT … 27
  28. 28. Resources • Presto Query FAQs • https://docs.treasuredata.com/articles/presto- query-faq • Presto Documentation • https://prestodb.io/docs 28

×