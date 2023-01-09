Successfully reported this slideshow.
Write Faster SQL with Trino.pdf

Jan. 09, 2023
Write Faster SQL with Trino.pdf

Jan. 09, 2023
A presentation on how to write better sql queries in Trino (Presto) at a large scale.

Write Faster SQL with Trino.pdf

  1. 1. Write Faster SQL with Presto Eric Xiao, Michelle Ark, Nayeem Zen, Tristan Boudreault
  2. 2. - Describe Presto’s Query Engine Architecture - Interpret, analyze, evaluate query plans with the EXPLAIN syntax - Recognize query optimizations and gotchas - Employ optimization techniques discussed in the wild Learning Objectives
  3. 3. - Presto Architecture - Tools for Debugging / Analyzing Query Performance - Storage Formats + Optimizations - Hands on Query Optimization Talk Outline
  4. 4. Presto Architecture
  5. 5. “ Open-source, distributed SQL query engine for interactive, analytic queries ”
  6. 6. “ Open-source, distributed SQL query engine for interactive, analytic queries ”
  7. 7. “ Open-source, distributed SQL query engine for interactive, analytic queries ”
  8. 8. “ Open-source, distributed SQL query engine for interactive, analytic queries ”
  9. 9. is NOT… a database: - Does not store any data Instead, it employs a ‘Connector’ Architecture
  10. 10. Connector Architecture Presto - Connectors enable reading from external data sources - Can query data in different formats in same query Text Text Connector Parquet Parquet Connector MySQL MySQL Connector JSON JSON Connector
  11. 11. is NOT… a transactional query engine: - Not designed for queries common in application development: - ie: point-lookups Instead, designed for analytic queries - ie: full table scans and aggregations - Note: Indices would not speed up these queries
  12. 12. Coordinator Presto Architecture Result Worker Worker Worker Queue Plan Schedule Processor External Data Sources Read Data Read Data Processor Processor Optimize
  13. 13. Life of a Query
  14. 14. From SQL to Execution
  15. 15. Planning Scan [shops] Filter [country=‘CAN’] Join [on shop_id] Aggregate [shop_id, COUNT(1)] Scan [buy_button] Query Plan Coordinato Queue Plan Schedule Optimize
  16. 16. Scan [shops] Filter [country=‘CAN’] Join [on shop_id] Aggregate [shop_id, COUNT(1)] Scan [buy_button] Query Plan Plan Optimization ScanFilter [table = shops] [country = ‘CAN’] Join [on shop_id] Aggregate [shop_id, COUNT(1)] Scan [buy_button] Optimized Query Plan Coordinato Queue Plan Schedule Optimize
  17. 17. ScanFilter [table = shops] [country = ‘CAN’] Join [on shop_id] Aggregate [shop_id, COUNT(1)] Scan [buy_button] Optimized Query Plan Distributed Query Plan STAGE 2 ScanFilter [table = shops] [country = ‘CAN’] Project [] STAGE 3 Scan [buy_button] Project [] STAGE 0 Local Exchange [] Result [] STAGE 1 Join [on shop_id] Aggregate [shop_id, COUNT(1)] Local Exchange [] Local Exchange [] Stages
  18. 18. Coordinato Queue Plan Schedule Optimize STAGE 2 ScanFilter [table = shops] [country = ‘CAN’] Project [] Tasks Project ScanFilter TASK 1 Project ScanFilter TASK 2 Project ScanFilter TASK N . . .
  19. 19. Understanding Query Execution
  20. 20. Find Underlying Tables & Datasets Read Presto Query Plans Understand Query Execution
  21. 21. Find underlying table for views SHOW CREATE VIEW catalog.schema.view_name Example: SHOW CREATE VIEW hive .sensitive_partitioned_monorail .monorail_shopify_admin_page_view_1
  22. 22. SHOW CREATE VIEW hive.sensitive_monorail.monorail_shopify_admin_page_view_1 CREATE VIEW hive.sensitive_monorail.monorail_shopify_admin_page_view_1 AS WITH envelope AS ( SELECT * FROM hive.raw_monorail_do_not_query_directly.monorail_shopify_admin_page_view_1 ) SELECT schema_id _schema_id , message_id _message_id , "from_unixtime"(("message_timestamp" / 1000)) _message_timestamp , TRY_CAST("json_extract_scalar"(payload, '$.user_id') AS bigint) "user_id" , TRY_CAST("json_extract_scalar"(payload, '$.shop_id') AS bigint) "shop_id" ... , edge_user_agent _edge_user_agent , edge_remote_ip _edge_remote_ip , partition_yyyy_mm_dd_hh _partition_yyyy_mm_dd_hh FROM envelope
  23. 23. Find underlying dataset for table SHOW CREATE TABLE catalog.schema.table_name Example: SHOW CREATE TABLE hive .raw_monorail_do_not_query_directly .monorail_shopify_admin_page_view_1
  24. 24. SHOW CREATE TABLE hive.raw_monorail_dnqd.monorail_shopify_admin_page_view_1 CREATE TABLE hive .raw_monorail_do_not_query_directly .monorail_shopify_admin_page_view_1 ( magic varchar, schema_id varchar, message_id varchar, message_timestamp bigint, payload varchar, edge_user_agent varchar, edge_remote_ip varchar, edge_event_created_at_ms bigint, edge_event_sent_at_ms bigint, partition_yyyy_mm_dd_hh varchar ) WITH ( external_location = ‘gs://.../monorail.shopify.admin.page.view.1', partitioned_by = ARRAY['partition_yyyy_mm_dd_hh'] )
  25. 25. Reading Presto Query Plans SELECT shop_id, url FROM hive .sensitive_partitioned_monorail .monorail_shopify_admin_page_view_1 LIMIT 100; EXPLAIN
  26. 26. - Output[shop_id, url] => [expr_52:bigint, expr_54:varchar] Estimates: {rows: 100 (6.25kB), cpu: ?, memory: 0.00, network: 5500.00} shop_id := expr_52 url := expr_54 - Project[] => [expr_52:bigint, expr_54:varchar] Estimates: {rows: 100 (6.25kB), cpu: ?, memory: 0.00, network: 5500.00} expr_52 := TRY_CAST("json_extract_scalar"("payload", CAST('$.shop_id' AS jsonpath)) AS bigint) expr_54 := "json_extract_scalar"("payload", CAST('$.url' AS jsonpath)) - LocalExchange[ROUND_ROBIN] () => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 5500.00} - Limit[100] => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 5500.00} - LocalExchange[SINGLE] () => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 5500.00} - RemoteStreamingExchange[GATHER] => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 5500.00} - LimitPartial[100] => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 0.00} - TableScan[TableHandle {connectorId='hive'}] => [payload:varchar] Estimates: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00} LAYOUT: raw_monorail.monorail_shopify_admin_page_view_1 payload := payload:string:4:REGULAR partition_yyyy_mm_dd_hh:string:-1:PARTITION_KEY :: [[2018-10-17-21, 2019-07-26-13]]
  27. 27. Reading Presto Query Plans • Read Bottom-Up • Each - is an operator • Distill only what you need to know (e.g partitioning scheme)
  28. 28. Reading Presto Query Plans SELECT shop_id, url FROM hive .sensitive_partitioned_monorail .monorail_shopify_admin_page_view_1 LIMIT 100; EXPLAIN (type distributed)
  29. 29. Glossary TableScan - Scans the underlying dataset for the tables for data, using partitions (if any). Project - Select speci fi ed columns from the scanned data, could also transform projected column. ScanProject - Combines table scans and column projections into one operator Filter - Filters out data not matching provided predicates Aggregate (Partial) - Aggregates data on a single worker Aggregate (Final) - Aggregation of the aggregates Limit (Partial) - Applies limits on the data scanned on a single node Limit (Final) - Apply a limit on the limits LocalExchange (Single) - Used to read data from another stage LocalExchange (Round Robin) - Used to read data from multiple stages
  30. 30. Fragment 0 [SINGLE] Output layout: [expr_52, expr_54] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - Output[shop_id, url] => [expr_52:bigint, expr_54:varchar] Estimates: {rows: 100 (6.25kB), cpu: ?, memory: 0.00, network: 5500.00} shop_id := expr_52 url := expr_54 - Project[] => [expr_52:bigint, expr_54:varchar] Estimates: {rows: 100 (6.25kB), cpu: ?, memory: 0.00, network: 5500.00} expr_52 := TRY_CAST("json_extract_scalar"("payload", CAST('$.shop_id' AS jsonpath)) AS bigint) expr_54 := "json_extract_scalar"("payload", CAST('$.url' AS jsonpath)) - LocalExchange[ROUND_ROBIN] () => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 5500.00} - Limit[100] => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 5500.00} - LocalExchange[SINGLE] () => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 5500.00} - RemoteSource[1] => [payload:varchar] Fragment 1 [SOURCE] Output layout: [payload] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - LimitPartial[100] => [payload:varchar] Estimates: {rows: 100 (5.37kB), cpu: ?, memory: 0.00, network: 0.00} - TableScan[TableHandle {connectorId=‘hive'}, grouped = false] => [payload:varchar] Estimates: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00} LAYOUT: raw_monorail_do_not_query_directly.monorail_shopify_admin_page_view_1 payload := payload:string:4:REGULAR partition_yyyy_mm_dd_hh:string:-1:PARTITION_KEY :: [[2018-10-17-21, 2019-07-26-14]]
  31. 31. Reading Presto Query Plans SELECT shop_id, COUNT(1) FROM hive .sensitive_partitioned_monorail .monorail_shopify_admin_page_view_1 WHERE _partition_yyyy_mm_dd_hh >= '2019-07-25' GROUP BY 1 ORDER BY 2 DESC LIMIT 100; EXPLAIN (type distributed)
  32. 32. Fragment 0 [SINGLE] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - Output[shop_id, _col1] => [expr_52:bigint, count:bigint] shop_id := expr_52 _col1 := count - TopN[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - LocalExchange[SINGLE] () => [expr_52:bigint, count:bigint] - RemoteSource[1] => [expr_52:bigint, count:bigint] Fragment 1 [HASH] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - TopNPartial[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - Aggregate(FINAL)[expr_52] => [expr_52:bigint, count:bigint] count := "count"("count_167") - LocalExchange[HASH][$hashvalue] ("expr_52") => [expr_52:bigint, count_167:bigint, $hashvalue:bigint] - RemoteSource[2] => [expr_52:bigint, count_167:bigint, $hashvalue_168:bigint] Fragment 2 [SOURCE] Output layout: [expr_52, count_167, $hashvalue_169] Output partitioning: HASH [expr_52][$hashvalue_169] Stage Execution Strategy: UNGROUPED_EXECUTION - Project[] => [expr_52:bigint, count_167:bigint, $hashvalue_169:bigint] $hashvalue_169 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("expr_52"), 0)) - Aggregate(PARTIAL)[expr_52] => [expr_52:bigint, count_167:bigint] count_167 := "count"(*) - ScanProject[table = TableHandle {..}, grouped = false] => [expr_52:bigint] Estimates: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00} expr_52 := TRY_CAST("json_extract_scalar"("payload", CAST('$.shop_id' AS jsonpath)) AS bigint) LAYOUT: raw_monorail_do_not_query_directly.monorail_shopify_admin_page_view_1 payload := payload:string:4:REGULAR partition_yyyy_mm_dd_hh:string:-1:PARTITION_KEY :: [[2019-07-25-00, 2019-07-26-14]]
  33. 33. Fragment 2 [SOURCE] Output layout: [expr_52, count_167, $hashvalue_169] Output partitioning: HASH [expr_52][$hashvalue_169] Stage Execution Strategy: UNGROUPED_EXECUTION - Project[] => [expr_52:bigint, count_167:bigint, $hashvalue_169:bigint] $hashvalue_169 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("expr_52"), 0)) - Aggregate(PARTIAL)[expr_52] => [expr_52:bigint, count_167:bigint] count_167 := "count"(*) - ScanProject[table = TableHandle {..}, grouped = false] => [expr_52:bigint] Estimates: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00} expr_52 := TRY_CAST("json_extract_scalar"("payload", CAST('$.shop_id' AS jsonpath)) AS bigint) LAYOUT: raw_monorail_do_not_query_directly.monorail_shopify_admin_page_view_1 payload := payload:string:4:REGULAR partition_yyyy_mm_dd_hh:string:-1:PARTITION_KEY :: [[2019-07-25-00, 2019-07-26-14]]
  34. 34. Fragment 0 [SINGLE] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - Output[shop_id, _col1] => [expr_52:bigint, count:bigint] shop_id := expr_52 _col1 := count - TopN[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - LocalExchange[SINGLE] () => [expr_52:bigint, count:bigint] - RemoteSource[1] => [expr_52:bigint, count:bigint] Fragment 1 [HASH] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - TopNPartial[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - Aggregate(FINAL)[expr_52] => [expr_52:bigint, count:bigint] count := "count"("count_167") - LocalExchange[HASH][$hashvalue] ("expr_52") => [expr_52:bigint, count_167:bigint, $hashvalue:bigint] - RemoteSource[2] => [expr_52:bigint, count_167:bigint, $hashvalue_168:bigint] Fragment 2 [SOURCE] Output layout: [expr_52, count_167, $hashvalue_169] Output partitioning: HASH [expr_52][$hashvalue_169] Stage Execution Strategy: UNGROUPED_EXECUTION - Project[] => [expr_52:bigint, count_167:bigint, $hashvalue_169:bigint] $hashvalue_169 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("expr_52"), 0)) - Aggregate(PARTIAL)[expr_52] => [expr_52:bigint, count_167:bigint] count_167 := "count"(*) - ScanProject[table = TableHandle {..}, grouped = false] => [expr_52:bigint] Estimates: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00} expr_52 := TRY_CAST("json_extract_scalar"("payload", CAST('$.shop_id' AS jsonpath)) AS bigint) LAYOUT: raw_monorail_do_not_query_directly.monorail_shopify_admin_page_view_1 payload := payload:string:4:REGULAR partition_yyyy_mm_dd_hh:string:-1:PARTITION_KEY :: [[2019-07-25-00, 2019-07-26-14]]
  35. 35. Fragment 1 [HASH] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - TopNPartial[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - Aggregate(FINAL)[expr_52] => [expr_52:bigint, count:bigint] count := "count"("count_167") - LocalExchange[HASH][$hashvalue] ("expr_52") => [expr_52:bigint, count_167:bigint, $hashvalue:bigint] - RemoteSource[2] => [expr_52:bigint, count_167:bigint, $hashvalue_168:bigint]
  36. 36. Fragment 0 [SINGLE] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - Output[shop_id, _col1] => [expr_52:bigint, count:bigint] shop_id := expr_52 _col1 := count - TopN[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - LocalExchange[SINGLE] () => [expr_52:bigint, count:bigint] - RemoteSource[1] => [expr_52:bigint, count:bigint] Fragment 1 [HASH] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - TopNPartial[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - Aggregate(FINAL)[expr_52] => [expr_52:bigint, count:bigint] count := "count"("count_167") - LocalExchange[HASH][$hashvalue] ("expr_52") => [expr_52:bigint, count_167:bigint, $hashvalue:bigint] - RemoteSource[2] => [expr_52:bigint, count_167:bigint, $hashvalue_168:bigint] Fragment 2 [SOURCE] Output layout: [expr_52, count_167, $hashvalue_169] Output partitioning: HASH [expr_52][$hashvalue_169] Stage Execution Strategy: UNGROUPED_EXECUTION - Project[] => [expr_52:bigint, count_167:bigint, $hashvalue_169:bigint] $hashvalue_169 := "combine_hash"(bigint '0', COALESCE("$operator$hash_code"("expr_52"), 0)) - Aggregate(PARTIAL)[expr_52] => [expr_52:bigint, count_167:bigint] count_167 := "count"(*) - ScanProject[table = TableHandle {..}, grouped = false] => [expr_52:bigint] Estimates: {rows: ? (?), cpu: ?, memory: 0.00, network: 0.00}/{rows: ? (?), cpu: ?, memory: 0.00, network: 0.00} expr_52 := TRY_CAST("json_extract_scalar"("payload", CAST('$.shop_id' AS jsonpath)) AS bigint) LAYOUT: raw_monorail_do_not_query_directly.monorail_shopify_admin_page_view_1 payload := payload:string:4:REGULAR partition_yyyy_mm_dd_hh:string:-1:PARTITION_KEY :: [[2019-07-25-00, 2019-07-26-14]]
  37. 37. Fragment 0 [SINGLE] Output layout: [expr_52, count] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - Output[shop_id, _col1] => [expr_52:bigint, count:bigint] shop_id := expr_52 _col1 := count - TopN[100 by (count DESC_NULLS_LAST)] => [expr_52:bigint, count:bigint] - LocalExchange[SINGLE] () => [expr_52:bigint, count:bigint] - RemoteSource[1] => [expr_52:bigint, count:bigint]
  38. 38. Data (File) Formats 41
  39. 39. JSON Parquet (Columnar)
  40. 40. JSON • Nested format. • Row by row. • Used for Kafka and Monorail data at Shopify.
  41. 41. JSON Example { “edge_event_created_at_ms”:…, “edge_event_sent_at_ms”:…, "edge_remote_ip":"...", "edge_user_agent":"...", “event_timestamp”:”...", "magic":"...", }, {…}, …. Row 1 Row 2
  42. 42. Parquet File Format • Columnar data format. • Each parquet fi le is made of multiple “row groups”. • Each “row group” is made of multiple “data pages”. • Makes queries that only need a subset of columns ef fi cient. • Metadata on a fi le and row group level. Reference: https://parquet.apache.org/documentation/latest/
  43. 43. Parquet File
  44. 44. Row Groups Row Group 1 Row Group n …
  45. 45. Data Pages r Row 1 Column 1 Row 2 Column 1 Metadata Row 1 Column 2 Row 2 Column 2 Metadata Row n - 1 Column m Row n Column m Metadata … … Row n Column 1 Metadata
  46. 46. DISCLAIMER: ONLY IF FILE FORMAT IS COLUMNAR
  47. 47. Storage Layouts and their Bene fi ts 50
  48. 48. Partitioning File Sizes Sorted Data
  49. 49. Partitioning • Data is stored and separated into different folders called “partitions” on disk. • ex. partition_key=value • There can be multiple layers of partitioning * • ex. partition_key_1=value_1/partition_key_2=value_2/etc. • To see the partitions for a table • SELECT * FROM catalog.schema.”table_name$partitions" Caveat: • Too many partitions can lead to sub-optimal performance.
  50. 50. Partitioning • We store our monorail data with partitions year, month, day, hour • ie. path_to_data/year=2019/month=01/day=02/hour=03 • Bad partitioning would be if we partitioned by minute as well.
  51. 51. File Sizes • Number of fi les == number of initial splits • Find a balance for reading metadata and data • If fi les are too small, your query will be degraded by I/O overhead, reading more metadata than data • The ideal fi le size is to match the HDFS cache block size (128mb).
  52. 52. File Sizes • Buttt what about thick fi les? • Bigger row groups (multiple rows). • More likely to run into memory issues.
  53. 53. Sorted Data • Presto can read metadata about the row groups. • These include min, max, count stats for each row group. • Based on the metadata, presto can skip row groups. Caveat: • The initial sorting of the data when writing is costly.
  54. 54. Sorted Data • Presto can read metadata about the row groups. • These include min, max, count stats for each row group. • Based on the metadata, presto can skip row groups. • Can only sort: • Within bucketed tables. • On a fi le level. Caveat: • The initial sorting of the data when writing is costly.
  55. 55. Hands on Query Optimizations
  56. 56. Web UI
  57. 57. Overview shop_dimension gmv_adjustment_facts payment_gateway_dimension shops + gmv shops + gmv + gateway (final) 👆BEGIN AT BOTTOM 👆
  58. 58. FROM shop JOIN gmv shop_dimension gmv_adjustment_facts
  59. 59. FROM shop + gmv JOIN gateway shops + gmv payment_gateway_dimension
  60. 60. Overview shop_dimension 24 million rows gmv_adjustment_facts 2,680 million rows payment_gateway_dimension 1 row shop + gmv 2,680 million rows gmv + shop + gateway (final) 👆BEGIN AT BOTTOM 👆
  61. 61. Optimal shop_dimension 24 million rows gmv_adjustment_facts 2,680 million rows payment_gateway_dimension 1 row gmv + gateway 575 million rows gmv + gateway + shops (final) 👆BEGIN AT BOTTOM 👆
  62. 62. Many rows per merchant (slower) gmv adjustment (usd) _merchant_key $30.25 163873553166667789 $45.69 163873553166667789 $100.10 163873553166667789 $19.91 214536949654314165 _merchant_key merchant name 163873553166667789 ColourPop 214536949654314165 Triangl
  63. 63. One rows per merchant (faster) total_gmv _merchant_key $176.74 163873553166667789 $19.91 214536949654314165 _merchant_key merchant name 163873553166667789 ColourPop 214536949654314165 Triangl
  64. 64. Checklist Join order Grouped before join Approximation Sampling Partitions (if lucky)
  65. 65. Thank You Please leave feedback!

