AWS July Webinar Series: Amazon Redshift Optimizing Performance

5,948 views

Published on

Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data for a fraction of the cost of traditional data warehouses.

By following a few best practices for schema design and cluster design, you can unleash the high performance capabilties of Amazon Redshift. This webinar is a deep dive into performance tuning techniques based on real-world use cases.

Learning Objectives:

Learn how to get the best performance from your Redshift cluster
Design Amazon Redshift clusters based on real world use cases
See sample tuning scripts to diagnose and maximize cluster performance
Learn about increasing query performance using interleaved sorting

Published in: Technology

AWS July Webinar Series: Amazon Redshift Optimizing Performance

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sanjay Kotecha, Solution Architect Eric Ferreira, Principal Database Engineer July 21, 2015 Best Practices: Amazon Redshift Optimizing Performance
  2. 2. Getting Started – June Webinar Series: https://www.youtube.com/watch?v=biqBjWqJi-Q Best Practices – July Webinar Series: Optimizing Performance – July 21, 2015 Migration and Data Loading – July 22,2015 Reporting and Advanced Analytics – July 23, 2015 Amazon Redshift – Resources
  3. 3. Architecture Distribution Sort Keys Compression DDL Loading Vacuum Analyze Workload Management Agenda
  4. 4. Leader Node • SQL endpoint • Stores metadata • Coordinates query execution Compute Nodes • Local, columnar storage • Execute queries in parallel • Load, backup, restore via S3 • Parallel load from DynamoDB or SSH HW optimized for data processing • Optimized for data processing • DS2: HDD; scale from 2TB to 2PB • DC1: SSD; scale from 160GB to 356TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC Amazon Redshift Architecture
  5. 5. – One slice per core – DS2 – 2 slices on XL, 16 on 8XL – DC1 – 2 slices on XL, 32 on 8XL Architecture – Nodes and Slices
  6. 6. Table Distribution Styles Distribution Key All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution
  7. 7. Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 cloudfront uri = /games/g1.exe user_id=1234 … user_profile user_id=1234 name=janet … user_profile user_id=6789 name=fred … cloudfront uri = /imgs/ad1.png user_id=2345 … user_profile user_id=2345 name=bill … cloudfront uri=/games/g10.exe user_id=4312 … user_profile user_id=4312 name=fred … order_line order_line_id = 25693 … cloudfront uri = /img/ad_5.img user_id=1234 … Data Distribution with Distribution Keys
  8. 8. Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 user_profile user_id=1234 name=janet … user_profile user_id=6789 name=fred … cloudfront uri = /imgs/ad1.png user_id=2345 … user_profile user_id=2345 name=bill … cloudfront uri=/games/g10.exe user_id=4312 … user_profile user_id=4312 name=fred … order_line order_line_id = 25693 … Distribution Keys determine which data resides on which slices cloudfront uri = /games/g1.exe user_id=1234 … cloudfront uri = /img/ad_5.img user_id=1234 … Records with same distribution key for a table are on the same slice Data Distribution and Distribution Keys
  9. 9. Node 1 Slice 1 Slice 2 cloudfront uri = /games/g1.exe user_id=1234 … user_profile user_id=1234 name=janet … cloudfront uri = /imgs/ad1.png user_id=2345 … user_profile user_id=2345 name=bill … order_line order_line_id = 25693 … cloudfront uri = /img/ad_5.img user_id=1234 … Records from other tables with the same distribution key value are also on the same slice Records with same distribution key for a table are on the same slice Distribution Keys help with data locality for join evaluation Node 2 Slice 3 Slice 4 user_profile user_id=6789 name=fred … cloudfront uri=/games/g10.exe user_id=4312 … user_profile user_id=4312 name=fred … Data Distribution and Distribution Keys
  10. 10. Example Query (TPC-H dataset) Data Distribution - Comparison Distribution Type Query against the tables with distribution key was 178% faster Key Even 14 seconds 39 seconds
  11. 11. Query plan for tables with distribution key Data Distribution - Comparison Query plan for tables without distribution key
  12. 12. Query Plan http://docs.aws.amazon.com/redshift/latest/dg/c-query-processing.html
  13. 13. Tools – AdminScripts
  14. 14. Tools – AdminViews
  15. 15. Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 cloudfront uri = /games/g1.exe user_id=1234 … cloudfront uri = /imgs/ad1.png user_id=2345 … cloudfront uri=/games/g10.exe user_id=4312 … cloudfront uri = /img/ad_5.img user_id=1234 … 2M records 5M records 1M records 4M records Poor key choices lead to uneven distribution of records… Data Distribution and Distribution Keys
  16. 16. Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 cloudfront uri = /games/g1.exe user_id=1234 … cloudfront uri = /imgs/ad1.png user_id=2345 … cloudfront uri=/games/g10.exe user_id=4312 … cloudfront uri = /img/ad_5.img user_id=1234 … 2M records 5M records 1M records 4M records Unevenly distributed data cause processing imbalances! Data Distribution and Distribution Keys
  17. 17. Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 cloudfront uri = /games/g1.exe user_id=1234 … cloudfront uri = /imgs/ad1.png user_id=2345 … cloudfront uri=/games/g10.exe user_id=4312 … cloudfront uri = /img/ad_5.img user_id=1234 … 2M records2M records 2M records 2M records Evenly distributed data improves query performance select * from v_check_data_distribution where tablename = 'lineitem'; Data Distribution and Distribution Keys
  18. 18. KEY • Large Fact tables • Large dimension tables ALL • Medium dimension tables (1K – 2M) EVEN • Tables with no joins or group by • Small dimension tables (<1000) Data Distribution
  19. 19. Tools – Admin Scripts: table_info.sql
  20. 20. SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09- JUNE-2015’ MIN: 01-JUNE-2015 MAX: 20-JUNE-2015 MIN: 08-JUNE-2015 MAX: 30-JUNE-2015 MIN: 12-JUNE-2015 MAX: 20-JUNE-2015 MIN: 02-JUNE-2015 MAX: 25-JUNE-2015 MIN: 06-JUNE-2015 MAX: 12-JUNE-2015 Unsorted Table MIN: 01-JUNE-2015 MAX: 06-JUNE-2015 MIN: 07-JUNE-2015 MAX: 12-JUNE-2015 MIN: 13-JUNE-2015 MAX: 18-JUNE-2015 MIN: 19-JUNE-2015 MAX: 24-JUNE-2015 MIN: 25-JUNE-2015 MAX: 30-JUNE-2015 Sorted By Date READ READ READ READ READ Sort Keys – Zone Maps
  21. 21. Sort Keys - How to choose Timestamp column Frequent range filtering or equality filtering on one column Join column: create table customer ( c_custkey int8 not null, c_name varchar(25) not null, c_address varchar(40) not null, c_nationkey int4 not null, c_phone char(15) not null, c_acctbal numeric(12,2) not null, c_mktsegment char(10) not null, c_comment varchar(117) not null ) distkey(c_custkey) sortkey(c_custkey) ;
  22. 22. Single Column Compound Interleaved Sort Keys
  23. 23. Table is sorted by 1 column [ SORTKEY ( date ) ] Best for: • Queries that use 1st column (i.e. date) as primary filter • Can speed up joins and group bys • Quickest to VACUUM Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea Sort Keys – Single Column
  24. 24. • Table is sorted by 1st column , then 2nd column etc. [ SORTKEY COMPOUND ( date, region, country) ] • Best for: • Queries that use 1st column as primary filter, then other cols • Can speed up joins and group bys • Slower to VACUUM Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea Sort Keys – Compound
  25. 25. • Equal weight is given to each column. [ SORTKEY INTERLEAVED ( date, region, country) ] • Best for: • Queries that use different columns in filter • Queries get faster the more columns used in the filter (up to 8) • Slowest to VACUUM Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea Sort Keys – Interleaved
  26. 26. Sort Keys – Comparing Styles Single create table cust_sales_dt_single sortkey (c_custkey) as select * from cust_sales_date; Compound create table cust_sales_dt_compound compound sortkey (c_custkey, c_region, c_mktsegment, d_date) as select * from cust_sales_date; Interleaved create table cust_sales_dt_interleaved interleaved sortkey (c_custkey, c_region, c_mktsegment, d_date) as select * from cust_sales_date;
  27. 27. Query 1 select max(lo_revenue), min(lo_revenue) from cust_sales_date_single where c_custkey < 100000; select max(lo_revenue), min(lo_revenue) from cust_sales_date_compound where c_custkey < 100000; select max(lo_revenue), min(lo_revenue) from cust_sales_date_interleaved where c_custkey < 100000; Query 2 select max(lo_revenue), min(lo_revenue) from cust_sales_date_single where c_region = 'ASIA' and c_mktsegment = 'FURNITURE'; select max(lo_revenue), min(lo_revenue) from cust_sales_date_compound where c_region = 'ASIA' and c_mktsegment = 'FURNITURE'; select max(lo_revenue), min(lo_revenue) from cust_sales_date_interleaved where c_region = 'ASIA' and c_mktsegment = 'FURNITURE'; Query 3 select max(lo_revenue), min(lo_revenue) from cust_sales_date_single where d_date between '01/01/1996' and '01/14/1996' and c_mktsegment = 'FURNITURE' and c_region = 'ASIA'; select max(lo_revenue), min(lo_revenue) from cust_sales_date_compound where d_date between '01/01/1996' and '01/14/1996' and c_mktsegment = 'FURNITURE' and c_region = 'ASIA'; select max(lo_revenue), min(lo_revenue) from cust_sales_date_interleaved where d_date between '01/01/1996' and '01/14/1996' and c_mktsegment = 'FURNITURE' and c_region = 'ASIA'; Sort Keys – Comparing Styles
  28. 28. Sort Style Query 1 Query 2 Query 3 Single 0.25 seconds 18.37 seconds 30.04 seconds Compound 0.27 seconds 18.24 seconds 30.14 seconds Interleaved 0.94 seconds 1.46 seconds 0.80 seconds Sort Keys – Comparing Styles
  29. 29. Increased load and vacuum times More effective with large tables (> 100M+ rows) Use Compound Sort Key when appending data in order Sort Keys – Interleaved Considerations
  30. 30. Tools – Admin Scripts: table_info.sql
  31. 31. Raw encoding (RAW) Byte-dictionary (BYTEDICT) Delta encoding (DELTA / DELTA32K) Mostly encoding (MOSTLY8 / MOSTLY16 / MOSTLY32) Runlength encoding (RUNLENGTH) Text encoding (TEXT255 / TEXT32K) LZO encoding ( Average: 2-4x Compression - Encodings
  32. 32. COPY samples data automatically when loading into an empty table • Samples up to 100,000 rows and picks optimal encoding If use temp tables or staging tables • Turn off automatic compression • Use analyze compression to determine the right encodings • Bake those encodings into your DML COPY <tablename> FROM 's3://<bucket-name>/<object-prefix>' CREDENTIALS <AWS_ACCESS_KEY>;<AWS_SECRET_ACCESS_KEY> DELIMITER ',' COMPUPDATE OFF MANIFEST; Compression
  33. 33. Compression Encodings Compression - Comparison No Compression Encodings
  34. 34. Example Query (TPC-H dataset) Compressed Uncompressed 14 seconds 37 seconds Query against the tables with compression was 164% faster Compression - Comparison
  35. 35. • Zone maps store min/max per block • Once we know which block(s) contain the range, we know which row offsets to scan • Highly compressed sort keys means many rows per block • You’ll scan more data blocks than you need • If your sort keys compress significantly more than your data columns, you may want to skip compression Compression – Sort Keys
  36. 36. Tools – Admin Scripts: table_info.sql
  37. 37. CREATE TABLE orders ( orderkey int8 NOT NULL DISTKEY, custkey int8 NOT NULL, orderstatus char(1) NOT NULL , totalprice numeric(12,2) NOT NULL , orderdate date NOT NULL SORTKEY , orderpriority char(15) NOT NULL, clerk char(15) NOT NULL , shippriority int4 NOT NULL, comment varchar(79) NOT NULL ); DDL
  38. 38. During queries and ingestion, the system allocates buffers based on column width Wider than needed columns mean memory is wasted Fewer rows fit into memory; increased likelihood of queries spilling to disk DDL – Make Columns as narrow as possible
  39. 39. Define Primary & Foreign Keys Not Enforced but….. Helps optimizer with query plan DDL
  40. 40. Use the COPY command Each slice can load one file at a time A single input file means only one slice is ingesting data Instead of 100MB/s, you’re only getting 6.25MB/s Loading – Use multiple input files to maximize throughput
  41. 41. Use the COPY command You need at least as many input files as you have slices With 16 input files, all slices are working so you maximize throughput Get 100MB/s per node; scale linearly as you add nodes Loading – Use multiple input files to maximize throughput
  42. 42. Tools – Use the AdminScripts
  43. 43. VACUUM reclaims space and re-sorts tables VACUUM can be run in 4 modes: • VACUUM FULL • Reclaims space and re-sorts • VACUUM DELETE ONLY • Reclaims space but does not re-sort • VACUUM SORT ONLY • Re-sorts but does not reclaim space • VACUUM REINDEX • Used for INTERLEAVED sort keys. • Re-Analyzes sort keys and then runs FULL VACUUM Vacuum
  44. 44. VACUUM is an I/O intensive operation and can take time to run. To minimize the impact of VACUUM: • Run VACUUM on a regular schedule • Use TRUNCATE instead of DELETE where possible • TRUNCATE or DROP test tables • Perform a Deep Copy instead of VACUUM • Load Data in sort order and remove need for VACUUM Vacuum
  45. 45. • Is an alternate to VACUUM. • Will remove deleted rows and also re-sort the table • Is more efficient than VACUUM • You can’t make concurrent updates to the table Deep copy options: • Use original table DDL and run INSERT INTO…SELECT • Best option - Retains all table attributes • Use CREATE TABLE AS • New table does not inherit encoding, distkey, sortkey, primary keys, or foreign keys. • Use CREATE TABLE LIKE • New table inherits all attributes except primary and foreign keys • Use a TEMP table to COPY data out and back in again • Retains all attributes but requires two full inserts of the table Vacuum – Deep Copy
  46. 46. Redshift’s query optimizer relies on up-to-date statistics Update stats on sort/dist key columns after every load Analyze
  47. 47. Analyze – AdminScripts: missing_table_stats.sql
  48. 48. Workload Management Workload management is about creating queues for different workloads User Group A Short-running queueLong-running queue Short Query Group Long Query Group
  49. 49. Workload Management
  50. 50. Workload Management Don’t set concurrency to more that you need set query_group to allqueries; select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000; reset query_group;
  51. 51. Resources Sanjay Kotecha | kotechas@amazon.com Detail Pages • http://aws.amazon.com/redshift • https://aws.amazon.com/marketplace/redshift/ Best Practices • http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html • http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html • http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html Deep Drive Webinar Series in July • Migration and Loading Data – July 22nd, 2015 • Reporting and Advanced Analytics – July 23rd, 2015
  52. 52. Thank you!

×