Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Fine-Tune Performance Using Amazon Redshift

1,642 views

Published on

How to fine-tune the performance on Amazon Redshift to meet your requirements. Best practices and recommendations.

  • I like this service ⇒ www.WritePaper.info ⇐ from Academic Writers. I don't have enough time write it by myself.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If u need a hand in making your writing assignments - visit ⇒ www.HelpWriting.net ⇐ for more detailed information.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hi there! I just wanted to share a list of sites that helped me a lot during my studies: .................................................................................................................................... www.EssayWrite.best - Write an essay .................................................................................................................................... www.LitReview.xyz - Summary of books .................................................................................................................................... www.Coursework.best - Online coursework .................................................................................................................................... www.Dissertations.me - proquest dissertations .................................................................................................................................... www.ReMovie.club - Movies reviews .................................................................................................................................... www.WebSlides.vip - Best powerpoint presentations .................................................................................................................................... www.WritePaper.info - Write a research paper .................................................................................................................................... www.EddyHelp.com - Homework help online .................................................................................................................................... www.MyResumeHelp.net - Professional resume writing service .................................................................................................................................. www.HelpWriting.net - Help with writing any papers ......................................................................................................................................... Save so as not to lose
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

How to Fine-Tune Performance Using Amazon Redshift

  1. 1. Amazon Redshift: Good practices for performance optimization
  2. 2. What is Amazon Redshift? •Petabyte-scale columnar database •Fast response time • ~ 10 times faster than typical relational database •Cost-effective (around 1000 $ per TB per year)
  3. 3. When customers are using Amazon Redshift? • Reduce cost by extending DW instead of adding HW • Migrate from existing DWH system • Respond faster to business: provision in minutes Replacing traditional DWH • Improve performance by an order of magnitude • Make more data available for analysis • Access business data via standard reporting tools Analyzing Big Data • Add analytics functionality to the applications • Scale DW capacity as demand grows • Reduce SW and HW costs by an order of magnitude Providing services and SaaS
  4. 4. Amazon Redshift architecture
  5. 5. Redshift reduces IO • With raw storage you do unnecessary IO • With columnar storage you only read the data you need Column storage • COPY compresses automatically • You can analyze and override • More performance, less cost Data compression • Track the minimum and maximum value for each block • Skip over blocks that do not contain relevant data Zone maps Direct-attached storage • > 2 GB/sec scan rate • Optimized for data processing • High disk density
  6. 6. Redshift Recap
  7. 7. Main possible Issues with Redshift performance • Incorrect column encoding • Skewed table data • Queries not benefiting from sort keys – Excessive scanning • Tables without statistics or which need vacuum • Tables with very large VARCHAR columns • Queries waiting on queue slots • Queries that are disk-based – incorrect sizing, GROUP BY distinct values, UNION, hash joins with DISTINCT values • Commit queue waits • Inefficient data loads • Inefficient use of Temporary Tables • Large nested loop JOINs • Inappropriate Join Cardinality
  8. 8. Incorrect column encoding Running an Amazon Redshift cluster without column encoding is not considered a best practice, and customers find large performance gains when they ensure that column encoding is optimally applied. To determine if you are deviating from this best practice, you can use the v_extended_table_info view from the Amazon Redshift Utils GitHub repository. If you find that you have tables without optimal column encoding, then use the Amazon Redshift Column Encoding Utility from the Utils repository to apply encoding. This command line utility uses the ANALYZE COMPRESSION command on each table. If encoding is required, it generates a SQL script which creates a new table with the correct encoding, copies all the data into the new table, and then transactionally renames the new table to the old name while retaining the original data. •Raw Encoding •Byte-Dictionary Encoding •Delta Encoding •LZO Encoding •Mostly Encoding •Runlength Encoding •Text255 and Text32k Encodings •Zstandard Encoding
  9. 9. Skewed table data If skew is a problem, you typically see that node performance is uneven on the cluster. Use table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster. Consider changing the distribution key to a column that exhibits high cardinality and uniform distribution. Evaluate a candidate column as a distribution key by creating a new table using CTAS: CREATE TABLE my_test_table DISTKEY (<column name>) AS SELECT <column name> FROM <table name>; You can use SCT for setting the proper Distribution Key and Style during the migration process
  10. 10. Deal with tables with very large VARCHAR columns During processing of complex queries, intermediate query results might need to be stored in temporary blocks. These temporary tables are not compressed, so unnecessarily wide columns consume excessive memory and temporary disk space, which can affect query performance. SELECT database, schema || '.' || "table" AS "table", max_varchar FROM svv_table_info WHERE max_varchar > 150 ORDER BY 2; Use the following query to generate a list of tables that should have their maximum column widths reviewed: Identify which table columns have wide varchar columns and then determine the true maximum width for each wide column: SELECT max(len(rtrim(column_name))) FROM table_name; If you query the top running queries for the database using the top_queries.sql admin script, pay special attention to SELECT * queries which include the JSON fragment column. If end users query these large columns but don’t use actually execute JSON functions against them, consider moving them into another table that only contains the primary key column of the original table and the JSON column.
  11. 11. Queries not benefiting from sort keys Amazon Redshift tables can have a sort key column identified, which acts like an index in other databases, but which does not incur a storage cost as with other platforms (for more information, see Choosing Sort Keys). A sort key should be created on those columns which are most commonly used in WHERE clauses. If you have a known query pattern, then COMPOUND sort keys give the best performance; if end users query different columns equally, then use an INTERLEAVED sort key. If using compound sort keys, review your queries to ensure that their WHERE clauses specify the sort columns in the same order they were defined in the compound key. To determine which tables don’t have sort keys, run the following query against the v_extended_table_info view from the Amazon Redshift Utils repository: SELECT * FROM admin.v_extended_table_info WHERE sortkey IS null; If you do not have a sort key this potentially can lead to a excessive scanning issue.
  12. 12. Queries waiting on queue slots (ICT) Amazon Redshift runs queries using a queuing system known as workload management (WLM). You can define up to 8 queues to separate workloads from each other, and set the concurrency on each queue to meet your overall throughput requirements. In some cases, the queue to which a user or query has been assigned is completely busy and a user’s query must wait for a slot to be open. During this time, the system is not executing the query at all, which is a sign that you may need to increase concurrency. First, you need to determine if any queries are queuing, using the queuing_queries.sql admin script. Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. Keep in mind that while increasing concurrency allows more queries to run, each query will get a smaller share of the memory allocated to its queue
  13. 13. Queries that are disk-based (ICT) SELECT q.query, trim(q.cat_text) FROM ( SELECT query, replace( listagg(text,' ') WITHIN GROUP (ORDER BY sequence), 'n', ' ') AS cat_text FROM stl_querytext WHERE userid>1 GROUP BY query) q JOIN ( SELECT distinct query FROM svl_query_summary WHERE is_diskbased='t' AND (LABEL LIKE 'hash%' OR LABEL LIKE 'sort%’ OR LABEL LIKE 'aggr%’) AND userid > 1) qs ON qs.query = q.query; If a query isn’t able to completely execute in memory, it may need to use disk-based temporary storage for parts of an explain plan. The additional disk I/O slows down the query; this can be addressed by increasing the amount of memory allocated to a session (for more information, see WLM Dynamic Memory Allocation). To determine if any queries have been writing to disk, use the following query: Based on the user or the queue assignment rules, you can increase the amount of memory given to the selected queue to prevent queries needing to spill to disk to complete.
  14. 14. Commit queue waits (ICT) Amazon Redshift is designed for analytics queries, rather than transaction processing. The cost of COMMIT is relatively high, and excessive use of COMMIT can result in queries waiting for access to a commit queue. If you are committing too often on your database, you will start to see waits on the commit queue increase, which can be viewed with the commit_stats.sql admin script. This script shows the largest queue length and queue time for queries run in the past two days. If you have queries that are waiting on the commit queue, then look for sessions that are committing multiple times per session, such as ETL jobs that are logging progress or inefficient data loads. One of the worst practices is to insert data into Amazon Redshift row by row. Use COPY command or your ETL tool compatible with Amazon Redshift instead.
  15. 15. Inefficient use of Temporary Tables Amazon Redshift provides temporary tables, which are like normal tables except that they are only visible within a single session. When the user disconnects the session, the tables are automatically deleted. Temporary tables can be created using the CREATE TEMPORARY TABLE syntax, or by issuing a SELECT … INTO #TEMP_TABLE query. The CREATE TABLE statement gives you complete control over the definition of the temporary table, while the SELECT … INTO and C(T)TAS commands use the input data to determine column names, sizes and data types, and use default storage properties. These default storage properties may cause issues if not carefully considered. Amazon Redshift’s default table structure is to use EVEN distribution with no column encoding. This is a sub-optimal data structure for many types of queries, and if you are using the SELECT…INTO syntax you cannot set the column encoding or distribution and sort keys. If you use the CREATE TABLE AS (CTAS) syntax, you can specify a distribution style and sort keys, and Redshift will automatically apply LZO encoding for everything other than sort keys, booleans, reals and doubles. If you consider this automatic encoding sub- optimal, and require further control, use the CREATE TABLE syntax rather than CTAS. If you are creating temporary tables, it is highly recommended that you convert all SELECT…INTO syntax to use the CREATE statement. This ensures that your temporary tables have column encodings and are distributed in a fashion that is sympathetic to the other entities that are part of the workflow.
  16. 16. Tables without statistics • Amazon Redshift, like other databases, requires statistics about tables and the composition of data blocks being stored in order to make good decisions when planning a query (for more information, see Analyzing Tables). Without good statistics, the optimiser may make suboptimal choices about the order in which to access tables, or how to join datasets together. • The ANALYZE Command History topic in the Amazon Redshift Developer Guide supplies queries to help you address missing or stale statistics, and you can also simply run the missing_table_stats.sql admin script to determine which tables are missing stats, or the statement below to determine tables that have stale statistics: SELECT database, schema || '.' || "table" AS "table", stats_off FROM svv_table_info WHERE stats_off > 5 ORDER BY 2;
  17. 17. Tables which need VACUUM In Amazon Redshift, data blocks are immutable. When rows are DELETED or UPDATED, they are simply logically deleted (flagged for deletion) but not physically removed from disk. Updates result in a new block being written with new data appended. Both of these operations cause the previous version of the row to continue consuming disk space and continue being scanned when a query scans the table. As a result, table storage space is increased and performance degraded due to otherwise avoidable disk I/O during scans. A VACUUM command recovers the space from deleted rows and restores the sort order. You can use the perf_alert.sql admin script to identify tables that have had alerts about scanning a large number of deleted rows raised in the last seven days. To address issues with tables with missing or stale statistics or where vacuum is required, run another AWS Labs utility, Analyze & Vacuum Schema. This ensures that you always keep up-to-date statistics, and only vacuum tables that actually need reorganization.
  18. 18. Analyze the performance of the queries and address issues • Your best friends in a Redshift world are: • ANALYZE for identifying out-of-date statistics • SVL_QUERY_SUMMARY View for summary of all the useful data around the behavior of your queries and high- level overview of the cluster • Query Alerts – for receiving an important information around deviations in behavior of your queries • SVL_QUERY_REPORT View – for collecting details about your queries health and performance
  19. 19. Setup and finetune monitoring and maintenance procedures • In addition to your preferred monitoring methods and multiple partners solutions we do have some helpful tools: • Amazon Redshift Column Encoding Utility • Use table_inspector.sql, to see how data blocks in a distribution key map to the slices and nodes in the cluster. • Run the missing_table_stats.sql admin script to determine which tables are missing stats. • Use the perf_alert.sql admin script to identify tables that have had alerts about scanning a large number of deleted rows raised in the last seven days. • Use top_queries.sql to determine the top running queries. • Review the maximum concurrency that your cluster has needed in the past with wlm_apex.sql, down to an hour-by-hour historical analysis with wlm_apex_hourly.sql. • View the commit stats with the commit_stats.sql admin script
  20. 20. Eliminate Nested Loops • Due to SQL query specifying a join condition that requires a ”brute force” approach between two large tables • Quite easy to spot • Look for “Nested Loop” in a Query Plans or STL_ALERT_EVENT_LOG ON a.date > b.date ON a.text LIKE b.text ON a.x = b.x OR a.y = b.y • Rewrite inequality JOIN condition as a window function • Use small nested loop instead of two large tables • Maybe you can do a nested loop join and persist the results in a separate relational table
  21. 21. Inappropriate Joins Cardinality • Query “Fan-out” • Look for high number of rows generated from joins (higher than the sum of rows of all scanned tables) in the query plan or execution metrics • Carefully review the joins logic • Use SVL_QUERY_SUMMARY to detect FROM house JOIN rooms ON rooms.house_id = house.id JOIN residents ON residents.house_id = house.id • If possible break large fan-out queries into several smaller queries • Use derived tables to transform one-to-many joins into one-to-one joins • Try out some advanced techniques like 1 and 2
  22. 22. Suboptimal WHERE clause If your WHERE clause causes excessive table scans, you might see a SCAN step in the segment with the highest maxtime value in SVL_QUERY_SUMMARY. For more information, see Using the SVL_QUERY_SUMMARY View. To fix this issue, add a WHERE clause to the query based on the primary sort column of the largest table. This approach will help minimize scanning time. For more information, see Amazon Redshift Best Practices for Designing Tables.
  23. 23. Insufficiently Restrictive Predicate If your query has an insufficiently restrictive predicate, you might see a SCAN step in the segment with the highest maxtime value in SVL_QUERY_SUMMARY that has a very high rows value compared to the rows value in the final RETURN step in the query. For more information, see Using the SVL_QUERY_SUMMARY View. To fix this issue, try adding a predicate to the query or making the existing predicate more restrictive to narrow the output.
  24. 24. Very Large Result Set If your query returns a very large result set, consider rewriting the query to use UNLOAD to write the results to Amazon S3. This approach will improve the performance of the RETURN step by taking advantage of parallel processing. For more information on checking for a very large result set, see Using the SVL_QUERY_SUMMARY View.
  25. 25. Large SELECT List If your query has an unusually large SELECT list, you might see a bytes value that is high relative to the rows value for any step (in comparison to other steps) in SVL_QUERY_SUMMARY. This high bytes value can be an indicator that you are selecting a lot of columns. For more information, see Using the SVL_QUERY_SUMMARY View. To fix this issue, review the columns you are selecting and see if any can be removed.
  26. 26. Working with SVL_QUERY_SUMMARY SELECT query, elapsed, substring FROM svl_qlog ORDER BY query DESC limit 5; Select your query ID Collect query data SELECT * FROM svl_query_summary WHERE query = MyQueryID ORDER BY stm, seg, step; For analysis please refer to: https://docs.aws.amazon.com/redshift/latest/dg/using-SVL- Query-Summary.html
  27. 27. Some additional materials can be found here • Best practices for designing queries • Best practices for designing tables • Top 10 performance tuning techniques • Troubleshooting queries Amazon Redshift Developer Guide https://docs.aws.amazon.com/redshift/l atest/dg/redshift-dg.pdf

×