Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

2,350 views

Published on

Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.

Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features

Published in: Technology

Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pavan Pothukuchi, Principal Product Manager , AWS September 20, 2016 Deep Dive: Amazon Redshift for Big Data Analytics
  2. 2. Agenda • Service Overview • Best Practices • Schema / Table Design • Data Ingestion • Database Tuning • Migration • Examples
  3. 3. Service Overview
  4. 4. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  5. 5. Selected Amazon Redshift customers
  6. 6. Amazon Redshift system architecture Leader node • SQL endpoint • Stores metadata • Coordinates query execution Compute nodes • Local, columnar storage • Execute queries in parallel • Load, backup, restore via Amazon S3; load from Amazon DynamoDB, Amazon EMR, or SSH Two hardware platforms • Optimized for data processing • DS2: HDD; scale from 2TB to 2PB • DC1: SSD; scale from 160GB to 326TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  7. 7. A deeper look at compute node architecture Each node contains multiple slices • DS2 – 2 slices on XL, 16 on 8XL • DC1 – 2 slices on L, 32 on 8XL A slice can be thought as a “virtual compute node” • Unit of data partitioning • Parallel query processing Facts about slices: • Each compute node has either 2, 16, or 32 slices • Table rows are distributed to slices • A slice processes only its own data Leader Node
  8. 8. Amazon Redshift dramatically reduces I/O Data compression Zone maps ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 • Calculating SUM(Amount) with row storage: – Need to read everything – Unnecessary I/O ID Age State Amount
  9. 9. Amazon Redshift dramatically reduces I/O Data compression Zone maps ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 • Calculating SUM(Amount) with column storage: – Only scan the necessary blocks ID Age State Amount
  10. 10. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps • Columnar compression – Effective due to like data – Reduces storage requirements – Reduces I/O ID Age State Amount analyze compression orders; Table | Column | Encoding --------+-------------+---------- orders | id | mostly32 orders | age | mostly32 orders | state | lzo orders | amount | mostly32
  11. 11. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps • In-memory block metadata • Contains per-block MIN and MAX value • Effectively prunes blocks which don’t contain data for a given query • Minimize unnecessary I/O ID Age State Amount
  12. 12. Best Practices: Schema Design
  13. 13. Data Distribution • Distribution style is a table property which dictates how that table’s data is distributed throughout the cluster: • KEY: Value is hashed, same value goes to same location (slice) • ALL: Full table data goes to first slice of every node • EVEN: Round robin • Goals: • Distribute data evenly for parallel processing • Minimize data movement during query processing KEY ALL Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 EVEN
  14. 14. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green 2 3 4 ID Gender Name 101 M John Smith 306 F Lisa Green ID Gender Name 292 F Jane Jones 209 M James White ID Gender Name 139 M Peter Black 164 M Brian Snail ID Gender Name 446 M Pat Partridge 658 F Sarah Cyan Round Robin DISTSTYLE EVEN
  15. 15. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green Hash Function ID Gender Name 101 M John Smith 306 F Lisa Green ID Gender Name 292 F Jane Jones 209 M James White ID Gender Name 139 M Peter Black 164 M Brian Snail ID Gender Name 446 M Pat Partridge 658 F Sarah Cyan DISTSTYLE KEY
  16. 16. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green Hash Function ID Gender Name 101 M John Smith 139 M Peter Black 446 M Pat Partridge 164 M Brian Snail 209 M James White ID Gender Name 292 F Jane Jones 658 F Sarah Cyan 306 F Lisa Green DISTSTYLE KEY
  17. 17. ID Gender Name 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M James White 306 F Lisa Green 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White 101 M John Smith 292 F Jane Jones 139 M Peter Black 446 M Pat Partridge 658 F Sarah Cyan 164 M Brian Snail 209 M Lisa Green 306 F James White ALL DISTSTYLE ALL
  18. 18. CUSTOMERS CUST_ID GENDER NAME 101 M John Smith 306 F James White ORDERS ORDER_ID CUST_ID Amount A1600 101 120 B8765 306 340 RESULTS CUST_ID GENDER Amount 101 M 120 306 F 340 CUSTOMERS CUST_ID GENDER NAME 292 F Jane Jones 209 M Lyall Green ORDERS ORDER_ID CUST_ID Amount C0967 292 750 D8753 209 601 RESULTS CUST_ID GENDER Amount 292 F 750 209 M 601
  19. 19. CUSTOMERS CUST_ID GENDER NAME 101 M John Smith 306 F James White ORDERS ORDER_ID CUST_ID Amount A1600 101 120 B8765 306 340 RESULTS CUST_ID GENDER Amount 101 M 120 306 F 340 CUSTOMERS CUST_ID GENDER NAME 292 F Jane Jones 209 M Lyall Green ORDERS ORDER_ID CUST_ID Amount C0967 292 750 D8753 209 601 RESULTS CUST_ID GENDER Amount 292 F 750 209 M 601
  20. 20. Choosing a Distribution Style KEY • Large FACT tables • Large or rapidly changing tables used in joins • Localize columns used within aggregations ALL • Have slowly changing data • Reasonable size (i.e., few millions but not 100’s of millions of rows) • No common distribution key for frequent joins • Typical use case – joined dimension table without a common distribution key EVEN • Tables not frequently joined or aggregated • Large tables without acceptable candidate keys
  21. 21. Data Sorting Goals Physically order rows of table data based on certain column(s) Optimize effectiveness of zone maps Enable MERGE JOIN operations Impact Enables rrscans to prune blocks by leveraging zone maps Overall reduction in block IO Achieved with the table property SORTKEY defined over one or more columns Optimal SORTKEY is dependent on: Query patterns Data profile Business requirements
  22. 22. Zone Maps SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’ MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013 MIN: 06-JUNE-2013 MAX: 12-JUNE-2013 Unsorted Table MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013 MIN: 25-JUNE-2013 MAX: 30-JUNE-2013 Sorted By Date READ READ READ READ READ
  23. 23. Single Column • Table is sorted by 1 column Date Region Country 2-JUN-2015 Oceania New Zealand 2-JUN-2015 Asia Singapore 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Hong Kong 3-JUN-2015 Europe Germany 3-JUN-2015 Asia Korea [ SORTKEY ( date ) ] Best for: • Queries that use 1st column (i.e. date) as primary filter • Can speed up joins and group bys
  24. 24. Compound Date Region Country 2-JUN-2015 Africa Zaire 2-JUN-2015 Asia Korea 2-JUN-2015 Asia Singapore 2-JUN-2015 Europe Germany 3-JUN-2015 Asia Hong Kong 3-JUN-2015 Asia Korea [ SORTKEY COMPOUND ( date, region, country) ] Best for: • Queries that use 1st column as primary filter, then other cols • Can speed up joins and group bys
  25. 25. Interleaved • Equal weight is given to each column. Date Region Country 2-JUN-2015 Africa Zaire 3-JUN-2015 Asia Singapore 2-JUN-2015 Asia Korea 2-JUN-2015 Europe Germany 3-JUN-2015 Asia Hong Kong 2-JUN-2015 Asia Korea [ SORTKEY INTERLEAVED ( date, region, country) ] Best for: • Queries that use different columns in filter • Queries get faster the more columns used in the filter
  26. 26. COMPOUND • Most Common • Well defined filter criteria • Time-series data Choosing a SORTKEY INTERLEAVED • Edge Cases • Large tables (>Billion Rows) • No common filter criteria • Non time-series data • Primarily as a query predicate (date, identifier, …) • Optionally choose a column frequently used for aggregates • Optionally choose same as distribution key column for most efficient joins (merge join)
  27. 27. Compressing Data • COPY automatically analyzes and compresses data when loading into empty tables • ANALYZE COMPRESSION checks existing tables and proposes optimal compression algorithms for each column • Changing column encoding requires a table rebuild
  28. 28. Compressing Data If you have a regular ETL process and you use temp tables or staging tables, turn off automatic compression • Use analyze compression to determine the right encodings • Bake those encodings into your DML • Use CREATE TABLE … LIKE
  29. 29. Compressing Data • From the zone maps we know: • Which block(s) contain the range • Which row offsets to scan • Highly compressed sort keys: • Many rows per block • Large row offset Skip compression on just the leading column of the compound sortkey
  30. 30. Best Practices: Ingestion
  31. 31. Amazon Redshift Loading Data Overview AWS CloudCorporate Data center Amazon DynamoDB Amazon S3 Data Volume Amazon Elastic MapReduce Amazon RDS Amazon Redshift Amazon Glacier logs / files Source DBs VPN Connection AWS Direct Connect S3 Multipart Upload AWS Import/ Export EC2 or On- Prem (using SSH)
  32. 32. Parallelism is a function of load files Each slice’s query processors are able to load one file at a time • Streaming Decompression • Parse • Distribute • Write A single input file means only one slice is ingesting data Realizing only partial cluster usage as 6.25% of slices are active 2 4 6 8 10 12 141 3 5 7 9 11 13 15
  33. 33. Maximize Throughput with Multiple Files Use at least as many input files as there are slices in cluster With 16 input files, all slices are working so you maximize throughput COPY continues to scale linearly as you add additional nodes 2 4 6 8 10 12 141 3 5 7 9 11 13 15
  34. 34. New feature: ALTER TABLE APPEND ELT workloads typically “massage” or aggregate data in a staging table and then append to production table ALTER TABLE APPEND moves data from staging to production table by manipulating metadata Much faster than INSERT INTO as data is not duplicated
  35. 35. Best Practices: Performance Tuning
  36. 36. Optimizing a database for querying • Periodically check your table status • Vacuum and Analyze regularly • SVV_TABLE_INFO • Missing statistics • Table skew • Uncompressed Columns • Unsorted Data • Check your cluster status • WLM queuing • Commit queuing • Database Locks
  37. 37. Missing Statistics • Amazon Redshift’s query optimizer relies on up-to-date statistics • Statistics are only necessary for data which you are accessing • Updated stats important on: • SORTKEY • DISTKEY • Columns in query predicates
  38. 38. Table Skew • Unbalanced workload • Query completes as fast as the slowest slice completes • Can cause skew inflight: • Temp data fills a single node resulting in query failure Table Maintenance and Status Unsorted Table • Sortkey is just a guide, but data needs to actually be sorted • VACUUM or DEEP COPY to sort • Scans against unsorted tables continue to benefit from zone maps: • Load sequential blocks
  39. 39. WLM Queue Identify short/long-running queries and prioritize them Define multiple queues to route queries appropriately. Default concurrency of 5 Leverage wlm_apex_hourly to tune WLM based on peak concurrency requirements Cluster Status: Commits and WLM Commit Queue How long is your commit queue? • Identify needless transactions • Group dependent statements within a single transaction • Offload operational workloads • STL_COMMIT_STATS
  40. 40. Cluster Status: Database Locks • Database Locks • Read locks, Write locks, Exclusive locks • Reads block exclusive • Writes block writes and exclusive • Exclusives block everything • Ungranted locks block subsequent lock requests • Exposed through SVV_TRANSACTIONS
  41. 41. Migration Considerations
  42. 42. Typical ETL/ELT on legacy data warehouse • One file per table, maybe a few if too big • Many updates (“massage” the data) • Every job clears the data, then loads • Count on primary key to block double loads • High concurrency of load jobs • Small table(s) to control the job stream
  43. 43. Two questions to ask Why you do what you do? • Many times, users don’t know What is the customer need? • Many times, needs do not match current practice • You might benefit from adding other AWS services
  44. 44. On Amazon Redshift Updates are delete + insert of the row • Deletes just mark rows for deletion Blocks are immutable • Minimum space used is one block per column, per slice Commits are expensive • 4 GB write on 8XL per node • Mirrors WHOLE dictionary • Cluster-wide serialized
  45. 45. On Amazon Redshift • Not all aggregations created equal • Pre-aggregation can help • Order on group by matters • Concurrency should be low for better throughput • Caching layer for dashboards is recommended • WLM parcels RAM to queries. Use multiple queues for better control.
  46. 46. Workload Management (WLM) Concurrency and memory can now be changed dynamically You can have distinct values for load time and query time Use wlm_apex_hourly.sql to monitor “queue pressure”
  47. 47. New Feature – WLM Queue Hopping
  48. 48. Query throughput vs. Concurrency • Query throughput (QPM or QPH) is more representative of end user experience than concurrency • Several improvements over the last 6 months • Commit improvements • Dynamic resource management • Query throughput doubled over the last 6 months
  49. 49. Resources https://github.com/awslabs/amazon-redshift-utils https://github.com/awslabs/amazon-redshift-monitoring https://github.com/awslabs/amazon-redshift-udfs https://s3.amazonaws.com/chriz-webinar/webinar.zip Admin scripts Collection of utilities for running diagnostics on your cluster Admin views Collection of utilities for managing your cluster, generating schema DDL, etc. ColumnEncodingUtility Gives you the ability to apply optimal column encoding to an established schema with data already loaded
  50. 50. Monday, October 24, 2016 JW Marriot Austin https://aws.amazon.com/events/devday-austin Free, one-day developer event featuring tracks, labs, and workshops around Serverless, Containers, IoT, and Mobile Q&A If you want to learn more, register for our upcoming DevDay Austin:
  51. 51. Appendix: Performance optimization examples
  52. 52. Use SORTKEYs to effectively prune blocks
  53. 53. Use SORTKEYs to effectively prune blocks
  54. 54. Use SORTKEYs to effectively prune blocks
  55. 55. Don’t compress initial SORTKEY column
  56. 56. Use compression encoding to reduce I/O
  57. 57. Choose a DISTKEY which avoids data skew
  58. 58. Ingest: Disable predictable compression analysis
  59. 59. Ingest: Load multiple files to match cluster slices
  60. 60. VACUUM to physically removed deleted rows
  61. 61. VACUUM to keep your tables sorted
  62. 62. Gather statistics to assist the query planner

×