AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance
Upcoming SlideShare
Loading in...5
×
 

AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

on

  • 8,075 views

Loading very large data sets can take a long time and consume a lot of computing resources. How data is loaded can also affect query performance. We will discuss best practices for loading data ...

Loading very large data sets can take a long time and consume a lot of computing resources. How data is loaded can also affect query performance. We will discuss best practices for loading data efficiently using COPY commands, bulk inserts, and staging tables. We will also cover the key decisions that will heavily influence overall query performance. These design choices also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process queries.

Statistics

Views

Total Views
8,075
Views on SlideShare
8,068
Embed Views
7

Actions

Likes
9
Downloads
165
Comments
1

1 Embed 7

https://twitter.com 7

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Where can I get more information of WLM, creating user groups and query groups?? Docs dint seem to help.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance Presentation Transcript

  • Amazon Redshift Best Practices –Part 1April 2013Vidhya Srinivasan & David Pearson
  • Agenda• Introduction• Redshift cluster architecture• Best Practices for  Data loading  Key selection  Querying  WLM• Q&A
  • AWS Database Amazon Redshift Fast, Powerful, Fully Managed, Petabyte-ScaleServices Data Warehouse Service Amazon DynamoDB Scalable High Performance Fast, Predictable, Highly-Scalable NoSQL Data StoreApplication Storage in the Cloud Amazon RDS Deployment & Administration Managed Relational Database Service for MySQL, Oracle and SQL Server Application Services Amazon ElastiCacheCompute Storage Database In-Memory Caching Service Networking AWS Global Infrastructure
  • objectivesdesign and build a petabyte-scale data warehouse service A Lot FasterAmazonRedshift A Lot Cheaper A Whole Lot Simpler
  • Redshift Dramatically Reduces I/O• Direct-attached storage Id Age State 123 20 CA• Large data block sizes 345 25 WA• Columnar storage 678 40 FL• Data compression• Zone maps Row storage Column storage
  • Redshift Runs on Optimized Hardware Click to grow …to 1.6PBHS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rateHS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage • Optimized for I/O intensive workloads • HS1.8XL available on Amazon EC2 • Runs in HPC - fast network • High disk density
  • data generated Gap cost +data volume effort data available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • Redshift is Priced to Analyze All Your Data $0.85 per hour for on-demand (2TB) $999 per TB per year (3-yr reservation)
  • Amazon Redshift Architecture• Leader Node  SQL endpoint  Postgres based JDBC/ODBC  Stores metadata  Communicates with client  Compiles queries 10 GigE  Coordinates query execution (HPC)• Compute Nodes  Local, columnar storage  Execute queries in parallel - slices  Load, backup, restore via Amazon S3 Ingestion Backup Restore• Everything is mirrored
  • Ingestion – Best Practices• Goal  1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead• Best Practices  Preferred method - COPY from S3  Loads data in sorted order through the compute nodes  Single Copy command, Split data into multiple files  Strongly recommend that you gzip large datasets copy time from s3://mybucket/data/timerows.gz’ credentials aws_access_key_id=<Your- Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter |’;• If you must ingest through SQL  Multi-row inserts insert into category_stage values  Avoid large number of singleton (default, default, default, default), insert/update/delete operations (20, default, Country, default),• To copy from another table (21, Concerts, Rock, default);  CREATE TABLE AS or INSERT INTO SELECT
  • Ingestion – Best Practices (Cont’d) select query, trim(filename), curtime, status• Verifying load data files from stl_load_commits  For US east – S3 provides where filename like %tickit% order by query; eventual consistency• Verify files are in S3 query | btrim | curtime | status -------+---------------------------+----------------------------+--------• Listing Object Keys 22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1 22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1• Query Redshift after 22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1 22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1 load. This query 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1 22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1 returns entries for 22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1 loading the tables in 22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1 1 the TICKIT database… 22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1 22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1
  • Ingestion – Best Practices (Cont’d)• Redshift does not currently support an upsert statement. Use staging tables to perform an upsert by doing a join on the staging table with the target – Update then Insert• Redshift does not currently enforce primary key constraint, if you COPY same data twice, it will be duplicated• Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count set wlm_query_slot_count to 3• Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your data to ensure your table statistics are current• Amazon Redshift system table that can be helpful in troubleshooting data load issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust MAX ERRORS as needed.• Check character set : Support UTF8 up to 3 bytes long• View the console for errors
  • Console
  • Choose a Sort key• Goal  Skip over data blocks to minimize IO• Best Practice  Sort based on range or equality predicate (WHERE clause)  If you access recent data frequently, sort based on TIMESTAMP
  • Choose a Distribution Key• Goal  Distribute data evenly across nodes  Minimize data movement among nodes : Co-located Joins and Co-located Aggregates• Best Practice  Consider using Join key as distribution key (JOIN clause)  If multiple joins, use the foreign key of the largest dimension as distribution key  Consider using Group By column as distribution key (GROUP BY clause)• Avoid  Keys used as equality filter as your distribution key • If de-normalized tables and no aggregates, do not specify a distribution key -Redshift will use round robin
  • Distribution Key – Verify Data SkewCheck the data distribution select slice, col, num_values, minvalue, maxvalue from svv_diskusage where name=users and col =0 order by slice, col; slice| col | num_values | minvalue | maxvalue -----+-----+------------+----------+---------- 0 | 0 | 12496 | 4 | 49987 1 | 0 | 12498 | 1 | 49988 2 | 0 | 12497 | 2 | 49989 3 | 0 | 12499 | 3 | 49990
  • ExampleSelect sum( S.Price * S.Quantity )FROM SALES S Dist key (C) = ProductIDJOIN CATEGORY C ON C.ProductId = S.ProductId Dist key (S) = ProductIDJOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId Dist key (F) = FranchiseIDWhere C.CategoryId = ‘Produce’ And F.State = ‘WA’AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’ Sort key (S) = Date -- Total Produce sold in Washington in January 2013
  • Query Performance – Best Practices• Encode date and time using “TIMESTAMP” data type instead of “CHAR”• Specify Constraints  Redshift does not enforce constraints (primary key, foreign key, unique values) but the optimizer uses it  Loading and/or applications need to be aware• Specify redundant predicate on the sort column SELECT * FROM tab1, tab2 WHERE tab1.key = tab2.key AND tab1.timestamp > 1/1/2013 AND tab2.timestamp > 1/1/2013;• WLM settings
  • Workload Manager• Allows you to manage and adjust query concurrency• WLM allows you to  Increase query concurrency up to 15  Define user groups and query groups  Segregate short and long running queries  Help improve performance of individual queries• Be aware: query workload is distributed to every compute node  Increasing concurrency may not always help due to resource contention • CPU, Memory and I/O  Total throughput may increase by letting one query complete first and allowing other queries to wait
  • Workload Manager• Default : 1 queue with a concurrency of 5• Define up to 8 queues with a total concurrency of 15• Redshift has a super user queue internally
  • Summary• Avoid large number of singleton DML statements if possible• Use COPY for uploading large datasets• Choose Sort and Distribution keys with care• Encode data and time with TIMESTAMP data type• Experiment with WLM settings
  • More Information Best Practices for Designing Tableshttp://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html Best Practices for Data Loadinghttp://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html View the Redshift Developer Guide at: http://aws.amazon.com/documentation/redshift/
  • Questions?