Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deploying Your Data Warehouse on AWS

94 views

Published on

Deploying Your Data Warehouse on AWS

  • Be the first to comment

Deploying Your Data Warehouse on AWS

  1. 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ian Robinson, Specialist SA, Analytics, EMEA 10 April 2018 Deploying Your Data Warehouse on AWS
  2. 2. Legacy Architectural Models Lead to Dark Data 0 200 400 600 800 1000 1200 Enterprise Data Data in Warehouse Very Expensive Lock-In Proprietary Inflexible licensing Dark Data
  3. 3. AWS Big Data Portfolio Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon Snowball Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon Dynamo DB Amazon Elasticsearch Amazon EMR Amazon EC2 Amazon Redshift Amazon Machine Learning Amazon QuickSight AWS Data PipelineAWS Database Migration Service AWS Glue Amazon AthenaAmazon Kinesis Analytics
  4. 4. Amazon Redshift Managed, massively parallel, petabyte-scale, relational data warehouse • Scale from 160GB to 2PB online • Automatic streaming backup/restore to S3, • Automatic failover and recovery • ANSI SQL interface • Load data from S3, DynamoDB and EMR
  5. 5. Query Exabytes of Data: Redshift Spectrum Run Amazon Redshift SQL queries against exabytes of data in Amazon S3 Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore Redshift Spectrum
  6. 6. Design for Throughput For each query: • Use just enough cluster resources • Minimum amount of work • Equally on each slice
  7. 7. Do an Equal Amount of Work on Each Slice
  8. 8. Choose Best Table Distribution Style All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Key Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 key1 key2 key3 key4 Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution
  9. 9. Avoid Data Skew
  10. 10. Avoid Selectively Filtering on Distribution Key WHERE o_orderdate = current_date
  11. 11. Do the Minimum Amount of Work on Each Slice
  12. 12. Columnar storage + Large data block sizes + Data compression + Zone maps + Direct-attached storage analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 Reduced I/O = Enhanced Performance
  13. 13. Use Cluster Resources Efficiently to Complete as Quickly as Possible
  14. 14. Workload Management WLM Waiting BI tools SQL clients Analytics tools Client Running Queries: 80% memory ETL: 20% memory 4 Slots 2 Slots 80/4 = 20% per slot 20/2 = 10% per slot Short Query Acceleration
  15. 15. Demo

×