Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013

3,046 views

Published on

Amazon Redshift is a fast, fully-managed, petabyte-scale data warehouse service that costs less than $1,000 per terabyte per year—less than a tenth the price of most traditional data warehousing solutions. In this session, you get an overview of Amazon Redshift, including how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. Finally, we announce new features that we've been working on over the past few months.

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,046
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
56
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013

  1. 1. Amazon Redshift Overview & What’s Next Rahul Pathak, Redshift PM (rapathak@amazon.com) Anurag Gupta, Redshift GM (awgupta@amazon.com) November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  2. 2. Amazon Redshift Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
  3. 3. Amazon Redshift dramatically reduces I/O ID Amount 20 CA 500 345 25 WA 250 678 Data compression State 123 • Age 40 FL 125 37 WA 375 • Zone maps 957 • Direct-attached storage • With row storage you do unnecessary I/O • To get total amount, you have to read everything
  4. 4. Amazon Redshift dramatically reduces I/O ID Amount 20 CA 500 345 25 WA 250 678 Data compression State 123 • Age 40 FL 125 37 WA 375 • Zone maps 957 • Direct-attached storage • With column storage, you only read the data you need
  5. 5. Amazon Redshift dramatically reduces I/O analyze compression listing; • Data compression • Zone maps • Direct-attached storage Table | Column | Encoding ---------+----------------+---------listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw • COPY compresses automatically on load • You can analyze and override • More performance, less cost Slides not intended for redistribution.
  6. 6. Amazon Redshift dramatically reduces I/O 10 324 • Data compression 375 623 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 • Zone maps • Direct-attached storage 637 | 712 | 809 … 959 … | 834 | 921 | 959 • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data
  7. 7. Amazon Redshift dramatically reduces I/O DW.HS1.XL: • Data compression • Zone maps • Direct-attached storage DW.HS1.8XL: • > 2 GB/s scan rate • Optimized for data processing • High disk density
  8. 8. Amazon Redshift architecture • Leader Node – – – JDBC/ODBC SQL endpoint Stores metadata Coordinates query execution • Compute Nodes – – – – 10 GigE (HPC) Local, columnar storage Execute queries in parallel Load, backup, restore via Amazon S3 Parallel load from Amazon DynamoDB • Single node version available Ingestion Backup Restore
  9. 9. Amazon Redshift parallelizes and distributes everything • Load • Backup/Restore • Resize
  10. 10. Amazon Redshift parallelizes and distributes everything • Load • Backup/Restore • • Load in parallel from Amazon S3 or Amazon DynamoDB • Data automatically distributed and sorted according to DDL • Scales linearly with number of nodes Resize
  11. 11. Amazon Redshift parallelizes and distributes everything • Load • Backup/Restore • Backups to Amazon S3 are automatic, continuous and incremental • Resize • Configurable system snapshot retention period • Take user snapshots on-demand • Streaming restores enable you to resume querying faster
  12. 12. Amazon Redshift parallelizes and distributes everything • Load • Backup/Restore • Resize • Resize while remaining online • Provision a new cluster in the background • Copy data in parallel from node to node • Only charged for source cluster
  13. 13. Amazon Redshift parallelizes and distributes everything • Load • Backup/Restore • • Automatic SQL endpoint switchover via DNS • Decommission the source cluster • Simple operation via Console or API Resize
  14. 14. Amazon Redshift lets you start small and grow big Extra Large Node (DW.HS1.XL) 3 spindles, 2 TB, 16 GB RAM, 2 cores Eight Extra Large Node (DW.HS1.8XL) 24 spindles, 16 TB, 128 GB RAM, 16 cores, 10 GigE Single Node (2 TB) Cluster 2-100 Nodes (32 TB – 1.6 PB) Cluster 2-32 Nodes (4 TB – 64 TB) Note: Nodes not to scale
  15. 15. Amazon Redshift is priced to let you analyze all your data Price Per Hour for HS1.XL Single Node Effective Hourly Price per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999 Simple Pricing Number of Nodes x Cost per Hour No charge for Leader Node No upfront costs Pay as you go
  16. 16. Amazon Redshift has security built in Customer VPC • SSL to secure data in transit JDBC/ODBC • Encryption to secure data at rest – – Internal VPC AES-256; hardware accelerated All blocks on disk and in Amazon S3 encrypted 10 GigE (HPC) • No direct access to compute nodes • Amazon VPC support Ingestion Backup Restore
  17. 17. Amazon Redshift automatically manages data replication and hardware failures • Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times • Backups to Amazon S3 are continuous, automatic, and incremental – Designed for eleven nines of durability • Continuous monitoring and automated recovery from failures of drives and nodes • Able to restore snapshots to any Availability Zone within a region
  18. 18. Growing ecosystem
  19. 19. AWS Marketplace • Find software to use with Amazon Redshift • One-click deployments • Flexible pricing options http://aws.amazon.com/marketplace
  20. 20. Over 40 new features since launch on Feb 14 • Regions – • Certifications – • Snapshot sharing, backup/restore progress indicators Query – • Load/unload encrypted files, Resource-level IAM, Temporary credentials Manageability – • PCI, SOC 1/2/3 Security – • N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format detection, epoch
  21. 21. Amazon Redshift – What’s Next
  22. 22. Security, visibility and control • Audit logging Redshift • SNS Alerts
  23. 23. Visibility and control AWS CloudTrail System Activity Creates, Changes, Deletes, Resizes • Audit logging • SNS Alerts Amazon Redshift Database Activity Logins, Login failures, Queries, Loads Amazon S3
  24. 24. Visibility and control • • Audit logging Monitoring Security Maintenance Errors SNS Alerts Amazon Redshift SNS Topic
  25. 25. Batch operations • Cluster Creation • Faster Resize Amazon Corporate Amazon EC2 Data Center EMR Amazon Redshift Amazon S3
  26. 26. Batch operations • Cluster Creation • Faster Resize Amazon Corporate Amazon EC2 Data Center EMR Amazon Redshift Amazon S3
  27. 27. Batch operations • Cluster Creation • Faster Resize 15-20 min 3 min
  28. 28. Batch operations • Cluster Creation • Faster Resize 29 hours 7 hours
  29. 29. Performance & Concurrency
  30. 30. Performance & Concurrency 692.8s 34.9s < 2%
  31. 31. Performance & Concurrency 5,951.7s 2,151.9s
  32. 32. Performance & Concurrency 15 50
  33. 33. Feature Delivery Unload logs (7/5) Temp Credentials (4/11) Sharing snapshots (7/18) DUB (4/25) Resource Level IAM (8/9) SHA1 Builtin (7/15) SOC1/2/3 (5/8) Statement Timeout (7/22) WLM Timeout/Wildcards (8/1) UTF-8 Substitution (8/29) JDBC Fetch Size (6/27) EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, WLM Memory Management (11/13) Service Launch (2/14) Split_part, Audit tables (10/3) 6 weeks left PCI (8/22) SIN/SYD (10/8) PDX (4/2) JSON, Regex, Cursors (9/10) NRT (6/5) CRC32 Builtin, CSV, Restore Progress (8/9) Timezone, Epoch, Autoformat (7/25) 4 byte UTF-8 (7/18) Unload Encrypted Files HSM Support (11/11)
  34. 34. Redshift Customers at re:Invent BDT 101: Big Data ‘State of the Union’ Earlier today DAT 305: Getting Maximum Performance from Amazon Redshift Wednesday 11/13: 3pm in Murano 3303
  35. 35. Redshift Customers at re:Invent DAT 306: How Amazon.com is Leveraging Amazon Redshift Thursday 11/14: 3pm in Murano 3303 DAT 205: Amazon Redshift in Action: Enterprise, Big Data, SaaS Friday 11/15: 9am in Lido 3006
  36. 36. Please give us your feedback on this presentation DAT 103 As a thank you, we will select prize winners daily for completed surveys!

×