(DAT201) Introduction to Amazon Redshift


Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing and scale-out architecture to ensure compute resources grow with your dataset size, and columnar, direct-attached storage to dramatically reduce I/O time. Learn how top online retailer RetailMeNot moved their largest Vertica cluster on Amazon EC2 to Amazon Redshift. See how they gain insights from clickstream, location, merchant, marketing, and operational data across desktop and mobile properties.

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pavan Pothukuchi, Amazon Redshift Nam Nguyen, RetailMeNot October 2015 DAT201 Introduction to Amazon Redshift
  2. 2. What to expect from the session • Amazon Redshift – What and Why • Benefits • Use cases • Amazon Redshift at RetailMeNot • Q&A
  3. 3. AnalyzeStore Import/Export Direct Connect Collect Amazon Kinesis Amazon Glacier S3 DynamoDB Amazon Aurora AWS big data portfolio Data Pipeline CloudSearch EMR EC2 Amazon Redshift Machine Learning
  4. 4. Relational data warehouse Massively parallel; Petabyte scale Fully managed HDD and SSD Platforms $1,000/TB/Year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  5. 5. The legacy view of data warehousing ... Global 2,000 companies Sell to central IT Multi-year commitment Multi-year deployments Multi-million dollar deals
  6. 6. … Leads to dark data This is a narrow view Small companies also have big data (mobile, social, gaming, adtech, IoT) Long cycles, high costs, administrative complexity all stifle innovation 0 200 400 600 800 1000 1200 Enterprise Data Data in Warehouse
  7. 7. The Amazon Redshift view of data warehousing 10x cheaper Easy to provision Higher DBA productivity 10x faster No programming Easily leverage BI tools, Hadoop, Machine Learning, Streaming Analysis in-line with process flows Pay as you go, grow as you need Managed availability & DR Enterprise Big Data SaaS
  8. 8. Selected Amazon Redshift customers
  9. 9. Amazon Redshift architecture Leader Node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution Compute Nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB Ingestion/Backup Backup Restore JDBC/ODBC 10 GigE (HPC)
  10. 10. Benefit #1: Amazon Redshift is fast Dramatically less I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959
  11. 11. SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’ MIN: 01-JUNE-2013 MAX: 20-JUNE-2013 MIN: 08-JUNE-2013 MAX: 30-JUNE-2013 MIN: 12-JUNE-2013 MAX: 20-JUNE-2013 MIN: 02-JUNE-2013 MAX: 25-JUNE-2013 Unsorted Table MIN: 01-JUNE-2013 MAX: 06-JUNE-2013 MIN: 07-JUNE-2013 MAX: 12-JUNE-2013 MIN: 13-JUNE-2013 MAX: 18-JUNE-2013 MIN: 19-JUNE-2013 MAX: 24-JUNE-2013 Sorted By Date Benefit #1: Amazon Redshift is fast Sort Keys and Zone Maps
  12. 12. Benefit #1: Amazon Redshift is fast Parallel and Distributed Query Load Export Backup Restore Resize
  13. 13. ID Name 1 John Smith 2 Jane Jones 3 Peter Black 4 Pat Partridge 5 Sarah Cyan 6 Brian Snail 1 John Smith 4 Pat Partridge 2 Jane Jones 5 Sarah Cyan 3 Peter Black 6 Brian Snail Benefit #1: Amazon Redshift is fast Distribution Keys
  14. 14. Benefit #1: Amazon Redshift is fast H/W optimized for I/O intensive workloads, 4GB/sec/node Enhanced networking, over 1M packets/sec/node Choice of storage type, instance size Regular cadence of auto-patched improvements Example: Our new Dense Storage (HDD) instance type Improved memory 2x, compute 2x, disk throughput 1.5x Cost: same as our prior generation !
  15. 15. Benefit #2: Amazon Redshift is inexpensive DS2 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB compressed On-Demand $ 0.850 $ 3,725 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DC1 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB compressed On-Demand $ 0.250 $ 13,690 1 Year Reservation $ 0.161 $ 8,795 3 Year Reservation $ 0.100 $ 5,500 Pricing is simple Number of nodes x price/hour No charge for leader node No up front costs Pay as you go
  16. 16. Benefit #3: Amazon Redshift is fully managed Continuous/incremental backups Multiple copies within cluster Continuous and incremental backups to S3 Continuous and incremental backups across regions Streaming restore Amazon S3 Amazon S3 Region 1 Region 2
  17. 17. Benefit #3: Amazon Redshift is fully managed Amazon S3 Amazon S3 Region 1 Region 2 Fault tolerance Disk failures Node failures Network failures Availability Zone/Region level disasters
  18. 18. Benefit #4: Security is built-in • Load encrypted from S3 • SSL to secure data in transit • ECDHE perfect forward security • Amazon VPC for network isolation • Encryption to secure data at rest • All blocks on disks & in Amazon S3 encrypted • Block key, Cluster key, Master key (AES-256) • On-premises HSM & AWS CloudHSM support • Audit logging and AWS CloudTrail integration • SOC 1/2/3, PCI-DSS, FedRAMP, BAA 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  19. 19. Benefit #5: We innovate quickly Well over 100 new features added since launch Release every two weeks Automatic patching Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1) pg_last_unload_count (9/15) AES-128 S3 encryption (9/29) UTF-16 support (9/29)
  20. 20. Benefit #6: Amazon Redshift is powerful • Approximate functions • User defined functions • Machine Learning • Data Science Amazon ML
  21. 21. Benefit #7: Amazon Redshift has a large ecosystem Data Integration Systems IntegratorsBusiness Intelligence
  22. 22. Benefit #8: Service oriented architecture DynamoDB EMR S3 EC2/SSH RDS/Aurora Amazon Redshift Amazon Kinesis Machine Learning Data Pipeline CloudSearch Mobile Analytics
  23. 23. Use cases
  24. 24. Analyzing Twitter Firehose
  25. 25. Amazon Redshift Starts at $0.25/hour EC2 Starts at $0.02/hour S3 $0.030/GB-Mo Amazon Glacier $0.010/GB-Mo Amazon Kinesis $0.015/shard 1MB/s in; 2MB/out $0.028/million puts Analyzing Twitter Firehose
  26. 26. 500MM tweets/day = ~ 5,800 tweets/sec 2k/tweet is ~12MB/sec (~1TB/day) $0.015/hour per shard, $0.028/million PUTS Amazon Kinesis cost is $0.765/hour Amazon Redshift cost is $0.850/hour (for a 2TB node) S3 cost is $1.28/hour (no compression) Total: $2.895/hour Data warehouses can be inexpensive and powerful
  27. 27. Use only the services you need Scale only the services you need Pay for what you use ~40% discount with 1 year commitment ~70% discounts with 3 year commitment Data warehouses can be inexpensive and powerful
  28. 28. – Weblog analysis Web log analysis for 1PB+ workload, 2TB/day, growing 67% YoY Largest table: 400 TB Want to understand customer behavior Solution Legacy DW—query across 1 week/hr. Hadoop—query across 1 month/hr.
  29. 29. Query 15 months of data (1PB) in 14 minutes Load 5B rows in 10 minutes 21B rows joined with 10B rows – 3 days (Hive) to 2 hours Load pipeline: 90 hours (Oracle) to 8 hours 64 clusters 800 total nodes 13PB provisioned storage 2 DBAs Data warehouses can be fast and simple
  30. 30. Petabytes of data generated by many cell phone towers Hard to scale, expensive Needed a secure scalable system that can work with on premises NTT Docomo – Mobile usage analysis Data Source ET Direct Connect Client Forwarder LoaderState Management SandboxRedshift S3
  31. 31. High speed redundant direct connect lines Load billions of rows in minutes All data in private VPC All data encrypted with private on-premises hardware keys Encryption of data, transport, backups, partial spills Audit of all SQL actions Audit of all configuration changes The cloud can be made more secure than on premises
  32. 32. Sushiro – Real-time streaming from IoT & analysis
  33. 33. Sushiro – Real-time streaming & analysis Real-time data ingested by Amazon Kinesis is analyzed in Amazon Redshift 380 stores stream live data from Sushi plates Inventory information combined with consumption information near real-time Forecast demand by store, minimize food waste, and improve efficiencies Amazon
  34. 34. Big data does not mean batch Can be streamed in Can be processed in near real time Can be used to respond quickly to requests You can mix and match On premises and cloud Custom development and managed services Infrastructure with managed scaling, security Data warehouses can support real-time data
  35. 35. In sum… Amazon Redshift: Spend time with your data, not your database
  36. 36. Europe: 67.3M Greater China: 27.5M Middle East & Africa: 81.7M Asia-Pacific: 81.7M Latin America: 43.4M
  37. 37. Our Data
  38. 38. Our data 100s of TBs in Data Warehouses 2012 2013 2014 2015 >100% Year over Year Data Growth
  39. 39. The legacy Vertica Reporting Content Presentation Source DBs 3rd Party Data Log Data A B Testing
  40. 40. Pain points Fire Fights Query Traffic Jams Processing Windows Scaling
  41. 41. Adopting cloud strategies Amazon Redshift Instances Reporting Content Presentation A B Testing Source DBs 3rd Party Data Log Data
  42. 42. On-demand breakdown Only when needed Ephemeral Processing Up during business hours Always Up
  43. 43. Benefits to the data team Processing Windows Fire Fights Scaling Number of Clusters Scaling the Size of Clusters
  44. 44. DOH! Reserved Instances Automated vs. Manual Backups Automated Cluster Shut Down Sort/Distribution Keys For Joins
  45. 45. Benefits to the business 50% Reduced time on administration $0 Licensing 50% cost reduction for instances 100% Growth of Internal Customers
  46. 46. Q&A
  47. 47. Thank you!
