Uses and Best Practices for Amazon Redshift
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Uses and Best Practices for Amazon Redshift

on

  • 1,033 views

In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and ...

In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.

Statistics

Views

Total Views
1,033
Views on SlideShare
1,032
Embed Views
1

Actions

Likes
3
Downloads
57
Comments
0

1 Embed 1

http://springxd.ninja 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Uses and Best Practices for Amazon Redshift Presentation Transcript

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Uses & Best Practices for Amazon Redshift Rahul Pathak, AWS (rahulpathak@) Daniel Mintz, Upworthy (danielmintz@) July 10, 2014
  • 2. Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year Amazon Redshift
  • 3. Redshift EMR EC2 Analyze Glacier S3 DynamoDB Store Direct Connect Collect Kinesis
  • 4. Petabyte scale Massively parallel Relational data warehouse Fully managed; zero admin Amazon Redshift a lot faster a lot cheaper a whole lot simpler
  • 5. Common Customer Use Cases • Reduce costs by extending DW rather than adding HW • Migrate completely from existing DW systems • Respond faster to business • Improve performance by an order of magnitude • Make more data available for analysis • Access business data via standard reporting tools • Add analytic functionality to applications • Scale DW capacity as demand grows • Reduce HW & SW costs by an order of magnitude Traditional Enterprise DW Companies with Big Data SaaS Companies
  • 6. Amazon Redshift Customers
  • 7. Growing Ecosystem
  • 8. AWS Marketplace • Find software to use with Amazon Redshift • One-click deployments • Flexible pricing options http://aws.amazon.com/marketplace/redshift
  • 9. Data Loading Options • Parallel upload to Amazon S3 • AWS Direct Connect • AWS Import/Export • Amazon Kinesis • Systems integrators Data Integration Systems Integrators
  • 10. Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH • Two hardware platforms – Optimized for data processing – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 11. Amazon Redshift Node Types • Optimized for I/O intensive workloads • High disk density • On demand at $0.85/hour • As low as $1,000/TB/Year • Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate • High performance at smaller storage size • High compute and memory density • On demand at $0.25/hour • As low as $5,500/TB/Year • Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage
  • 12. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • With row storage you do unnecessary I/O • To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 13. • With column storage, you only read the data you need Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 14. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • COPY compresses automatically • You can analyze and override • More performance, less cost
  • 15. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 • Track the minimum and maximum value for each block • Skip over blocks that don’t contain relevant data
  • 16. Amazon Redshift dramatically reduces I/O • Column storage • Data compression • Zone maps • Direct-attached storage • Use local storage for performance • Maximize scan rates • Automatic replication and continuous backup • HDD & SSD platforms
  • 17. Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize
  • 18. • Load in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB or any SSH connection • Data automatically distributed and sorted according to DDL • Scales linearly with number of nodes Amazon Redshift parallelizes and distributes everything • Query • Load • • Backup/Restore • Resize
  • 19. • Backups to Amazon S3 are automatic, continuous and incremental • Configurable system snapshot retention period. Take user snapshots on-demand • Cross region backups for disaster recovery • Streaming restores enable you to resume querying faster Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize
  • 20. • Resize while remaining online • Provision a new cluster in the background • Copy data in parallel from node to node • Only charged for source cluster Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize
  • 21. • Automatic SQL endpoint switchover via DNS • Decommission the source cluster • Simple operation via Console or API Amazon Redshift parallelizes and distributes everything • Query • Load • Backup/Restore • Resize
  • 22. Amazon Redshift is priced to let you analyze all your data • Number of nodes x cost per hour • No charge for leader node • No upfront costs • Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $ 8,794 3 Year Reservation $ 0.100 $ 5,498
  • 23. Amazon Redshift is easy to use • Provision in minutes • Monitor query performance • Point and click resize • Built in security • Automatic backups
  • 24. Amazon Redshift has security built-in • SSL to secure data in transit • Encryption to secure data at rest – AES-256; hardware accelerated – All blocks on disks and in Amazon S3 encrypted – HSM Support so you control keys • Audit logging & AWS CloudTrail integration • Amazon VPC support • SOC 1/2/3, PCI DSS Level 1, Fedramp and more 10 GigE (HPC) Ingestion Backup Restore Customer VPC Internal VPC JDBC/ODBC
  • 25. Amazon Redshift continuously backs up your data and recovers from failures • Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times • Backups to Amazon S3 are continuous, automatic, and incremental – Designed for eleven nines of durability • Continuous monitoring and automated recovery from failures of drives and nodes • Able to restore snapshots to any Availability Zone within a region • Easily enable backups to a second region for disaster recovery
  • 26. 60+ new features since launch • Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney • Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others • Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE for perfect forward security • Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region backups • Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50 • Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR
  • 27. Amazon Redshift Feature Delivery Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) DUB (4/25) SOC1/2/3 (5/8) Unload Encrypted Files NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) SHA1 Builtin (7/15) 4 byte UTF-8 (7/18) Sharing snapshots (7/18) Statement Timeout (7/22) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) Resource Level IAM (8/9) PCI (8/22) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts, Cross Region Backup (11/13) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25) 50 slots, COPY from EMR, ECDHE ciphers (4/22) 3 new regex features, Unload to single file, FedRAMP(5/6) Rename Cluster (6/2) Copy from multiple regions, percentile_cont, percentile_disc (6/30) Free Trial (7/1)
  • 28. New Features • UNLOAD to single file • COPY from multiple regions • Percentile_cont & percentile_disc window functions
  • 29. Try Amazon Redshift with BI & ETL for Free! • http://aws.amazon.com/redshift/free-trial • 2 months, 750 hours/month of free DW2.Large usage • Also try BI & ETL for free from nine partners
  • 30. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. A Year With Amazon Redshift Why Upworthy Chose Redshift And How It’s Going July 10, 2014
  • 31. What’s Upworthy • We’ve been called: – “Social media with a mission” by our About Page – “The fastest growing media site of all time” by Fast Company – “The Fastest Rising Startup” by The Crunchies – “That thing that’s all over my newsfeed” by my annoyed friends – “The most data-driven media company in history” by me, optimistically
  • 32. What We Do • We aim to drive massive amounts of attention to things that really matter. • We do that by finding, packaging, and distributing great, meaningful content.
  • 33. Our Use Case
  • 34. When We Started • Building a data warehouse from scratch • One engineer on the project • Object data in MongoDB • Had discovered MoSQL • Knew which two we’d choose: – Comprehensive – Ad Hoc – Real-Time
  • 35. The Decision • Downsides to self-hosting – Cost – Maintenance • Tried Redshift by the hour
  • 36. Building it out initially • ~50 events/second • Snowflake or denormalized? Both. Raw S3 Events Redshift MongoDB Browser Web Service S3 Drain MoSQL PostgreSQL Objects Master EMR Processed Objects
  • 37. Our system now • Stats: – ~5 TB of compressed data. – Two main tables = 13 billion rows – Average: ~1085 events/second – Peak: ~2500 events/second • 5-10 minute ETL cycle (Kinesis session later) • Lots of rollup tables • COPY happens quickly (5-10s) every 1-2 minutes
  • 38. What We’ve Learned
  • 39. Initial Lessons • Columnar can be disorienting: – What’s in the SELECT matters. A lot. – SELECT COUNT(*) is really, really fast. – SELECT * is really, really slow. – Wide, tall tables with lots of NULLs = Fine. • Sortkeys are hugely powerful. • Bulk Operations (COPY, not INSERT) FTW
  • 40. Staging to Master Pattern INSERT INTO master WITH to_insert as ( SELECT col1, col2, col3 FROM staging ) SELECT s.* FROM to_insert s LEFT JOIN master m on m.col1 = s.col1 WHERE m.col1 is NULL;
  • 41. Hash multiple join keys into one ... FROM table1 t1 LEFT JOIN table2 t2 on t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 md5(col1 || col2 || col3) as hashed_join_key ... FROM table1 t1 LEFT JOIN table2 t2 on t1.hashed_join_key = t2.hashed_join_key
  • 42. Pain Points • Data loading can be a bit tricky at first – Actually read the documentation. It’ll help. • COUNT(DISTINCT col1) can be painful – Use APPROXIMATE COUNT(DISTINCT col1) instead • No null-safe operator. NULL = NULL returns NULL – NVL(t1.col1, 0) = NVL(t2.col1, 0) is null-safe • Error messages can be less than informative – AWS Redshift Forum is fantastic. Someone else has probably seen that error before you.
  • 43. Where We’re Going Next
  • 44. In The Next Few Months • More structure to our ETL (implementing luigi) • Need to avoid Serializable Isolation Violations • Dozens of people querying a cluster of 4 dw1.xlarge nodes can get ugly • “Write” node of dw1 nodes and “Read” cluster of dw2 “dense compute” nodes
  • 45. Thank You!