• Like
  • Save

AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)

  • 997 views
Uploaded on

Interested in finding out more about the AWS data warehouse service, Amazon Redshift? Then join us for this introductory level technical session where you can learn more about the way in which AWS …

Interested in finding out more about the AWS data warehouse service, Amazon Redshift? Then join us for this introductory level technical session where you can learn more about the way in which AWS customers are using Redshift and the benefits that they have delivered to their organisations as a result as well as tricks and tips for getting the most from Redshift.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
997
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
1
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Uses & Best Practices for Amazon Redshift Ian Meyers, AWS (meyersi@) Adrian Wiles, The Financial Times (adrian.wiles@) April 30, 2014
  • 2. Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year Amazon Redshift
  • 3. Redshift EMR EC2 Analyze Glacier S3 DynamoDB Store Direct Connect Collect Kinesis
  • 4. Petabyte  scale Massively  parallel   Rela2onal  data  warehouse   Fully  managed;  zero  admin Amazon Redshift a lot faster a lot cheaper a whole lot simpler
  • 5. Common Customer Use Cases •  Reduce costs by extending DW rather than adding HW •  Migrate completely from existing DW systems •  Respond faster to business •  Improve performance by an order of magnitude •  Make more data available for analysis •  Access business data via standard reporting tools •  Add analytic functionality to applications •  Scale DW capacity as demand grows •  Reduce HW & SW costs by an order of magnitude Traditional Enterprise DW Companies with Big Data SaaS Companies
  • 6. Amazon Redshift Customers
  • 7. Growing Ecosystem
  • 8. AWS Marketplace •  Find software to use with Amazon Redshift •  One-click deployments •  Flexible pricing options http://aws.amazon.com/marketplace/redshift
  • 9. Data Loading Options •  Parallel upload to Amazon S3 •  AWS Direct Connect •  AWS Import/Export •  Amazon Kinesis •  SSH / Elastic MapReduce •  Systems integrators Data Integration Systems Integrators
  • 10. Amazon Redshift Architecture •  Leader Node –  SQL endpoint –  Stores metadata –  Coordinates query execution •  Compute Nodes –  Local, columnar storage –  Execute queries in parallel –  Load, backup, restore via Amazon S3; load from Amazon DynamoDB, EMR & SSH 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores Amazon S3 / DynamoDB / SSH / EMR JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  • 11. Amazon Redshift Node Types •  Optimized for I/O intensive workloads •  High disk density •  On demand at $0.85/hour •  As low as $1,000/TB/Year •  Scale from 2TB to 1.6PB DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan rate •  High performance at smaller storage size •  High compute and memory density •  On demand at $0.25/hour •  As low as $5,500/TB/Year •  Scale from 160GB to 256TB DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage
  • 12. Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage •  With row storage you do unnecessary I/O •  To get total amount, you have to read everything ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 13. •  With column storage, you only read the data you need to evaluate the query Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  • 14. analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage •  COPY compresses automatically •  You can analyze and override •  More performance, less cost
  • 15. Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 •  Track the minimum and maximum value for each block •  Skip over blocks that don’t contain relevant data
  • 16. Amazon Redshift dramatically reduces I/O •  Column storage •  Data compression •  Zone maps •  Direct-attached storage •  Use local storage for performance •  Maximize scan rates •  Automatic replication and continuous backup •  HDD & SSD platforms
  • 17. Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 18. •  Load in parallel from Amazon S3, Elastic MapReduce, Amazon DynamoDB or any SSH connection •  Data automatically distributed and sorted according to DDL •  Scales linearly with number of nodes Amazon S3 / DynamoDB / EMR / SSH 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Amazon Redshift parallelizes and distributes everything •  Query •  Load •  •  Backup/Restore •  Resize
  • 19. •  Backups to Amazon S3 are automatic, continuous and incremental •  Configurable system snapshot retention period. Take user snapshots on-demand •  Cross region backups for disaster recovery •  Streaming restores enable you to resume querying faster Amazon S3 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 20. •  Resize while remaining online •  Provision a new cluster in the background •  Copy data in parallel from node to node •  Only charged for source cluster SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Leader Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Leader Node Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 21. SQL Clients/BI Tools 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Comput e Node 128GB RAM 48TB disk 16 cores Leader Node •  Automatic SQL endpoint switchover via DNS •  Decommission the source cluster •  Simple operation via Console or API Amazon Redshift parallelizes and distributes everything •  Query •  Load •  Backup/Restore •  Resize
  • 22. Amazon Redshift is priced to let you analyze all your data •  Number of nodes x cost per hour •  No charge for leader node •  No upfront costs •  Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $ 8,794 3 Year Reservation $ 0.100 $ 5,498
  • 23. Amazon Redshift is easy to use •  Provision in minutes •  Powerful Web Console •  Monitor query performance •  Point and click resize •  Built in security •  Automatic backups
  • 24. Amazon Redshift has security built-in •  SSL to secure data in transit •  Encryption to secure data at rest –  AES-256; hardware accelerated –  All blocks on disks and in Amazon S3 encrypted –  HSM Support •  No direct access to compute nodes •  Audit logging & AWS CloudTrail integration •  Amazon VPC support 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Amazon S3 / Amazon DynamoDB Customer VPC Internal VPC JDBC/ODBC Leader Node Compute Node Compute Node Compute Node
  • 25. Amazon Redshift continuously backs up your data and recovers from failures •  Replication within the cluster and backup to Amazon S3 to maintain multiple copies of data at all times •  Backups to Amazon S3 are continuous, automatic, and incremental –  Designed for eleven nines of durability •  Continuous monitoring and automated recovery from failures of drives and nodes •  Able to restore snapshots to any Availability Zone within a region •  Easily enable backups to a second region for disaster recovery
  • 26. Amazon Redshift Feature Delivery Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) Unload Encrypted Files DUB (4/25) NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) 4 byte UTF-8 (7/18) Statement Timeout (7/22) SHA1 Builtin (7/15) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/ CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts (11/13) SOC1/2/3 (5/8) Sharing snapshots (7/18) Resource Level IAM (8/9) PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25)
  • 27. 50+ new features since launch •  Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney •  Certifications – PCI, SOC 1/2/3 •  Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM •  Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region snapshots •  Query – Regex, Cursors, MD5, SHA1, Time Zone, workload queue timeout, HLL •  Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution, CSV, auto datetime format detection, epoch, Ingest from SSH
  • 28. COPY from JSON { "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] } COPY venue FROM 's3://mybucket/venue.json’ credentials 'aws_access_key_id=ACCESS-KEY-ID; aws_secret_access_key=SECRET-ACCESS-KEY' JSON AS 's3://mybucket/venue_jsonpaths.json';
  • 29. REGEX_SUBSTR() select email, regexp_substr(email,'@[^.]*') from users limit 5; email | regexp_substr --------------------------------------------+---------------- Suspendisse.tristique@nonnisiAenean.edu | @nonnisiAenean sed@lacusUtnec.ca | @lacusUtnec elementum@semperpretiumneque.ca | @semperpretiumneque Integer.mollis.Integer@tristiquealiquet.org | @tristiquealiquet Donec.fringilla@sodalesat.org | @sodalesat
  • 30. Resize Progress •  Progress indicator in console •  New API call
  • 31. Adrian Wiles Enterprise Data Architect Financial Times
  • 32. Better value for money
  • 33. Flexibility • Storage growth • Ability to try new things 0 2 4 6 8 10 12 14 2011 2012 2013 2014
  • 34. Democratise Data
  • 35. Why Redshift? •  Faster •  No up-front commitment –  Fail fast and cheap –  Avoid procurement process •  Scale down as well as up •  AWS ecosystem
  • 36. Source Data Archive Staging DQ Standardise ETL Data Warehouse Reporting Campaigns Cubes
  • 37. Source Data Archive Staging DQ Standardise ETL Reporting Campaigns Cubes Data Warehouse
  • 38. Source Data Archive DQ Standardise ETL Reporting Campaigns Cubes Data Warehouse Staging
  • 39. Source Data DQ Standardise ETL Reporting Campaigns Cubes Data Warehouse Staging Archive
  • 40. Data Warehouse Staging Archive sftp ETL Cubes Reporting Campaigns
  • 41. Data Warehouse Staging Archive sftp ETL Cubes Reporting Campaigns
  • 42. Performance “My query must be wrong, that was too fast” 3.7 Billion Rows Before: 7:03m Redshift: 43.82s
  • 43. Near real-time Dashboard
  • 44. “Yes we can”
  • 45. Learning •  Plan your network in advance •  Different way of thinking –  Distributed data and column stores –  No stored procedures –  Small queries •  Work with AWS –  AWS invested in our success
  • 46. Future •  Consolidate existing marts •  More sources near real-time •  Service Oriented Architecture
  • 47. AWS Partner Trail Win a Kindle Fire •  10 in total •  Get a code from our sponsors
  • 48. Please rate this session using the AWS Summits App and help us build better events
  • 49. #AWSSummit @AWScloud @AWS_UKI