Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best Practices for Supercharging Cloud Analytics on Amazon Redshift


Published on

In this webinar, we discuss how the secret sauce to your business analytics strategy remains rooted on your approached, methodologies and the amount of data incorporated into this critical exercise. We also address best practices to supercharge your cloud analytics initiatives, and tips and tricks on designing the right information architecture, data models and other tactical optimizations.

To learn more, visit:

Published in: Technology, Business
  • Real Money Streams ~ Create multiple streams of wealth from your home! ♣♣♣
    Are you sure you want to  Yes  No
    Your message goes here

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

  1. 1. 1 Best Practices for Supercharging Cloud Analytics on Amazon Redshift Tina Adams, Amazon Redshift Brandon Davis, Cervello Maneesh Joshi, SnapLogic May 2014
  2. 2. 2 Featured Speakers
  3. 3. 3 Agenda • Amazon Redshift Feature and Market Update • SnapLogic Case Studies with Amazon Redshift • Demo: SnapLogic Free Trial for Amazon Redshift and RDS • Cervello: Implementation Best Practices
  4. 4. 4 Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year Amazon Redshift
  5. 5. 5 Amazon Redshift Architecture • Leader Node – SQL endpoint – Stores metadata – Coordinates query execution • Compute Nodes – Local, columnar storage – Execute queries in parallel – Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH • Two hardware platforms – Optimized for data processing – DW1: HDD; scale from 2TB to 1.6PB – DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  6. 6. 6 Amazon Redshift is priced to let you analyze all your data • Number of nodes x cost per hr • No charge for leader node • No upfront costs • Pay as you go DW1 (HDD) Price Per Hour for DW1.XL Single Node Effective Annual Price per TB On-Demand $ 0.850 $ 3,723 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DW2 (SSD) Price Per Hour for DW2.L Single Node Effective Annual Price per TB On-Demand $ 0.250 $ 13,688 1 Year Reservation $ 0.161 $ 8,794 3 Year Reservation $ 0.100 $ 5,498
  7. 7. 7 Amazon Redshift Feature Delivery Service Launch (2/14) PDX (4/2) Temp Credentials (4/11) Unload Encrypted Files DUB (4/25) NRT (6/5) JDBC Fetch Size (6/27) Unload logs (7/5) 4 byte UTF-8 (7/18) Statement Timeout (7/22) SHA1 Builtin (7/15) Timezone, Epoch, Autoformat (7/25) WLM Timeout/Wildcards (8/1) CRC32 Builtin, CSV, Restore Progress (8/9) UTF-8 Substitution (8/29) JSON, Regex, Cursors (9/10) Split_part, Audit tables (10/3) SIN/SYD (10/8) HSM Support (11/11) Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count Distinct, SNS Alerts (11/13) SOC1/2/3 (5/8) Sharing snapshots (7/18) Resource Level IAM (8/9) PCI (8/22) Distributed Tables, Single Node Cursor Support, Maximum Connections to 500 (12/13) EIP Support for VPC Clusters (12/28) New query monitoring system tables and diststyle all (1/13) Redshift on DW2 (SSD) Nodes (1/23) Compression for COPY from SSH, Fetch size support for single node clusters, new system tables with commit stats, row_number(), strotol() and query termination (2/13) Resize progress indicator & Cluster Version (3/21) Regex_Substr, COPY from JSON (3/25)
  8. 8. 8 Improved Concurrency 15 50
  9. 9. 9 COPY from JSON { "jsonpaths": [ "$['id']", "$['name']", "$['location'][0]", "$['location'][1]", "$['seats']" ] } COPY venue FROM 's3://mybucket/venue.json' credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret- access-key>' JSON AS 's3://mybucket/venue_jsonpaths.json';
  10. 10. 10 COPY from Amazon Elastic MapReduce COPY sales From ‘emr:// j-1H7OUO3B52HI5/myoutput/part*' credentials ‘aws_access_key_id=<access-key id>; aws_secret_access_key=<secret-access-key>'; Amazon EMR Amazon Redshift
  11. 11. 11 REGEX_SUBSTR() select email, regexp_substr(email,'@[^.]*') from users limit 5; email | regexp_substr --------------------------------------------+---------------- | @nonnisiAenean | @lacusUtnec | @semperpretiumneque | @tristiquealiquet | @sodalesat
  12. 12. 12 Resize Progress • Progress indicator in console • New API call
  13. 13. 13 ECDHE cipher suites for perfect forward security over SSL ECDHE-RSA & ECDHE-ECDCSA cipher suites supported
  14. 14. 14 Amazon Redshift integrates with multiple data sources Amazon S3 Amazon EMR Amazon Redshift DynamoDB Amazon RDS Corporate Datacenter
  15. 15. 15 Agenda • Amazon Redshift Feature and Market Update • SnapLogic Case Studies with Amazon Redshift • Demo: SnapLogic Free Trial for Amazon Redshift and RDS • Cervello: Implementation Best Practices
  16. 16. 16 The SnapLogic Platform for Elastic Integration Powering Analytics, Apps and APIs Data Applications APIs
  17. 17. 17 Why SnapLogic? Multi-Point Orchestration • SnapStore: 160+ Prebuilt Snaps • Orchestration & Workflow Modern Platform • Elastic, Scale-out Architecture • Hybrid: Cloud to Cloud and Cloud to Ground Use Cases Faster Integration • Easily Design, Monitor, Manage • Deploy in Days not Months
  18. 18. 18 Multi-Point: Comprehensive Connectivity Snap your Apps: 160+ pre-built integrations
  19. 19. 19 Software-defined Integration Metadata Data • Streams: No data is stored/cached • Secure: 100% standards-based • Elastic: Scales out & handles data, app, API integration use cases Hybrid Scale-out Architecture Respects Data Gravity
  20. 20. 20 International Hotel Chain Reservation Data Mgmt. • 126 TB of hotel reservation data • Prohibitive cost-per- query for analytics • Unacceptable performance PAST PRESENT • FedEx’ed 126 TB of data to load into AWS Redshift • Now run daily sync between on- premise and cloud with SnapLogic of data changes (100-150GB) • Enrich analytics with Twitter and Travelocity data • Improved cost-per-query and performance
  21. 21. 21 Mid-sized Pharma Creates Cloud Data Mart Cloud to On-prem Snaplex REST Cloud to Cloud Snaplex Metadata Data • Consolidate DBs (Customer, Address, and Order) and SFDC (Contact and Account) into Redshift • MicroStrategy is the visualization layer
  22. 22. 22 Agenda • Amazon Redshift Feature and Market Update • SnapLogic Case Studies with Amazon Redshift • Demo: SnapLogic Free Trial for Amazon Redshift and RDS • Cervello: Implementation Best Practices
  23. 23. 23 DEMO
  24. 24. 24 Agenda • Amazon Redshift Feature and Market Update • SnapLogic Case Studies with Amazon Redshift • Demo: SnapLogic Free Trial for Amazon Redshift and RDS • Cervello: Implementation Best Practices
  25. 25. 25 Enterprise Performance Management (Finance) Customer Relationship Management (Sales & Marketing) Data Management Custom Development Business Intelligence & Analytics (IT) • We have offices in Boston, New York, Dallas and the UK • Offshore development and support teams in Russia and India • We partner with the leading on premise and cloud technology companies Advise, Implement, Support Cervello Helps Clients Win With Data
  26. 26. 26 Implementation Case Study • Hospitality industry analytics – Detailed transactional data – Weekly / monthly / yearly trend analysis – Began with single-node cluster, adding nodes as data volumes grow Source Data Redshift Analytics ETL
  27. 27. 27 • Collect external data loads before merging with existing data • Maintain history of cleansed and standardized source data • Use data structures optimized for analytics – Dimension and fact tables for analytics – Aggregate tables Best Practice #1: Choose The Right Pattern • Staging tables • History tables • Star schema data warehouse Requirements Design
  28. 28. 28 Best Practice #2: Select the Right Node Type • Performance was good with initial volumes and small data sets on single node • Evaluated dense storage (dw1) and dense compute (dw2) nodes • More opportunity to optimize design as volumes grew • Increased nodes to handle larger volumes – Solution leverages dense storage (dw1) nodes – Expected to stabilize between 10-20TB • Have also seen smaller volumes that work really well in dense compute (dw2) nodes Early Stages Mature Stage
  29. 29. 29 Best Practice #3: Leverage MPP • Spread data evenly across nodes while also optimizing join performance • Distribution key and sort keys are primary considerations Leader Node Compute Node 1 Compute Node 2 Compute Node n Compute Node 3 • Initial fact table distribution key caused skewed data • Changed to dimension foreign key with better distribution for 40%+ improvement in query times • Surrogate keys on dimension tables – Primary key – Sort key and distribution key OR distribute to all nodes – Sort on foreign keys in fact tables Goals Approach
  30. 30. 30 Best Practice #4: Use Columnar Compression • Started with compression settings based on general data types – VARCHAR to TEXT255, INTEGER to MOSTLY16, etc. – Iterate using ANALYZE COMPRESSION • Redshift applies automatic compression during COPY – Staging tables • Reduce I/O workload by minimizing size of data stored on disk Goals Approach
  31. 31. 31 Best Practice #5: Load and Manage Data • ETL and ELT – ETL: First set of processes prepares data for analytics – business logic, standardization, validation – ELT: Second set of processes load data into Redshift and transform into analytical structures • Data management – Enforce constraints within ETL processes – Analyze after loads to update statistics – Vacuum after large loads to existing tables, updates and deletes
  32. 32. 32 Bringing it All Together • Analytic queries – Minimize number of query columns to improve performance – Most queries use SUM or COUNT – Leveraging aggregate tables for monthly dashboards • Explain long running queries to help optimize design – Sorting / merging within nodes and merging at leader node
  33. 33. 33 Learn more… 1. Try out the SnapLogic Free Trial for Amazon Redshift: 2. Learn more about Amazon Redshift at: 3. Learn more about Cervello at: