Best Practices for Supercharging Cloud Analytics on Amazon Redshift

1
Best Practices for Supercharging Cloud
Analytics on Amazon Redshift
Tina Adams, Amazon Redshift
Brandon Davis, Cervello
Maneesh Joshi, SnapLogic
May 2014

3
Agenda
• Amazon Redshift Feature and Market Update
• SnapLogic Case Studies with Amazon Redshift
• Demo: SnapLogic Free Trial for Amazon Redshift and
RDS
• Cervello: Implementation Best Practices

4
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift

5
Amazon Redshift Architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB or SSH
• Two hardware platforms
– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

6
Amazon Redshift is priced to let you analyze all
your data
• Number of nodes x cost per hr
• No charge for leader node
• No upfront costs
• Pay as you go
DW1 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DW2 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688

7
Amazon Redshift Feature Delivery
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
Unload Encrypted Files
DUB (4/25)
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
4 byte UTF-8 (7/18)
Statement Timeout (7/22)
SHA1 Builtin (7/15)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress
(8/9)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy,
Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize
Perf., Approximate Count Distinct, SNS
Alerts (11/13)
SOC1/2/3 (5/8)
Sharing snapshots (7/18)
Resource Level IAM (8/9)
PCI (8/22)
Distributed Tables, Single Node Cursor
Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and
diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch
size support for single node clusters,
new system tables with commit stats,
row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster
Version (3/21)
Regex_Substr, COPY from JSON (3/25)

9
COPY from JSON
{
"jsonpaths":
[
"$['id']",
"$['name']",
"$['location'][0]",
"$['location'][1]",
"$['seats']"
]
}
COPY venue FROM 's3://mybucket/venue.json'
credentials 'aws_access_key_id=<access-key-id>; aws_secret_access_key=<secret-
access-key>'
JSON AS 's3://mybucket/venue_jsonpaths.json';

10
COPY from Amazon Elastic MapReduce
COPY sales
From ‘emr:// j-1H7OUO3B52HI5/myoutput/part*'
credentials ‘aws_access_key_id=<access-key id>;
aws_secret_access_key=<secret-access-key>';
Amazon EMR Amazon Redshift

11
REGEX_SUBSTR()
select email, regexp_substr(email,'@[^.]*')
from users limit 5;
email | regexp_substr
--------------------------------------------+----------------
Suspendisse.tristique@nonnisiAenean.edu | @nonnisiAenean
sed@lacusUtnec.ca | @lacusUtnec
elementum@semperpretiumneque.ca | @semperpretiumneque
Integer.mollis.Integer@tristiquealiquet.org | @tristiquealiquet
Donec.fringilla@sodalesat.org | @sodalesat

12
Resize Progress
• Progress indicator in
console
• New API call

13
ECDHE cipher suites for perfect forward
security over SSL
ECDHE-RSA & ECDHE-ECDCSA cipher suites supported

14
Amazon Redshift integrates with multiple data
sources
Amazon S3 Amazon EMR
Amazon Redshift
DynamoDB
Amazon RDS
Corporate Datacenter

15
Agenda
RDS

16
The SnapLogic Platform for Elastic Integration
Powering Analytics, Apps and APIs
Data Applications APIs

17
Why SnapLogic?
Multi-Point Orchestration
• SnapStore: 160+ Prebuilt Snaps
• Orchestration & Workflow
Modern Platform
• Elastic, Scale-out Architecture
• Hybrid: Cloud to Cloud and
Cloud to Ground Use Cases
Faster Integration
• Easily Design, Monitor, Manage
• Deploy in Days not Months

18
Multi-Point: Comprehensive Connectivity
Snap your Apps: 160+ pre-built integrations

19
Software-defined Integration
Metadata
Data
• Streams: No data is
stored/cached
• Secure: 100%
standards-based
• Elastic: Scales out &
handles data, app, API
integration use cases
Hybrid Scale-out Architecture Respects Data Gravity

20
International Hotel Chain Reservation Data Mgmt.
• 126 TB of hotel
reservation data
• Prohibitive cost-per-
query for analytics
• Unacceptable
performance
PAST PRESENT
• FedEx’ed 126 TB of data to load into
AWS Redshift
• Now run daily sync between on-
premise and cloud with SnapLogic
of data changes (100-150GB)
• Enrich analytics with Twitter and
Travelocity data
• Improved cost-per-query and
performance

21
Mid-sized Pharma Creates Cloud Data Mart
Cloud to On-prem Snaplex
REST
Cloud to Cloud Snaplex
Metadata
Data
• Consolidate DBs
(Customer, Address,
and Order) and SFDC
(Contact and Account)
into Redshift
• MicroStrategy is the
visualization layer

22
Agenda
• Demo: SnapLogic Free Trial for Amazon Redshift
and RDS

24
Agenda
RDS

25
Enterprise
Performance
Management
(Finance)
Customer
Relationship
Management
(Sales &
Marketing)
Data Management
Custom Development
Business
Intelligence &
Analytics
(IT)
• We have offices in Boston, New York, Dallas and the UK
• Offshore development and support teams in Russia and India
• We partner with the leading on premise and cloud technology
companies
Advise, Implement, Support
Cervello Helps Clients Win With Data

26
Implementation Case Study
• Hospitality industry analytics
– Detailed transactional data
– Weekly / monthly / yearly trend analysis
– Began with single-node cluster, adding nodes as data volumes
grow
Source Data Redshift Analytics
ETL

27
• Collect external data loads
before merging with
existing data
• Maintain history of
cleansed and standardized
source data
• Use data structures
optimized for analytics
– Dimension and fact tables for
analytics
– Aggregate tables
Best Practice #1: Choose The Right Pattern
• Staging tables
• History tables
• Star schema data
warehouse
Requirements Design

28
Best Practice #2: Select the Right Node Type
• Performance was good with
initial volumes and small
data sets on single node
• Evaluated dense storage
(dw1) and dense compute
(dw2) nodes
• More opportunity to
optimize design as volumes
grew
• Increased nodes to handle
larger volumes
– Solution leverages dense
storage (dw1) nodes
– Expected to stabilize between
10-20TB
• Have also seen smaller
volumes that work really well
in dense compute (dw2) nodes
Early Stages Mature Stage

29
Best Practice #3: Leverage MPP
• Spread data evenly across
nodes while also optimizing
join performance
• Distribution key and sort
keys are primary
considerations
Leader
Node
Compute
Node 1
Compute
Node 2
Compute
Node n
Compute
Node 3
• Initial fact table distribution key
caused skewed data
• Changed to dimension foreign
key with better distribution for
40%+ improvement in query
times
• Surrogate keys on dimension
tables
– Primary key
– Sort key and distribution key OR
distribute to all nodes
– Sort on foreign keys in fact tables
Goals Approach

30
Best Practice #4: Use Columnar Compression
• Started with compression
settings based on general
data types
– VARCHAR to TEXT255,
INTEGER to MOSTLY16, etc.
– Iterate using ANALYZE
COMPRESSION
• Redshift applies automatic
compression during COPY
– Staging tables
• Reduce I/O workload by
minimizing size of data
stored on disk
Goals Approach

31
Best Practice #5: Load and Manage Data
• ETL and ELT
– ETL: First set of processes prepares data for analytics –
business logic, standardization, validation
– ELT: Second set of processes load data into Redshift and
transform into analytical structures
• Data management
– Enforce constraints within ETL processes
– Analyze after loads to update statistics
– Vacuum after large loads to existing tables, updates and
deletes

32
Bringing it All Together
• Analytic queries
– Minimize number of query columns to improve performance
– Most queries use SUM or COUNT
– Leveraging aggregate tables for monthly dashboards
• Explain long running queries to help optimize design
– Sorting / merging within nodes and merging at leader node

33
Learn more…
1. Try out the SnapLogic Free Trial for Amazon Redshift:
http://snaplogic.com/redshift-trial
2. Learn more about Amazon Redshift at:
http://aws.amazon.com/redshift
3. Learn more about Cervello at:
http://mycervello.com/

Best Practices for Supercharging Cloud Analytics on Amazon Redshift

More Related Content

What's hot

Similar to Best Practices for Supercharging Cloud Analytics on Amazon Redshift

More from SnapLogic

Recently uploaded

Best Practices for Supercharging Cloud Analytics on Amazon Redshift