SlideShare a Scribd company logo
1 of 37
Best Practices for Migrating your
Data Warehouse to Amazon
Redshift
Darin Briskman, AWS Technical Evangelist
briskman@amazon.com
§ Redshift can be 100x faster
§ Scales from GBs to PBs
§ Analyze data without storage
constraints
§ Redshift is 10x cheaper
§ Easy to provision and operate
§ Higher productivity
§ Redshift can be 10x faster
§ No programming
§ Standard interfaces and
integration to leverage BI tools,
machine learning, streaming
Transactional database MPP database Hadoop
Why Migrate to Amazon Redshift?
Migration from Oracle @ Boingo Wireless
2000+ Commercial Wi-Fi locations
1 million+ Hotspots
90M+ ad engagements
100+ countries
Legacy DW: Oracle 11g based DW
Before migration
Rapid data growth slowed
analytics
Mediocre IOPS, limited memory,
vertical scaling
Admin overhead
Expensive (license, h/w, support)
After migration
180x performance improvement
7x cost savings
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Exadata SAP
HANA
Redshift
$400,000
$300,000
$55,000
7,200
2,700
15 15
Query
Performance
Data Load
Performance
1 year of data
1 million records
Latencyinseconds
RedshiftExisting System
7X cheaper than Oracle Exadata 180X faster than Oracle database
Migration from Oracle @ Boingo Wireless
Migration from Greenplum @ NTT Docomo
68 million customers
10s of TBs per day of data across
mobile network
6PB of total data (uncompressed)
Data science for marketing
operations, logistics etc.
Legacy DW: Greenplum on-premises
After migration:
125 node DS2.8XL cluster
4,500 vCPUs, 30TB RAM
6 PB uncompressed
10x faster analytic queries
50% reduction in time for new BI
app. deployment
Significantly less ops. overhead
Migration from SQL on Hadoop @ Yahoo
§ Analytics for website/mobile events
across multiple Yahoo properties
§ On an average day
2B events
25M devices
Before migration: Hive – Found it to be
slow, hard to use, share and repeat
After migration:
21 node DC1.8XL (SSD)
50TB compressed data
100x performance improvement
Real-time insights
Easier deployment and
maintenance
Migration from SQL on Hadoop @ Yahoo
1
10
100
1000
10000
Count
Distinct
Devices
Count All
Events
Filter
Clauses
Joins
Seconds
Amazon Redshift
Impala
Evaluation Operations and
optimization
Perform PoC and
Validation
Design
Migration
Plan
Migration,
Integration and
Validation
Amazon Redshift Migration Process
ENGINE X Amazon Redshift
ETL Scripts
SQL in reports
Adhoc. queries
How to Migrate?
Schema Conversion Database Migration
Map data types
Choose compression
encoding, sort keys,
distribution keys
Generate and apply DDL
Schema & Data
Transformation
Data Migration
Convert SQL Code
Bulk Load
Capture updates
Transformations
Assess Gaps
Stored Procedures
Functions
1 2
3
4
If you forget everything else…
• Lift-and-Shift is NOT an ideal approach
• Depending where you are coming from, it is sure to fail
• AWS has a rich ecosystem of solutions
• Your final solution will use other AWS services
• AWS Solution Architects, ProServ, and Partners can help
Amazon Redshift Cluster Architecture
§ Massively parallel, shared nothing
§ Leader node
– SQL endpoint
– Stores metadata
– Coordinates parallel SQL processing
§ Compute nodes
– Local, columnar storage
– Executes queries in parallel
– Load, backup, restore
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Designed for I/O Reduction
§ Columnar storage
§ Data compression
§ Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with row storage:
– Need to read everything
– Unnecessary I/O
aid loc dt
CREATE TABLE loft_migration (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
Designed for I/O Reduction
§ Columnar storage
§ Data compression
§ Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with columnar storage:
– Only scan blocks for relevant
column
aid loc dt
CREATE TABLE loft_migration (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
Designed for I/O Reduction
§ Columnar storage
§ Data compression
§ Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Columns grow and shrink independently
• Effective compression ratios due to like data
• Reduces storage requirements
• Reduces I/O
aid loc dt
CREATE TABLE loft_migration (
aid INT ENCODE LZO
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
Designed for I/O Reduction
§ Columnar storage
§ Data compression
§ Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt
CREATE TABLE loft_migration (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks which cannot
contain data for a given query
• Eliminates unnecessary I/O
Terminology and Concepts: Slices
§ A slice can be thought of like a “virtual compute node”
– Unit of data partitioning
– Parallel query processing
§ Facts about slices:
– Each compute node has either 2, 16, or 32 slices
– Table rows are distributed to slices
– A slice processes only its own data
Terminology and Concepts: Data Distribution
• Distribution style is a table property which dictates how that table’s data is
distributed throughout the cluster:
• KEY: Value is hashed, same value goes to same location (slice)
• ALL: Full table data goes to first slice of every node
• EVEN: Round robin
• Goals:
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN
DS2.8XL Compute Node
§ Ingestion Throughput:
– Each slice’s query processors can load one file at a time:
§ Streaming decompression
§ Parse
§ Distribute
§ Write
§ Realizing only partial node usage as 6.25% of slices are active
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
Data Loading Best Practices
§ Use at least as many input
files as there are slices in the
cluster
§ With 16 input files, all slices
are working so you maximize
throughput
§ COPY continues to scale
linearly as you add nodes
16 Input Files
DS2.8XL Compute Node
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
Data Loading Best Practices Continued
Data Preparation
§ Export Data from Source System
– CSV Recommend (Delimiter '|')
§ Be aware of UTF-8 varchar columns (UTF-8 take 4 bytes per char)
§ Be aware of your NULL character (N)
– GZIP Compress Files
– Split Files (1MB – 1GB after gzip compression)
§ Useful COPY Options for PoC Data
– MAXERRORS
– ACCEPTINVCHARS
– NULL AS
Keep Columns as Narrow as Possible
• Buffers allocated based on declared
column width
• Wider than needed columns mean
memory is wasted
• Fewer rows fit into memory; increased
likelihood of queries spilling to disk
• Check
SVV_TABLE_INFO(max_varchar)
• SELECT max(len(col)) FROM table
Amazon Redshift is a Data Warehouse
§ Optimized for batch inserts
– The time to insert a single row in Redshift is roughly the same as inserting
100,000 rows
§ Updates are delete + insert of the row
• Deletes mark rows for deletion
§ Blocks are immutable
– Minimum space used is one block per column, per slice
§ Auto Compression
– Samples data automatically when COPY into an empty table
§ Samples up to 100,000 rows and picks optimal encoding
– Turn off Auto Compression for Staging Tables
§ Bake encodings into your DDL or use CREATE TABLE (LIKE …)
§ Analyze Compression
– Data profile has changed
– Run after changing sort key
Column Compression
Raw encoding (RAW)
Byte-dictionary (BYTEDICT)
Delta encoding (DELTA / DELTA32K)
Mostly encoding (MOSTLY8 / MOSTLY16 / MOSTLY32)
Runlength encoding (RUNLENGTH)
Text encoding (TEXT255 / TEXT32K)
LZO encoding (LZO)
Zstandard (ZSTD)
Average: 2-4x
Column (Compression) Encoding Types
Primary/Unique/Foreign Key Constraints
§ Primary/Unique/Foreign Key constraints are NOT enforced
– If you load data multiple times, Amazon Redshift won’t complain
– If you declare primary keys in your DDL, the optimizer will expect the data to
be unique
§ Redshift optimizer uses declared constraints to pick optimal plan
– In certain cases it can result in performance improvements
Benchmarking Tips
• Verify tables are vacuumed and analyzed
• Check SVV_TABLE_INFO (STATS_OFF, UNSORTED)
• Vacuum & Analyze
• https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AnalyzeVacuumUtility
• Verify column encodings
• Check PG_TABLE_DEF
• Re-encode tables
• https://github.com/awslabs/amazon-redshift-utils/tree/master/src/ColumnEncodingUtility
• Verify good distribution keys
• SVV_TABLE_INFO (SKEW_ROWS)
Convert schema in a few clicks
Sources include Oracle, Teradata,
Greenplum, Netezza, and MS SQL DW
Automatic schema optimization
Converts application SQL code
Detailed assessment report
AWS Schema
Conversion Tool
(AWS SCT)
AWS Schema Conversion Tool
Database migration process
Step 1: Convert or Copy your Schema
Source DB or DW
AWS SCT
Destination DB or DW
Step 2: Move your data
Source DB or DW
AWS SCT
Destination DB or DW
SCT data extractors
Extract Data from your data warehouse and migrate to Amazon Redshift
• Extracts through local migration agents
• Data is optimized for Redshift and Saved
in local files
• Files are loaded to an Amazon S3 bucket
(through network or Amazon Snowball)
and then to Amazon Redshift
Amazon RedshiftAWS SCT S3 Bucket
For Large Data Transfers
§ AWS Snowball
§ AWS Snowball Edge
§ AWS Snowmobile
Amazon Kinesis Firehose
Load massive volumes of streaming data intoAmazon S3, Redshift and
Elasticsearch
§ Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other
destinations without writing an application or managing infrastructure.
§ Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data
destinations in as little as 60 secs using simple configurations.
§ Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
§ Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform
incoming source data.
Capture and submit
streaming data
Analyze streaming data using
your favorite BI tools
Firehose loads streaming data
continuously into Amazon S3, Redshift
and Elasticsearch
Amazon Kinesis Firehose to Amazon Redshift
Data Integration Partners
Data Integration Systems Integrators
Amazon Redshift
Data Catalog
Hive metastore compatible metadata repository of data sources
Crawls data source to infer table, data type, partition format
Job Execution
Runs jobs in Spark containers – automatic scaling based on SLA
Glue is serverless – only pay for the resources you consume
Job Authoring
Generates Python code to move data from source to destination
Edit with your favorite IDE; share code snippets using Git
AWS Glue for Automated, Serverless ETL
Resources
§ https://github.com/awslabs/amazon-redshift-utils
§ https://github.com/awslabs/amazon-redshift-monitoring
§ https://github.com/awslabs/amazon-redshift-udfs
§ Admin scripts
Collection of utilities for running diagnostics on your cluster
§ Admin views
Collection of utilities for managing your cluster, generating schema DDL, etc.
§ ColumnEncodingUtility
Gives you the ability to apply optimal column encoding to an established schema
with data already loaded
§ Amazon Redshift Engineering’s Advanced Table Design Playbook
https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-
playbook-preamble-prerequisites-and-prioritization/
Thank you!
briskman@amazon.com

More Related Content

What's hot

Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon Web Services Korea
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?James Serra
 
Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...
Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...
Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...Amazon Web Services
 
Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...
Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...
Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...Amazon Web Services Korea
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Web Services Korea
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Michael Rys
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastDatabricks
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data LakeCalum Miller
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
 
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...Amazon Web Services
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기NAVER D2
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon RedshiftKel Graham
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftAmazon Web Services
 

What's hot (20)

Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
Amazon DocumentDB vs MongoDB 의 내부 아키텍쳐 와 장단점 비교
 
Should I move my database to the cloud?
Should I move my database to the cloud?Should I move my database to the cloud?
Should I move my database to the cloud?
 
Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...
Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...
Migrating Oracle to Aurora PostgreSQL Utilizing AWS Database Migration Servic...
 
Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...
Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...
Amazon kinesis와 elasticsearch service로 만드는 실시간 데이터 분석 플랫폼 :: 박철수 :: AWS Summi...
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB DayAmazon Redshift의 이해와 활용 (김용우) - AWS DB Day
Amazon Redshift의 이해와 활용 (김용우) - AWS DB Day
 
Data Lifecycle Management
Data Lifecycle ManagementData Lifecycle Management
Data Lifecycle Management
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
SQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at ComcastSQL Analytics Powering Telemetry Analysis at Comcast
SQL Analytics Powering Telemetry Analysis at Comcast
 
EVCache at Netflix
EVCache at NetflixEVCache at Netflix
EVCache at Netflix
 
Data Vault Vs Data Lake
Data Vault Vs Data LakeData Vault Vs Data Lake
Data Vault Vs Data Lake
 
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...
 
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
Migrate from Oracle to Amazon Aurora using AWS Schema Conversion Tool & AWS D...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기[215] Druid로 쉽고 빠르게 데이터 분석하기
[215] Druid로 쉽고 빠르게 데이터 분석하기
 
A tour of Amazon Redshift
A tour of Amazon RedshiftA tour of Amazon Redshift
A tour of Amazon Redshift
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 

Similar to Best Practices for Migrating Data to Redshift

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Amazon Web Services
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAmazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftAmazon Web Services
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Julien SIMON
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceAmazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike, Inc.
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
Getting Started with Managed Database Services on AWS - September 2016 Webina...
Getting Started with Managed Database Services on AWS - September 2016 Webina...Getting Started with Managed Database Services on AWS - September 2016 Webina...
Getting Started with Managed Database Services on AWS - September 2016 Webina...Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech TalksSelecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech TalksAmazon Web Services
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAndre Essing
 

Similar to Best Practices for Migrating Data to Redshift (20)

Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canb...
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)Deep Dive: Amazon Redshift (March 2017)
Deep Dive: Amazon Redshift (March 2017)
 
Deep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performanceDeep Dive Redshift, with a focus on performance
Deep Dive Redshift, with a focus on performance
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
Getting Started with Managed Database Services on AWS - September 2016 Webina...
Getting Started with Managed Database Services on AWS - September 2016 Webina...Getting Started with Managed Database Services on AWS - September 2016 Webina...
Getting Started with Managed Database Services on AWS - September 2016 Webina...
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech TalksSelecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
Selecting the Right AWS Database Solution - AWS 2017 Online Tech Talks
 
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep DiveAzure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - Technical Deep Dive
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Best Practices for Migrating Data to Redshift

  • 1. Best Practices for Migrating your Data Warehouse to Amazon Redshift Darin Briskman, AWS Technical Evangelist briskman@amazon.com
  • 2. § Redshift can be 100x faster § Scales from GBs to PBs § Analyze data without storage constraints § Redshift is 10x cheaper § Easy to provision and operate § Higher productivity § Redshift can be 10x faster § No programming § Standard interfaces and integration to leverage BI tools, machine learning, streaming Transactional database MPP database Hadoop Why Migrate to Amazon Redshift?
  • 3. Migration from Oracle @ Boingo Wireless 2000+ Commercial Wi-Fi locations 1 million+ Hotspots 90M+ ad engagements 100+ countries Legacy DW: Oracle 11g based DW Before migration Rapid data growth slowed analytics Mediocre IOPS, limited memory, vertical scaling Admin overhead Expensive (license, h/w, support) After migration 180x performance improvement 7x cost savings
  • 4. 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 Exadata SAP HANA Redshift $400,000 $300,000 $55,000 7,200 2,700 15 15 Query Performance Data Load Performance 1 year of data 1 million records Latencyinseconds RedshiftExisting System 7X cheaper than Oracle Exadata 180X faster than Oracle database Migration from Oracle @ Boingo Wireless
  • 5. Migration from Greenplum @ NTT Docomo 68 million customers 10s of TBs per day of data across mobile network 6PB of total data (uncompressed) Data science for marketing operations, logistics etc. Legacy DW: Greenplum on-premises After migration: 125 node DS2.8XL cluster 4,500 vCPUs, 30TB RAM 6 PB uncompressed 10x faster analytic queries 50% reduction in time for new BI app. deployment Significantly less ops. overhead
  • 6. Migration from SQL on Hadoop @ Yahoo § Analytics for website/mobile events across multiple Yahoo properties § On an average day 2B events 25M devices Before migration: Hive – Found it to be slow, hard to use, share and repeat After migration: 21 node DC1.8XL (SSD) 50TB compressed data 100x performance improvement Real-time insights Easier deployment and maintenance
  • 7. Migration from SQL on Hadoop @ Yahoo 1 10 100 1000 10000 Count Distinct Devices Count All Events Filter Clauses Joins Seconds Amazon Redshift Impala
  • 8. Evaluation Operations and optimization Perform PoC and Validation Design Migration Plan Migration, Integration and Validation Amazon Redshift Migration Process
  • 9. ENGINE X Amazon Redshift ETL Scripts SQL in reports Adhoc. queries How to Migrate? Schema Conversion Database Migration Map data types Choose compression encoding, sort keys, distribution keys Generate and apply DDL Schema & Data Transformation Data Migration Convert SQL Code Bulk Load Capture updates Transformations Assess Gaps Stored Procedures Functions 1 2 3 4
  • 10. If you forget everything else… • Lift-and-Shift is NOT an ideal approach • Depending where you are coming from, it is sure to fail • AWS has a rich ecosystem of solutions • Your final solution will use other AWS services • AWS Solution Architects, ProServ, and Partners can help
  • 11. Amazon Redshift Cluster Architecture § Massively parallel, shared nothing § Leader node – SQL endpoint – Stores metadata – Coordinates parallel SQL processing § Compute nodes – Local, columnar storage – Executes queries in parallel – Load, backup, restore 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores S3 / EMR / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  • 12. Designed for I/O Reduction § Columnar storage § Data compression § Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Accessing dt with row storage: – Need to read everything – Unnecessary I/O aid loc dt CREATE TABLE loft_migration ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );
  • 13. Designed for I/O Reduction § Columnar storage § Data compression § Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Accessing dt with columnar storage: – Only scan blocks for relevant column aid loc dt CREATE TABLE loft_migration ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date );
  • 14. Designed for I/O Reduction § Columnar storage § Data compression § Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Columns grow and shrink independently • Effective compression ratios due to like data • Reduces storage requirements • Reduces I/O aid loc dt CREATE TABLE loft_migration ( aid INT ENCODE LZO ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH );
  • 15. Designed for I/O Reduction § Columnar storage § Data compression § Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt CREATE TABLE loft_migration ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); • In-memory block metadata • Contains per-block MIN and MAX value • Effectively prunes blocks which cannot contain data for a given query • Eliminates unnecessary I/O
  • 16. Terminology and Concepts: Slices § A slice can be thought of like a “virtual compute node” – Unit of data partitioning – Parallel query processing § Facts about slices: – Each compute node has either 2, 16, or 32 slices – Table rows are distributed to slices – A slice processes only its own data
  • 17. Terminology and Concepts: Data Distribution • Distribution style is a table property which dictates how that table’s data is distributed throughout the cluster: • KEY: Value is hashed, same value goes to same location (slice) • ALL: Full table data goes to first slice of every node • EVEN: Round robin • Goals: • Distribute data evenly for parallel processing • Minimize data movement during query processing KEY ALL Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 EVEN
  • 18. DS2.8XL Compute Node § Ingestion Throughput: – Each slice’s query processors can load one file at a time: § Streaming decompression § Parse § Distribute § Write § Realizing only partial node usage as 6.25% of slices are active 0 2 4 6 8 10 12 141 3 5 7 9 11 13 15 Data Loading Best Practices
  • 19. § Use at least as many input files as there are slices in the cluster § With 16 input files, all slices are working so you maximize throughput § COPY continues to scale linearly as you add nodes 16 Input Files DS2.8XL Compute Node 0 2 4 6 8 10 12 141 3 5 7 9 11 13 15 Data Loading Best Practices Continued
  • 20. Data Preparation § Export Data from Source System – CSV Recommend (Delimiter '|') § Be aware of UTF-8 varchar columns (UTF-8 take 4 bytes per char) § Be aware of your NULL character (N) – GZIP Compress Files – Split Files (1MB – 1GB after gzip compression) § Useful COPY Options for PoC Data – MAXERRORS – ACCEPTINVCHARS – NULL AS
  • 21. Keep Columns as Narrow as Possible • Buffers allocated based on declared column width • Wider than needed columns mean memory is wasted • Fewer rows fit into memory; increased likelihood of queries spilling to disk • Check SVV_TABLE_INFO(max_varchar) • SELECT max(len(col)) FROM table
  • 22. Amazon Redshift is a Data Warehouse § Optimized for batch inserts – The time to insert a single row in Redshift is roughly the same as inserting 100,000 rows § Updates are delete + insert of the row • Deletes mark rows for deletion § Blocks are immutable – Minimum space used is one block per column, per slice
  • 23. § Auto Compression – Samples data automatically when COPY into an empty table § Samples up to 100,000 rows and picks optimal encoding – Turn off Auto Compression for Staging Tables § Bake encodings into your DDL or use CREATE TABLE (LIKE …) § Analyze Compression – Data profile has changed – Run after changing sort key Column Compression
  • 24. Raw encoding (RAW) Byte-dictionary (BYTEDICT) Delta encoding (DELTA / DELTA32K) Mostly encoding (MOSTLY8 / MOSTLY16 / MOSTLY32) Runlength encoding (RUNLENGTH) Text encoding (TEXT255 / TEXT32K) LZO encoding (LZO) Zstandard (ZSTD) Average: 2-4x Column (Compression) Encoding Types
  • 25. Primary/Unique/Foreign Key Constraints § Primary/Unique/Foreign Key constraints are NOT enforced – If you load data multiple times, Amazon Redshift won’t complain – If you declare primary keys in your DDL, the optimizer will expect the data to be unique § Redshift optimizer uses declared constraints to pick optimal plan – In certain cases it can result in performance improvements
  • 26. Benchmarking Tips • Verify tables are vacuumed and analyzed • Check SVV_TABLE_INFO (STATS_OFF, UNSORTED) • Vacuum & Analyze • https://github.com/awslabs/amazon-redshift-utils/tree/master/src/AnalyzeVacuumUtility • Verify column encodings • Check PG_TABLE_DEF • Re-encode tables • https://github.com/awslabs/amazon-redshift-utils/tree/master/src/ColumnEncodingUtility • Verify good distribution keys • SVV_TABLE_INFO (SKEW_ROWS)
  • 27. Convert schema in a few clicks Sources include Oracle, Teradata, Greenplum, Netezza, and MS SQL DW Automatic schema optimization Converts application SQL code Detailed assessment report AWS Schema Conversion Tool (AWS SCT)
  • 29. Database migration process Step 1: Convert or Copy your Schema Source DB or DW AWS SCT Destination DB or DW Step 2: Move your data Source DB or DW AWS SCT Destination DB or DW
  • 30. SCT data extractors Extract Data from your data warehouse and migrate to Amazon Redshift • Extracts through local migration agents • Data is optimized for Redshift and Saved in local files • Files are loaded to an Amazon S3 bucket (through network or Amazon Snowball) and then to Amazon Redshift Amazon RedshiftAWS SCT S3 Bucket
  • 31. For Large Data Transfers § AWS Snowball § AWS Snowball Edge § AWS Snowmobile
  • 32. Amazon Kinesis Firehose Load massive volumes of streaming data intoAmazon S3, Redshift and Elasticsearch § Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift, and other destinations without writing an application or managing infrastructure. § Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. § Seamless elasticity: Seamlessly scales to match data throughput w/o intervention § Serverless ETL using AWS Lambda - Firehose can invoke your Lambda function to transform incoming source data. Capture and submit streaming data Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Redshift and Elasticsearch
  • 33. Amazon Kinesis Firehose to Amazon Redshift
  • 34. Data Integration Partners Data Integration Systems Integrators Amazon Redshift
  • 35. Data Catalog Hive metastore compatible metadata repository of data sources Crawls data source to infer table, data type, partition format Job Execution Runs jobs in Spark containers – automatic scaling based on SLA Glue is serverless – only pay for the resources you consume Job Authoring Generates Python code to move data from source to destination Edit with your favorite IDE; share code snippets using Git AWS Glue for Automated, Serverless ETL
  • 36. Resources § https://github.com/awslabs/amazon-redshift-utils § https://github.com/awslabs/amazon-redshift-monitoring § https://github.com/awslabs/amazon-redshift-udfs § Admin scripts Collection of utilities for running diagnostics on your cluster § Admin views Collection of utilities for managing your cluster, generating schema DDL, etc. § ColumnEncodingUtility Gives you the ability to apply optimal column encoding to an established schema with data already loaded § Amazon Redshift Engineering’s Advanced Table Design Playbook https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design- playbook-preamble-prerequisites-and-prioritization/