This spring, the data warehouse team at Ancestry, flawlessly migrated and validated nearly half a trillion records from Actian Matrix to Amazon Redshift. During this session, the Ancestry team will describe how they orchestrated the entire migration in less than four months, the technical challenges they faced and overcame along the way, as well as share tips and tricks to break through common pitfalls of data warehouse migrations. They will also highlight how they tuned and optimized the Amazon Redshift environment, adopted Redshift Spectrum, and how they leverage their collaboration with Amazon to deliver a powerful customer experience.
2. Deep Dive Overview
• Amazon Redshift history and development
• Cluster architecture
• Concepts and terminology
• New & upcoming features
• Ancestry’s journey to Amazon Redshift
• Open Q&A
9. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with row storage:
o Need to read everything
o Unnecessary I/O
10. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
Designed for I/O Reduction
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with columnar storage
o Only scan blocks for relevant column
11. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
CREATE TABLE deep_dive (
aid INT ENCODE LZO
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Columns grow and shrink independently
• Reduces storage requirements
• Reduces I/O
12. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks that cannot contain
data for a given query
• Eliminates unnecessary I/O
13. SELECT COUNT(*) FROM deep_dive WHERE dt = '09-JUNE-2013'
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
Unsorted Table
MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Sorted by Date
Zone Maps
14. Terminology and Concepts: Data Sorting
• Goals:
• Physically order rows of table data based on certain column(s)
• Optimize effectiveness of zone maps
• Enable MERGE JOIN operations
• Impact:
• Enables rrscans to prune blocks by leveraging zone maps
• Overall reduction in block I/O
• Achieved with the table property SORTKEY defined over one or more columns
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements
15. Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node”
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices
• Table rows are distributed to slices
• A slice processes only its own data
16. Data Distribution
• Distribution style is a table property which dictates how that table’s data is
distributed throughout the cluster:
• KEY: Value is hashed, same value goes to same location (slice)
• ALL: Full table data goes to first slice of every node
• EVEN: Round robin
• Goals:
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN
17. Data Distribution: Example
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE (EVEN|KEY|ALL);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
Table: deep_dive
User
Columns
System
Columns
aid loc dt ins del row
18. Data Distribution: EVEN Example
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE EVEN;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 0 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24 MB)
Rows: 1 Rows: 1 Rows: 1 Rows: 1
19. Data Distribution: KEY Example #1
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE KEY DISTKEY (loc);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 2 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (2 slices) = 12 Blocks (12 MB)
Rows: 0Rows: 1
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 2Rows: 0Rows: 1
20. Data Distribution: KEY Example #2
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE KEY DISTKEY (aid);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0 Rows: 0 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24 MB)
Rows: 1 Rows: 1 Rows: 1 Rows: 1
21. Data Distribution: ALL Example
CREATE TABLE loft_deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) DISTSTYLE ALL;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (2 slice) = 12 Blocks (12 MB)
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0Rows: 1Rows: 2Rows: 4Rows: 3
Table: loft_deep_dive
User Columns System Columns
aid loc dt ins del row
Rows: 0Rows: 1Rows: 2Rows: 4Rows: 3
22. Terminology and Concepts: Data Distribution
KEY
• The key creates an even distribution of data
• Joins are performed between large fact/dimension tables
• Optimizing merge joins and group by
ALL
• Small and medium size dimension tables (< 2–3M)
EVEN
• When key cannot produce an even distribution
24. Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency:
Multiple clusters access
same data
No ETL: Query data in-
place using open file
formats
Full Amazon Redshift
SQL support
S3
SQL
Enter Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using
thousands of nodes
25. Analyze Subsets of Data Analyze ALL Available Data
Traditional Approach Redshift Spectrum Approach
Had to pick and choose which data you wanted to analyze
Analyze only the data that fits
in your data warehouse
Analyze any of the data in your
data lake
Paradigm Shift Enabled by Redshift Spectrum
27. Amazon Redshift Spectrum is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and
AWS KMS
Encrypt all Amazon Redshift
data using KMS, AWS
CloudHSM, or your on-premises
HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node
in your VPC. Compute nodes in
private VPC. Redshift Spectrum
nodes in private VPC, store no
state.
Communicate event-specific
notifications via email, text
message, or call with Amazon
SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA
28. “Redshift Spectrum will let us expand the
universe of the data we analyze to hundreds of
petabytes over time. This is truly a game
changer, and we can think of no other system in
the world that can get us there.”
“Multiple teams can now query the
same Amazon S3 data sets using both
Amazon Redshift and Amazon EMR.”
“Redshift Spectrum will help us scale yet
further while also lowering our costs.”
“Redshift Spectrum’s fast performance
across massive data sets is
unprecedented.”
“Redshift Spectrum enables us to directly
operate on our data in its native format in
Amazon S3 with no preprocessing or
transformation.”
“Our data science team using Amazon EMR can
now collaborate with our marketing and product
teams using Redshift Spectrum to analyze the
same Amazon S3 data sets.”
Customers Love Amazon Redshift Spectrum
29. Allows automatic handling of runaway (poorly written) queries
Metrics with operators and values (e.g. query_cpu_time > 1000) create a predicate
Multiple predicates can be AND-ed together to create a rule
Multiple rules can be defined for a queue in WLM. These rules are OR-ed together
If { rule } then [action]
{ rule : metric operator value } e.g.: rows_scanned > 100000
• Metric: cpu_time, query_blocks_read, rows scanned, query
execution time, cpu & io skew per slice, join_row_count, etc.
• Operator: <, >, ==
• Value: integer
[action]: hop, log, abort
Query Monitoring Rules (QMR)
30. BI tools SQL clientsAnalytics tools
Client AWS
Amazon
Redshift
ADFS
Corporate
Active Directory IAM
Amazon Redshift
ODBC/JDBC
User groups Individual user
Single Sign-On
Identity providers
New Amazon
Redshift
ODBC/JDBC
drivers. Grab the
ticket (userid) and
get a SAML
assertion.
IAM Authentication
31. Coming Soon: SQL Scalar User-Defined Functions
Language SQL Support Added for Scalar UDFs
CREATE OR REPLACE FUNCTION inet_ntoa(bigint)
RETURNS varchar(15)
AS
$$
SELECT
CASE WHEN $1 BETWEEN 0 AND 4294967295 THEN
(($1 >> 24) & 255) || '.' ||
(($1 >> 16) & 255) || '.' ||
(($1 >> 8) & 255) || '.' ||
(($1 >> 0) & 255)
ELSE
NULL
END
$$ LANGUAGE SQL IMMUTABLE;
Example:
33. 20 billion
historical records
90 million
family trees
10 billion
profiles
330 million
user-generated photos and
stories
5 million
people in genomics networks
34. Why DNA matching is so powerful
The number of ways I could pick two different
people in this room:
0.0001875Chance that a random pair of individuals share
enough DNA to be considered fourth cousins:
19,900
97.6%Chance that there is at least one pair of genetic
fourth cousin pairs in the room:
200Number of people in the room:
35. Where we were
• Actian Matrix
• On Premises
• Hardware Bound
• Single Admin
• SSIS for ETL
• 161 Packages
• 50+ Data Sources
36. How we successfully migrated.
Step 1. PLAN
• Document ALL Components That Need to Migrate
• Identify Dependencies
• Prioritize Importance
• Create Roadmap with Milestones
• Align Resources to Swimlanes
• Rally Team
37. Roadmap
JAN FEB MAR APR
Q1 Q2
Infra.DataWarehouse
AWS Infrastructure
(VPC, Subnet, Network)
DW Environment Design & PoC
DW Infra.
Configuration
(EDW)
Milestones
JAN 27
Core Infrastructure Established
PoC Complete
FEB 7
Amazon
Redshift Env.
Established
Design
Extract
Extract and Transfer
Data
Design S3 to Amazon
Redshift Load
Refactor and Test ELT Packages
(161 Packages, 5 Resources 80% allocated)
Migrate to
AWS Run ELT to “Catch Up”
Load Schema & Data into Amazon
Redshift
MAR 17
Data Loaded into Amazon Redshift
ELT Packages Migrated
to AWS and Verified
FEB 24
Data
Loaded onto S3
Amount of time determines
ELT “Catch Up” time
BackupTableau
& MicroStrategy
Validate Tableau & MicroStrategy
(900 Tableau Workbooks)
Restore to AWS
Export, Transfer and Load
SQL Server Data
Validate Environment
& Cutover
Run MATRIX and
Amazon Redshift in Parallel
APR 7
Redshift ”Caught Up”
Begin Running in Parallel
APR 20
Business Live on
Amazon Redshift
Today, Apr 18
AWS and DW Security
(ACL,Groups, IAM, AIM, Encryption)
DW Infra. Configuration
(SOX, Tableau)
DW Infra.
(Monitoring, DR)
• Cutover to Amazon Redshift completed 9 days
ahead of schedule! Business is live on
Amazon Redshift.
• ETL and Tools Migration 96% complete
• Tableau/MicroStrategy Migration 94% complete
38. How we successfully migrated.
Step 2. ADJUST
• Review Tasks and Roadmap Frequently
• There Will Be Unknowns
• Evaluate Impacts
• Be Flexible and Agile
39. How we successfully migrated.
Step 3. COMMUNICATE
• Internal with Team
• Up with Executives and Management
• Out with Customers
• Be Transparent, Honest, and Realistic
40. Data Warehouse Migration
Status Scorecard – 4/18
Phase Status Complete In Progress
AWS Infrastructure Common Infrastructure – 100% complete
Security Management Security Management – 100% complete
DW Infrastructure Configuration
DW Infrastructure Configuration – 100%
complete
Monitoring and DR Setup – 100% complete
DW Data Extract and Load to AWS Data Extract and Load to AWS –100% complete
ETL and Tools Migration
ETL and Tools Migration – 96% complete
100% (165/165) migration complete
100% (127/127) ETL catchup complete
85% (108/127) validation complete
4/19 – 85% validation complete
4/28 – Run MATRIX and Amazon Redshift in parallel, 100%
validation complete
SQL Server Data Migration SQL Server Data Migration – 100% complete
Tableau and MicroStrategy
Migration
Tableau/MicroStrategy Migration – 94%
complete
Final Tableau Backup and Restore
4/12 - Test connectivity using the ELB/ENI for Tableau
4/18 – Upgrade MicroStrategy
4/19 – MicroStrategy Validation
4/21 – Tableau Validation
Not on schedule & high risk to complete on time On schedule & moderate risk to complete on time On track to deliver on time
41. How we successfully migrated.
Step 4. RUN PARALLEL
• Start as SOON as Possible
• Quickly Identify Issues
• Added Work, but WORTH IT!
• Insights Are Invaluable
42. How we successfully migrated.
Step 5. HAVE THE RIGHT PARTNER
• We Were Scared
• Meet Often
• Communicate Concerns
• Amazon Has Been AMAZING
43. How we successfully migrated.
End result:
• Anticlimactic Failover
• Flawless Migration
• Zero Business Downtime
• 88 Business Days. 9 Days EARLY!
• 470 BILLION Records. Migrated AND Validated. 100% Accurate.
• 161 ETL Packages Migrated
• 600+ Tableau Workbooks Migrated
44. Lessons learned during migration
1. AWS Infrastructure Creation
2. Data Migration
3. Code Migration
45. Lessons learned during AWS infrastructure
How do I setup a production level AWS account with all
appropriate services in 3 weeks?
• Preparation
• Practice
• Patience
46. Lessons learned during data migration
How do I get 80 TB of highly compressed data to the cloud
securely, quickly, and accurately?
• AWS Snowball method
• AWS Direct Connect method
VS
AWS Snowball
48. Lessons learned during code migration
How do I migrate 161 ETL code packages
with no business logic loss or data loss?
• Automation of changes
• Strict tracking of changes
• No deleting source data
49. Optimized for the cloud
Changes we have made since migrations
• Change in mindset
• Node size and type change: 96 DC1.8XL -> 47 DS2.8XL
• Rewrite of ETL code to leverage commit blocks, temp tables, dist
keys, sort keys etc.
drop table if exists foo.me ;
Create table foo.me as (
Select x , y
from foo.you ) ;
2 commits hitting commit queue
Begin;
drop table if exists foo.me ;
Create table foo.me as (
Select x , y
from foo.you );
Commit;
1 Commit
50. Amazon Redshift Spectrum use case
• Had 3 simple use cases of analysis that needed to be
performed on the data
• Loaded 20 TB of data from source data systems to
Amazon S3 in raw then convert to parquet
• Partitioned by day
• Delivered analysis that was previously impossible to
produce
• https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/