SRV405 Ancestry's Journey to Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tony Gibbs, Data Warehousing Solutions Architect at AWS
Chris Sanders, Director Data Services at Ancestry
Tanner Pratt, Manager Data Warehousing at Ancestry
August 14, 2017
Ancestry’s Journey to Amazon Redshift

Deep Dive Overview
• Amazon Redshift history and development
• Cluster architecture
• Concepts and terminology
• New & upcoming features
• Ancestry’s journey to Amazon Redshift
• Open Q&A

Amazon Redshift History & Development

Columnar
MPP
OLAP
IAMAmazon
VPC
Amazon SWF
Amazon S3 AWS KMS Amazon
Route 53
Amazon
CloudWatch
Amazon
EC2
PostgreSQL
Amazon Redshift

February 2013
August 2017
> 100 Significant Patches
> 150 Significant Features

Amazon Redshift Cluster Architecture

Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / Amazon EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with row storage:
o Need to read everything
o Unnecessary I/O

Columnar storage
Data compression
Zone maps
aid loc dt
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with columnar storage
o Only scan blocks for relevant column

Columnar storage
Data compression
Zone maps
aid loc dt
aid INT ENCODE LZO
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Columns grow and shrink independently
• Reduces storage requirements
• Reduces I/O

Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt
,dt DATE --date
);
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks that cannot contain
data for a given query
• Eliminates unnecessary I/O

SELECT COUNT(*) FROM deep_dive WHERE dt = '09-JUNE-2013'
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
Unsorted Table
MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
Sorted by Date
Zone Maps

Terminology and Concepts: Data Sorting
• Goals:
• Physically order rows of table data based on certain column(s)
• Optimize effectiveness of zone maps
• Enable MERGE JOIN operations
• Impact:
• Enables rrscans to prune blocks by leveraging zone maps
• Overall reduction in block I/O
• Achieved with the table property SORTKEY defined over one or more columns
• Optimal SORTKEY is dependent on:
• Query patterns
• Data profile
• Business requirements

Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node”
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices
• Table rows are distributed to slices
• A slice processes only its own data

Data Distribution
• Distribution style is a table property which dictates how that table’s data is
distributed throughout the cluster:
• KEY: Value is hashed, same value goes to same location (slice)
• ALL: Full table data goes to first slice of every node
• EVEN: Round robin
• Goals:
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN

Data Distribution: Example
,dt DATE --date
) DISTSTYLE (EVEN|KEY|ALL);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
Table: deep_dive
User
Columns
System
Columns
aid loc dt ins del row

Data Distribution: EVEN Example
,dt DATE --date
) DISTSTYLE EVEN;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
INSERT INTO deep_dive VALUES
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Table: loft_deep_dive
User Columns System Columns
Rows: 0 Rows: 0 Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (4 slices) = 24 Blocks (24 MB)

Data Distribution: KEY Example #1
,dt DATE --date
) DISTSTYLE KEY DISTKEY (loc);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Rows: 2 Rows: 0 Rows: 0
Rows: 0Rows: 1
Rows: 2Rows: 0Rows: 1

Data Distribution: KEY Example #2
,dt DATE --date
) DISTSTYLE KEY DISTKEY (aid);
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');

Data Distribution: ALL Example
CREATE TABLE loft_deep_dive (
,dt DATE --date
) DISTSTYLE ALL;
CN1
Slice 0 Slice 1
CN2
Slice 2 Slice 3
(1, 'SFO', '2016-09-01'),
(2, 'JFK', '2016-09-14'),
(3, 'SFO', '2017-04-01'),
(4, 'JFK', '2017-05-14');
Rows: 0 Rows: 0
(3 User Columns + 3 System Columns) x (2 slice) = 12 Blocks (12 MB)
Rows: 0Rows: 1Rows: 2Rows: 4Rows: 3
Rows: 0Rows: 1Rows: 2Rows: 4Rows: 3

Terminology and Concepts: Data Distribution
KEY
• The key creates an even distribution of data
• Joins are performed between large fact/dimension tables
• Optimizing merge joins and group by
ALL
• Small and medium size dimension tables (< 2–3M)
EVEN
• When key cannot produce an even distribution

Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency:
Multiple clusters access
same data
No ETL: Query data in-
place using open file
formats
Full Amazon Redshift
SQL support
S3
SQL
Enter Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using
thousands of nodes

Analyze Subsets of Data Analyze ALL Available Data
Traditional Approach Redshift Spectrum Approach
Had to pick and choose which data you wanted to analyze
Analyze only the data that fits
in your data warehouse
Analyze any of the data in your
data lake
Paradigm Shift Enabled by Redshift Spectrum

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore

Amazon Redshift Spectrum is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and
AWS KMS
Encrypt all Amazon Redshift
data using KMS, AWS
CloudHSM, or your on-premises
HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node
in your VPC. Compute nodes in
private VPC. Redshift Spectrum
nodes in private VPC, store no
state.
Communicate event-specific
notifications via email, text
message, or call with Amazon
SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA

“Redshift Spectrum will let us expand the
universe of the data we analyze to hundreds of
petabytes over time. This is truly a game
changer, and we can think of no other system in
the world that can get us there.”
“Multiple teams can now query the
same Amazon S3 data sets using both
Amazon Redshift and Amazon EMR.”
“Redshift Spectrum will help us scale yet
further while also lowering our costs.”
“Redshift Spectrum’s fast performance
across massive data sets is
unprecedented.”
“Redshift Spectrum enables us to directly
operate on our data in its native format in
Amazon S3 with no preprocessing or
transformation.”
“Our data science team using Amazon EMR can
now collaborate with our marketing and product
teams using Redshift Spectrum to analyze the
same Amazon S3 data sets.”
Customers Love Amazon Redshift Spectrum

Allows automatic handling of runaway (poorly written) queries
Metrics with operators and values (e.g. query_cpu_time > 1000) create a predicate
Multiple predicates can be AND-ed together to create a rule
Multiple rules can be defined for a queue in WLM. These rules are OR-ed together
If { rule } then [action]
{ rule : metric operator value } e.g.: rows_scanned > 100000
• Metric: cpu_time, query_blocks_read, rows scanned, query
execution time, cpu & io skew per slice, join_row_count, etc.
• Operator: <, >, ==
• Value: integer
[action]: hop, log, abort
Query Monitoring Rules (QMR)

BI tools SQL clientsAnalytics tools
Client AWS
Amazon
Redshift
ADFS
Corporate
Active Directory IAM
Amazon Redshift
ODBC/JDBC
User groups Individual user
Single Sign-On
Identity providers
New Amazon
Redshift
ODBC/JDBC
drivers. Grab the
ticket (userid) and
get a SAML
assertion.
IAM Authentication

Coming Soon: SQL Scalar User-Defined Functions
Language SQL Support Added for Scalar UDFs
CREATE OR REPLACE FUNCTION inet_ntoa(bigint)
RETURNS varchar(15)
AS
$$
SELECT
CASE WHEN $1 BETWEEN 0 AND 4294967295 THEN
(($1 >> 24) & 255) || '.' ||
(($1 >> 16) & 255) || '.' ||
(($1 >> 8) & 255) || '.' ||
(($1 >> 0) & 255)
ELSE
NULL
END
$$ LANGUAGE SQL IMMUTABLE;
Example:

Ancestry’s Journey to Amazon Redshift

20 billion
historical records
90 million
family trees
10 billion
profiles
330 million
user-generated photos and
stories
5 million
people in genomics networks

Why DNA matching is so powerful
The number of ways I could pick two different
people in this room:
0.0001875Chance that a random pair of individuals share
enough DNA to be considered fourth cousins:
19,900
97.6%Chance that there is at least one pair of genetic
fourth cousin pairs in the room:
200Number of people in the room:

Where we were
• Actian Matrix
• On Premises
• Hardware Bound
• Single Admin
• SSIS for ETL
• 161 Packages
• 50+ Data Sources

How we successfully migrated.
Step 1. PLAN
• Document ALL Components That Need to Migrate
• Identify Dependencies
• Prioritize Importance
• Create Roadmap with Milestones
• Align Resources to Swimlanes
• Rally Team

Roadmap
JAN FEB MAR APR
Q1 Q2
Infra.DataWarehouse
AWS Infrastructure
(VPC, Subnet, Network)
DW Environment Design & PoC
DW Infra.
Configuration
(EDW)
Milestones
JAN 27
Core Infrastructure Established
PoC Complete
FEB 7
Amazon
Redshift Env.
Established
Design
Extract
Extract and Transfer
Data
Design S3 to Amazon
Redshift Load
Refactor and Test ELT Packages
(161 Packages, 5 Resources 80% allocated)
Migrate to
AWS Run ELT to “Catch Up”
Load Schema & Data into Amazon
Redshift
MAR 17
Data Loaded into Amazon Redshift
ELT Packages Migrated
to AWS and Verified
FEB 24
Data
Loaded onto S3
Amount of time determines
ELT “Catch Up” time
BackupTableau
& MicroStrategy
Validate Tableau & MicroStrategy
(900 Tableau Workbooks)
Restore to AWS
Export, Transfer and Load
SQL Server Data
Validate Environment
& Cutover
Run MATRIX and
Amazon Redshift in Parallel
APR 7
Redshift ”Caught Up”
Begin Running in Parallel
APR 20
Business Live on
Amazon Redshift
Today, Apr 18
AWS and DW Security
(ACL,Groups, IAM, AIM, Encryption)
DW Infra. Configuration
(SOX, Tableau)
DW Infra.
(Monitoring, DR)
• Cutover to Amazon Redshift completed 9 days
ahead of schedule! Business is live on
Amazon Redshift.
• ETL and Tools Migration 96% complete
• Tableau/MicroStrategy Migration 94% complete

Step 2. ADJUST
• Review Tasks and Roadmap Frequently
• There Will Be Unknowns
• Evaluate Impacts
• Be Flexible and Agile

Step 3. COMMUNICATE
• Internal with Team
• Up with Executives and Management
• Out with Customers
• Be Transparent, Honest, and Realistic

Data Warehouse Migration
Status Scorecard – 4/18
Phase Status Complete In Progress
AWS Infrastructure Common Infrastructure – 100% complete
Security Management Security Management – 100% complete
DW Infrastructure Configuration
DW Infrastructure Configuration – 100%
complete
Monitoring and DR Setup – 100% complete
DW Data Extract and Load to AWS Data Extract and Load to AWS –100% complete
ETL and Tools Migration
ETL and Tools Migration – 96% complete
100% (165/165) migration complete
100% (127/127) ETL catchup complete
85% (108/127) validation complete
4/19 – 85% validation complete
4/28 – Run MATRIX and Amazon Redshift in parallel, 100%
validation complete
SQL Server Data Migration SQL Server Data Migration – 100% complete
Tableau and MicroStrategy
Migration
Tableau/MicroStrategy Migration – 94%
complete
Final Tableau Backup and Restore
4/12 - Test connectivity using the ELB/ENI for Tableau
4/18 – Upgrade MicroStrategy
4/19 – MicroStrategy Validation
4/21 – Tableau Validation
Not on schedule & high risk to complete on time On schedule & moderate risk to complete on time On track to deliver on time

Step 4. RUN PARALLEL
• Start as SOON as Possible
• Quickly Identify Issues
• Added Work, but WORTH IT!
• Insights Are Invaluable

Step 5. HAVE THE RIGHT PARTNER
• We Were Scared
• Meet Often
• Communicate Concerns
• Amazon Has Been AMAZING

End result:
• Anticlimactic Failover
• Flawless Migration
• Zero Business Downtime
• 88 Business Days. 9 Days EARLY!
• 470 BILLION Records. Migrated AND Validated. 100% Accurate.
• 161 ETL Packages Migrated
• 600+ Tableau Workbooks Migrated

Lessons learned during migration
1. AWS Infrastructure Creation
2. Data Migration
3. Code Migration

Lessons learned during AWS infrastructure
How do I setup a production level AWS account with all
appropriate services in 3 weeks?
• Preparation
• Practice
• Patience

Lessons learned during data migration
How do I get 80 TB of highly compressed data to the cloud
securely, quickly, and accurately?
• AWS Snowball method
• AWS Direct Connect method
VS
AWS Snowball

Lessons learned during code migration
How do I migrate 161 ETL code packages
with no business logic loss or data loss?
• Automation of changes
• Strict tracking of changes
• No deleting source data

Optimized for the cloud
Changes we have made since migrations
• Change in mindset
• Node size and type change: 96 DC1.8XL -> 47 DS2.8XL
• Rewrite of ETL code to leverage commit blocks, temp tables, dist
keys, sort keys etc.
drop table if exists foo.me ;
Create table foo.me as (
Select x , y
from foo.you ) ;
2 commits hitting commit queue
Begin;
drop table if exists foo.me ;
Create table foo.me as (
Select x , y
from foo.you );
Commit;
1 Commit

Amazon Redshift Spectrum use case
• Had 3 simple use cases of analysis that needed to be
performed on the data
• Loaded 20 TB of data from source data systems to
Amazon S3 in raw then convert to parquet
• Partitioned by day
• Delivered analysis that was previously impossible to
produce
• https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/

Unlock your past.
Inspire your future.

Thank You!
Tony Gibbs tonygibb@amazon.com
Chris Sanders csanders@ancestry.com
Tanner Pratt tpratt@ancestry.com

SRV405 Ancestry's Journey to Amazon Redshift

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SRV405 Ancestry's Journey to Amazon Redshift

Similar to SRV405 Ancestry's Journey to Amazon Redshift (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

SRV405 Ancestry's Journey to Amazon Redshift