Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pavan Pothukuchi, Principal Product Manager , AWS
September 20, 2016
Deep Dive: Amazon Redshift for Big Data
Analytics

Agenda
• Service Overview
• Best Practices
• Schema / Table Design
• Data Ingestion
• Database Tuning
• Migration
• Examples

Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Selected Amazon Redshift customers

Amazon Redshift system architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via
Amazon S3; load from
Amazon DynamoDB, Amazon EMR, or SSH
Two hardware platforms
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 326TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

A deeper look at compute node architecture
Each node contains multiple slices
• DS2 – 2 slices on XL, 16 on 8XL
• DC1 – 2 slices on L, 32 on 8XL
A slice can be thought as a “virtual
compute node”
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Each compute node has either 2,
16, or 32 slices
• Table rows are distributed to slices
• A slice processes only its own data
Leader Node

Amazon Redshift dramatically reduces I/O
Data compression
Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Calculating SUM(Amount) with
row storage:
– Need to read everything
– Unnecessary I/O
ID Age State Amount

Data compression
Zone maps
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Calculating SUM(Amount) with
column storage:
– Only scan the necessary
blocks
ID Age State Amount

Column storage
Data compression
Zone maps
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks which don’t
contain data for a given query
• Minimize unnecessary I/O
ID Age State Amount

Data Distribution
• Distribution style is a table property which dictates how that table’s data is
distributed throughout the cluster:
• KEY: Value is hashed, same value goes to same location (slice)
• ALL: Full table data goes to first slice of every node
• EVEN: Round robin
• Goals:
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
ALL
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
EVEN

ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
2
3
4
ID Gender Name
101 M John Smith
306 F Lisa Green
ID Gender Name
292 F Jane Jones
209 M James White
ID Gender Name
139 M Peter Black
164 M Brian Snail
ID Gender Name
446 M Pat Partridge
658 F Sarah Cyan
Round
Robin
DISTSTYLE EVEN

ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
Hash
Function
ID Gender Name
101 M John Smith
306 F Lisa Green
ID Gender Name
292 F Jane Jones
209 M James White
ID Gender Name
139 M Peter Black
164 M Brian Snail
ID Gender Name
446 M Pat Partridge
658 F Sarah Cyan
DISTSTYLE KEY

ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
Hash
Function
ID Gender Name
101 M John Smith
139 M Peter Black
446 M Pat Partridge
164 M Brian Snail
209 M James White
ID Gender Name
292 F Jane Jones
658 F Sarah Cyan
306 F Lisa Green
DISTSTYLE KEY

ID Gender Name
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M James White
306 F Lisa Green
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
101 M John Smith
292 F Jane Jones
139 M Peter Black
446 M Pat Partridge
658 F Sarah Cyan
164 M Brian Snail
209 M Lisa Green
306 F James White
ALL
DISTSTYLE ALL

CUSTOMERS
CUST_ID GENDER NAME
101 M John Smith
306 F James White
ORDERS
ORDER_ID CUST_ID Amount
A1600 101 120
B8765 306 340
RESULTS
CUST_ID GENDER Amount
101 M 120
306 F 340
CUSTOMERS
CUST_ID GENDER NAME
292 F Jane Jones
209 M Lyall Green
ORDERS
ORDER_ID CUST_ID Amount
C0967 292 750
D8753 209 601
RESULTS
CUST_ID GENDER Amount
292 F 750
209 M 601

Choosing a Distribution Style
KEY
• Large FACT tables
• Large or rapidly changing
tables used in joins
• Localize columns used within
aggregations
ALL
• Have slowly changing data
• Reasonable size (i.e., few
millions but not 100’s of
millions of rows)
• No common distribution key for
frequent joins
• Typical use case – joined
dimension table without a
common distribution key
EVEN
• Tables not frequently joined or
aggregated
• Large tables without acceptable
candidate keys

Data Sorting
Goals
Physically order rows of table data based on certain column(s)
Optimize effectiveness of zone maps
Enable MERGE JOIN operations
Impact
Enables rrscans to prune blocks by leveraging zone maps
Overall reduction in block IO
Achieved with the table property SORTKEY defined over one or more columns
Optimal SORTKEY is dependent on:
Query patterns
Data profile
Business requirements

Zone Maps
SELECT COUNT(*) FROM LOGS WHERE DATE = ‘09-JUNE-2013’
MIN: 01-JUNE-2013
MAX: 20-JUNE-2013
MIN: 08-JUNE-2013
MAX: 30-JUNE-2013
MIN: 12-JUNE-2013
MAX: 20-JUNE-2013
MIN: 02-JUNE-2013
MAX: 25-JUNE-2013
MIN: 06-JUNE-2013
MAX: 12-JUNE-2013
Unsorted Table
MIN: 01-JUNE-2013
MAX: 06-JUNE-2013
MIN: 07-JUNE-2013
MAX: 12-JUNE-2013
MIN: 13-JUNE-2013
MAX: 18-JUNE-2013
MIN: 19-JUNE-2013
MAX: 24-JUNE-2013
MIN: 25-JUNE-2013
MAX: 30-JUNE-2013
Sorted By Date
READ
READ
READ
READ
READ

Single Column
• Table is sorted by 1 column
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
[ SORTKEY ( date ) ]
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group bys

Compound
Date Region Country
[ SORTKEY COMPOUND ( date, region, country) ]
Best for:
• Queries that use 1st column as primary filter, then other cols
• Can speed up joins and group bys

Interleaved
• Equal weight is given to each column.
Date Region Country
[ SORTKEY INTERLEAVED ( date, region, country) ]
Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter

COMPOUND
• Most Common
• Well defined filter criteria
• Time-series data
Choosing a SORTKEY
INTERLEAVED
• Edge Cases
• Large tables (>Billion Rows)
• No common filter criteria
• Non time-series data
• Primarily as a query predicate (date, identifier, …)
• Optionally choose a column frequently used for aggregates
• Optionally choose same as distribution key column for most efficient
joins (merge join)

Compressing Data
• COPY automatically analyzes and compresses data
when loading into empty tables
• ANALYZE COMPRESSION checks existing tables and
proposes optimal compression algorithms for each
column
• Changing column encoding requires a table rebuild

Compressing Data
If you have a regular ETL process and you use temp tables
or staging tables, turn off automatic compression
• Use analyze compression to determine the right encodings
• Bake those encodings into your DML
• Use CREATE TABLE … LIKE

Compressing Data
• From the zone maps we know:
• Which block(s) contain the
range
• Which row offsets to scan
• Highly compressed sort keys:
• Many rows per block
• Large row offset
Skip compression on just the
leading column of the compound
sortkey

Amazon Redshift Loading Data Overview
AWS CloudCorporate Data center
Amazon
DynamoDB
Amazon S3
Data
Volume
Amazon Elastic
MapReduce
Amazon
RDS
Amazon
Redshift
Amazon
Glacier
logs / files
Source DBs
VPN
Connection
AWS Direct
Connect
S3 Multipart
Upload
AWS Import/
Export
EC2 or On-
Prem (using
SSH)

Parallelism is a function of load files
Each slice’s query processors are able to load one file at a time
• Streaming Decompression
• Parse
• Distribute
• Write
A single input file means
only one slice is ingesting data
Realizing only partial cluster usage as 6.25% of slices are active
2 4 6 8 10 12 141 3 5 7 9 11 13 15

Maximize Throughput with Multiple Files
Use at least as many input files as
there are slices in cluster
With 16 input files, all slices are
working so you maximize
throughput
COPY continues to scale linearly
as you add additional nodes
2 4 6 8 10 12 141 3 5 7 9 11 13 15

New feature: ALTER TABLE APPEND
ELT workloads typically “massage” or aggregate data in a
staging table and then append to production table
ALTER TABLE APPEND moves data from staging to
production table by manipulating metadata
Much faster than INSERT INTO as data is not duplicated

Best Practices: Performance
Tuning

Optimizing a database for querying
• Periodically check your table status
• Vacuum and Analyze regularly
• SVV_TABLE_INFO
• Missing statistics
• Table skew
• Uncompressed Columns
• Unsorted Data
• Check your cluster status
• WLM queuing
• Commit queuing
• Database Locks

Missing Statistics
• Amazon Redshift’s query
optimizer relies on up-to-date
statistics
• Statistics are only necessary for
data which you are accessing
• Updated stats important on:
• SORTKEY
• DISTKEY
• Columns in query predicates

Table Skew
• Unbalanced workload
• Query completes as fast as the
slowest slice completes
• Can cause skew inflight:
• Temp data fills a single
node resulting in query
failure
Table Maintenance and Status
Unsorted Table
• Sortkey is just a guide, but data
needs to actually be sorted
• VACUUM or DEEP COPY to
sort
• Scans against unsorted tables
continue to benefit from zone
maps:
• Load sequential blocks

WLM Queue
Identify short/long-running queries
and prioritize them
Define multiple queues to route
queries appropriately.
Default concurrency of 5
Leverage wlm_apex_hourly to tune
WLM based on peak concurrency
requirements
Cluster Status: Commits and WLM
Commit Queue
How long is your commit queue?
• Identify needless transactions
• Group dependent statements
within a single transaction
• Offload operational workloads
• STL_COMMIT_STATS

Cluster Status: Database Locks
• Database Locks
• Read locks, Write locks, Exclusive locks
• Reads block exclusive
• Writes block writes and exclusive
• Exclusives block everything
• Ungranted locks block subsequent lock requests
• Exposed through SVV_TRANSACTIONS

Typical ETL/ELT on legacy data warehouse
• One file per table, maybe a few if too big
• Many updates (“massage” the data)
• Every job clears the data, then loads
• Count on primary key to block double loads
• High concurrency of load jobs
• Small table(s) to control the job stream

Two questions to ask
Why you do what you do?
• Many times, users don’t know
What is the customer need?
• Many times, needs do not match current practice
• You might benefit from adding other AWS services

On Amazon Redshift
Updates are delete + insert of the row
• Deletes just mark rows for deletion
Blocks are immutable
• Minimum space used is one block per column, per slice
Commits are expensive
• 4 GB write on 8XL per node
• Mirrors WHOLE dictionary
• Cluster-wide serialized

On Amazon Redshift
• Not all aggregations created equal
• Pre-aggregation can help
• Order on group by matters
• Concurrency should be low for better throughput
• Caching layer for dashboards is recommended
• WLM parcels RAM to queries. Use multiple queues for
better control.

Workload Management (WLM)
Concurrency and memory can now be changed dynamically
You can have distinct values for load time and query time
Use wlm_apex_hourly.sql to monitor “queue pressure”

New Feature – WLM Queue Hopping

Query throughput vs. Concurrency
• Query throughput (QPM or QPH) is more representative
of end user experience than concurrency
• Several improvements over the last 6 months
• Commit improvements
• Dynamic resource management
• Query throughput doubled over the last 6 months

Resources
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
https://s3.amazonaws.com/chriz-webinar/webinar.zip
Admin scripts
Collection of utilities for running diagnostics on your cluster
Admin views
Collection of utilities for managing your cluster, generating schema DDL, etc.
ColumnEncodingUtility
Gives you the ability to apply optimal column encoding to an established
schema with data already loaded

Monday, October 24, 2016
JW Marriot Austin
https://aws.amazon.com/events/devday-austin
Free, one-day developer event featuring tracks,
labs, and workshops around Serverless,
Containers, IoT, and Mobile
Q&A
If you want to learn more, register for our upcoming DevDay Austin:

Appendix: Performance
optimization examples

Use SORTKEYs to effectively prune blocks

Don’t compress initial SORTKEY column

Use compression encoding to reduce I/O

Choose a DISTKEY which avoids data skew

Ingest: Disable predictable compression analysis

Ingest: Load multiple files to match cluster slices

VACUUM to physically removed deleted rows

VACUUM to keep your tables sorted

Gather statistics to assist the query planner

Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Similar to Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series