This document provides an overview and best practices for optimizing performance on Amazon Redshift. It discusses topics like data distribution, sort keys, compression, loading data efficiently, vacuum operations, and query processing. The webinar agenda covers architecture, distribution styles, sort keys, compression, workload management and more. Examples are provided to demonstrate how different techniques can significantly improve query performance. Administrative scripts and views are also recommended as helpful tools.
2. Getting Started – June Webinar Series:
https://www.youtube.com/watch?v=biqBjWqJi-Q
Best Practices – July Webinar Series:
Optimizing Performance – July 21, 2015
Migration and Data Loading – July 22,2015
Reporting and Advanced Analytics – July 23, 2015
Amazon Redshift – Resources
4. Leader Node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute Nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via S3
• Parallel load from DynamoDB or SSH
HW optimized for data processing
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 356TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Amazon Redshift Architecture
5. – One slice per core
– DS2 – 2 slices on XL, 16 on 8XL
– DC1 – 2 slices on XL, 32 on 8XL
Architecture – Nodes and Slices
6. Table Distribution Styles
Distribution Key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Same key to same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution
7. Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
user_profile
user_id=1234
name=janet
…
user_profile
user_id=6789
name=fred
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…
order_line
order_line_id = 25693
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Data Distribution with Distribution Keys
8. Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
user_profile
user_id=1234
name=janet
…
user_profile
user_id=6789
name=fred
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…
order_line
order_line_id = 25693
…
Distribution Keys determine which data resides on which slices
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Records with same
distribution key for a table
are on the same slice
Data Distribution and Distribution Keys
9. Node 1
Slice 1 Slice 2
cloudfront
uri = /games/g1.exe
user_id=1234
…
user_profile
user_id=1234
name=janet
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
order_line
order_line_id = 25693
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Records from other tables
with the same distribution
key value are also on the
same slice
Records with same
distribution key for a table
are on the same slice
Distribution Keys help with data locality for join evaluation
Node 2
Slice 3 Slice 4
user_profile
user_id=6789
name=fred
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…
Data Distribution and Distribution Keys
10. Example Query (TPC-H dataset)
Data Distribution - Comparison
Distribution Type
Query against the tables with distribution
key was 178% faster
Key Even
14 seconds 39 seconds
11. Query plan for tables with distribution key
Data Distribution - Comparison
Query plan for tables without distribution key
15. Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Poor key choices lead to uneven distribution of records…
Data Distribution and Distribution Keys
16. Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Unevenly distributed data cause processing imbalances!
Data Distribution and Distribution Keys
17. Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records2M records 2M records 2M records
Evenly distributed data improves query performance
select * from v_check_data_distribution where tablename = 'lineitem';
Data Distribution and Distribution Keys
18. KEY
• Large Fact tables
• Large dimension tables
ALL
• Medium dimension tables (1K – 2M)
EVEN
• Tables with no joins or group by
• Small dimension tables (<1000)
Data Distribution
21. Sort Keys - How to choose
Timestamp column
Frequent range filtering or equality filtering on one column
Join column:
create table customer (
c_custkey int8 not null,
c_name varchar(25) not null,
c_address varchar(40) not null,
c_nationkey int4 not null,
c_phone char(15) not null,
c_acctbal numeric(12,2) not null,
c_mktsegment char(10) not null,
c_comment varchar(117) not null
) distkey(c_custkey) sortkey(c_custkey) ;
23. Table is sorted by 1 column
[ SORTKEY ( date ) ]
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group bys
• Quickest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Single Column
24. • Table is sorted by 1st column , then 2nd column etc.
[ SORTKEY COMPOUND ( date, region, country) ]
• Best for:
• Queries that use 1st column as primary filter, then other cols
• Can speed up joins and group bys
• Slower to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Compound
25. • Equal weight is given to each column.
[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter (up to 8)
• Slowest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Interleaved
26. Sort Keys – Comparing Styles
Single
create table
cust_sales_dt_single
sortkey (c_custkey)
as select * from
cust_sales_date;
Compound
create table
cust_sales_dt_compound
compound sortkey
(c_custkey, c_region,
c_mktsegment, d_date) as
select * from
cust_sales_date;
Interleaved
create table
cust_sales_dt_interleaved
interleaved sortkey
(c_custkey, c_region,
c_mktsegment, d_date)
as select * from
cust_sales_date;
27. Query 1
select max(lo_revenue),
min(lo_revenue)
from cust_sales_date_single
where c_custkey < 100000;
select max(lo_revenue),
min(lo_revenue)
from cust_sales_date_compound
where c_custkey < 100000;
select max(lo_revenue),
min(lo_revenue) from
cust_sales_date_interleaved
where c_custkey < 100000;
Query 2
select max(lo_revenue),
min(lo_revenue)
from cust_sales_date_single
where c_region = 'ASIA'
and c_mktsegment = 'FURNITURE';
select max(lo_revenue),
min(lo_revenue)
from cust_sales_date_compound
where c_region = 'ASIA'
and c_mktsegment = 'FURNITURE';
select max(lo_revenue),
min(lo_revenue)
from cust_sales_date_interleaved
where c_region = 'ASIA'
and c_mktsegment = 'FURNITURE';
Query 3
select max(lo_revenue), min(lo_revenue)
from cust_sales_date_single
where d_date between '01/01/1996' and
'01/14/1996'
and c_mktsegment = 'FURNITURE'
and c_region = 'ASIA';
select max(lo_revenue), min(lo_revenue)
from cust_sales_date_compound
where d_date between '01/01/1996' and
'01/14/1996'
and c_mktsegment = 'FURNITURE'
and c_region = 'ASIA';
select max(lo_revenue), min(lo_revenue)
from cust_sales_date_interleaved
where d_date between '01/01/1996' and
'01/14/1996'
and c_mktsegment = 'FURNITURE'
and c_region = 'ASIA';
Sort Keys – Comparing Styles
29. Increased load and vacuum times
More effective with large tables (> 100M+ rows)
Use Compound Sort Key when appending data in order
Sort Keys – Interleaved Considerations
32. COPY samples data automatically when loading into an empty table
• Samples up to 100,000 rows and picks optimal encoding
If use temp tables or staging tables
• Turn off automatic compression
• Use analyze compression to determine the right encodings
• Bake those encodings into your DML
COPY <tablename> FROM 's3://<bucket-name>/<object-prefix>' CREDENTIALS
<AWS_ACCESS_KEY>;<AWS_SECRET_ACCESS_KEY> DELIMITER ',' COMPUPDATE OFF
MANIFEST;
Compression
34. Example Query (TPC-H dataset)
Compressed Uncompressed
14 seconds 37 seconds
Query against the tables with
compression was 164% faster
Compression - Comparison
35. • Zone maps store min/max per block
• Once we know which block(s) contain the
range, we know which row offsets to scan
• Highly compressed sort keys means many
rows per block
• You’ll scan more data blocks than you need
• If your sort keys compress significantly
more than your data columns, you may
want to skip compression
Compression – Sort Keys
37. CREATE TABLE orders (
orderkey int8 NOT NULL DISTKEY,
custkey int8 NOT NULL,
orderstatus char(1) NOT NULL ,
totalprice numeric(12,2) NOT NULL ,
orderdate date NOT NULL SORTKEY ,
orderpriority char(15) NOT NULL,
clerk char(15) NOT NULL ,
shippriority int4 NOT NULL,
comment varchar(79) NOT NULL
);
DDL
38. During queries and ingestion,
the system allocates buffers
based on column width
Wider than needed columns
mean memory is wasted
Fewer rows fit into memory;
increased likelihood of queries
spilling to disk
DDL – Make Columns as narrow as possible
39. Define Primary & Foreign Keys
Not Enforced but…..
Helps optimizer with query plan
DDL
40. Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
Loading – Use multiple input files to maximize
throughput
41. Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Loading – Use multiple input files to maximize
throughput
43. VACUUM reclaims space and re-sorts tables
VACUUM can be run in 4 modes:
• VACUUM FULL
• Reclaims space and re-sorts
• VACUUM DELETE ONLY
• Reclaims space but does not re-sort
• VACUUM SORT ONLY
• Re-sorts but does not reclaim space
• VACUUM REINDEX
• Used for INTERLEAVED sort keys.
• Re-Analyzes sort keys and then runs FULL VACUUM
Vacuum
44. VACUUM is an I/O intensive operation and can take time to run.
To minimize the impact of VACUUM:
• Run VACUUM on a regular schedule
• Use TRUNCATE instead of DELETE where possible
• TRUNCATE or DROP test tables
• Perform a Deep Copy instead of VACUUM
• Load Data in sort order and remove need for VACUUM
Vacuum
45. • Is an alternate to VACUUM.
• Will remove deleted rows and also re-sort the table
• Is more efficient than VACUUM
• You can’t make concurrent updates to the table
Deep copy options:
• Use original table DDL and run INSERT INTO…SELECT
• Best option - Retains all table attributes
• Use CREATE TABLE AS
• New table does not inherit encoding, distkey, sortkey, primary keys, or foreign keys.
• Use CREATE TABLE LIKE
• New table inherits all attributes except primary and foreign keys
• Use a TEMP table to COPY data out and back in again
• Retains all attributes but requires two full inserts of the table
Vacuum – Deep Copy
46. Redshift’s query optimizer relies on up-to-date statistics
Update stats on sort/dist key columns after every load
Analyze
48. Workload Management
Workload management is about creating queues for different workloads
User Group A
Short-running queueLong-running queue
Short
Query Group
Long
Query Group
50. Workload Management
Don’t set concurrency to more that you need
set query_group to allqueries;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;
reset query_group;
51. Resources
Sanjay Kotecha | kotechas@amazon.com
Detail Pages
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
Best Practices
• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html
Deep Drive Webinar Series in July
• Migration and Loading Data – July 22nd, 2015
• Reporting and Advanced Analytics – July 23rd, 2015