Maryna Popova "Deep dive AWS Redshift"

Deep dive into
AWS RedshiftMaryna Popova
Big Data Engineer

800+
transport partners
and providers
35+
countries with full
service in 15 of those
27m+
users per month from
150+ different countries
About Omio

Agenda
1. What is AWS Redshift?
2. Columnar vs Row-based storage
3. Data compression
4. Redshift as MPP
5. Distkey and Sortkey
6. Scaling
7. Features and Bugs
8. Q&A

What is Amazon Redshift?
Fully managed, petabyte-scale data warehouse service in the cloud
Enterprise-class relational database query and management system.
.
.

Amazon Redshift SQL
based on PostgreSQL 8.0.2.
Amazon Redshift and PostgreSQL have a number of very important differences.
.
.

Columnar vs Row data storage
.
.Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com Block 1
2 Juan Rodrigez JuanRodrigez@somemail.com Block 2
3 Ian Noel IanNoel@somemail.com Block 3
4 John Smith JohnSmith@somemail.com Block 4

Row - oriented
.
.Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com Block 1
2 Juan Rodrigez JuanRodrigez@somemail.com Block 2
3 Ian Noel IanNoel@somemail.com Block 3
4 John Smith JohnSmith@somemail.com Block 4

Row - oriented
Data blocks store values sequentially
Inefficient use of disk space
.
.

Row - oriented
Designed to return a record in as few operations as possible
Optimal for OLTP databases
Disadvantage: inefficient use of disk space
.
.
.

Columnar
.
.
Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com
2 Juan Rodrigez JuanRodrigez@somemail.com
3 Ian Noel IanNoel@somemail.com
4 John Smith JohnSmith@somemail.com
1 2 3 4 Block 1
Ivan Juan Ian John Block 2
Sidorov Rodrigez Noel Smith Block 3
IvanSidorov@somemail.com JuanRodrigez@somemail.com IanNoel@somemail.com JohnSmith@somemail.com Block 4

Columnar
Data block stores values of a single column for multiple rows
Much less I/O operations for reading same number of column
field values for the same number of records compared
to row-wise storage
Same type of data in block ⇒ can use a compression scheme
.
.
.

Data compression in Redshift
Specifies the type of compression that is applied to a column
of data values as rows are added to a table
Applied during table design stage
.
.

Example
create table dwh.fact_bookings_and_cancellations(
reporting_operations_id bigint,
booking_id varchar(255) DISTKEY,
lowest_unit_value_in_euros bigint,
operation_currency varchar(255) encode bytedict,
operation_date_time timestamp SORTKEY,
...

Default encodings
Columns that are defined as sort keys are assigned RAW
Compression.
Columns that are defined as BOOLEAN, REAL, or DOUBLE
PRECISION data types are assigned RAW compression.
All other columns are assigned LZO compression.
.
.
.

Example
create table dwh.fact_bookings_and_cancellations(
reporting_operations_id bigint encode lzo,
booking_id varchar(255) encode lzo DISTKEY,
lowest_unit_value_in_euros bigint encode lzo ,
operation_currency varchar(255) encode bytedict,
operation_date_time timestamp encode raw SORTKEY,
...

Encodings
data is stored in raw, uncompressed form..
Raw Encoding
a separate dictionary of unique values is created
for each block of column values on disk
effective when a column contains a limited number
(<256) of unique values
.
.
Byte-Dictionary Encoding

Encodings
provides a very high compression ratio
with good performance
works especially well for CHAR and VARCHAR
columns that store very long character strings
.
.
LZO Encoding
useful when the data type for a column is larger
than most of the stored values require
.
Mostly Encoding

Encodings
compresses data by recording the difference between
values that follow each other in the column
.
Delta Encoding
replaces a value that is repeated consecutively
with a token that consists of the value and a count
of the number of consecutive occurrences
DON'T apply on SORTKEY
.
.
Runlength Encoding

Encodings
useful for compressing VARCHAR columns in which
the same words recur often
a separate dictionary of unique words is created
for each block of column values on disk
.
.
Text255 and Text32k Encodings
provides a high compression ratio with very good
performance across diverse data sets
.
Zstandard Encoding

Encoding Type Keyword in CREATE TABLE Data types
Raw (no compression) RAW All
Byte dictionary BYTEDICT All except BOOLEAN
Delta DELTA
DELTA32K
SMALLINT, INT, BIGINT, DATE, TIMESTAMP,
DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
LZO LZO All except BOOLEAN, REAL, and DOUBLE
PRECISION
Mostlyn MOSTLY8
MOSTLY16
MOSTLY32
SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
BIGINT, DECIMAL
Run-length RUNLENGTH All
Text TEXT255
TEXT32K
VARCHAR Only
VARCHAR Only
Zstandard ZSTD All

In computing, massively parallel refers to
the use of a large number of processors to
perform a set of coordinated
computations in parallel.
What is MPP?

Input Output
Processing time
Sequential processing

Input
Processing time
Output
Parallel Processing
with equal workload distribution

InputInput
Processing time
Output
Parallel Processing
with unequal workload distribution

Amazon Redshift automatically distributes
data and query load across all nodes.
Redshift as MPP

CPU Utilization
CPU Utilization

Columnar storage, MPP - what is the
way to influence the performance?

Sortkey and Distkey
Applied during table design stage - initial DDL
Can be imagined as indices
Improve performance dramatically
.
.
.

Both specified at the table design stage
Create table dwh.fact_page_views (
page_type varchar(32) encode zstd,
page_view_ts timestamp SORTKEY,
event_id varchar(36) encode zstd DISTKEY,
session_id varchar(100) encode zstd
...
Sortkey and Distkey

Sortkey
Amazon Redshift stores your data on disk
in sorted order according to the sort key.
The Amazon Redshift query optimizer uses
sort order when it determines optimal query plans.
.
.

Best Sortkey
If recent data is queried most frequently, specify
the timestamp column as the leading column
for the sort key.
Queries will be more efficient because they can
skip entire blocks that fall outside the time range.
.
.

Best Sortkey
If you do frequent range filtering or equality filtering
on one column, specify that column as the sort key.
Redshift can skip reading entire blocks of data for that
column because it keeps track of the minimum
and maximum column values stored on each block.
.
.

Best Sortkey
If you frequently join a table, specify the join
column as both the sort key and the distribution key.
This enables the query optimizer to choose
a sort merge join instead of a slower hash join.
.
.

Main Rule for Sortkey
Define the column which is/(will be) used
to filter and make it a SORTKEY
For Developers:

Main Rule for Sortkey
Define which column is a SORTKEY
and use it in your queries to filter the data
For Data Users:

The MOST important Rule for Sortkey
Let your Data USERS know the SORTKEY for the tables
For Developers:

Sortkey benefits
Queries will be more efficient because they can skip
entire blocks that fall outside the range as it keeps
track of the minimum and maximum column values
stored on each block
Because the data is already sorted on the join key,
the query optimizer can bypass the sort phase
of the sort merge join.
1.
2.

No vertical
filter
cost=0.00..33547553.28

Timestamp
filter
cost=0.00.. 41934441.60

cost=0.00..4193.44
Demo - with
sortkey filter

Demo Summary
cost=0.00..33547553.28
rows=3354755328
No Filter:
cost=0.00..41934441.60
rows=335476
Any Filter:
cost=0.00..4193.44
rows=335476
Sortkey Filter:

Killing the sortkey
Avoid using functions on sortkey
If you need to use a function, specify a wider range to help the optimizer
.
.

cost=0.00..50321329.92
rows=1118251776
Demo -
Killing the
sortkey

Sortkey types
Compound
Interleaved
.
.

Compound Sortkey
More efficient when query predicates use a prefix,
which is a subset of the sort key columns in order
Is the default
.
.

Example
...
,local_session_ts timestamp encode lzo
,vendor_id varchar(80) encode text255
,is_onsite boolean encode runlength
)
SORTKEY (session_type, session_first_ts);
alter table dwh.fact_traffic_united owner to etl;
...

Interleaved sort key
Gives equal weight to each column in the sort key,
so query predicates can use any subset of the columns
that make up the sort key, in any order
Can use a maximum of eight columns.
Prevents Concurrency Scaling
.
.
.

Interleaved Sortkey Example
...
Interleaved SORTKEY (session_type,session_first_ts);

Vacuum and analyze
Reclaims space and resorts rows in either a specified
table or all tables in the current database.
.
VACUUM
Updates table statistics for use by the query planner..
ANALYZE

Auto Vacuum and Auto Analyze
Since 19 Dec 2018 - Auto Vacuum on DELETE
routinely scheduled VACUUM DELETE jobs don't need
to be modified
all vacuum operations now run only on a portion
of a table at a given time
.
.
.
Auto Vacuum
Since Jan 2019 - Auto Analyze
in the background
explicit ANALYZE skips tables with up-to-date table
statistics
.
.
.
Auto Analyze

Columnar and Sortkey
When columns are sorted appropriately, the query
processor is able to rapidly filter out a large subset of
data blocks.

MPP and DISTKEY
Redshift distributes the rows of a table to the compute
nodes so that the data can be processed in parallel.

MPP and DISTKEY
Optimizer decides how the data needs to be located
Some rows or entire table is moved
Substantial data movements slow overall system
performance
Using DISTKEY minimizes data redistribution
.
.
.
.

Data distribution goals
To distribute the workload uniformly among the nodes
in the cluster.
To minimize data movement during query execution.
.
.

Distribution Styles
KEY
EVEN
ALL
1.
2.
3.

Example
...
provider_api varchar(500) encode lzo,
table_loaded_at timestamp default getdate()
)
DISTSTYLE EVEN;

Bugs/features to keep in mind
ALTER TABLE statement
For SORTKEY/DISTKEY changes
recreating a table only
keep in mind for CI/CD
.
.

automatically adds additional cluster capacity
when you need it
In WLM queues manage which queries are sent
to the Concurrency Scaling cluster
.
.
Node types: dc2.8xlarge, ds2.8xlarge, dc2.large, or ds2.xlarge
1 < Node amount < 32
.
.
Cluster requirements
Concurrency Scaling

Concurrency Scaling
The following types of queries are candidates for Concurrency Scaling:
Read-only SELECT queries.
Queries that don't reference tables that use
an interleaved sort key.
The query doesn't use Redshift Spectrum to reference
external tables.
The query must encounter queueing to be routed
to a Concurrency Scaling cluster.
.
.
.
.

Cluster Resize
Quickly add or remove nodes from
a cluster
The cluster is unavailable briefly, usually
only a few minutes.
Redshift tries to hold connections open
and queries are paused temporarily.
.
.
.
Elastic resize
Change the node type, the number
of nodes, or both
Your cluster is put into a read-only state
for the duration of the operation
.
.
Classic resize
As your data warehousing capacity and performance needs change
or grow, you can resize your cluster by using one of the following approaches:
Snapshot, restore, and resize – To keep your
cluster available during a classic resize

No
When to use Elastic vs Classic Resize
Classic Resize
Yes
Yes
Yes
Elastic Resize
Scaling for a 'new normal'
(not transient spike)?
More than doubling/halving
the number of nodes?
Changing node type?
No
No

Bonus
Omio BI team's Redshift resize experience to share
Resize operation
Using the Snapshot, Restore, and Resize Operations
to Resize a Cluster
Our difficulties and our own way
.
.
.

Snapshot, restore, and resize experience
Pause Snowplow pipeline.
Take a snapshot of production cluster (30 mins)
Spin up a new cluster from snapshot (3 hours)
Resize the new cluster (~8 hours)
Rename new cluster id to production.
* Give the IAM roles as they were
Resume Snowplow pipeline
1.
2.
3.
4.
5.
6.
7.

Bugs/features to keep in mind
Redshift types
Redshift scalability
Automatic backups
Quick table restoring
Advisor
.
.
.
.
.

Summary
Redshift is MPP - Distkey is the way to influence it
Redshift is columnar storage - Sortkey is the way
to influence it
.
.

75
Further questions?
Maryna Popova
Big Data Engineer
LinkedIn: www.linkedin.com/in/marinapopova05
Email: maryna.popova@omio.com

Maryna Popova "Deep dive AWS Redshift"

Maryna Popova "Deep dive AWS Redshift"

More Related Content

Similar to Maryna Popova "Deep dive AWS Redshift"

More from Lviv Startup Club

Recently uploaded

Maryna Popova "Deep dive AWS Redshift"

Editor's Notes