Deep dive into
AWS RedshiftMaryna Popova
Big Data Engineer
800+
transport partners
and providers
35+
countries with full
service in 15 of those
27m+
users per month from
150+ different countries
About Omio
Agenda
1. What is AWS Redshift?
2. Columnar vs Row-based storage
3. Data compression
4. Redshift as MPP
5. Distkey and Sortkey
6. Scaling
7. Features and Bugs
8. Q&A
What is
Amazon Redshift?
What is Amazon Redshift?
Fully managed, petabyte-scale data warehouse service in the cloud
Enterprise-class relational database query and management system.
.
.
Amazon Redshift SQL
based on PostgreSQL 8.0.2.
Amazon Redshift and PostgreSQL have a number of very important differences.
.
.
Columnar vs Row
Data Storage
Columnar vs Row data storage
.
.Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com Block 1
2 Juan Rodrigez JuanRodrigez@somemail.com Block 2
3 Ian Noel IanNoel@somemail.com Block 3
4 John Smith JohnSmith@somemail.com Block 4
Row - oriented
.
.Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com Block 1
2 Juan Rodrigez JuanRodrigez@somemail.com Block 2
3 Ian Noel IanNoel@somemail.com Block 3
4 John Smith JohnSmith@somemail.com Block 4
Row - oriented
Data blocks store values sequentially
Inefficient use of disk space
.
.
Row - oriented
Designed to return a record in as few operations as possible
Optimal for OLTP databases
Disadvantage: inefficient use of disk space
.
.
.
Columnar
.
.
Customer ID Name Surname Email
1 Ivan Sidorov IvanSidorov@somemail.com
2 Juan Rodrigez JuanRodrigez@somemail.com
3 Ian Noel IanNoel@somemail.com
4 John Smith JohnSmith@somemail.com
1 2 3 4 Block 1
Ivan Juan Ian John Block 2
Sidorov Rodrigez Noel Smith Block 3
IvanSidorov@somemail.com JuanRodrigez@somemail.com IanNoel@somemail.com JohnSmith@somemail.com Block 4
Columnar
Data block stores values of a single column for multiple rows
Much less I/O operations for reading same number of column
field values for the same number of records compared
to row-wise storage
Same type of data in block ⇒ can use a compression scheme
.
.
.
Data compression
Data compression in Redshift
Specifies the type of compression that is applied to a column
of data values as rows are added to a table
Applied during table design stage
.
.
Example
create table dwh.fact_bookings_and_cancellations(
reporting_operations_id bigint,
booking_id varchar(255) DISTKEY,
lowest_unit_value_in_euros bigint,
operation_currency varchar(255) encode bytedict,
operation_date_time timestamp SORTKEY,
...
Default encodings
Columns that are defined as sort keys are assigned RAW
Compression.
Columns that are defined as BOOLEAN, REAL, or DOUBLE
PRECISION data types are assigned RAW compression.
All other columns are assigned LZO compression.
.
.
.
Example
create table dwh.fact_bookings_and_cancellations(
reporting_operations_id bigint encode lzo,
booking_id varchar(255) encode lzo DISTKEY,
lowest_unit_value_in_euros bigint encode lzo ,
operation_currency varchar(255) encode bytedict,
operation_date_time timestamp encode raw SORTKEY,
...
Encodings
data is stored in raw, uncompressed form..
Raw Encoding
a separate dictionary of unique values is created
for each block of column values on disk
effective when a column contains a limited number
(<256) of unique values
.
.
Byte-Dictionary Encoding
Encodings
provides a very high compression ratio
with good performance
works especially well for CHAR and VARCHAR
columns that store very long character strings
.
.
LZO Encoding
useful when the data type for a column is larger
than most of the stored values require
.
Mostly Encoding
Encodings
compresses data by recording the difference between
values that follow each other in the column
.
Delta Encoding
replaces a value that is repeated consecutively
with a token that consists of the value and a count
of the number of consecutive occurrences
DON'T apply on SORTKEY
.
.
Runlength Encoding
Encodings
useful for compressing VARCHAR columns in which
the same words recur often
a separate dictionary of unique words is created
for each block of column values on disk
.
.
Text255 and Text32k Encodings
provides a high compression ratio with very good
performance across diverse data sets
.
Zstandard Encoding
Encoding Type Keyword in CREATE TABLE Data types
Raw (no compression) RAW All
Byte dictionary BYTEDICT All except BOOLEAN
Delta DELTA
DELTA32K
SMALLINT, INT, BIGINT, DATE, TIMESTAMP,
DECIMAL
INT, BIGINT, DATE, TIMESTAMP, DECIMAL
LZO LZO All except BOOLEAN, REAL, and DOUBLE
PRECISION
Mostlyn MOSTLY8
MOSTLY16
MOSTLY32
SMALLINT, INT, BIGINT, DECIMAL
INT, BIGINT, DECIMAL
BIGINT, DECIMAL
Run-length RUNLENGTH All
Text TEXT255
TEXT32K
VARCHAR Only
VARCHAR Only
Zstandard ZSTD All
In computing, massively parallel refers to
the use of a large number of processors to
perform a set of coordinated
computations in parallel.
What is MPP?
Input Output
Processing time
Sequential processing
Input
Processing time
Output
Parallel Processing
with equal workload distribution
InputInput
Processing time
Output
Parallel Processing
with unequal workload distribution
Amazon Redshift automatically distributes
data and query load across all nodes.
Redshift as MPP
CPU Utilization
CPU Utilization
Columnar storage, MPP - what is the
way to influence the performance?
Distkey and Sortkey
Sortkey and Distkey
Applied during table design stage - initial DDL
Can be imagined as indices
Improve performance dramatically
.
.
.
Both specified at the table design stage
Create table dwh.fact_page_views (
page_type varchar(32) encode zstd,
page_view_ts timestamp SORTKEY,
event_id varchar(36) encode zstd DISTKEY,
session_id varchar(100) encode zstd
...
Sortkey and Distkey
Sortkey
Amazon Redshift stores your data on disk
in sorted order according to the sort key.
The Amazon Redshift query optimizer uses
sort order when it determines optimal query plans.
.
.
Best Sortkey
If recent data is queried most frequently, specify
the timestamp column as the leading column
for the sort key.
Queries will be more efficient because they can
skip entire blocks that fall outside the time range.
.
.
Best Sortkey
If you do frequent range filtering or equality filtering
on one column, specify that column as the sort key.
Redshift can skip reading entire blocks of data for that
column because it keeps track of the minimum
and maximum column values stored on each block.
.
.
Best Sortkey
If you frequently join a table, specify the join
column as both the sort key and the distribution key.
This enables the query optimizer to choose
a sort merge join instead of a slower hash join.
.
.
Main Rule for Sortkey
Define the column which is/(will be) used
to filter and make it a SORTKEY
For Developers:
Main Rule for Sortkey
Define which column is a SORTKEY
and use it in your queries to filter the data
For Data Users:
The MOST important Rule for Sortkey
Let your Data USERS know the SORTKEY for the tables
For Developers:
Sortkey benefits
Queries will be more efficient because they can skip
entire blocks that fall outside the range as it keeps
track of the minimum and maximum column values
stored on each block
Because the data is already sorted on the join key,
the query optimizer can bypass the sort phase
of the sort merge join.
1.
2.
No vertical
filter
cost=0.00..33547553.28
Timestamp
filter
cost=0.00.. 41934441.60
cost=0.00..4193.44
Demo - with
sortkey filter
Demo Summary
cost=0.00..33547553.28
rows=3354755328
No Filter:
cost=0.00..41934441.60
rows=335476
Any Filter:
cost=0.00..4193.44
rows=335476
Sortkey Filter:
Killing the sortkey
Avoid using functions on sortkey
If you need to use a function, specify a wider range to help the optimizer
.
.
cost=0.00..50321329.92
rows=1118251776
Demo -
Killing the
sortkey
Sortkey types
Compound
Interleaved
.
.
Compound Sortkey
More efficient when query predicates use a prefix,
which is a subset of the sort key columns in order
Is the default
.
.
Example
...
,local_session_ts timestamp encode lzo
,vendor_id varchar(80) encode text255
,is_onsite boolean encode runlength
)
SORTKEY (session_type, session_first_ts);
alter table dwh.fact_traffic_united owner to etl;
...
Interleaved sort key
Gives equal weight to each column in the sort key,
so query predicates can use any subset of the columns
that make up the sort key, in any order
Can use a maximum of eight columns.
Prevents Concurrency Scaling
.
.
.
Interleaved Sortkey Example
...
Interleaved SORTKEY (session_type,session_first_ts);
Vacuum and analyze
Reclaims space and resorts rows in either a specified
table or all tables in the current database.
.
VACUUM
Updates table statistics for use by the query planner..
ANALYZE
Auto Vacuum and Auto Analyze
Since 19 Dec 2018 - Auto Vacuum on DELETE
routinely scheduled VACUUM DELETE jobs don't need
to be modified
all vacuum operations now run only on a portion
of a table at a given time
.
.
.
Auto Vacuum
Since Jan 2019 - Auto Analyze
in the background
explicit ANALYZE skips tables with up-to-date table
statistics
.
.
.
Auto Analyze
Columnar and Sortkey
When columns are sorted appropriately, the query
processor is able to rapidly filter out a large subset of
data blocks.
MPP and DISTKEY
Redshift distributes the rows of a table to the compute
nodes so that the data can be processed in parallel.
MPP and DISTKEY
Optimizer decides how the data needs to be located
Some rows or entire table is moved
Substantial data movements slow overall system
performance
Using DISTKEY minimizes data redistribution
.
.
.
.
Data distribution goals
To distribute the workload uniformly among the nodes
in the cluster.
To minimize data movement during query execution.
.
.
CPU Utilization
CPU Utilization
Distribution Styles
KEY
EVEN
ALL
1.
2.
3.
Example
...
provider_api varchar(500) encode lzo,
table_loaded_at timestamp default getdate()
)
DISTSTYLE EVEN;
Bugs/features to keep in mind
ALTER TABLE statement
For SORTKEY/DISTKEY changes
recreating a table only
keep in mind for CI/CD
.
.
Scaling
63
automatically adds additional cluster capacity
when you need it
In WLM queues manage which queries are sent
to the Concurrency Scaling cluster
.
.
Node types: dc2.8xlarge, ds2.8xlarge, dc2.large, or ds2.xlarge
1 < Node amount < 32
.
.
Cluster requirements
Concurrency Scaling
Concurrency Scaling
The following types of queries are candidates for Concurrency Scaling:
Read-only SELECT queries.
Queries that don't reference tables that use
an interleaved sort key.
The query doesn't use Redshift Spectrum to reference
external tables.
The query must encounter queueing to be routed
to a Concurrency Scaling cluster.
.
.
.
.
Cluster Resize
Quickly add or remove nodes from
a cluster
The cluster is unavailable briefly, usually
only a few minutes.
Redshift tries to hold connections open
and queries are paused temporarily.
.
.
.
Elastic resize
Change the node type, the number
of nodes, or both
Your cluster is put into a read-only state
for the duration of the operation
.
.
Classic resize
As your data warehousing capacity and performance needs change
or grow, you can resize your cluster by using one of the following approaches:
Snapshot, restore, and resize – To keep your
cluster available during a classic resize
No
When to use Elastic vs Classic Resize
Classic Resize
Yes
Yes
Yes
Elastic Resize
Scaling for a 'new normal'
(not transient spike)?
More than doubling/halving
the number of nodes?
Changing node type?
No
No
Bonus
Omio BI team's Redshift resize experience to share
Resize operation
Using the Snapshot, Restore, and Resize Operations
to Resize a Cluster
Our difficulties and our own way
.
.
.
Snapshot, restore, and resize experience
Pause Snowplow pipeline.
Take a snapshot of production cluster (30 mins)
Spin up a new cluster from snapshot (3 hours)
Resize the new cluster (~8 hours)
Rename new cluster id to production.
* Give the IAM roles as they were
Resume Snowplow pipeline
1.
2.
3.
4.
5.
6.
7.
Features & Bugs
70
Bugs/features to keep in mind
Redshift types
Redshift scalability
Automatic backups
Quick table restoring
Advisor
.
.
.
.
.
Summary
Redshift is MPP - Distkey is the way to influence it
Redshift is columnar storage - Sortkey is the way
to influence it
.
.
Q&A
73
75
Further questions?
Maryna Popova
Big Data Engineer
LinkedIn: www.linkedin.com/in/marinapopova05
Email: maryna.popova@omio.com
Thank you
76
Maryna Popova "Deep dive AWS Redshift"

Maryna Popova "Deep dive AWS Redshift"

  • 1.
    Deep dive into AWSRedshiftMaryna Popova Big Data Engineer
  • 2.
    800+ transport partners and providers 35+ countrieswith full service in 15 of those 27m+ users per month from 150+ different countries About Omio
  • 3.
    Agenda 1. What isAWS Redshift? 2. Columnar vs Row-based storage 3. Data compression 4. Redshift as MPP 5. Distkey and Sortkey 6. Scaling 7. Features and Bugs 8. Q&A
  • 4.
  • 5.
    What is AmazonRedshift? Fully managed, petabyte-scale data warehouse service in the cloud Enterprise-class relational database query and management system. . .
  • 6.
    Amazon Redshift SQL basedon PostgreSQL 8.0.2. Amazon Redshift and PostgreSQL have a number of very important differences. . .
  • 7.
  • 8.
    Columnar vs Rowdata storage . .Customer ID Name Surname Email 1 Ivan Sidorov IvanSidorov@somemail.com Block 1 2 Juan Rodrigez JuanRodrigez@somemail.com Block 2 3 Ian Noel IanNoel@somemail.com Block 3 4 John Smith JohnSmith@somemail.com Block 4
  • 9.
    Row - oriented . .CustomerID Name Surname Email 1 Ivan Sidorov IvanSidorov@somemail.com Block 1 2 Juan Rodrigez JuanRodrigez@somemail.com Block 2 3 Ian Noel IanNoel@somemail.com Block 3 4 John Smith JohnSmith@somemail.com Block 4
  • 10.
    Row - oriented Datablocks store values sequentially Inefficient use of disk space . .
  • 11.
    Row - oriented Designedto return a record in as few operations as possible Optimal for OLTP databases Disadvantage: inefficient use of disk space . . .
  • 12.
    Columnar . . Customer ID NameSurname Email 1 Ivan Sidorov IvanSidorov@somemail.com 2 Juan Rodrigez JuanRodrigez@somemail.com 3 Ian Noel IanNoel@somemail.com 4 John Smith JohnSmith@somemail.com 1 2 3 4 Block 1 Ivan Juan Ian John Block 2 Sidorov Rodrigez Noel Smith Block 3 IvanSidorov@somemail.com JuanRodrigez@somemail.com IanNoel@somemail.com JohnSmith@somemail.com Block 4
  • 13.
    Columnar Data block storesvalues of a single column for multiple rows Much less I/O operations for reading same number of column field values for the same number of records compared to row-wise storage Same type of data in block ⇒ can use a compression scheme . . .
  • 14.
  • 15.
    Data compression inRedshift Specifies the type of compression that is applied to a column of data values as rows are added to a table Applied during table design stage . .
  • 16.
    Example create table dwh.fact_bookings_and_cancellations( reporting_operations_idbigint, booking_id varchar(255) DISTKEY, lowest_unit_value_in_euros bigint, operation_currency varchar(255) encode bytedict, operation_date_time timestamp SORTKEY, ...
  • 17.
    Default encodings Columns thatare defined as sort keys are assigned RAW Compression. Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression. All other columns are assigned LZO compression. . . .
  • 18.
    Example create table dwh.fact_bookings_and_cancellations( reporting_operations_idbigint encode lzo, booking_id varchar(255) encode lzo DISTKEY, lowest_unit_value_in_euros bigint encode lzo , operation_currency varchar(255) encode bytedict, operation_date_time timestamp encode raw SORTKEY, ...
  • 19.
    Encodings data is storedin raw, uncompressed form.. Raw Encoding a separate dictionary of unique values is created for each block of column values on disk effective when a column contains a limited number (<256) of unique values . . Byte-Dictionary Encoding
  • 20.
    Encodings provides a veryhigh compression ratio with good performance works especially well for CHAR and VARCHAR columns that store very long character strings . . LZO Encoding useful when the data type for a column is larger than most of the stored values require . Mostly Encoding
  • 21.
    Encodings compresses data byrecording the difference between values that follow each other in the column . Delta Encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences DON'T apply on SORTKEY . . Runlength Encoding
  • 22.
    Encodings useful for compressingVARCHAR columns in which the same words recur often a separate dictionary of unique words is created for each block of column values on disk . . Text255 and Text32k Encodings provides a high compression ratio with very good performance across diverse data sets . Zstandard Encoding
  • 23.
    Encoding Type Keywordin CREATE TABLE Data types Raw (no compression) RAW All Byte dictionary BYTEDICT All except BOOLEAN Delta DELTA DELTA32K SMALLINT, INT, BIGINT, DATE, TIMESTAMP, DECIMAL INT, BIGINT, DATE, TIMESTAMP, DECIMAL LZO LZO All except BOOLEAN, REAL, and DOUBLE PRECISION Mostlyn MOSTLY8 MOSTLY16 MOSTLY32 SMALLINT, INT, BIGINT, DECIMAL INT, BIGINT, DECIMAL BIGINT, DECIMAL Run-length RUNLENGTH All Text TEXT255 TEXT32K VARCHAR Only VARCHAR Only Zstandard ZSTD All
  • 24.
    In computing, massivelyparallel refers to the use of a large number of processors to perform a set of coordinated computations in parallel. What is MPP?
  • 25.
  • 26.
  • 27.
  • 28.
    Amazon Redshift automaticallydistributes data and query load across all nodes. Redshift as MPP
  • 29.
  • 30.
    Columnar storage, MPP- what is the way to influence the performance?
  • 31.
  • 32.
    Sortkey and Distkey Appliedduring table design stage - initial DDL Can be imagined as indices Improve performance dramatically . . .
  • 33.
    Both specified atthe table design stage Create table dwh.fact_page_views ( page_type varchar(32) encode zstd, page_view_ts timestamp SORTKEY, event_id varchar(36) encode zstd DISTKEY, session_id varchar(100) encode zstd ... Sortkey and Distkey
  • 34.
    Sortkey Amazon Redshift storesyour data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans. . .
  • 35.
    Best Sortkey If recentdata is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. . .
  • 36.
    Best Sortkey If youdo frequent range filtering or equality filtering on one column, specify that column as the sort key. Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block. . .
  • 37.
    Best Sortkey If youfrequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. . .
  • 38.
    Main Rule forSortkey Define the column which is/(will be) used to filter and make it a SORTKEY For Developers:
  • 39.
    Main Rule forSortkey Define which column is a SORTKEY and use it in your queries to filter the data For Data Users:
  • 40.
    The MOST importantRule for Sortkey Let your Data USERS know the SORTKEY for the tables For Developers:
  • 41.
    Sortkey benefits Queries willbe more efficient because they can skip entire blocks that fall outside the range as it keeps track of the minimum and maximum column values stored on each block Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join. 1. 2.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
    Killing the sortkey Avoidusing functions on sortkey If you need to use a function, specify a wider range to help the optimizer . .
  • 47.
  • 48.
  • 49.
    Compound Sortkey More efficientwhen query predicates use a prefix, which is a subset of the sort key columns in order Is the default . .
  • 50.
    Example ... ,local_session_ts timestamp encodelzo ,vendor_id varchar(80) encode text255 ,is_onsite boolean encode runlength ) SORTKEY (session_type, session_first_ts); alter table dwh.fact_traffic_united owner to etl; ...
  • 51.
    Interleaved sort key Givesequal weight to each column in the sort key, so query predicates can use any subset of the columns that make up the sort key, in any order Can use a maximum of eight columns. Prevents Concurrency Scaling . . .
  • 52.
    Interleaved Sortkey Example ... InterleavedSORTKEY (session_type,session_first_ts);
  • 53.
    Vacuum and analyze Reclaimsspace and resorts rows in either a specified table or all tables in the current database. . VACUUM Updates table statistics for use by the query planner.. ANALYZE
  • 54.
    Auto Vacuum andAuto Analyze Since 19 Dec 2018 - Auto Vacuum on DELETE routinely scheduled VACUUM DELETE jobs don't need to be modified all vacuum operations now run only on a portion of a table at a given time . . . Auto Vacuum Since Jan 2019 - Auto Analyze in the background explicit ANALYZE skips tables with up-to-date table statistics . . . Auto Analyze
  • 55.
    Columnar and Sortkey Whencolumns are sorted appropriately, the query processor is able to rapidly filter out a large subset of data blocks.
  • 56.
    MPP and DISTKEY Redshiftdistributes the rows of a table to the compute nodes so that the data can be processed in parallel.
  • 57.
    MPP and DISTKEY Optimizerdecides how the data needs to be located Some rows or entire table is moved Substantial data movements slow overall system performance Using DISTKEY minimizes data redistribution . . . .
  • 58.
    Data distribution goals Todistribute the workload uniformly among the nodes in the cluster. To minimize data movement during query execution. . .
  • 59.
  • 60.
  • 61.
    Example ... provider_api varchar(500) encodelzo, table_loaded_at timestamp default getdate() ) DISTSTYLE EVEN;
  • 62.
    Bugs/features to keepin mind ALTER TABLE statement For SORTKEY/DISTKEY changes recreating a table only keep in mind for CI/CD . .
  • 63.
  • 64.
    automatically adds additionalcluster capacity when you need it In WLM queues manage which queries are sent to the Concurrency Scaling cluster . . Node types: dc2.8xlarge, ds2.8xlarge, dc2.large, or ds2.xlarge 1 < Node amount < 32 . . Cluster requirements Concurrency Scaling
  • 65.
    Concurrency Scaling The followingtypes of queries are candidates for Concurrency Scaling: Read-only SELECT queries. Queries that don't reference tables that use an interleaved sort key. The query doesn't use Redshift Spectrum to reference external tables. The query must encounter queueing to be routed to a Concurrency Scaling cluster. . . . .
  • 66.
    Cluster Resize Quickly addor remove nodes from a cluster The cluster is unavailable briefly, usually only a few minutes. Redshift tries to hold connections open and queries are paused temporarily. . . . Elastic resize Change the node type, the number of nodes, or both Your cluster is put into a read-only state for the duration of the operation . . Classic resize As your data warehousing capacity and performance needs change or grow, you can resize your cluster by using one of the following approaches: Snapshot, restore, and resize – To keep your cluster available during a classic resize
  • 67.
    No When to useElastic vs Classic Resize Classic Resize Yes Yes Yes Elastic Resize Scaling for a 'new normal' (not transient spike)? More than doubling/halving the number of nodes? Changing node type? No No
  • 68.
    Bonus Omio BI team'sRedshift resize experience to share Resize operation Using the Snapshot, Restore, and Resize Operations to Resize a Cluster Our difficulties and our own way . . .
  • 69.
    Snapshot, restore, andresize experience Pause Snowplow pipeline. Take a snapshot of production cluster (30 mins) Spin up a new cluster from snapshot (3 hours) Resize the new cluster (~8 hours) Rename new cluster id to production. * Give the IAM roles as they were Resume Snowplow pipeline 1. 2. 3. 4. 5. 6. 7.
  • 70.
  • 71.
    Bugs/features to keepin mind Redshift types Redshift scalability Automatic backups Quick table restoring Advisor . . . . .
  • 72.
    Summary Redshift is MPP- Distkey is the way to influence it Redshift is columnar storage - Sortkey is the way to influence it . .
  • 73.
  • 74.
    75 Further questions? Maryna Popova BigData Engineer LinkedIn: www.linkedin.com/in/marinapopova05 Email: maryna.popova@omio.com
  • 75.

Editor's Notes

  • #3 We enable people to find and book tickets for trains, buses and flights in more than 35 countries across Europe. We’re fully operational in 15 of those countries, where it’s possible to book travel to major cities and towns, and even lots of smaller villages.
  • #6 Amazon Redshift supports client connections with many types of applications, including business intelligence (BI), reporting, data, and analytics tools. My: Amazon Redshift is scalable DWH in cloud. It is columnar datastorage It is mpp
  • #25 In computing, massively parallel refers to the use of a large number of processors (or separate computers) to perform a set of coordinated computations in parallel (simultaneously).
  • #29 Массово-параллельная_архитектура
  • #30 Before switching to next slide: "Here comes the question:"
  • #36 If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. Queries will be more efficient because they can skip entire blocks that fall outside the time range. If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Amazon Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range. If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
  • #38 Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.
  • #46 These are all made up too.
  • #56 https://docs.aws.amazon.com/redshift/latest/dg/c_challenges_achieving_high_performance_queries.html
  • #57 By selecting an appropriate distribution key for each table, you can optimize the distribution of data to balance the workload and minimize movement of data from node to node.